Transcript Project 2
CS240A Final
Project 2
Final for CS240A: Project II:
Data Mining in SQL and Datalog.
In this project, you will gain experience and
understanding on the problem that DB query languages
are facing in supporting Predictive Analytics even when
the task is as simple as 1R and NBC classifiers which do
not require recursion. In fact, you are expected to build
generic classifiers, i.e., that operate on tables with
arbitrary number of columns, once these are placed in
verticalized in column form.
Your specific tasks are as follows:
Your specific tasks are as follows:
TASK A
• Build 1R and NBC classifiers using Deal and test them on the verticalized
representation in the DeAL tutorial. However, you should not use the rules in the
DeAL tutorial. You should instead simplify the training rules by using a Laplace
estimator where missing examples are counted as one. Also in the decision rules
do not rely on the user-defined aggregate given in the notes: use the standard
aggregates instead.
TASK B
• Using DB2, build an NBC classifier for a dataset used in Task A and one or more
datasets of your choice. See if you can find interesting datasets to classify
(perhaps some that you have used in other projects). You can find interesting ones
at the following sites:
• http://www.cs.toronto.edu/~delve/data/
• http://kdd.ics.uci.edu/summary.data.alphabetical.html
You are encouraged to try new datasets and applications, and if have your own
interesting application you should use it!
Task B consists of the following subtasks (you should try to implement them using
clean and compact SQL):
1. Select a dataset and load it into DB2 as a tables called, say, DataSet1, Dataset2,
etc. Then for each dataset do the following:
2. Randomly partition your DataSet into a TrainSet and a TestSet (the first
containing about three times as many tuples as the second).
3. If your data contain numerical attributes, then you should represent them in the
TrainSet by either (i) discretizing them, or (ii) approximating their probability by
e.g., assuming that they follow a simple Gaussian distribution. (Of course
different columns might be better treated by different methods.)
4. Devise a strategy for dealing with missing values.
5. Build a Naive Bayesian Classifier using DB2's SQL aggregates and (preferably)
table functions, and store it in a table called NBC.
--Try to provide general-purpose solutions and code, expecting that it will be
used on data sets with arbitrary number of columns.
6. Write SQL queries that take the tuples in TestSet (wihout class labels)
and predict such class labels using your NBC.
7. Build first a 1R classifier (single column) and compare its results to those of the
NBC classifier
8. [Boosting of single classifier] Find the misclassified samples from TrainSet, and
increase their weights (e.g., by simply duplicating them) to implement the
boosting step. Repeat steps 5 and 6 above but but stop as soon you see the
accuracy stop improving.
Final Project 2.
Task C. [Ensemble-based bagging for extra credit] See if you can get better decisions
from your DB2 or Datalog classifiers by using voting ensembles of classifiers (possibly
by assigning weights to each classifier on the basis of their accuracy). Alternatively
build a better voting ensemble by using boosting.
Task D. [Back to Datalog, for extra Credit]. Write a K-means classifier in DeAL using an
XY-stratified program. Test and demonstrate it on a dataset of your choice (e.g.,
http://web.cs.ucla.edu/classes/fall16/cs240A/notes/decision/NYtaxiCalls.txt)
Task E: Write a nice report about your work.
Respective credit for Task A, B and E is 30%, 40% and 30%. Task C and D get 10% extra
credit each. The credit in each task depends on the complexity of the dataset and
mining methods selected and quality of analysis and solutions proposed. Focus on
those and on writing an interesting report, before you work on D and E that
are meant for extra credit.
More on Data Sets:
Good results were reported in the past with datasets led, mushrooms,
splice, titanic, waveform, abalone, letter, and census. But data are
continuously being revised and upgraded and you are encouraged to
try new data sets. However make sure that your data set is not too
small, otherwise your experiments with performance will not be
interesting.
It is important that you write generic classifiers, i.e. one that will work
for data sets having different number of columns and data types (e.g.,
discrete and continuous). In fact, you must test and demonstrate your
classifiers on different data sets.The best way to achieve genericity is
to work with table in a verticalized format, as is A (e.g., by using DB2s
table functions to transform your example tales into a vertical form).
You are encouraged to try new datasets and applications, and if have
your own interesting application you should use it!
Verticalized Representation