Presentation - Villanova Computer Science

Download Report

Transcript Presentation - Villanova Computer Science

Sharanya Thandra
TABLE OF CONTENTS






•
•
•



Datasets definition.
Iris flower datasets.
Weka, Weka tool.
Handling Different Datasets with Weka
Techniques for managing large data sets.
Compression
Indexing
Summerization
Large data sets with Python
Results and conclusions(Python)
References.
What are datasets?

 A Dataset or Data set is a collection of data.
 It corresponds to single database table, or a single
statistical matrix.
 Column is a variable and row is a given number of
the dataset in question.
 Singular form of datasets is datum.
Datasets properties?

 Characteristics that define data set’s structure and
properties: Number and types of the attributes.
 Statistical measures such as standard deviation and
Kurtosis.
 Values : real numbers or integers(nominal data)
 Datasets may further be generated by algorithms.
Classic Datasets?

 Iris flower data set
 Categorical Data Analysis
 Robust Statistics
 Time series
 Extreme Values
 Bayesian Data Analysis
 The Bupa Liver data
 Anscombe’s Quaret
Iris Flower Dataset

 Multivariate data set , as an example of Discriminate
analysis.
 A typical test case for classification techniques in
Machine Learning.
 Good example: explains difference between
supervised and unsupervised techniques in data
mining.
Example Image

An example of the so called
“metro map” for the Iris data
set.
Weka (Weh-kah)

 Weka is a Popular suite of Machine learning software
written in Java.
 It contains collection of tools and algorithms for data
analysis and predictive modeling.
 The original one was TCL/TK with non-java version,
preprocessing utilities in C.
 More recent one is fully Java-based Version.
 All of Weka’s Techniques are predicated assuming that
the data is available as a single Flat file or relation.
 Each Data point is described by a fixed number of
attributes.
Uses of weka

 Original one was used mostly for Agricultural
domains.
 Latest one is being used in many different
application areas – research and education.
 Data mining tasks such as
• Preprocessing ,
• clustering,
• classification,
• regression, Visualization.
Panels of Weka

 Weka’s main Interface is the Explorer
The Explorer contains Interface Features With several
Following Panels:
• Preprocess panel: Imports data from a database, as a
CSV file. Preprocess the data using Filtering
algorithm. Filters are used to transform data, helps in
deleting instances and attribute according to specific
data.
Panels of weka

• Classify Panel: makes user apply Classification and
regression algorithms(classifiers) to resulting dataset.
To estimate the accuracy of the predictive model, to
visualize erroneous predictions, Roc curves.
• Associate panel: provides access to association rule
learners that attempt to identify important
interrelationships between attributed in the data.
Panels of Weka

• Cluster panel: Gives access to clustering techniques in
Weka, for example: KMeans algorithm. It provides
implementation of the expectation maximization
algorithm for learning a mixture of normal
distributions.
• Select attribute panel: Algorithms for Identifying most
predictive attributes in a dataset.
• Visualize panel: provides scatter plot matrix, these plots
can be further enlarged and analyzed using different
selection operators.
Weka Tool

 Written in java and runs on any platform
 Available to users under GNU general public license.
• Advantages of Using Weka:
• Freely Available for all the users.
• Portability is high as its fully implemented in Java
Programming language as it runs on any platform.
• Provides comprehensive collection of data
preprocessing and modeling techniques.
• Ease of usage with its graphical User Interface.
Some links for the
information on weka

 http://arxiv.org/ftp/arxiv/papers/1310/1310.4647.
pdf
Following link shows link to download the tool.
http://www.cs.waikato.ac.nz/~ml/weka/
Handling large datasets
using weka

 Data Characteristics: if data contains many zeros we
make use of sparse data which saves a lot of
memory. Every algorithm in Weka can take
advantage of this memory savings for speeding up
computation.
 An ARFF is a developer version , It’s a file which is
in Ascii that describes a list of instances sharing a set
of attributes.
 Has two different sections Header information
followed by Data Information.
Datasets with Weka

 The ARFF header Section contains Relation
declaration and attribute declarations.
 The @Relation Declaration format : @relation
<relation-name>
 The @attribute Declarations format: @attribute
<attribute-name> <datatype>
 Contains Numeric and Nominal attributes
 Nominal attributes: {<nominal-name1>, <nominalname2>, <nominal-name3>, ...}
Handling large data sets
with weka

 String attributes: @ATTRIBUTE LCC string
 Date attributes::@attribute <name> date [<dateformat>]
 Relational attributes:
@attribute <name> relational
<further attribute definitions>
@end <name>
Techniques for
managing large datasets

 As We deal with huge amount of data, providing
memory for every datum becomes a very important
factor
 Two features such as compression and Indexing help
to provide ways in decreasing the amount of room
needed for datasets.
 Summarization is the third technique.
Techniques for handling
large data sets

 Its highly important to know about the data we use before
using any of these techniques.
 Ask questions for yourself such as :
How dense is the data?
Upon which variables are users ,applications likely to
process or clarify the data?
How evenly distributed are values of variables?
How is the data set being used? How often do we refresh it?
Apart from knowing your data , before you perform any of
the techniques always benchmark your results.
Techniques for handling
large data sets

 Data Set Compression : It is a technique for “squeezing
out” excess blanks, then abbreviating repeated strings of
values to decrease the data sets size and therefore to
lower the storage space it requires.
 Uncompressed data will look like this:
ADAMS MIKE
BARNET MARYBETH,
Whereas the same example after compression is
@ADAMS#@MIKE#
@BARNET#@MARYBETH#
Compression

 Way to compress data is to use the COMPRESS=YES
data set option ,whenever a new data set is created
we use this :
example: data sasuer.new(compress=yes);
set sasuser.master;
….other statements….
run;
 The data sets descriptor portion records that the data
set is compressed.
Compression

 Data sets descriptor portion records the compression of
data set.
Observations:
466560
Variables:
12
Indexes:
0
Observation Length: 149
Deleted Observations: 0
Compressed:
Yes
Reuse space:
No
Sorted:
No
Data set Indexing :

 A dataset Index is a data Structure that specifies a location
of observations based on the values of one or more key
variables.
 Lets consider the following example
December 1(1,2,3,…) 2(…)
February 1(22,23,…) 4(…)
 In the above example numbers 1,2 and 1,4 indicate the
data set pages
 Numbers in the parenthesis are the relative observation
numbers within that page matching each Distinct key
value.
Data set Indexing

 The data sets Descriptor portion records that the data set
indexes associated with it.
Observations:
466560
Variables:
12
Indexes:
1
Observation length: 149
Deleted Observations: 0
Compressed:
NO
Reuse space:
NO
Sorted:
NO
Datasets Indexing

Indexing Guidelines:
• Only when there are fairly large number of distinct
values for the variables to be indexed.
• The data is not randomly scattered throughout the
dataset
• Data set is frequently queried and subset on the
indexed values.
Summarization

 Questions before summarization:
• To what level of detail must the data be accessible?
• How often is the data refreshed?
• Is it possible to anticipate how best to summarize
date?
The most important element to efficiency, is the
thoughtful and careful consideration how the above
techniques can best used
Thorough testing of the techniques before launching
them into production.
Example of large data
sets for free

 Million Song Dataset:
• The million Song Dataset is a freely available
collection of audio features and metadata for a
million contemporary music tracks
Its purpose is to :
 Encourage research on algorithms that equals to
commercial sizes.
 Reference dataset for evaluating research
 As a shortcut to creating a large datasets with APIs
 Help new researchers get started in the MIR field.
Large data sets using
Python

With Python

 Considered an example : Task is to screen a huge set of
large data files in text format with billions of entries each.
 To have a unified database structure available that
combines all the coloumns, which represent different
features, that are listed in the separate text files.
 Goal is to get a database which is extendable, and the
workflow will require that one can pull entries with
combining features for further computation efficiently.
 This can be achieved by sqlite3 Python module that work
with SQLite database structures.
SQLite 3 Module

 SQLite is a C library which provides a lightweight
disk-based database that don’t require a separate
server process.
 It allows accessing the database using a nonstandard
variant of the SQL.
 The SQlite3 module provides a SQL interface in
compliance with DB-API 2.0 specification.
Getting started…

 To use the module – create a connection object that
represents the database, data will be stored in the
example.db file.
• >> Import sqlite3
conn=sqlite3.connect(‘example.db’)
Once we have it connected , we can create a cursor
object and call its execute() method to perform SQL
commands.
With python..

 c = conn.cursor()
# Create table
c.execute('''CREATE TABLE stocks (date text, trans text,
symbol text, qty real, price real)''')
# Insert a row of data
c.execute("INSERT INTO stocks VALUES ('2006-0105','BUY','RHAT',100,35.14)")
# Save (commit) the changes
conn.commit()
# We can also close the connection if we are done with it.
Just be sure any changes have been committed or they will be lost.
conn.close()
With python.

 The data we saved is Persistent and can use it further:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
Note: Persistent.
Also Our SQL operations will need values from Python
variables. We shouldn’t assemble our query using Python’s
String operations doing so causes insecurity as it makes the
program easily attacked with SQL injection.
http://xkcd.com/327/
With Python

 # Never do this -- insecure!
symbol = 'RHAT'
c.execute("SELECT * FROM stocks WHERE symbol = '%s'" %
symbol)
 # Do this instead t = ('RHAT',)
c.execute('SELECT * FROM stocks WHERE symbol=?', t) print
c.fetchone()
 # Larger example that inserts many records at a time purchases =
[('2006-03-28', 'BUY', 'IBM', 1000, 45.00), ('2006-04-05', 'BUY',
'MSFT', 1000, 72.00), ('2006-04-06', 'SELL', 'IBM', 500, 53.00), ]
c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)',
purchases)
With python

import sqlite3
 # create new db and make connection
conn = sqlite3.connect('my_db.db')
c = conn.cursor()
 # create table
c.execute('''CREATE TABLE my_db (id TEXT, my_var1 TEXT, my_var2 INT)''')
 # insert one row of data
c.execute("INSERT INTO my_db VALUES ('ID_2352532','YES', 4)")
# insert multiple lines of data
multi_lines =[ ('ID_2352533','YES', 1), ('ID_2352534','NO', 0), ('ID_2352535','YES', 3),
('ID_2352536','YES', 9), ('ID_2352537','YES', 10) ]
c.executemany('INSERT INTO my_db VALUES (?,?,?)', multi_lines)
 # save (commit) the changes
conn.commit()
 # close connection
conn.close()
With Python

 import sqlite3
 # make connection to existing db
conn = sqlite3.connect('my_db.db')
c = conn.cursor()
•
# update field
t = ('NO', 'ID_2352533', )
c.execute("UPDATE my_db SET my_var1=? WHERE id=?", t)
print "Total number of rows changed:", conn.total_changes
•
# delete rows
t = ('NO', )
c.execute("DELETE FROM my_db WHERE my_var1=?", t)
print "Total number of rows deleted: ", conn.total_changes
 # add column
c.execute("ALTER TABLE my_db ADD COLUMN 'my_var3' TEXT")
•
# save changes
conn.commit()
Benchmarking

 To test how fast is SQLite ?
 To compare some of the simple speed comparisions,
here is an example ,to measure CPU time
a) Read in the text file line by line with simple Python.
b) Read in the Text file to create an SQLite database.
c) Query the whole database.
read_lines.py

 import time
• start_time = time.clock()
lines = 0
 with open("feature1.txt", "rb") as fileobj:
for line in fileobj:
lines += 1
 elapsed_time = time.clock() - start_time
print "Time elapsed: {} seconds".format(elapsed_time)
print "Read {} lines".format(lines)
Results and conclusions:

Links for the references

 http://wiki.pentaho.com/display/DATAMINING/Han
dling+Large+Data+Sets+with+Weka
 http://en.wikipedia.org/wiki/Comma-separated_values
 http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf
 http://labrosa.ee.columbia.edu/millionsong/
 http://faculty.washington.edu/kenrice/sisg-adv/sisg09.pdf
 http://sebastianraschka.com/Articles/sqlite3_database.h
tml
 http://www.mssqltips.com/sqlservertip/1200/handling
-large-sql-server-tables-with-data-partitioning/
Questions??
