CS186: Introduction to Database Systems
Download
Report
Transcript CS186: Introduction to Database Systems
EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Review
Classification
Training data set
Testing data set
Classification Models
Model evaluation
7/20/2015
Luke Huan Univ. of Kansas
2
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
7/20/2015
Luke Huan Univ. of Kansas
3
Cost-Sensitive Measures
PREDICTED CLASS
Class=Y Class=N
es
o
ACTUA Class=Y
L
es
CLASS
Class=N
o
7/20/2015
a
(TP)
b
(FN)
c
(FP)
d
(TN)
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)
r p 2a b c
Luke Huan Univ. of Kansas
4
Today’s Topic
Clustering
XML
7/20/2015
Luke Huan Univ. of Kansas
5
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
7/20/2015
Luke Huan Univ. of Kansas
6
Applications of Cluster Analysis
Understanding
Group related documents for browsing, group genes and
proteins that have similar functionality, or group stocks
with similar price fluctuations
Summarization
Reduce the size of large data sets
7/20/2015
Luke Huan Univ. of Kansas
7
Multidisciplinary Efforts of Clustering
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
Bioinfo:
Phylogenetic tree
Microarray analysis
7/20/2015
Luke Huan Univ. of Kansas
8
What is not Cluster Analysis?
Supervised classification
Simple segmentation
Dividing students into different registration groups
alphabetically, by last name
Results of a query
Have class label information
Groupings are a result of an external specification
Graph partitioning
7/20/2015
Some mutual relevance and synergy, but areas are not identical
Luke Huan Univ. of Kansas
9
Terms in Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
So what?
Clustering can be used as a stand-alone tool to get insight into data
distribution
Clustering can be used as a preprocessing step for other algorithms
such as discretization
7/20/2015
Luke Huan Univ. of Kansas
10
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
7/20/2015
Luke Huan Univ. of Kansas
11
Types of Clusters: Well-Separated
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than to
any point not in the cluster.
3 well-separated clusters
7/20/2015
Luke Huan Univ. of Kansas
12
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average
of all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
7/20/2015
Luke Huan Univ. of Kansas
13
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
7/20/2015
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Luke Huan Univ. of Kansas
14
Partitional Clustering
Original Points
7/20/2015
A Partitional Clustering
Luke Huan Univ. of Kansas
15
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
7/20/2015
p3 p4
Non-traditional Dendrogram
Luke Huan Univ. of Kansas
16
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
km1tm iKm (Cm tmi )2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
7/20/2015
Luke Huan Univ. of Kansas
17
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of the
current partition (the centroid is the mean of the cluster)
Assign each object to the cluster with the nearest seed point
Go back to Step 2, stop when no more new assignment
7/20/2015
Luke Huan Univ. of Kansas
18
The K-Means Clustering Method
10
9
8
7
6
5
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
10
10
9
9
8
8
7
7
6
6
5
4
3
2
1
0
0
7/20/2015
Update
the
cluster
means
1
2
3
4
5
6
7
8
9
10
Luke Huan Univ. of Kansas
Update
the
cluster
means
1
2
3
4
5
6
7
8
9
10
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
19
Comments on the K-Means Method
Strength
Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum.
7/20/2015
The global optimum may be found using techniques such
as genetic algorithms
Luke Huan Univ. of Kansas
20
Comments on the K-Means Method
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
7/20/2015
Luke Huan Univ. of Kansas
21
A Problem K-means: Differing Density
K-means (3 Clusters)
Original Points
7/20/2015
Luke Huan Univ. of Kansas
22
A Problem of K-means:
Non-globular Shapes
Original Points
7/20/2015
K-means (2 Clusters)
Luke Huan Univ. of Kansas
23
From HTML to XML (eXtensible Markup Language)
HTML describes presentation of content
<h1>Bibliography</h1>
<p><i>Foundations of Databases</i>
Abiteboul, Hull, and Vianu
<br>Addison Wesley, 1995
<p>…
XML describes only the content
<bibliography>
<book>
<title>Foundations of Databases</title>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publisher>
<year>1995</year>
</book>
<book>…</book>
</bibliography>
Separation of content from presentation simplifies content extraction and
allows the same content to be presented easily in different looks
7/20/2015
Luke Huan Univ. of Kansas
24
Other nice features of XML
Portability: Just like HTML, you can ship XML data
across platforms
Flexibility: You can represent any information
(structured, semi-structured, documents, …)
Relational data requires heavy-weight protocols, e.g.,
JDBC
Relational data is best suited for structured data
Extensibility: Since data describes itself, you can change
the schema easily
Relational schema is rigid and difficult to change
7/20/2015
Luke Huan Univ. of Kansas
25
XML terminology
<bibliography>
<book ISBN=”ISBN-10” price=”80.00”>
<title>Foundations of Databases</t
<is_textbook/>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publish
<year>1995</year>
</book>…
</bibliography>
Tag names: book, title, …
Start tags: <book>, <title>, …
End tags: </book>, </title>, …
An element is enclosed by a pair of start and end tags:
<book>…</book>
Elements can be nested:
<book>…<title>…</title>…</book>
Empty elements: <is_textbook></is_textbook>
Can be abbreviated: <is_textbook/>
Elements can also have attributes: <book ISBN=”…”
price=”80.00”>
7/20/2015
Luke Huan Univ. of Kansas
26
Well-formed XML documents
A well-formed XML document
Follows XML lexical conventions
Wrong: <section>We show that x < 0…</section>
Right: <section>We show that x < 0…</section>
Other special entities: > becomes > and & becomes &
Contains a single root element
Has tags that are properly matched and elements that are properly
nested
Right:
<section>…<subsection>…</subsection>…</sectio
n>
Wrong:
<section>…<subsection>…</section>…</subsectio
n>
7/20/2015
Luke Huan Univ. of Kansas
27
More XML features
Comments: <!-- Comments here -->
CDATA: <![CDATA[Tags: <book>,…]]>
ID’s and references
<person id=”o12”><name>Homer</name>…</person>
<person id=”o34”><name>Marge</name>…</person>
<person id=”o56” father=”o12”
mother=”o34”><name>Bart</name>…</person>…
Namespaces allow external schemas and qualified names
<book xmlns:myCitationStyle=”http://…/mySchema”>
<myCitationStyle:title>…</myCitationStyle:title>
<myCitationStyle:author>…</myCitationStyle:author>…
</book>
Processing instructions for apps: <? …java applet… ?>
And more…
7/20/2015
Luke Huan Univ. of Kansas
28
Valid XML documents
A valid XML document conforms to a Document Type Definition
(DTD)
A DTD specifies
A DTD is optional
A grammar for the document
Constraints on structures and values of elements, attributes, etc.
Example
<!DOCTYPE bibliography [
<!ELEMENT bibliography (book+)>
<!ELEMENT book (title, author*, publisher?, year?, section*)>
<!ATTLIST book ISBN CDATA #REQUIRED>
<!ATTLIST book price CDATA #IMPLIED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT section (title, (#PCDATA)?, section*)>
]>
7/20/2015
Luke Huan Univ. of Kansas
29
DTD explained
<!DOCTYPE bibliography [
bibliography is the root element of the document
<!ELEMENT bibliography (book+)>
One or more
bibliography consists of a sequence of one or more book elements
<!ELEMENT book (title, author*, publisher?, year?,
Zero or one
section*)>
Zero or more
book consists of a title, zero or more authors,
an optional publisher, and zero or more sections, in sequence
<!ATTLIST
book
ISBNISBN
ID #REQUIRED>
book has
a required
attribute which is a unique identifier
<bibliography>
<book ISBN=”ISBN-10” price=”80.00”>
<title>Foundations of Databases</
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publis
<year>1995</year>
</book>…
</bibliography>
<!ATTLIST
book
price(#IMPLIED)
CDATA #IMPLIED>
book has
an optional
price attribute which contains
character data
7/20/2015
Luke Huan Univ. of Kansas
30
DTD explained (cont’d)
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
title (#PCDATA)>
PCDATA is text that will be parsed
author (#PCDATA)>
publisher (#PCDATA)> (<…> will be treated as a markup tag
and < etc. will be treated as entities
year (#PCDATA)>
CDATA is unparsed character data
title, author, publisher, and year all
contain parsed character data (#PCDATA)
<!ELEMENT section (title, (#PCDATA)?, section*)>
]>
7/20/2015
Each section starts with a title,
followed by some optional text and then
zero or more subsections
<section><title>Introduction</title>
Luke Huan Univ. of Kansas
In this section we introduce XML and
<section><title>XML</title>
XML stands for…
</section>
<section><title>DTD</title>
<section><title>Definition</title>
DTD stands for…
</section>
<section><title>Usage</title>
You can use DTD to…
</section>
</section>
</section>
31
“Deterministic” content declaration
Catch: the following declaration does not work:
<!ELEMENT pub-venue
( (name, address, month, year) |
(name, volume, number, year) )>
Because when looking at name, the XML processor
would not know which way to go without looking further
ahead
Requirement: content declaration must be
“deterministic” (i.e., no look-ahead required)
7/20/2015
Luke Huan Univ. of Kansas
32
XML versus relational data
Relational data
Schema is always fixed in advance
and difficult to change
Simple, flat table structures
Ordering of rows and columns is
unimportant
Data exchange is problematic
“Native” support in all serious
commercial DBMS
7/20/2015
XML data
Well-formed XML does not require
predefined, fixed schema
Nested structure; ID/IDREF(S)
permit arbitrary graphs
Ordering forced by document
format; may or may not be
important
Designed for easy exchange
Often implemented as an “add-on”
on top of relations
Luke Huan Univ. of Kansas
33
Query languages for XML
XPath
XQuery
Path expressions with conditions
Building block of other standards (XQuery, XSLT, XLink,
XPointer, etc.)
XPath + full-fledged SQL-like query language
XSLT
XPath + transformation templates
7/20/2015
Luke Huan Univ. of Kansas
34