CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Review

Classification




Training data set
Testing data set
Classification Models
Model evaluation
7/20/2015
Luke Huan Univ. of Kansas
2
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS

Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
7/20/2015
Luke Huan Univ. of Kansas
3
Cost-Sensitive Measures
PREDICTED CLASS
Class=Y Class=N
es
o
ACTUA Class=Y
L
es
CLASS
Class=N
o
7/20/2015
a
(TP)
b
(FN)
c
(FP)
d
(TN)
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
2a
F - measure (F) 

r  p 2a  b  c
Luke Huan Univ. of Kansas
4
Today’s Topic


Clustering
XML
7/20/2015
Luke Huan Univ. of Kansas
5
What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
7/20/2015
Luke Huan Univ. of Kansas
6
Applications of Cluster Analysis

Understanding
 Group related documents for browsing, group genes and
proteins that have similar functionality, or group stocks
with similar price fluctuations

Summarization
 Reduce the size of large data sets
7/20/2015
Luke Huan Univ. of Kansas
7
Multidisciplinary Efforts of Clustering

Pattern Recognition

Spatial Data Analysis

Create thematic maps in GIS by clustering feature spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing

Economic Science (especially market research)

WWW


Document classification

Cluster Weblog data to discover groups of similar access patterns
Bioinfo:

Phylogenetic tree

Microarray analysis
7/20/2015
Luke Huan Univ. of Kansas
8
What is not Cluster Analysis?

Supervised classification


Simple segmentation


Dividing students into different registration groups
alphabetically, by last name
Results of a query


Have class label information
Groupings are a result of an external specification
Graph partitioning

7/20/2015
Some mutual relevance and synergy, but areas are not identical
Luke Huan Univ. of Kansas
9
Terms in Cluster Analysis


Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters
Cluster analysis

Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes

So what?

Clustering can be used as a stand-alone tool to get insight into data
distribution

Clustering can be used as a preprocessing step for other algorithms
such as discretization
7/20/2015
Luke Huan Univ. of Kansas
10
Types of Clusters



Well-separated clusters
Center-based clusters
Contiguous clusters
7/20/2015
Luke Huan Univ. of Kansas
11
Types of Clusters: Well-Separated

Well-Separated Clusters:

A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than to
any point not in the cluster.
3 well-separated clusters
7/20/2015
Luke Huan Univ. of Kansas
12
Types of Clusters: Center-Based

Center-based


A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average
of all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
7/20/2015
Luke Huan Univ. of Kansas
13
Major Clustering Approaches (I)

Partitioning approach:

Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors


Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:

Create a hierarchical decomposition of the set of data (or objects)
using some criterion

7/20/2015
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Luke Huan Univ. of Kansas
14
Partitional Clustering
Original Points
7/20/2015
A Partitional Clustering
Luke Huan Univ. of Kansas
15
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
7/20/2015
p3 p4
Non-traditional Dendrogram
Luke Huan Univ. of Kansas
16
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
km1tm iKm (Cm  tmi )2

Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the center
of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
7/20/2015
Luke Huan Univ. of Kansas
17
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:

Partition objects into k nonempty subsets

Compute seed points as the centroids of the clusters of the
current partition (the centroid is the mean of the cluster)

Assign each object to the cluster with the nearest seed point

Go back to Step 2, stop when no more new assignment
7/20/2015
Luke Huan Univ. of Kansas
18
The K-Means Clustering Method
10
9
8
7
6
5
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
10
10
9
9
8
8
7
7
6
6
5
4
3
2
1
0
0
7/20/2015
Update
the
cluster
means
1
2
3
4
5
6
7
8
9
10
Luke Huan Univ. of Kansas
Update
the
cluster
means
1
2
3
4
5
6
7
8
9
10
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
19
Comments on the K-Means Method

Strength


Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum.

7/20/2015
The global optimum may be found using techniques such
as genetic algorithms
Luke Huan Univ. of Kansas
20
Comments on the K-Means Method

Weakness

Applicable only when mean is defined, then what about
categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes
7/20/2015
Luke Huan Univ. of Kansas
21
A Problem K-means: Differing Density
K-means (3 Clusters)
Original Points
7/20/2015
Luke Huan Univ. of Kansas
22
A Problem of K-means:
Non-globular Shapes
Original Points
7/20/2015
K-means (2 Clusters)
Luke Huan Univ. of Kansas
23
From HTML to XML (eXtensible Markup Language)

HTML describes presentation of content
<h1>Bibliography</h1>
<p><i>Foundations of Databases</i>
Abiteboul, Hull, and Vianu
<br>Addison Wesley, 1995
<p>…

XML describes only the content
<bibliography>
<book>
<title>Foundations of Databases</title>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publisher>
<year>1995</year>
</book>
<book>…</book>
</bibliography>

Separation of content from presentation simplifies content extraction and
allows the same content to be presented easily in different looks
7/20/2015
Luke Huan Univ. of Kansas
24
Other nice features of XML

Portability: Just like HTML, you can ship XML data
across platforms


Flexibility: You can represent any information
(structured, semi-structured, documents, …)


Relational data requires heavy-weight protocols, e.g.,
JDBC
Relational data is best suited for structured data
Extensibility: Since data describes itself, you can change
the schema easily

Relational schema is rigid and difficult to change
7/20/2015
Luke Huan Univ. of Kansas
25
XML terminology




<bibliography>
<book ISBN=”ISBN-10” price=”80.00”>
<title>Foundations of Databases</t
<is_textbook/>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publish
<year>1995</year>
</book>…
</bibliography>
Tag names: book, title, …
Start tags: <book>, <title>, …
End tags: </book>, </title>, …
An element is enclosed by a pair of start and end tags:
<book>…</book>


Elements can be nested:
<book>…<title>…</title>…</book>
Empty elements: <is_textbook></is_textbook>


Can be abbreviated: <is_textbook/>
Elements can also have attributes: <book ISBN=”…”
price=”80.00”>
7/20/2015
Luke Huan Univ. of Kansas
26
Well-formed XML documents
A well-formed XML document
 Follows XML lexical conventions


Wrong: <section>We show that x < 0…</section>
Right: <section>We show that x &lt; 0…</section>



Other special entities: > becomes &gt; and & becomes &amp;
Contains a single root element
Has tags that are properly matched and elements that are properly
nested

Right:
<section>…<subsection>…</subsection>…</sectio
n>

Wrong:
<section>…<subsection>…</section>…</subsectio
n>
7/20/2015
Luke Huan Univ. of Kansas
27
More XML features

Comments: <!-- Comments here -->
CDATA: <![CDATA[Tags: <book>,…]]>

ID’s and references

<person id=”o12”><name>Homer</name>…</person>
<person id=”o34”><name>Marge</name>…</person>
<person id=”o56” father=”o12”
mother=”o34”><name>Bart</name>…</person>…

Namespaces allow external schemas and qualified names
<book xmlns:myCitationStyle=”http://…/mySchema”>
<myCitationStyle:title>…</myCitationStyle:title>
<myCitationStyle:author>…</myCitationStyle:author>…
</book>

Processing instructions for apps: <? …java applet… ?>

And more…
7/20/2015
Luke Huan Univ. of Kansas
28
Valid XML documents

A valid XML document conforms to a Document Type Definition
(DTD)


A DTD specifies



A DTD is optional
A grammar for the document
Constraints on structures and values of elements, attributes, etc.
Example
<!DOCTYPE bibliography [
<!ELEMENT bibliography (book+)>
<!ELEMENT book (title, author*, publisher?, year?, section*)>
<!ATTLIST book ISBN CDATA #REQUIRED>
<!ATTLIST book price CDATA #IMPLIED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT section (title, (#PCDATA)?, section*)>
]>
7/20/2015
Luke Huan Univ. of Kansas
29
DTD explained
<!DOCTYPE bibliography [
bibliography is the root element of the document
<!ELEMENT bibliography (book+)>
One or more
bibliography consists of a sequence of one or more book elements
<!ELEMENT book (title, author*, publisher?, year?,
Zero or one
section*)>
Zero or more
book consists of a title, zero or more authors,
an optional publisher, and zero or more sections, in sequence
<!ATTLIST
book
ISBNISBN
ID #REQUIRED>
book has
a required
attribute which is a unique identifier
<bibliography>
<book ISBN=”ISBN-10” price=”80.00”>
<title>Foundations of Databases</
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publis
<year>1995</year>
</book>…
</bibliography>
<!ATTLIST
book
price(#IMPLIED)
CDATA #IMPLIED>
book has
an optional
price attribute which contains
character data
7/20/2015
Luke Huan Univ. of Kansas
30
DTD explained (cont’d)
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
title (#PCDATA)>
PCDATA is text that will be parsed
author (#PCDATA)>
publisher (#PCDATA)> (<…> will be treated as a markup tag
and &lt; etc. will be treated as entities
year (#PCDATA)>
CDATA is unparsed character data
title, author, publisher, and year all
contain parsed character data (#PCDATA)
<!ELEMENT section (title, (#PCDATA)?, section*)>
]>
7/20/2015
Each section starts with a title,
followed by some optional text and then
zero or more subsections
<section><title>Introduction</title>
Luke Huan Univ. of Kansas
In this section we introduce XML and
<section><title>XML</title>
XML stands for…
</section>
<section><title>DTD</title>
<section><title>Definition</title>
DTD stands for…
</section>
<section><title>Usage</title>
You can use DTD to…
</section>
</section>
</section>
31
“Deterministic” content declaration

Catch: the following declaration does not work:



<!ELEMENT pub-venue
( (name, address, month, year) |
(name, volume, number, year) )>
Because when looking at name, the XML processor
would not know which way to go without looking further
ahead
Requirement: content declaration must be
“deterministic” (i.e., no look-ahead required)
7/20/2015
Luke Huan Univ. of Kansas
32
XML versus relational data
Relational data
 Schema is always fixed in advance
and difficult to change

Simple, flat table structures

Ordering of rows and columns is
unimportant

Data exchange is problematic
“Native” support in all serious
commercial DBMS

7/20/2015
XML data
 Well-formed XML does not require
predefined, fixed schema
 Nested structure; ID/IDREF(S)
permit arbitrary graphs
 Ordering forced by document
format; may or may not be
important
 Designed for easy exchange
 Often implemented as an “add-on”
on top of relations
Luke Huan Univ. of Kansas
33
Query languages for XML

XPath



XQuery


Path expressions with conditions
Building block of other standards (XQuery, XSLT, XLink,
XPointer, etc.)
XPath + full-fledged SQL-like query language
XSLT

XPath + transformation templates
7/20/2015
Luke Huan Univ. of Kansas
34