Data Mining - Mount Holyoke College

Download Report

Transcript Data Mining - Mount Holyoke College

Data Mining
CS 341, Spring 2007
Final Project: presentation &
report & codes
Judgment File for Testing Data is
Ready to Download
http://www.mtholyoke.edu/courses/xli/341/
project/test04-complete-rel.txt
2
What’s Uue on Monday (May 7th)

Your output files for the 25 testing queries
– by 12:00am Monday (two electronic files via email)

15-minute PowerPoint presentation
– In class

A written (typed) report
– In class

Codes (a hard copy of the programs you
wrote for this project)
– In class
3
http://www.mtholyoke.edu/courses/xli/341/project/
sameple_output.txt
76 Q0 NYT19980722.0197-19 1 24.5831 1
76 Q0 NYT19991019.0443-14 2 23.4941 1
76 Q0 APW20000405.0059-15 3 21.891 1
76 Q0 APW20000405.0092-17 4 21.891 0
76 Q0 APW20000406.0056-26 5 21.891 0
76 Q0 NYT19980722.0197-22 6 21.7889 0
...
100 Q0 APW20000115.0009-6 12 19.3636 1
100 Q0 APW20000114.0238-6 13 19.3636 1
100 Q0 APW20000114.0263-6 14 19.3636 1
100 Q0 NYT20000421.0179-10 15 19.2492 1
100 Q0 APW20000406.0056-6 16 18.963 0
4
A Written Report on the Final
Project
Problem Statement
 Problem Analysis
 Techniques/Approaches (what, why and
how)
 Implementation
 Experimental Results and Analysis
 Lessons Learned

5
Good Luck!
6
7
Final project

Goal: Apply available data mining techniques to
solve real world problem.

Requirement:
– Apply two techniques/algorithm and implement at least
one algorithm. (Find existing codes online or team up
with your classmates)

Problem: Relevant sentence retrieval
– Retrieve the set of relevant sentences given a query
and a collection of documents
8
An Example Query:

<title> India and Pakistan Nuclear Tests

<desc>Description:
» On May 11 and 13, 1998 India conducted five nuclear
tests; Pakistan responded by detonating six nuclear
tests on May 28 and 30th. This nuclear testing was
condemned by the international community.

<narr>Narrative:
» Relevant documents mention the nuclear testing
conducted in May 1998 by both India and Pakistan.
Historical information about the antagonism and rivalry
between the two countries is not relevant. Mention of
the furor created around the world by these
detonations is relevant.
9
An Example Document
















<DOCNO>XIE19980529.0045-1</DOCNO>
<TEXT>
XIE19980529.0045
</TEXT>
<DOCNO>XIE19980529.0045-2</DOCNO>
<TEXT>
1998-05-29
</TEXT>
<DOCNO>XIE19980529.0045-3</DOCNO>
<TEXT>
France, Canada Condemn Pakistani Nuclear Tests
</TEXT>
<DOCNO>XIE19980529.0045-4</DOCNO>
<TEXT>
NEW YORK, May 28 (Xinhua) -- More countries have come out to
condemn Pakistan's nuclear tests.
</TEXT>
10
An Example Document








<DOCNO>XIE19980529.0045-5</DOCNO>
<TEXT>
The French Foreign Ministry issued a communique
Thursday to deplore and condemn the nuclear tests
conducted by Pakistan on the same day.
</TEXT>
<DOCNO>XIE19980529.0045-6</DOCNO>
<TEXT>
France calls on both India and Pakistan not to conduct any
more nuclear tests but to sign the Comprehensive Test Ban
Treaty and join talks on the banning of production of fissile
materials that can be used to produce nuclear arms, said
the communique.
</TEXT>
11
An Example Document








<DOCNO>XIE19980529.0045-7</DOCNO>
<TEXT>
Canadian Foreign Affairs Minister Lloyd Axworthy said in a
statement released Thursday, "We continue to urge
Pakistan and India to renounce their nuclear weapons
programs and to sign the Nuclear Non-Proliferation Treaty
and the Comprehensive Test Ban Treaty."
</TEXT>
<DOCNO>XIE19980529.0045-8</DOCNO>
<TEXT>
He also announced a series of sanctions against Pakistan,
which he said are consistent with those imposed on India
after its nuclear tests.
</TEXT>
12
Training and Testing

Training set
– 25 queries
– Collections of documents
– Relevance judgment of each sentence

Testing set
– Another 25 queries
– Collections of documents
13
Questions to Be Considered:

How do you represent queries and
sentences?

What features may be considered for the
task?

What data mining techniques can be used
and how?

How do you evaluate the performance of your
system?
14
Evaluation:
Precision and Recall
 The F measure

– F = 2*Precision*Recall/(Precision + Recall)

Average Precision for query
– Calculate by averaging precision as recall
increases.

Mean Average Precision
15
Data Mining Techniques
Classification
 Clustering
 Association Rules

16
Fuzzy c-means clustering

Each point has a degree of belonging to
clusters, rather than belonging completely
to just one cluster

For each point x we have a coefficient
giving the degree of being in the kth cluster
uk(x)
17
Fuzzy c-means clustering

The centroid of a cluster is the mean of all points, weighted
by their degree of belonging to the cluster:

The degree of belonging is related to the inverse of the
distance to the cluster

The coefficients are normalized with a real parameter m >
1 so that their sum is 1.
18
Fuzzy c-means clustering



Choose a number of clusters.
Assign randomly to each point coefficients for
being in the clusters.
Repeat until the algorithm has converged
(that is, the coefficients' change between two
iterations is no more than ε, the given
sensitivity threshold) :
– Compute the centroid for each cluster, using the
formula above.
– For each point, compute its coefficients of being in
the clusters, using the formula above.
19
Next Class (Wednesday April 11th)

A 10-minute presentation on your
project design, including
– (1) Problem statement(s)
– (2) Your technical approaches (I & II)
– (3) A work plan
– (4) Anticipated results
– (5) Evaluation metrics
20