Transcript Data Mining - Mount Holyoke College
Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes Judgment File for Testing Data is Ready to Download http://www.mtholyoke.edu/courses/xli/341/ project/test04-complete-rel.txt 2 What’s Uue on Monday (May 7th) Your output files for the 25 testing queries – by 12:00am Monday (two electronic files via email) 15-minute PowerPoint presentation – In class A written (typed) report – In class Codes (a hard copy of the programs you wrote for this project) – In class 3 http://www.mtholyoke.edu/courses/xli/341/project/ sameple_output.txt 76 Q0 NYT19980722.0197-19 1 24.5831 1 76 Q0 NYT19991019.0443-14 2 23.4941 1 76 Q0 APW20000405.0059-15 3 21.891 1 76 Q0 APW20000405.0092-17 4 21.891 0 76 Q0 APW20000406.0056-26 5 21.891 0 76 Q0 NYT19980722.0197-22 6 21.7889 0 ... 100 Q0 APW20000115.0009-6 12 19.3636 1 100 Q0 APW20000114.0238-6 13 19.3636 1 100 Q0 APW20000114.0263-6 14 19.3636 1 100 Q0 NYT20000421.0179-10 15 19.2492 1 100 Q0 APW20000406.0056-6 16 18.963 0 4 A Written Report on the Final Project Problem Statement Problem Analysis Techniques/Approaches (what, why and how) Implementation Experimental Results and Analysis Lessons Learned 5 Good Luck! 6 7 Final project Goal: Apply available data mining techniques to solve real world problem. Requirement: – Apply two techniques/algorithm and implement at least one algorithm. (Find existing codes online or team up with your classmates) Problem: Relevant sentence retrieval – Retrieve the set of relevant sentences given a query and a collection of documents 8 An Example Query: <title> India and Pakistan Nuclear Tests <desc>Description: » On May 11 and 13, 1998 India conducted five nuclear tests; Pakistan responded by detonating six nuclear tests on May 28 and 30th. This nuclear testing was condemned by the international community. <narr>Narrative: » Relevant documents mention the nuclear testing conducted in May 1998 by both India and Pakistan. Historical information about the antagonism and rivalry between the two countries is not relevant. Mention of the furor created around the world by these detonations is relevant. 9 An Example Document <DOCNO>XIE19980529.0045-1</DOCNO> <TEXT> XIE19980529.0045 </TEXT> <DOCNO>XIE19980529.0045-2</DOCNO> <TEXT> 1998-05-29 </TEXT> <DOCNO>XIE19980529.0045-3</DOCNO> <TEXT> France, Canada Condemn Pakistani Nuclear Tests </TEXT> <DOCNO>XIE19980529.0045-4</DOCNO> <TEXT> NEW YORK, May 28 (Xinhua) -- More countries have come out to condemn Pakistan's nuclear tests. </TEXT> 10 An Example Document <DOCNO>XIE19980529.0045-5</DOCNO> <TEXT> The French Foreign Ministry issued a communique Thursday to deplore and condemn the nuclear tests conducted by Pakistan on the same day. </TEXT> <DOCNO>XIE19980529.0045-6</DOCNO> <TEXT> France calls on both India and Pakistan not to conduct any more nuclear tests but to sign the Comprehensive Test Ban Treaty and join talks on the banning of production of fissile materials that can be used to produce nuclear arms, said the communique. </TEXT> 11 An Example Document <DOCNO>XIE19980529.0045-7</DOCNO> <TEXT> Canadian Foreign Affairs Minister Lloyd Axworthy said in a statement released Thursday, "We continue to urge Pakistan and India to renounce their nuclear weapons programs and to sign the Nuclear Non-Proliferation Treaty and the Comprehensive Test Ban Treaty." </TEXT> <DOCNO>XIE19980529.0045-8</DOCNO> <TEXT> He also announced a series of sanctions against Pakistan, which he said are consistent with those imposed on India after its nuclear tests. </TEXT> 12 Training and Testing Training set – 25 queries – Collections of documents – Relevance judgment of each sentence Testing set – Another 25 queries – Collections of documents 13 Questions to Be Considered: How do you represent queries and sentences? What features may be considered for the task? What data mining techniques can be used and how? How do you evaluate the performance of your system? 14 Evaluation: Precision and Recall The F measure – F = 2*Precision*Recall/(Precision + Recall) Average Precision for query – Calculate by averaging precision as recall increases. Mean Average Precision 15 Data Mining Techniques Classification Clustering Association Rules 16 Fuzzy c-means clustering Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster For each point x we have a coefficient giving the degree of being in the kth cluster uk(x) 17 Fuzzy c-means clustering The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: The degree of belonging is related to the inverse of the distance to the cluster The coefficients are normalized with a real parameter m > 1 so that their sum is 1. 18 Fuzzy c-means clustering Choose a number of clusters. Assign randomly to each point coefficients for being in the clusters. Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : – Compute the centroid for each cluster, using the formula above. – For each point, compute its coefficients of being in the clusters, using the formula above. 19 Next Class (Wednesday April 11th) A 10-minute presentation on your project design, including – (1) Problem statement(s) – (2) Your technical approaches (I & II) – (3) A work plan – (4) Anticipated results – (5) Evaluation metrics 20