MLtechniqes4detectingTopics

Download Report

Transcript MLtechniqes4detectingTopics

Machine learning techniques for
detecting topics in research papers
Amy Dai
The Goal
Build a web application that allows users to easily
browse and search papers
Project Overview
Part I – Data Processing
1.


Convert PDF to text
Extract information from documents
Part II – Discovering topics
2.



Index documents
Group documents by similarity
Learn underlying topics
Part I - Data Processing
How do we extract information from PDF
documents?
Pdf to Text
 Research papers are in PDF
 PDFs are images
 Computer sees colored lines and dots
 Conversion process loses some of the formatting
Getting what we need

Construct heuristic rules to extract info
• First line
• Between title
and abstract
• Preceded by
“Abstract”
• Preceded by
“Keywords”
Finding Names
Can we predict names?
 Named Entity Tagger
 by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign.
Spam, Damn Spam, and Statistics
Using statistical analysis to locate spam web pages
Dennis Fetterly Mark Manasse Marc Najork
Microsoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La
Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA
Mountain View, CA 94043, USA
[email protected]
[email protected]
[email protected]
Accuracy
 To determine how well my script to extract info worked
 (# right + # needing minor changes)/ Total # of documents
 Example
 30 were correctly extracted
 10 needed minor changes
 60 total documents
 (30+10)/60 = 66.7%
Accuracy and Error
Perfect
Match (%)
Partial
Match (%)
No Match
(%)
Title
78
5
17
Abstract
63
12
35
Keywords
68.75
12.5
18.75
Authors
38
31
31
Part II – Learning Topics
Can we use machine learning to discover
underlying topics?
Indexing Documents
 Index documents
 Remove common words leaving better descriptors for
clustering
 Compare to corpus
 Brown Corpus: A Standard Corpus of Present-Day Edited
American English
 From the Natural Language Toolkit
 Reduce from 19,100 to 12,400 words
 Documents contain between 100 – 1,700 words after
common word removal
Effect on Index Size
 Changes in document index size for “Defining quality in web
search results”
Common Word
Frequency Cutoff
Index Size
20
15
10
5
357
318
276
230
Keeping What’s Important
 Words in abstract of “Defining quality in web search results”
5
querying
google
yahoo
metrics
retrieval
Common Word Frequency Cutoff
10
15
web
querying
google
yahoo
evaluating
metrics
retrieval
web
querying
google
controversial
yahoo
evaluating
metrics
retrieval
20
web
querying
google
controversial
engines
yahoo
evaluating
metrics
retrieval
Documents as Vectors
 Represent documents as numerical vectors by
transforming words to numbers using tf-idf
 Length is normalized
 Vector length is the length of index for corpus
 Mostly sparse
Clustering using Machine Learning
 Use machine learning algorithms to cluster by:
 K-means
 Group Average Agglomerative (GAA)
 Unsupervised learning
 Cosine similarity
Clustering Results
Documents
K-Means
GAA
 A: SpamRank – Fully Automatic Link
 Group 1
 Group 1





Spam Detection
A
B: An Approach to Confidence Based
Page Ranking for User Oriented Web  Group 2
Search
 B,C,D,E
C: Spam, Damn Spam, and Statistics
 Group 3
D:Web Spam, Propaganda and Trust
F
E: Detecting Spam Web Pages through
Content Analysis
F: A Survey of Trust and Reputation
Systems for Online Service Provision
B
 Group 2
 A,C,D,E
 Group 3
F
Challenges
 K-Means
 Finding K
 Group Average Agglomerative
 The depth to cut the dendogram
Labeling Clusters
 Compare term frequency in a cluster with the
collection
 A frequent word within the cluster and in the
collection isn’t a good discriminative label
 A good label is one that is infrequent in the
collection
Summary
1.
Part I – Data Processing
• PDF to text conversion isn’t perfect and imperfections make it
difficult to extract text
• Documents don’t follow one formatting standard, need
heuristic rules to extract info
2.
Part II – Discovering topics
• Indexes are large, to keep the important we need a good corpus
to compare it to.
• There are many clustering algorithms and each has limitations
• How do I choose the best label?
Ongoing work
 Use Bigrams
 Keywords: Web search, adversarial information
retrieval, web spam
 Limit the number of topic labels by ranking
 Use algorithm that clusters based on probabilistic
distributions
 Logistic normal distribution
Useful Tools
1. Pdftotext – Unix command for converting PDF to text
2. Python libraries
Unicode
 Re –regular expressions
3. NLTK – Natural language processing tool
 Software and datasets for natural language processing
 Used for clustering algorithms and reference corpus
