MLtechniqes4detectingTopics
Download
Report
Transcript MLtechniqes4detectingTopics
Machine learning techniques for
detecting topics in research papers
Amy Dai
The Goal
Build a web application that allows users to easily
browse and search papers
Project Overview
Part I – Data Processing
1.
Convert PDF to text
Extract information from documents
Part II – Discovering topics
2.
Index documents
Group documents by similarity
Learn underlying topics
Part I - Data Processing
How do we extract information from PDF
documents?
Pdf to Text
Research papers are in PDF
PDFs are images
Computer sees colored lines and dots
Conversion process loses some of the formatting
Getting what we need
Construct heuristic rules to extract info
• First line
• Between title
and abstract
• Preceded by
“Abstract”
• Preceded by
“Keywords”
Finding Names
Can we predict names?
Named Entity Tagger
by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign.
Spam, Damn Spam, and Statistics
Using statistical analysis to locate spam web pages
Dennis Fetterly Mark Manasse Marc Najork
Microsoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La
Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA
Mountain View, CA 94043, USA
[email protected]
[email protected]
[email protected]
Accuracy
To determine how well my script to extract info worked
(# right + # needing minor changes)/ Total # of documents
Example
30 were correctly extracted
10 needed minor changes
60 total documents
(30+10)/60 = 66.7%
Accuracy and Error
Perfect
Match (%)
Partial
Match (%)
No Match
(%)
Title
78
5
17
Abstract
63
12
35
Keywords
68.75
12.5
18.75
Authors
38
31
31
Part II – Learning Topics
Can we use machine learning to discover
underlying topics?
Indexing Documents
Index documents
Remove common words leaving better descriptors for
clustering
Compare to corpus
Brown Corpus: A Standard Corpus of Present-Day Edited
American English
From the Natural Language Toolkit
Reduce from 19,100 to 12,400 words
Documents contain between 100 – 1,700 words after
common word removal
Effect on Index Size
Changes in document index size for “Defining quality in web
search results”
Common Word
Frequency Cutoff
Index Size
20
15
10
5
357
318
276
230
Keeping What’s Important
Words in abstract of “Defining quality in web search results”
5
querying
google
yahoo
metrics
retrieval
Common Word Frequency Cutoff
10
15
web
querying
google
yahoo
evaluating
metrics
retrieval
web
querying
google
controversial
yahoo
evaluating
metrics
retrieval
20
web
querying
google
controversial
engines
yahoo
evaluating
metrics
retrieval
Documents as Vectors
Represent documents as numerical vectors by
transforming words to numbers using tf-idf
Length is normalized
Vector length is the length of index for corpus
Mostly sparse
Clustering using Machine Learning
Use machine learning algorithms to cluster by:
K-means
Group Average Agglomerative (GAA)
Unsupervised learning
Cosine similarity
Clustering Results
Documents
K-Means
GAA
A: SpamRank – Fully Automatic Link
Group 1
Group 1
Spam Detection
A
B: An Approach to Confidence Based
Page Ranking for User Oriented Web Group 2
Search
B,C,D,E
C: Spam, Damn Spam, and Statistics
Group 3
D:Web Spam, Propaganda and Trust
F
E: Detecting Spam Web Pages through
Content Analysis
F: A Survey of Trust and Reputation
Systems for Online Service Provision
B
Group 2
A,C,D,E
Group 3
F
Challenges
K-Means
Finding K
Group Average Agglomerative
The depth to cut the dendogram
Labeling Clusters
Compare term frequency in a cluster with the
collection
A frequent word within the cluster and in the
collection isn’t a good discriminative label
A good label is one that is infrequent in the
collection
Summary
1.
Part I – Data Processing
• PDF to text conversion isn’t perfect and imperfections make it
difficult to extract text
• Documents don’t follow one formatting standard, need
heuristic rules to extract info
2.
Part II – Discovering topics
• Indexes are large, to keep the important we need a good corpus
to compare it to.
• There are many clustering algorithms and each has limitations
• How do I choose the best label?
Ongoing work
Use Bigrams
Keywords: Web search, adversarial information
retrieval, web spam
Limit the number of topic labels by ranking
Use algorithm that clusters based on probabilistic
distributions
Logistic normal distribution
Useful Tools
1. Pdftotext – Unix command for converting PDF to text
2. Python libraries
Unicode
Re –regular expressions
3. NLTK – Natural language processing tool
Software and datasets for natural language processing
Used for clustering algorithms and reference corpus