CS 601R: Advanced NLP

Download Report

Transcript CS 601R: Advanced NLP

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 679: Advanced NLP
Lecture #1: Introduction to Text
Mining
Objectives for Today
1.
2.
3.
4.
5.
Quick course info.
Overview of Text Mining
Discuss your applications of Text Mining
Elements of Text Mining
Introduce course objectives
Course Info.
 Office Hours:
 Tue & Thu. 3-4pm (without appointment)
 OR by appointment
 TA: TBD
 Web page: https://facwiki.cs.byu.edu/cs679
 Syllabus
 Regularly updated schedule: Due dates, Reading
assignments, Projects guidelines, Lecture Notes
 Google Group “BYU CS 679”
 Email: ringger AT cs DOT byu DOT edu
 Grades: http://gradebook.byu.edu
Assignments
 Readings – with max. one page reports
 Mostly research papers (see course web page for all hyperlinks)
 Usually one reading report per week
 Intro. Projects
 Presentation
 Report
 Semester Project
 Proposal
 Presentation
 Report
Course Policies




Early
Late
Grades
Other
See Syllabus for details
Text Mining
The process of discovering
previously unknown information
in large text collections
Paraphrased from M. Hearst
Other Definitions
 Looking for patterns in unstructured text
(Nahm)
 Text mining applies the same analytical
functions of data mining to the domain of
textual information (Doore(
“Search” versus “Discover”
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
(opportunistic)
Data
Retrieval
Data
Mining
Information
Retrieval
Text
Mining
Credit: adapted from slide by Nathan Treloar, AvaQuest
Your Exciting Applications
F2011: Your Exciting Applications
W2011: Exciting Applications
2010: Exciting Applications
2009: Exciting Applications
Additional Applications





News Mining
Sentiment Detection
Summarization
Trend Analysis
Association Detection
Course Objectives
 Acquire experience conducting exploratory data
analysis on large collections of text
 Gain in-depth experience with and understanding of
approaches to
 document classification
 sentiment classification
 feature engineering
 feature selection
 document clustering
 unsupervised topic identification
 visualization, including document summarization
 Build a foundation of techniques for approximate
Bayesian reasoning for unsupervised text analysis
Course Objectives (2)
 Obtain experience with techniques for
evaluating and visualizing the results of
unsupervised learning processes
 Independent investigation of methods of your
choice!
 Application of your methods to learn
something important from a significant text
corpus of your choice
Simplistic Text Mining Process
Credit: NCSA
Methods







Feature Engineering
Feature Selection
Information Extraction
Categorization (Supervised)
Clustering (Unsupervised)
Topic Identification / Topic Modeling
Visualization
Some Available Data Sets









20 Newsgroups -- Usenet
Reuters (1990s) newswire
Del.icio.us bookmarked web pages
Enron Email
Movie Reviews
Gamespot game reviews
General Conference
State of the Union
Campaign Speeches
…
 Yours!
Assignment
 Reading for next time:
 Course Syllabus
 "Tapping the Power of Text Mining" by Fan et al.
(CACM 9/2006)
 "Text-Mining the Voice of the People" by
Evangelopoulos et al. (CACM 2/2012)
 Skim: Alta Plana Text Analytics Report
 Reading Report #1
 % Completed
 Questions