Transcript 推荐系统
文本挖掘简介
邹权
博士,助理教授
Outline
Introduction
TF-IDF
Similarity
Introduction
Why?
Text mining ≈ Web mining
How?
Classification or Clustering
Retrieval
文本分类一般过程
预处理
将文档集表示成易于计算机处理的形式
特征表示与选择、降维
根据适宜的权重计算方法表示文档中各项的重要性
学习建模
构建分类器
文本分类预处理
去标点、多余空格、数字(可选)
大小写统一
去停用词(stop words)
没有实际含义的词,比如and, you, have等等
统一词根
PorterStemmer
分词
英文?中文
特征表示
向量空间模型
以词项为特征组成高维特征向量
TF/IDF得到权值
TF-IDF
TF(Term Frequency)
表示词项频率
TFij fij / max fi
IDF(Inverse Document Frequency)
逆文档频率
TF*IDF值
IDFi log 2 N / ni
Similarity Applications
Many Web-mining problems can be expressed as finding
“similar” sets:
Plagiarism/Mirror Pages/Articles from the Same
Source/Duplication Remove
Collaborative Filtering as a Similar-Sets Problem
Recommend to users items that were liked by other users who
have exhibited smilar tastes
8
Measurement
Edit distance
Short text, words
For personal text
Jaccard distance
Long text, ignoring the word similarity
For government text
Real-world Data is Rather Dirty!
Microsoft Academic Search
Kenneth De Jong
Kenneth Dejong
PK
http://academic.research.microsoft.
com/Author/2037349.aspx
2016/4/10
http://academic.research.microsoft
.com/Author/3054641.aspx
Trie-Join @ VLDB2010
10/38
Real-world Data is Rather Dirty!
DBLP Complete Search
Typo in “author”
Argyrios Zymnis
Argyris Zymnis
relaxed
2016/4/10
Trie-Join @ VLDB2010
related
11/38
Similarity Joins
The similarity join is an essential operation for
data integration and cleaning
2016/4/10
Id
Name
Univ.
2037349
Kenneth De Jong
George …
…
…
…
3054641
Kenneth Dejong
George …
…
…
…
R
Trie-Join @ VLDB2010
12/38
Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to the
former world No. 1 - the man who owns the record of
14 Grand Slams he wants.
03/11/2008 | 11:28 AM
By JAY COHEN, AP
Sports Writer
Mar 11, 4:23 am EDT
Similarity Join
Tokenize:
Each record is a set of tokens from a finite universe.
Suppose each record is a single text document
• x = “yes as soon as possible”
• y = “as soon as possible please”
word
yes
as
soon
as1
possbile
please
token
A
B
C
D
E
F
• x = {A, B, C, D, E}
• y = {B, C, D, E, F}
参考文献
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu
Yu. Efficient Similarity Joins for Near Duplicate
Detection. WWW 2008.
Guoliang Li, Dong Deng, Jiannan Wang, Jianhua
Feng. Pass-Join: A Partition based Method for
Similarity Joins. VLDB 2012.