Computer Science (cont.)

Transcript Computer Science (cont.)

Computational
Linguistics
WTLAB (Web Technology Laboratory)
Mohsen Kamyar
Computer Science

Main “Data Source” in recent years is “World
Wide Web” or other sources of text data

Autonomous data generation


We can’t force people to a specific format for data
People want to present data with fewer words as
possible.


We will see structures that are illegal in language
grammar or even they are not words.
We will see rapid language changes, so we can’t use
static models for language.
Computer Science (cont.)


In this view computing the precision of language
processing is based on frequency of words (on
the other hand in Linguistics we have distinct
words).
Some examples of such applications

American governmental programs:



Total Information Awareness (TIA) during 2003
Computer-Assisted Passenger Prescreening System
(CAPPS II) till 2004 and assigns a color to each passenger
Analysis, Dissemination, Visualization, Insight, Semantic
Enhancement (ADVISE) during 2004-2006 and as a
component of a program with $47 million budget.
Computer Science (cont.)


And many software vendors (based on 2008 reports)


Multistate Anti-Terrorism Information Exchange (MATRIX) till
2005.
Angoss Software, Infor CRM Epiphany, Kxen, Portrait
Software, SAS, SPSS, ThinkAnalytics, Unica, Viscovery, …
Although, we have applications that are more
similar to Linguistics:




Machine Translation
Human-Computer interaction applications
Text to Speech
Text Simplification
Data Mining

As “Text Data” view, Data Mining has three
main steps:

Pre-processing


Preparing a representation for data that is suitable for
next steps.
Data Mining

Indicating relevance of data in following views


Classification: arranging the data in predefined groups
Clustering: arranging the data in groups, but in this case we
should find groups and they aren’t predefined.
Data Mining (cont.)




Regression: finding an equation that can describe the data
model
Association Rule Learning: finding relations between
concepts or main objects in data model.
Interpreting the results
We can guess that common research areas
between “Computer Science” and
“Linguistics” in this process are steps 1 and 3
(mainly step 1).

In an example we can highlight it.
Search Engine
Web
Web Cache
Ranking
Crawler
URL Queue
Indexer
Stemmer
WordNet
Indexes
Search Engine (cont.)


It is the most popular application, most
important example of using the data mining,
one of high technologies and … .
In pre-processing we have following tasks in
search engines that focus on linguistic
aspects of data:

Computing importance factor of a word in a
document


Frequency
TFIDF (Vector Space Model)
Search Engine (cont.)

Stemming



There are two main categories of approaches:
Dictionary based and non-Dictionary based.
Using tagging a word in a sentence for stemming
Related words (works such as WordNet)




Synonyms: Same meaning
Hypernyms and hyponyms: General concepts and sub
concepts.
Homonyms: Same spelling but different meaning
Acronyms: Abbreviations
Semantic Search Engine

In a “Semantic Search Engine” main
differences are as below:

Indexing is not based on words, but on “Ontology”

Ontology Extraction


Latent Semantic Indexing
Ranking is not based on “Web Links”, but on
“Similarity Between Pages”.

Computer Science (cont.)

Transcript Computer Science (cont.)

Directory