Text Mining - ESP Conference and Summer School

Download Report

Transcript Text Mining - ESP Conference and Summer School

Languages and Information
Technologies
Prof. Milena Stanković
University of Niš,
Faculty of Electronic Engineering
ESP 2015, 22.05.2015.
Outline
Internet as a global data resource
• Searching
• Text mining
Web 2.0 (Blogs, Wikis, RSS)
New information technologies in learning and teaching
languages
• On line courses
• Social networking
• Learning through gaming
• Mobile applications
• Augmented reality
• Semantics in learning applications
Conclusions
ESP 2015, 22.05.2015.
Internet as a global network and data repository
• More than 190 countries are
linked into exchanges of data,
news and opinions.
• Estimated number of Internet
users worldwide is
3,000,608,300, which is nearly
40 percent of the world's
population.
• The total number of websites
with a unique hostname online
exceeded 1 billion.
(Internet Live Stats, December
30, 2014),
http://www.internetlivestats.com
ESP 2015, 22.05.2015.
Web search-indexing
https://developer.apple.com/library/mac/documen
tation/UserExperience/Conceptual/SearchKitConce
ESP 2015, 22.05.2015.
pts/searchKit_basics/searchKit_basics.html
Searching
https://developer.apple.com/library/mac/d
ocumentation/UserExperience/Conceptual/
SearchKitConcepts/searchKit_basics/search
Kit_basics.html
ESP 2015, 22.05.2015.
Unstructured Data on the Internet
Unstructured data (or unstructured information) refers to the
information that either does not have a pre-defined data model or
is not organized in a pre-defined manner.
Techniques such as
Data mining,
Text mining,
Natural Language Processing,
Web mining,
Text analytics,
Multimedia data mining,
provide different methods to
find patterns in, or otherwise
interpret the information.
ESP 2015, 22.05.2015.
Text Mining?
• Text Mining is about discovery by computer of new, previously
unknown information, by automatically extracting information
from different written resources.
• Text mining is different from web search.
In search, the user is typically looking for something that is already known
and has been written by someone else.
• In text mining, the goal is to discover unknown information,
something that is not directly visible.
ESP 2015, 22.05.2015.
Text mining workflow
ESP 2015, 22.05.2015.
Word level representation of the
documents
• The most common representation of text used for many
techniques.
• Word frequencies in texts have the power distribution:
• …small number of very frequent words
• …big number of low frequency words.
• Relations among word surface forms and their senses:
• Homonymy: same form, but different meaning
(e.g. bank: river bank, financial institution)
• Polysemy: same form, related meaning
(e.g. bank: blood; bank: financial institution)
• Synonymy: different forms, same meaning
(e.g. singer, vocalist)
• Hyponymy: one word denotes a subclass of an another
(e.g. breakfast, meal).
ESP 2015, 22.05.2015.
Stop-words
Stop-words are words that from non-linguistic view do not carry
information
• they have mainly functional role
• usually we remove them to help the methods to perform
better.
Stop words are language dependent.
• English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ...
• Dutch: de, en, van, ik, te, dat, die, in, een, hij, het, niet, zijn, is,
was, op, aan, met, als, voor, had, er, maar, om, hem, dan, zou,
of, wat, mijn, men, dit, zo, ...
ESP 2015, 22.05.2015.
Stemming
• Stemming is a process of transforming a word into its stem
(normalized form).
• Different forms of the same word are usually problematic
for text data analysis, because they have different spelling
and similar meaning (e.g. consign, consigned, consigning,
consignment,…)
ESP 2015, 22.05.2015.
Some rules in Porter stemmer
ESP 2015, 22.05.2015.
Taxonomies/thesaurus level
• Thesaurus has a main function to connect different surface
word forms with the same meaning into one sense
(synonyms)
• aditionally, we often use hypernym relation to relate
general-to-specific word senses
By using synonyms and hypernym relation we compact the
representation of documents.
WordNet - the most commonly used general thesaurus which
exists in many other languages (e.g. EuroWordNet)
http://www.illc.uva.nl/EuroWordNet/
ESP 2015, 22.05.2015.
WordNet relations
ESP 2015, 22.05.2015.
Phrases level
Google n-gram corpus
ESP 2015, 22.05.2015.
Part-of speech examples
Stanford Part-Of-Speech Tagger (POS tagger)
ESP 2015, 22.05.2015.
http://nlp.stanford.edu/software/tagger.shtml
Vector Space Model
The most common way to deal with documents is first to transform
them into sparse numeric vectors and then deal with them with linear
algebra operations
• by this, we forget everything about the linguistic structure
within the text
• this is sometimes called “structural curse” because this way of
forgetting about the structure doesn’t harm efficiency of solving
many relevant problems
This representation is referred to also as
“Bag-Of-Words” or “Vector-Space-Model”
Typical tasks on vector-space-model are classification, clustering,
visualization, etc.
ESP 2015, 22.05.2015.
Vector space document
representation – example
ESP 2015, 22.05.2015.
Example document and its vector
representation
Term frequency–Inverse document frequency ( tf–idf )
Tfidf(t,D) = tf(t,D)*idf(t,D)
Idf(t,D)=log(N/Dt)
ESP 2015, 22.05.2015.
Supervised Learning-Classification
Given: set of documents labeled with content categories
The goal: build a model which automatically assigns right
content categories to new unlabeled documents.
ESP 2015, 22.05.2015.
Unsupervised Learning-Clustering
• Clustering is a process of finding natural groups in the data in a
unsupervised way (no class labels are pre-assigned to documents)
• Key element is similarity measure.
• Cosine similarity is most widely used.
• Most popular clustering methods are:
• K-Means clustering (flat, hierarchical)
• Agglomerative hierarchical clustering
• EM (Gaussian Mixture)
ESP 2015, 22.05.2015.
Web 2.0
• Users are active - no longer
limited to consuming
information. Instead, they are
producers of the information.
• Sites utilize tools that make
them easy to publish on the
web.
• Social Networking
• Collective intelligence
• Multimedia publishing has
exploded
ESP 2015, 22.05.2015.
Useful Web 2.0 Tools
•
•
•
•
•
•
•
•
Weblogs
Wikis
Forums
Real Simple Syndication
(RSS)
Aggregators
Social Bookmarking and
Networking
Online Photo Galleries
Audio/video-casting
ESP 2015, 22.05.2015.
Moodle - LMS
Moodle is a free and open-source software learning management
system distributed under the GNU General Public License.
Developed on pedagogical principles, Moodle is used for blended
learning, distance education, flipped classroom, and
other eLearning projects in schools, universities, workplaces, and
other sectors.
ESP 2015, 22.05.2015.
Languages Courses on the Internet
• Duolingo - At the moment, it offers Spanish, English (for Spanish
speakers), French, German, Portuguese and Italian and more languages
are in beta and on the way soon.
• The Omniglot intro to languages has a great first overview of many
languages, and follows it up with links to courses and other tools for
that language.
• BBC’s languages has a great mini-introduction to almost 40 different
languages!
• About.com has some interesting articles, courses, and word lists for
English as a second language, French, German, Italian, Japanese,
Mandarin, and Spanish.
• Internet polyglot has some great courses and help to memorize words
for many languages.
ESP 2015, 22.05.2015.
Duolingo
ESP 2015, 22.05.2015.
Essential benefits of learning a foreign
language through online courses
• Multimedia
• Repetition
• Autonomy
• Accessibility
• New Learning Methods
ESP 2015, 22.05.2015.
Skype in the classroom
ESP 2015, 22.05.2015.
About this Skype lesson
“I teach English in a primary school Hungary. We are looking for a nice
class or group for regular meetings. Children in the partner group
should be the teachers and they may teach my students English.
We may plan any language games to play, or any basic topic to talk or
any grammar to practice. I think practicing the language with the same
age group may motivate my students to learn the language better.
We were so glad to find a partner class or group”.
ESP 2015, 22.05.2015.
Skype in the Primary school Čegar in Niš
Skype in he classroom.
ESP 2015, 22.05.2015.
Vocabulary learning
• Memrise is one of the most versatile sites for providing pre-made
mnemonics for vocabulary in a wide range of languages, which is
always expanding since the system is open to people adding their
own public vocabulary lists and suggestions in.
ESP 2015, 22.05.2015.
Native content in the language
• Tunein - lets to listen live streamed radio from all over the world!
ESP 2015, 22.05.2015.
Language learning forums
Fluent in 3 months forum - the forum on this site is one of the
most active language learning forums online, with 20,000
members.
How to learn any language forum
If you are a foreign language enthusiast, a
polyglot or just want to learn a new
language on your own, you will find here:
• How to choose a new language to learn
• A detailed, hands-on guide to teaching
yourself a foreign language.
• Reviews of books about language
learning
• The questions about language
learning people ask most frequently.
ESP 2015, 22.05.2015.
Get it pronounced/corrected by a native
speaker
• Forvo is a great site if you come across a
new word and would really like to hear
how it’s pronounced by a native speaker.
It has a huge database covering many
languages that you can search and get an
instant answer.
• Rhinospike - is better to hear how an
entire sentence or even a couple of short
paragraphs are pronounced by a native
speaker.
• Lang 8 - is a site where you can write text
in a particular language, and pretty soon
have natives look over it and give you
great feedback.
ESP 2015, 22.05.2015.
Multilingual dictionaries
• Wordreference is one of sites to search for the meaning of words
in French, Spanish, Italian and Portuguese.
• Bab.la is another dictionary for a bunch (24) of languages.
• Google Translate – while it will mess things up a lot, as far as
automatic translations go that are completely free,.
• Proz term search , the Interactive Terminology for
Europe and Mymemory – specialized dictionaries , specifically for
finding technical terminology that is less likely to appear in other
general dictionaries.
ESP 2015, 22.05.2015.
Social networking
In the past few years, a series of language learning social
networks have popped up, and they make learning more fun,
efficient, interactive and interesting than usual.
Through these language education social networks, the
student now can study language into enjoyable environment
by meeting and interacting with native language speakers from
around the world.
Live Mocha
ESP 2015, 22.05.2015.
Interactive games
ESP 2015, 22.05.2015.
Mobile applications
• Babel mobile for Android
• DuoLingo
• Rosetta Course
ESP 2015, 22.05.2015.
Augmented reality
Word Lens
Word Lens is an augmented reality application that recognizes printed
words using its camera and optical character recognition capabilities and
instantly translates these words into the desired language.
Does not require connection to the Internet.
ESP 2015, 22.05.2015.
Adaptation by usage of semantic rules
https://elearning4109.wikispaces.com/The+Semantic+Web+%2
6+Ontologies+in+e-learning
ESP 2015, 22.05.2015.
DSi framework
ESP 2015, 22.05.2015.
Conclusions
In stead of a conclusion, I would like to say that this
conference will be a nice opportunity to exchange
experiences and ideas about possibilities to use
contemporary information technologies in learning and
teaching languages.
Also, we will have an opportunity to discuss ideas for some
new common projects based on the usage of information
technologies and languages.
ESP 2015, 22.05.2015.