Career Opportunities for Linguists in Industry - Linguistics

Download Report

Transcript Career Opportunities for Linguists in Industry - Linguistics

Career Opportunities for Linguists in
Industry
Vita Markman
Computational Linguistics Engineer
DIMG
Linguists in Industry??!!!
Overview
I. What can a linguist do in industry?
II. Some concrete examples of what linguists work on “in the
real world”
II. From school to work: what skills/knowledge one needs
IV. Useful References
V. Conclusion and Q&A
Vita Markman
2
But Why?
Because:
●Having options is good
●You may want to take a year off before going to grad
school, yet work on something relevant
● It's really fun
●Research jobs are NOT confined to the academic
setting
●Not all industry jobs are like from 'The Office Space'
3
Explosion of Text Data
●
Explosion of linguistic data online
●
Social media (twitter, facebook, blogosphere)
●
●
Linguistic data is not easily amenable to analysis. It
requires much processing and much insight into the
nature and structure of language
Modifying old NLP techniques and tools as well as
inventing new ones: for example, no off-the-shelf
parser can handle Twitter conversations.
4
A linguist in the industry?
Examples:
Information retrieval (esp. search engines that do semantic
search)
●
Voice recognition/generation – think 'google voice'!
●
Text classification and text clustering
●
Text mining – finding grains of useful info in unstructured
text
●
E-discovery
●
Analyzing the language of social media (topic and
sentiment extraction from short fragmented and noisy data)
●
5
Chat filtering: an example
●
Computational linguistics application for online virtual worlds
●
Ensuring safety of on-line chat environments
●
Filtering chat for appropriate content
●
Filtering is an example of a classification
problem: classify text as ‘appropriate’ or
‘inappropriate’
6
Chat filtering: an example
 The problem of determining whether lines involve
inappropriate content is similar to spam detection
 Simple word/phrase matches are not enough.
 It will be too aggressive and not be enough: some
nefarious lines may pass through
 Most inappropriate talk is made up of completely
innocent words!
Chat filtering
 People use innocent words to make inappropriate phrases
 People also find ways to say things with MixEdCAse or b,r,o.k.en
w.o.r,ds that get around the filter
 We want: a general way of saying: “if you see MixED CaSE or br.ok.en
word,s it is probably a sign of something bad”
 We also want: to capture inappropriate combinations of innocent
words
 A solution: a model, much like the ones used for spam
 What is the general idea?
The idea behind filtering
 Look at the words and other features that make up
appropriate and inappropriate chat, and ask how likely is a
word to appear in inappropriate chat?
 Example: If a pair of words such as “stew pit” never
appears in regular, appropriate chat, it is probably an
indicator of something inappropriate.
 Having done that, we can ask for a new chat segment, is it
likely to be inappropriate?
That said…
 Filtering is an example of text classification, used very
broadly
 Classifying documents by content/topic requires a
labeled set of data and a learning algorithm
 The difficult part is not the algorithm itself, but the
fine-tuning of its parameters and manipulating and
preprocessing the data
 This is where creative and analytical thinking becomes
truly crucial and the job becomes really fun!
Another example: Twitter
 Clustering super-short twitter posts by topic
 Clustering = finding groups in unlabeled data
 This data is very noisy and fragmented
lets see im on lates on monday so dont start till two and could get down from work
.......mail me xxx
7/16/2009 11:50:26 AM
Well i'm watching mate running at silly o'clock but i'll be free in the late afternoon
7/17/2009 8:18:12 AM
 Resistant to parsing and part of speech tagging
 Goal: arrive at K clusters, each representing a topic
of the post!
 Problem: very little to go on!
Twitter
 Possible solutions: padding – adding more relevant
words to each post
 Applying spellcheckers to reduce the noise in the data
 Removing various content-less function words, known
as stop-words.
 Since even those posts that share a common topic, do
not often share many words in common, look at the
mutual contexts in which the words in posts appear
 This technique is known as Latent Semantic Analysis
From school to workplace
What skills are needed for a future (relevant) career
in computational linguistics ?
●
–Statistics:
basic understanding of sampling methods and
probabilistic reasoning
–Some linear algebra, esp. as it relates to matrix and vector
manipulation
–Some calculus
–Machine learning algorithms that are used commonly in
NLP such as Naïve Bayes, HMM, and Expectation
Maximization
–Some programming (python, java, C++)
13
From school to workplace cont'd
How can these skills be learned?
●
Formal instruction
●
Self-teaching
●
Taking a class on-line
14
Companies that employ linguists





Google
(Bay area, Los Angeles)
Yahoo
(Bay area, Los Angeles)
IBM
Microsoft (Seattle, WA)
Smaller search engines and semantic search engines:
Ask.com
Autonomy (Bay area)
H5
(San Francisco)
Cognition
(Los Angeles)
 Companies that do Machine Translation (Systran,
Language Weaver)
 Entertainment Companies (Los Angeles)
Learning while working
 Internships: Companies hire students to work and
learn
 This can be an invaluable experience !
 It can really supplement classroom learning via the
actual application of the learned material
 Sometimes academic knowledge and practical
application do not go hand-in-hand…
Resources
Association for Computational Linguistics (Conference in Portland
●
OR in June, 2011)
KDD Nuggets – a website for data mining and knowledge discovery,
contains useful links to tutorials, news, and jobs
●
Dice.com – job website for technical jobs only
●
NLTK.org – natural language toolkit in python. Can be used to try a
'do it yourself' document classification and clustering.
●
NLP Group at Information Science Institute at USC, located in Marina
del Rey – great talks to get an overview of what’s going on in the field
●
Books and articles:
●
Jurafsky and Martin 2009 - The bible of computational linguistics
●
Chris Manning 2008 – Information Retrieval
●
Mitchel 1997 Machine Learning;
●
WEKA – Data Mining free software and book (Witten and Frank
2005)
●
17
Conclusion
As a linguist, you can do lots of interesting work
in a non-academic setting
●
You must supplement your knowledge of
linguistics by some math and computer science
knowledge and you are good to go!
●
Bottom line: all you really need is to be open to
learning new things, which is exactly why we all
go to school in the first place!
●
Thank you!
18