Applications of Text Mining Ewan Klein School of Informatics & NeSC

download report

Transcript Applications of Text Mining Ewan Klein School of Informatics & NeSC

Applications of Text Mining
Ewan Klein
School of Informatics & NeSC
Text Mining
Goals
Extract useful information from large bodies of
unstructured or semi-structured documents
Looks for patterns in natural language text
Driven by application needs
Three Areas:
Adding Metadata

E.g., identify Dublin Core elements from document headers
Information Extraction

Identify nuggets of text data and marshall them into a fixed format
Assisting Curation
Text mining and Curation
Example workflow:
Make an observation
Search the research literature for knowledge
Incorporate relevant information into database
Challenges:
Current Information Retrieval (IR) techniques often too imprecise

Which enzymes act as catalysts in the glycolysis pathway?

We want to identify a relation between two entities
Move to augmenting IR with more knowledge of text structure
Mostly supervised machine learning techniques
Still need training data for each domain
Need to integrate text mining into Grid applications
BlueDwarf for Text Mining
BioCreative Competitioin
Joint entry with Stanford
Recognition of drug names, chemical names, and protein names
in MEDLINE abstracts
Java maximum entropy tagger
Used roughly 700,000 features in the early stages
Java memory size of 1950 Mb
Died on available Informatics and Stanford machines
BlueDwarf
Arrived at 1,247,77 features, memory: 2560 Mb
Several experiments running in parallel
Provisional results: we obtained top-scoring results