Transcript John_Keane

National Centre for Text
Mining
John Keane
NaCTeM Co-director
University of Manchester
Welcome To All
•
•
•
•
•
JISC, BBSRC, EPSRC
National Agencies (British Libraries, HMCE, MoD)
Regional Agencies
Industry (pharmas etc, software related, etc)
Academic community (Univs, DCC, CURL etc)
• Thanks to the host institutions
• Thanks to:
Anne Trefethen
Ross King
Leona Carpenter
Funding Bodies, Community etc
Thanks to the funding bodies (JISC (JCSR),
BBSRC, EPSRC) and the UK and
international Text Mining Community
For recognition of potential impact and
significance of Text Mining on the biosector and wider academic community,
and for articulating need for a National
Centre
Invited Speakers/Panellists
• Terri Attwood, University of Manchester
• Clifford Lynch, Coalition for Digital
Information
• Rob Procter, National Centre for e-Social
Science
• Dietrich Rebholz-Schuhmann, European
Bioinformatics Institute
Self-funded Partners
• University of California, Berkley
Ray Larson
• University of Geneva
Margaret King
• University of Tokyo
Jun-ichi Tsujii
• San Diego Supercomputer Centre
Reagan Moore
Involvement
MANCHESTER
• Bill Black; Informatics
• Julia Chruszcz; MIMAS, Manchester Computing
• Carole Goble; ESNW and Computer Science
• John McCarthy; MIB and Faculty of Life Sciences
• John McNaught; Informatics
LIVERPOOL
• Paul Watry; University Library and Dept of English
SALFORD
• Sophia Ananiadou; Computing, Science and Engineering
Wendy Johnson, now MerseyBio
Text Mining – definition
Auvril and Searsmith (Illinois) 2003
• Non trivial extraction of implicit, previously
unknown, and potentially useful information from
(large amount of) textual data
• Exploration and analysis of textual (naturallanguage) data by automatic and semi automatic
means to discover new knowledge and update
existing knowledge
• What is “previously unknown” information?
– Strict: Information that not even the authors knew
– Lenient: Rediscover the information that the author
encoded in the text
BIO-SCIENCE
USERS
M
E
D
I
C
I
N
E
ONTOLOGIES
USER
INTERFACE
TEXT
TERM &
DATA
MINING
INFORMATION
MINING
EXTRACTION
E
N
G
I
N
E
E
R
I
N
G
INFORMATION
RETRIEVAL
SCIENCE
DIGITAL LIBRARIES
HUMANITIES
Text Mining – vision
• (Bio)DBs with accurate, valid, exhaustive, rapidly updated data
– only 12% of TOXLINE users find what they want
– significant error rate and gaps in manually curated data
• Drug discovery costs slashed; animal experimentation
reduced through early identification of unpromising paths
– $800M over 12 years to develop a new drug -> reduce by 2 years
• New insights gained through integration and exploitation of
experimental results, (bio)DBs, and scientific knowledge
• Product development archives and patents yield new
directions for R&D
Searching yields FACTS rather than documents
Text Mining – realism
Computerworld 2004
• Technical:
Technology is becoming mature but issues of
efficiency and scalability – need to integrate
myriad set of tools
• Person-intensive:
Skill set required to understand domain (e.g.
develop ontology) and interpret/analyse
results
NaCTeM so far …
• £1M over 3 years (review after 2 years) – co-funding
by institutions of ~£800K
• 6 core staff – joined October’04-January’05
• Requirements gathering and technical development
phases begun
• UGeneva have received funding for part-time post on
‘evaluation’
• Planned move to Manchester Interdisciplinary
Biocentre in summer 2005.
Thanks to all involved, and the NaCTeM team, in
particular Richard Barker for organising