Introduction_jobim20..

Download Report

Transcript Introduction_jobim20..

Retrospective study of a gene by
mining texts :
The Hepcidin use-case
Fouzia Moussouni-Marzolf
Introduction
Life Science is becoming the most VOLUMINOUS science.
3 major reasons :
Modern digital
revolution : INTERNET
Increasing incitment to publish :
• The competition pressure
• Evaluation concerns at several levels
Sharing of knowledge
at a global scale
Introduction
Rapid Expansion of the biomedical literature
available papers exploding
The comprehension of iron
regulation system is still
difficult
BOOM of
publications
since 2000
MLTrends
Hepcidin Comprehension of associated
Since dec 2000
diseases by medical experts
Increased demand for effective
text mining tools to find quickly
relevant information.
Introduction
These tools extract a deluge of information
Very dense data
Hepcidin : January 2011
Hepcidin : Febrary 2011
Text Mining with Ali-baba and a global Query « Hepcidin » [1]
Many common events
few news
non expert
Information dense
and unreadable
The pertinent information is hidden
biologists are rapidly
discouraged from using
these tools.
For an expert
A considerable amount of well
known data (background).
[1] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. AliBaba: PubMed as a graph. Bioinformatics. 22, 2444-2445 (2006).
Introduction
Which solutions for managing this increasing flood of
information extracted ?
Unfolding time during the process of text mining
time
Reduce the density of information at each period of time
Perception of a certain chronology in the sequence of events linked to a
gene: enhance comprehension
Ability to locate trivial information repeatedly published and extracted [2]
Select the most relevant events over time = Reduced density of information
[1] Jensen, L.J., Saric, J. & Bork, P., Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 7, 119-129 (2006).
Methods
Focus on 2 frames of study
1. Exploit Text Mining Engine Ali-Baba (HU-Berlin)
Information Extraction Tool from Medline abstracts resulting from a PubMed Query
Hepcidin 2005 [dp]
Ali-baba is not a simple pattern matching tool for counting keyword occurrences. It
recognizes effective biological entities localized in the abstracts using dictionnaries.
proteins
Disease
Different
sorts of
bio-entities
extracted
Cell Type
tissue
Drug
Specie
Methods
Ali-Baba extracts relationships between recognized bio-entities, namely bioevents.
…. STAT3 inhibitors, including curcumin, AG490 and a
peptide (PpYLKTK), reduced hepcidin1”, ….
Curcumin
AG490
Peptide
reduce
reduce
hepcidin1
reduce
hepcidin1
(PpYLKTK)
Source Entity
hepcidin1
Relationship
Biological Events
Target Entity
Methods
Abstracts of « Hepcidin 2005 [dp] »
Graph of events
Extraction of Bio-events
Natural Language
processing (NLP)
Co-occurrence
Methods
2. Focus on Hepcidin gene
Corpus of linked biological events published since gene discovery until today
Retrospective study of Hepcidin over time
June 2012
dec2000
time
period = 1 month
Filter trivial
bio-events
Select relevant
bio-entities
Methods
What is a time relevant biological entity ?
Definition
A biological entity e recognized by an IE based text mining system is time relevant for
period t if it achieves at time t a maximum of relationships with other biological entities
recognized by the same IE based system.
Graph G(Nodes,Edges) of
extracted bio-events,
e t-relevant
biological entity
e
e Highly Targeted by
other bio-entities
at time t
Methods
T-Relevance can be computed for different sorts of biological entities
Source Entity
Relationships
Target Entity
Protein
Protein
Disease
Disease
Cell Type
Cell Type
Tissue
tTssue
Drug
Drug
Specie
Specie
Different valuable information for each kind of relevance
Methods
What is a trivial biological event at time t ?
A trivial event Te = event already published before t
G0 = Graph of events at time t0
G1 = Graph of events at time t1 = t0+p
G2 = Graph of events at time t2= t0+2p
...
t0+2p
t0+p
Te Є G1
and
Te Є G0
t0+3p
Te Є G2
and
(Te Є G1 or Te Є G0)
Methods
Data Processing Pipeline
For each period t in [t0,tn] :
Query(t) = « Gene t [dp]"
Ali-baba
web-service
for Query(t)
graphML
export
events extracted and
drawn for period t
insert
GraphML
database
final retrospective
data analysis
Data transformation
Data stamping
Clearing of trivial data
Selection of t-relevant
bio-entities
integrated
time-based
events of the
decade
Data integration
Results
Hepcidin Gene Use case
- from t0 = 12/2000 to tn = 12/2011 -
Database of more than 50,000 published biological events.
Considerable amount of trivial events
Background ?
Cumulative Quantification of trivial events over
time
52% of published events on the whole Hepcidin decade are
trivial
Results
Hepcidin Gene Use case
Relevant bio-entities over
time
Relevant Proteins over time
Before clearing
trivials
Permanent visibility of
Hepcidin as relevant
After Clearing
New information emerge as
highly targeted : several
proteins regulate Hepcidin
Transcription
Results
Hepcidin Gene Use case
Relevant bio-entities over
time
Relevant diseases over time
Before clearing trivials
Permanent visibility of
hemochromatosis and iron
overload
After Clearing
New diseases linked to
Hepcidin and iron, emerge
as highly targeted, like the
neurological diseases
Results
More annotations of the “relevant entities”
Conclusion
A new straightforward approach for retrospective studies of genes has been
proposed.
Time has been coupled to the process of information extraction to improve
comprehension of the considerable amount of biological events linked to a
Hepcidin gene since its discovery in dec 2000.
This work is still ongoing. Current developments …
Toward a generalization to queries of any biological entities
Exclude review papers, sections “background” and “methods” from
mining to minimize trivial events and entities
Threshold of relevance, threshold of triviality
Acknowledgments
Contributors
• Bertrand De-Cadeville
Master2 MSB
• Olivier Loréal, resp. Iron Ieam
INSERM UMR 991
• Ulf Leser, resp. Bioinformatics Team
HU-Berlin
•Astrid Rheinlander
Ali-baba Team at Berlin