Statistical analysis of DNA microarray data

Transcript Statistical analysis of DNA microarray data

Literature Mining
BMI 730
Kun Huang
Department of Biomedical Informatics
Ohio State University
Announcement
• HW #3 is cancelled. The grades will be
adjusted accordingly.
Acknowledgement
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Acknowledgement
• Dr. Hongyu Peng (Brandies Univ.)
• Dr. Hagit Shatkay
(http://www.shatkay.org)
provided part of the slides.
Connecting the dots
• Story of Thalidomide (from
sedative to birth defects to anticancer drug)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Information Retrieval (IR)
• Finding the papers
• IR systems aim to identify the text
segments (be it full articles, abstracts,
paragraphs or sentences) that pertain to a
certain topic (e.g., yeast cell cycle).
• E.g., PubMed, Google Scholar
• Ad hoc IR
• Text categorization (pre-defined set of
papers)
• Advanced – integrate Entity Recognition
Ad Hoc IR
• User provide query
• Boolean model
• Index based (e.g. “Gene and CD”)
Boolean Queries
DB: Database of documents.
Vocabulary: {t1,…,tM } (Terms in DB, produced by the
tokenization stage)
Index Structure: A term  all the documents containing it.
acquired immunodeficiency
asthma
blood
blood pressure
Database
10
Index
Ad Hoc IR
• User provide query
• Boolean model
• Challenges
Synonymy (AGP1, aka, Amino Acid Permease1)
 Polysemy
CD
(54,745 Pubmed entries)
cytosine deaminase
compact disk...
capillary density
Cortical dysplasia
Chagas' disease
Crohn‘s disease
Ad Hoc IR
• User provide query
• Vector-based model
• Similarity query, e.g., Vector based.
Semantic search
TIME (Sept 5, 2005): Search engines are good
at matching words … The next step is semantic
search – looking for meaning, not just matching
key words. … Nervana, which analyzes
language by linking word patterns contextually
to answer questions in defined subject areas,
such as medical-research literature.
The Vector Model
DB: Database of documents.
Vocabulary: {v1,…,vM } {Terms in DB}
Document dDB: Vector, <w1d,…,wMd>, of weights.
Weighting Principles
• Document frequency: Terms occurring in a few documents are
more useful than terms occurring in many.
• Local term frequency: Terms occurring frequently within a
document are likely to be significant for the document.
• Document length: A term occurring the same # of times in a long
document and in a short one has less significance in the long one.
• Relevance: Terms occurring in documents judged as relevant to a
query, are likely to be significant (WRT the query).
[Sparck Jones et al. 98]
13
Some Weighting Schemes:
1 if ti d
0 otherwise
Binary
Wid =
TF
Wid = fid = # of times ti occurs in d.
Consider Local term frequency
TF X IDF
(one
version...)
Wid=
f id
fi
(fi= # of docs
containing ti)
Consider Local term frequency
and (Inverse) Document frequency
Vector-Based similarity
Document d= <w1d,…,wMd>DB
Query q = < w1q,…,wMq> (q could itself be a document in DB...)
Sim(q, d) = cosine (q, d )
=
q•d
q
|q| |d|
d
[Salton89, Witten et al99] Introductory IR.
Probabilistic Models
Query q ; Document d
• Goal: Find all d’s such that P(relevant | d, q) is high
P(relevant | d, q)
Maximize log-odds: Log[
P(Irrelevant | d, q)
[Sparck Jones et al. 98, Sahami98, Ponte&Croft 98, Hoffman 99]
]
Latent Semantics Analysis
[Dumais, Deerwester et al,1988,1990]
Motivation:
Overcoming synonymy and polysemy.
Reducing dimensionality.
Idea:
Project from “explicit term” space to a lower
dimension, “abstract concept” space.
Methodology:
PCA applied to the document-term matrix.
Highest singular values are used as the features for
representing documents.
17
Information Retrieval- Details(cont.)
Text Categorization (semantic)
Automatically place documents in right categories so as to make them easy-tofind.
Cancer
...
Apoptosis
...
Elongation
18
Information Retrieval-Details(cont.)
Rule-Based Text Classification
A knowledge-engineering approach.
Boolean rules (DNF), based on the presence/absence of specific terms within the
document, decide its membership in the class. (e.g. the CONSTRUE system [Hayes et
al. 90,92] )
Example:
If ( (<GENE_Name> ⋀ transcript) ⋁
((<GENE_Name> ⋀ Western Blot) ⋁
((<GENE_Name> ⋀ Northern Blot))
Then GeneExpressionDoc
Else ⌝GeneExpressionDoc
19
Information Retrieval-Details(cont.)
Machine Learning for Text Classification (supervised)
• Take a training set of pre-classified documents
• Build a model for the classes from the training examples
• Assign each new document to the class that best fits it
(e.g. closest or most-probable class.)
Types of class assignment:
Hard: Each document belongs to exactly one class
Soft: Each document is assigned a “degree of membership” in
classes
Methods
Nearest neighbor
Summarizing document vectors
SVM, Bayesian, boosting
20
several
Evaluating Extraction and Retrieval
To say how good a system is we need:
1. Performance metrics (numerical measures)
2. Benchmarks, on which performance is
measured (the gold-standard).
21
Evaluating Extraction and Retrieval(cont.)
Performance Metrics
N items (e.g. documents, terms or
sentences) in the collection
REL: Relevant
items (documents, terms or
sentences) in the collection.
These SHOULD be extracted or retrieved.
RETR: Retrieved
items (e.g. documents, terms or
sentences) are actually extracted/retrieved
Some correctly (A = |REL ⋀ RETR|),
Some incorrectly (B = |RETR – REL| )
|RETR| = A+B
22
Evaluating Extraction and Retrieval(cont.)
Performance Metrics (cont.)
|NotREL – RETR| = C
|Collection| = N
Collection
REL
|RETR – REL| = B
|REL-RETR| = D
|REL ⋀ RETR| = A
23
RETR
Performance Metrics (cont.)
Precision: P = A/(A+B)
How many of the retrieved/extracted items are correct
Recall: R = A/(A+D)
How many of the items that should be retrieved are recovered
Accuracy: (A+C)/N (Ratio of Correctly classified items)
Combination Scores:
F-score: 2PR / (P+R)
Harmonic mean, in the range [0,1]
Fβ-score: (1+β2)PR / (β2·P + R)
β >1 Prefer recall, β <1 Prefer precision
E-measure: 1 – F(β)-score
Inversely proportional to performance (Error measure).
24
Performance Metrics (cont.)
Precision-Recall Curves
1
25% Recall
2
3
4
75%
5
6
100%
7
Precision
50%
4 relevant documents in the collection.
7 retrieved and ranked.
100
90
80
70
60
50
40
30
20
10
0
100
75
66
0
25
50
Recall
25
66
75
100
Performance Metrics (cont.)
Accounting for Ranks
For a given rank n, Pn: Precision at rank n (P@n)
R-Precision: PR where R is the number of relevant documents
Average Scores
Average Precision: Average the precision over all the ranks in which a relevant
document is retrieved.
Mean Average Precision: Mean of the Average Precision over all the queries.
Micro-Average: Average over individual items across queries
Macro-Average: Average over queries
26
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Entity Recognition (ER)
• Identifying the substance(s)
• Rule and contextual based approach
(manual) – e.g., ‘-ase’ for enzyme
• Rule and contextual based approach
(machine learning)
• Dictionary-based approach
• How the names are written - CDC28,
cdc28, cdc28p, cdc-28
• Curation of the dictionary
Entity Recognition (ER)
• Major Challenge
Lack of standardization of names
• ‘cdc2’ refers to two completely unrelated
genes in budding and fission yeast
• ‘SDS’ - serine dehydratase gene vs. Sodium
Dodecyl Sulfate vs. Shwachman-Diamond
syndrome
Synonymy (AGP1, aka, Amino Acid Permease1)
 Polysemy
Entity Recognition (ER)
• Simpler version – if this symbol is for
gene or its product
• iHOP (Information hyperlinked over
proteins)
http://www.pdg.cnb.uam.es/UniPub/iHOP
Vocabulary
• Many, many
• SNOWMED, ICD, …
• ICD (International Statistical Classification
of Diseases and Related Health
Problems)
Vocabulary
• ICD
573.3 Hepatitis, unspecified
Toxic (noninfectious) hepatitis
Use additional E code to identify cause
571.4 Chronic hepatitis
Excludes:
viral hepatitis (acute) (chronic) (070.0-070.9)
571.49 Other
Chronic hepatitis:
active
aggressive
Recurrent hepatitis
070 Viral hepatitis
Includes:
viral hepatitis (acute) (chronic)
Excludes:
cytomegalic inclusion virus hepatitis (078.5)
Unified Medical Language system
(UMLS)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Information Extraction (IE)
• Extract pre-defined types of fact — in
particular, relationships between
biological entities.
• Co-occurrence based method
• Natural language processing (NLP) based
method
Information Extraction
Usually it requires
•
•
•
•
36
Identify the relevant sentences
Parse to extract specific information
Assume “well-behaved” fact sentences
Using co-occurrence relationships alone
does not require parsing or good factstructure
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Text Mining (TM)
• The discovery by computer of new,
previously unknown information, by
automatically extracting information from
different written records.
Text Mining
• Based on transitivity of relationships in co-occurrence graph.
• This idea can be used to discover new facts by co-occurrence
• Web Tool : Arrowsmith
Blood Viscosity
Fish Oil
Platelet aggregability
Raynaud’s
Syndrome
Reduces
Increased
Vascular Reactivity
(and co-occurs)
(and co-occurs)
Fish Oil
Raynaud’s
Syndrome
Can Reduce
[Swanson 86,Swanson87,Swanson90, Swanson and Smalheiser99,
Weeber et al. 2001, Stapley & Benoit 2000, Srinivasan 2003, Srivinasan 2004]
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Integration: combining text and
biological data
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Jensen et al. Nature Reviews Genetics 7, 119–129
(February 2006) | doi:10.1038/nrg1768

Statistical analysis of DNA microarray data

Transcript Statistical analysis of DNA microarray data

Directory