Transcript van dongen

Development of pathway
analysis tools
한양대학교 의과학연구소/
한국해양과학기술원
박기정
2014. 06. 12.
HiFI
(Human integrated Functional Interaction)
YEONGJUN JANG, Kiejung Park
[email protected], [email protected]
Background
High-throughput genomic experiments, including association studies,
examinations of sequence mutations/copy number variants, and
expression experiments typically generate multiple candidate genes
that are involved in cancer causing cellular processes
However, these data sets are noisy and contain false positives
How to extract true positive candidate genes, and reveal functional
relationships among these genes with confidence for use in further
experimental analysis
Methods to distinguish drivers from passengers
examine the rate of synonymous versus non-synonymous
mutations
predict the functional consequence of mutations
assess the overall rate of recurrence, based on combined rates
of sequence mutation and copy number alteration
identify cancer drivers by identifying an enrichment of rare
cancer mutations within network modules
Pathway-driven approach
It marks the genes associated with the disease or
other phenotype
And separates them (driver) from innocent
bystanders (passenger) caught in the general
instability of the malignant genome or other false
positive hits
It identifies and extends the biological pathways
affected by the genes
HiFI
Functional interaction network upon pathway context
Yu N, Seo J, Rho K, Jang Y, Park J, Kim WK, Lee S. (2012)
hiPathDB: a human-integrated pathway database with facile visualization.
Nucleic Acids Res. 40(Database issue), D797-802
Naive Bayes Classifier
The Bayes Naive classifier selects the most likely classification V given the
attribute values a1 , a2 , . . .
We generally estimate P (a | v ) using m-estimates:
i
n = the number of training examples for which v is happened
n = number of examples for which a is happened
i
pc= a priori estimate for P (a | v )
i
m = the equivalent sample size
Naive Bayes Classifier
Training No.
Interact?
EXP1
EXP2
EXP3
EXP3
1
Yes
Yes
No
Yes
No
2
Yes
No
Yes
No
No
3
Yes
No
Yes
No
No
4
No
Yes
Yes
No
Yes
5
No
Yes
No
Yes
No
6
No
No
Yes
Yes
No
?
Yes
No
No
Yes
Calculate P (EXP1=Yes | Interact?=Yes ) using p = 0.5, n = 3, n = 1 and m = 3
c
ѵ
a
a
...
a
Yes
1 =Yes | ѵ=Yes)=π
P(a
1
P(a1=No | ѵ=Yes)=1-π
2 =Yes | ѵ=Yes)=π
P(a
2
P(a2=No | ѵ=Yes)=1-π
...
i =Yes | ѵ=Yes)=π
P(a
i
P(ai=No | ѵ=Yes)=1-π
No
P(a =Yes | ѵ=No)=π
1
P(a1=No | ѵ=No)=1-π
...
P(a =Yes | ѵ=No)=π
i
P(ai=No | ѵ=No)=1-π
1
1
1
1
2
P(a =Yes | ѵ=No)=π
2
P(a2=No | ѵ=No)=1-π
2
2
2
Calculate V using the prior probability, P (v=Yes) = P (v=No) = 0.5
i
i
i
i
Avoid violations of the strong independence assumption
One requirement for a successful NBC is that the features used in
the classifier be independent
Human PPIs and gene co-expression data sets were generated
experimentally
Many human protein interactions in interaction databases,
including IntAct, BioGrid, and HPRD, are not generated
experimentally but are human curated from the literature
Many of the GO annotations and domain interactions are predicted
based on sequence similarities among proteins in different species.
Hence, there is a potential dependency among these data types
since they all rely on the same phylogenetic trees
Experimental methods as new features for NBC
PSI-MI Ontology
MI:0045
(experimental interaction detection)
125 experiment types
Biochemical
Biophysical
Genetic interference
Imaging technique
Phenotype-based detection assay
Post transcriptional interference
Protein complementation assay
Random Naïve Bayes
label
…
label
…
label
…
…
Generalization of Random Forest to Naïve Bayes (Naïve Bayes + Bagging)
Random Naïve Bayes
Conditional Independecne Test: Mutual Information
•Implementation:
–Used a stratified approach
•24 classifiers at feature size m = 5
Human PP Human PP
Is (3)
Is (2)
Human PP
Is (2)
Gene co-expres
sions: microarra
y (1)
Gene co-expre
ssions: RNA-se
q (2)
–Used majority voting with equal weights
•classifiers agree (> 12) → there exists a functional inter
action
PPI Data Sources
HIPPIE: Integrating Protein Interaction Networks with Experiment Based Quality S
Schaefer MH, et al. PLoS ONE 7(2): e31826. (2012)
PSICQUIC: a community standard for programmatic access to molecular interactio
Bruno Aranda, et al. Nature Methods 8, 528–529 (2011)
http://psicquic.googlecode.com
IntAct—open source resource for molecular interaction data
Kerrien S, et al. NAR, Database issue, D561-5 (2007)
http://www.ebi.ac.uk/intact
EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in
Vilella AJ, et al. Genome Res, 327-35 (2009)
http://ensembl.org/info/docs/compara/index.html
Gene Co-expressions:
COXPRESdb v5.0
Using Microarray data downloaded from ArrayExpress
Normalize raw data using the RMA method
Calculate PCC for every pair of genes
Calculate MR(Mutual Rank) from PCC
Species
Platform
# of genes
# of Microarrays
(GeneChip)
Release Date
Homo S
apiens
HG-U133_Plus_2
19,803
73,083
2012.08.29
COXPRESdb: a database of comparative gene coexpression
networks of eleven species for mammals. Obayashi T, et al. (2013)
Nucleic Acids Res. 41, D1014-20
Gene Co-expressions:
RNA-seq
Hong SJ., et al. Canonical Correlation Analysis for RNA-seq Co-expression
Networks. Nucleic Acids Research. 2013.
Traditional statistical methods designed for microarray data do not use all of
the information contained in RNA-seq data such as the expression at exon,
single-nucleotide polymorphism (SNP) and positional level; splicing; posttranscriptional RNA editing across the entire gene; isoform and allele-specific
expressions.
R package of Canonical Correlation based RNA-Seq Co-expression Network
TCGA LUSC RNA-seq dataset
Interaction Data Sources
Training and validation of the functional
interaction classifier
Build training and test FI sets from
hiPathDB
Function interaction(FI)?
Two proteins are involved in the same biochemical reaction
as an input, catalyst, activator, or inhibitor, or as two
members of the same protein complex
10-fold cross validation
10-fold cross-validation
Method
Positive/Negative
Training Data Ratio
Link
Threshold
Accuracy
Naïve Bayes classifier
10:1
0.5
92.15%
Naïve Bayes classifier
100:1
0.5
97.72%
Random naïve Bayes
10:1
0.5
96.47%
Random naïve Bayes
100:1
0.5
99.99%
Results
Number of Interactions
(probability cut-off: 0.5)
HiFI
183716
Reactome FI
169988
Sharing rate of GO
annotations
BP
CC
MF
HiFI
0.815
0.904
0.753
Reactome FI
-
0.962
-
HiFI Web Interface
Case study
Point mutations and CNV genes
TCGA glioblastoma multiforme (GBM) data set
Extract a subnetwork around these genes in the HiFI
Find modules by using a clustering algorithm to it
Evaluate statistical significance of modularity
Test mutated genes in modules significantly distributed over multiple samples
Annotate modules using pathways and GO terms via over-representation analysis
Glioblastoma multiforme (GBM)
the most common and aggressive brain tumor in humans
the first cancer type to undergo comprehensive genomic characterization
by The Cancer Genome Atlas (TCGA) project
the TCGA GBM project has cataloged somatic mutations and recurrent
copy number alterations in GBM, and has identified frequent alterations in
the p53, RB, PI3-kinase (PI3K) and receptor tyrosine kinase (RTK) signaling
pathways
Identify frequently altered network modules and candidate driver mutations in GBM
GBM data set
Cerami E, et al., “Automated Network Analysis Identifies Core Pathways in
Glioblastoma”, PLoS One. 2010 Feb 12;5(2):e8918
84 GBM cases with both sequence mutation and copy number data
Each gene was considered altered if modified by a validated non-synonymous
somatic nucleotide substitution, a homozygous deletion or a multi-copy
amplification
genes that were altered in two or more of the final 84 cases
517 alternated genes and 259 genes had interactions in HiFI
Extract GBM-specific functional network
For each pair of 259 genes, found all shortest paths in the HiFI (threshold
= 2)
To identify statistically significant linker genes, we used the
hypergeometric distribution to assess the probability that the linker gene
would connect to the altered genes
FDR correction via Benjamini Hochberg, p-value threshold = 0.05
96 GBM altered genes + 6 linker genes
Captures the majority of proteins and interactions in a humancurated map of the molecular pathways involved in GBM: 96% of
proteins (70 of 73) and 69% of interactions (129 of 187)
Network clustering
MCL algorithm
van Dongen S: Graph Clustering by Flow Simutation. PhD thesis.
University of Utrecht; 2000.
calculate the overall modularity of the partitioned GBM network
a total of 10 modules, with an overall network modularity of 0.519
the statistical significance of the observed network modularity in relation to a
null model of random networks of the same size and same degree distribution
Functional annotation
p53 tumor suppressor pathway
prevents the propagation of unstable genomes, is frequently altered in
glioblastoma
RB pathway
Glioblastomas also nearly universally circumvent cell cycle inhibition through
genetic alterations to the RB pathway
phosphatidylinositol 3-Kinase-AKT(PI3K/AKT) pathway
Major downstream effects of PI3K/AKT activation include cell growth,
proliferation and survival
Summary
A network-based approach of HiFI identifies many of the same candidates
as the original frequency-based approach used to assess mutational
significance.
Furthermore, this approach can automatically identify and extract
biologically relevant GBM modules, which correspond closely to prior
known GBM biology.
http://hifi.kobic.re.kr
[email protected]
PAMES ( PAthway Mapping &
Editing System)
KEGG pathway의 예
KEGG pathway 정보 추출
• Entry : 각 object에 대한 설명
– Map, enzyme, compound, gene 등
– Object 종류, 크기, 위치정보 확인
• Reaction : 각 enzyme반응에 대한 substrate, product
– Map상의 에지정보
<pathway name="path:aae00010" org="aae" number="00010"
• title="Glycolysis
Relation : map/link
등의 연결정보 추출
Gluconeogenesis"
image="http://www.genome.ad.jp/kegg/pathway/aae/aae00010.gif"
link="http://www.genome.ad.jp/dbget-bin/show_pathway?aae00010"><entry
id="1" name="aae:aq_186" type="product" reaction="rn:R00710"
link="http://www.genome.ad.jp/dbget-bin/www_bget?aae+aq_186"><graphics
name="aldH1" fgcolor="#000000" bgcolor="#BFFFBF" type="rectangle"
x="170" y="1018" width="45" height="17" />
</entry><entry id="2" name="aae:aq_2103 aae:aq_2104"
type="product" reaction="rn:R00235"
link="http://www.genome.ad.jp/dbgetbin/www_bget?aae+aq_2103+aq_2104"><graphics name="acs'..."
fgcolor="#000000" bgcolor="#BFFFBF" type="rectangle" x="102" y="916"
width="46" height="17" />
</entry>
Entry 위치 추출
에지 생성 및 semi auto layout
Editing pathway
• Object : Gene, Compound, Map
– Move, Resize
– Delete
– Create : ToolBox를 이용
• Path : 에지
–
–
–
–
–
–
Move, Resize
Delete
Create : ToolBox를 이용
Pivot point를 이용
화살표 추가 및 삭제
Rotation
그림 삽입 및 편집기능
• 그림( jpg, jpeg) 추가 가능
• 그림 편집
– Move, resize
• 각 맵당 필요한 그림 수작업
Label 편집
Pathway Editing
KEGG reference map – PAT파일 작성
• 178 XML 파일 –reference map
• 필요한 그림 추출 작업
– KEGG 이미지 파일로 부터 필요한 그림 오려내기
및 지우기
• Semi-Auto layout이후 manual 편집
– 에지 collision을 피하지 못한경우 에지 수정
– Label 수정
– 에지 겹침 수정
• Local pathway DB에 저장
– 178 map으로 저장
Pathway Mapping
Mapping genes
Applying microarray profile=>RNA_seq 처리
Down regulated genes
Up regulated genes