Transcript Slides

The Bytes of biological
Data
Artemis G. Hatzigeorgiou
Professor of Bioinformatics
Department of Electrical and
Computer Engineering
University of Thessaly
Hellenic Institute Pasteur
“Athena” Research Center
What is Bioinformatics?
• Bioinformatics is generally defined as the
analysis, prediction, modeling and storage of
biological data with the help of computers
Next Generation Sequencing
COSTS
Analysis
Sequencing
10%
90%
The central dogma
What are microRNAs (miRNAs)?
Gene B
DNA
Transcription
RNA
miRNAs are about 22 nt long
RNAs.
Translation
PROTEIN
They post-transcriptionally
regulate protein coding
gene expression
MicroRNAs are involved
in …
Development
stem cell proliferation
Division
Differentiation
regulation of innate & adaptive immunity
apoptosis
cell signaling
metabolism
human pathologies
Cancer viral infections cardiovascular diseases
metabolic disorders neurological pathologies
psychiatric disorders renal disease hepatological conditions
autoimmune diseases gastroenterological conditions
obesity
reproductive disorders
musculoskeletal disorders
periodontal pathologies
Superlinear Increase of known miRNAs
and relevant Research
Active Pathway Visualization
Citation:WangD,YanKK,SisuC,ChengC,RozowskyJ,
MeyersonW,etal.(2015)Lor
egic:AMethodtoCharacteri
zetheCooperativeLogicofR
egulatoryFactors.PLoSCom
putBiol11(4):
e1004132.doi:10.1371/jou
rnal.pcbi.1004132
Location of miRNAs
70%
Pol2
promoter
miR
miR
Pol2
promoter
30%
exon
miR
miR
exon
Why are the pri-miRNA genes not annotated ?
Fast degradation in the nucleus
Megraw, M., Baev, V., Rusinov, V., Jensen, S.T., Kalantidis, K., Hatzigeorgiou, A.G.
MicroRNA promoter element discovery in Arabidopsis (2006) RNA, 12 (9), pp.
1612-1619.
Recognition of Transcription Start Sites
For pri- microRNA genes
•
•
•
•
Weight matrices of Transcription Factors
Chip-Seq data of Pol II occupancy
Chip-Seq data of histone modifications (H3K4me3)
Cap Analysis of Gene Expression (CAGE)
ChIP Sequencing Visualization
H3K4me3
Pol2
Drawback: wide range of predictions
Experimental identification of miRNA TSS’s
Drosha null/conditional-null
(DroshaLacZ/e4COIN) mouse
model has been generated
using the conditional by
inversion (COIN) methodology
from Aris Economides @
REGENERON Pharmaceuticals
Economides, A.N. et al. Conditionals by inversion provide a universal
method for the generation of conditional alleles. Proceedings of the
National Academy of Sciences Aug 20;110(34):E3179-88 (2013).
RNA-seq read depth is essential!
RNA-seq coverage over the Mir17hg lncRNA locus
8,856 bp
Normalized read count ()
Drosha -/- mESCs with
27M reads
Drosha +/+ mESCs with
19M reads
GSM973235 WT mESCs
180M reads
Mir17 Mir20 Mir92-1
a
Mir18 Mir19a Mir19b-1
Mir17hg
…but ( deep RNA seq is ) not enough
RNA-seq
coverage
miRNAs
putative
TSS
Which one is correct?
ChIP-seq information can effectively reduce
putative TSS’s
TF footprints
H3K4me3
RNA-seq coverage
miRNAs
Pol2
putative TSS
Algorithm - First step: identify candidate TSS’s
Raw RNA-seq reads
mm10
Map reads on the
reference genomes
Reads tend to cluster over the
expressed genomic regions
mm10
Apply a sliding window
around miRNAs
mm10
Filter the candidate
transcription start sites
mm10
coding
miRNA
putative TSS
Algorithm - second step: Training of SVMs
An algorithm than can learn
from examples:
machine learning
Here we used
Support Vector Machines:
A supervised machine learning
approach.
Training with:
• positive examples
(protein coding TSS)
• negative examples
(random intergenic locations,
flanking positions)
Algorithm overview
First step
Second step
Final step
Comparison between microTSS and available
algorithms
• No prediction filtering based on distance
• Predictions located less than 1,000 bp from the validated TSS are
considered True Positives and the rest are considered False Positives.
• Precision = TP / (TP+FP)
• Sensitivity = Correct Predictions / Total Correct
Algorithms’ Precision and Sensitivity at
1kbp distance threshold from validated
TSSs in mESC
mESCs (N=47)
Sensitivity Precision
Precision
Marson et al
S-Peaker
PROmiRNA
microTSS
Distance threshold
Marson et al
54% (20/37)
64.5% (20/31)
PROmiRNA
78.7% (37/47) 25.4% (95/373)
S-Peaker
76.5% (36/47) 18.8% (77/409)
microTSS
93.6% (44/47)
100% (44/44)
Software on microRNA.gr
Other projects of DIANA lab on microrna.gr
•
miRNA target predictions (microT)
•
miRNA validated targets (TarBase)
•
miRNA genomics (miRGen)
•
miRNA experimental supported targets on protein coding genes (TarBase)
•
miRNA experimental supported targets on Long Non Coding genes (LincBase)
•
miRNA genomics (miRGen)
•
KEGG pathways analysis (mirPath)
•
miRNA targets gene enrichment analysis (mirExTra)
•
miRNA to disease associations
•
automatic bibliographic searches
•
miRNA naming history analysis
•
extended connectivity to online databases
Primary data
Meta analysis
Maragkakis M, Vergoulis T, Alexiou P, Reczko M et al. DIANA-microT Web server upgrade supports Fly and Worm miRNA target prediction and
bibliographic miRNA to disease association. Nucleic Acids Research, 2011.
Database of experimentally supported targets:
DIANA-TarBase
• Initially released in 2006
– The first database to catalog published experimentally
validates miRNA:gene interactions
• With more than 500,000 entries, the largest experimentally
validated repository with miRNA:gene interactions
• Last update DIANA-TarBase v7 http://www.microrna.gr/tarbase
S. Vlachos, M. D. Paraskevopoulou, D. Karagkouni, G. Georgakilas, T. Vergoulis, I.
Kanellos, I-L. Anastasopoulos, S. Maniou, K. Karathanou, D. Kalfakakou, A. Fevgas, T.
Dalamagas and A. G. Hatzigeorgiou.
DIANA-TarBase v7.0: indexing more than half a million experimentally supported
miRNA:mRNA interactions. Nucl. Acids Res. (2014)
Semi – Automatic Curation Pipeline
•
•
•
•
•
•
•
•
Automatic Detection of microRNA related articles
Formation of XML-based efficient tree-like structures
Detection of microRNA mentions
Detection of gene mentions
Detection miRNA-gene-interaction triplets
Text Scoring
Meta-Data insertion and mark-up
Score-based ranking and search capabilities
Growth of interactions per method
Evaluation in Poster # 66
http://www.microrna.gr/tarbase
Integration in ENSEMBL,
the European Browser for Genomes in EBI
Long Non Coding RNAs
LncBase http://www.microrna.gr/LncBase
is the largest available repository of
miRNA LNC RNA interactions
• The Experimental Module contains
more than 5,000 interactions between
2,958 lncRNAs and 120 miRNAs.
• The Prediction Module contains
detailed information for more than 10
million interactions, between 56,097
lncRNAs and 3,078 miRNAs.
Integration into RNAcentral ( EBI )
Paraskevopoulou, M.D., Georgakilas, G., Kostoulas, N., Reczko,
M., Maragkakis, M., Dalamagas, T.M., Hatzigeorgiou, A.G. DIANALncBase: Experimentally verified and computationally predicted
microRNA targets on long non-coding RNAs (2013)
Nucleic Acids Research, 41 (D1), pp. D239-D245.
miRBase
• Interconnects also entries with external resources:
DIANA-Tools
Visit us @
www.microrna.gr!
Integration of
microT &
TarBase in
miRBase
First
release
More than 130,000 visits per year, based on
Google Analytics!
Discussion
Check the citations of databases / webservers before
publishing
For example could be a question added to reviewers :
Have the researcher cited properly the data used ?
Are the data used for training – testing available ?
Can the data be reproduced ?
Availability of databases through time – diachronic data
Credibility for diachronic databases/web services
Funding: Project “TOM” that is implemented under the "ARISTEIA" Action of the "OPERATIONAL PROGRAMME
EDUCATION AND LIFELONG LEARNING" and is co-funded by the European Social Fund (ESF) and National Resources.