Bioinformatics - CS intranet - The University of Hong Kong
Download
Report
Transcript Bioinformatics - CS intranet - The University of Hong Kong
HKU CS Bioinformatics Research
Siu Ming Yiu
Department of Computer Science
The University of Hong Kong
Other faculty members:
Prof. Francis Chin
Prof. TW Lam
Dr HF Ting
1
Impact of bioinformatics
Medical research
Biological research
Huge
volume
of data
e.g. finding a cancercausing gene?
Environmental study
e.g. how to remove
harmful bacteria
e.g. can we make rice
grow faster?
e.g. human genome:
3G long; Medical study:
100 persons
e.g. human gut contains
1000+ bacteria (data:
500G) obesity
Biofuel
e.g. how bacteria
digest food to
produce energy?
2
The de novo assembly problem (single genome)
Given an unknown genome,
Genome X
NO existing technology is able to read out the DNA sequence
(ACCG…..) of it as the sequence is too long (e.g. human = 3 billions long;
even bacteria are about 10k – several millions).
What we can do?
High-throughput sequencing technology (next generation
sequencing (NGS)):
………………….
DNA
sequencing
machine
[Inside the machine, the genomes are
randomly cut into short fragments
(reads), the machine can read out the
DNA sequence of the reads.]
Multiple copies of
Genome X
GTCG ACCG
CTAG
CAAG
CTTG
AACG CTCG
GTCG
GTTG GGAG
3
Bad news
Multiple copies
of Genome X
(1) The reads are really short: 100-150 bp (c.f. genome of a
bacterium – 10K to several millions).
(2) They are mixing together (no idea where from the
genome each read is from!!).
(3) There are errors in the read. [AACCGTTC =>
AACGGTCC]
The (de novo) assembly problem:
Can we reconstruct the original genome from the reads?
4
Data volume: HUGE!!
Take human genome as an example.
The genome is of 3x109 (3 billion) long.
Recall: multiple copies are cut (fragmented). At any
position of the genome, multiple copies of reads may be
obtained.
The average number of
copies of reads from
each position of a
genome is referred as
……………….
the depth of the
sequencing.
Note that they are
mixed together, no
ordering information
For depth = 30,
# of reads:
(3x109x30)/100 ≈ 109
5
Good news
There are some clues inside the reads:
The reads are overlapping!
Unknown genome: AACCGGTTGCACGTTCCACTTGGCC………
CCGGTTGTC
TGTCACGTT
ACCGGTTGT
AACCGGTTG
GGTTGTCAC
CGGTTGTCA
TTGTCACGT
GTTGTCACG
Ideal case: every position has at least one read, no
errors in the read, then….
[But the reality…. is a lot worse]
6
Unknown genome: AACCGGTTGCACGTTCCACTTGGCC………
CCGGTTGTC
AACCGCTTG
GGTTGTCAC
TGTCACGTT
CGGTTGTCA
ACCGGTTTT
TTGTCACGT
GTTGTCACG
The reality:
(a) There are errors in the reads; not easy to locate the
next read!
(b) At some positions, we may have no reads.
7
Publications
Bioinformatics (impact factor: 5.323)
BMC genomics (impact factor: 4.4)
PloS One (impact factor: 3.73)
BMC bioinformatics (impact factor: 3.02)
Journal of Computational Biology (impact factor: 1.56)
IEEE/ACM TCBB (impact factor: 1.54)
……
Top conferences: RECOMB, ISMB, ECCB
Nature papers with our collaborators
HKU-BGI research center:
BGI (Shenzhen) is the largest genomic center in the world
Other international collaborators:
JGI, dept. of energy, US (biofuel); Sidekid hospital, Canada (diabetes);
CAS-MPG PICB, Shanghai (C4 Rice project); UC San Francisco (Optical
mapping data analysis); NUS, Singapore (RNA study); ….
8
How to solve the problem?
A few general approaches
String graph, de Bruijn graph, …
Genome
Idea: we still make use of the
…. A C G T G T A C C T C…….
overlapping parts in reads to
connect them together. We
do not need reads of every
Read
G T G T A C C T C (k = 4)
position.
-------------------------Graph:
Vertex: k-mer (k consecutive GTGT TGTA GTAC TACC ACCT CCTC
nucleotides in a read)
Edge: two k-mers appear
consecutively in a read
9
Genome: A A C G A C G T G T A C C T C A G T
AACGACGTG
ACGACGTGT
CGACGTGTA
GACGTGTAC
ACGTGTACC
CGTGTACCT
GTGTACCTC
TGTACCTCA
GTACCTCAG
TACCTCAGT
Reads
(len = 9)
AACG
ACGA
CGAC
Ideal case
- No errors
- Reads at every position
- The graph can read out
one single path, that will
be the genome!
GTGT
ACGT
GACG
CAGT
GTAC
CGTG
TGTA
CCTC
TCAG
CTCA
ACCT
TACC
10
Genome: A A C G A C G T G T A C C T C A G T
AACGACGTG
ACGACGTGT
CGACGTGTA
GACGTGTAC
ACGTGTACC
CGTGTACCT
GTGTACCTC
TGTACCTCA
GTACCTCAG
TACCTCAGT
Reads
(len = 9)
AACG
ACGA
CGAC
Note: even a few reads are
missing, we are still ok!
Can anyone see that how
many reads can be missed
depends on the value of k
(when constructing the
graph!)?
Q: to allow more missing
reads, larger or smaller k is
better?
GTGT
ACGT
GACG
CAGT
GTAC
CGTG
TGTA
CCTC
TCAG
CTCA
ACCT
TACC
11
Genome: A A C G A C G T G T A C G
CTCAGT
Reads
(len = 9)
AACGACGTG
ACGACGTGT
CGACGTGTA
GACGTGTAC
ACGTGTACC
CGTGTACCT
GTGTACCTC
TGTACC
GT C A
GTACCTCAG
TACCTCAGT
ACGT
CGTG
CGTC
Contigs: Maximal path
without branches/paths
contig
CGAC
GACG ACGT
CGACGT
Real case is more complicated:
Even no error, in a genome, some patterns may repeat!
In reality, we seldom can construct the whole genome in one
piece, but stop at junctions, resulting with a set of contigs
12
A part of the de Bruijn graph for Ecoli (~4M long); you
can imagine how complicated for human genome (3G long)
13
Conclusions
Our team:
Core Faculty members:
Prof. Francis Chin, Prof. TW Lam, me
1 Research Assistant Professor (Henry Leung)
1 Postdoc (Jianyu Shi)
about 8 PhD/master students + a team in HKU-BGI Lab
Some collaborators:
Beijing Genome Institute at Shenzhen (BGI)
- HKU-BGI Laboratory
HKU medical schools; life science departments
Sickkids hospital, Canada
JGI, DoE, US
CAS-MPG PICB, Shanghai (C4 Rice project)
UC San Francisco (Pui’s group)
GIS (Genome Institute at Singapore)
Universities: NUS, CUHK, U of Liverpool etc.
<Thank you>
14