Bioinformatics - Department of Statistics Oxford

Download Report

Transcript Bioinformatics - Department of Statistics Oxford

Bioinformatics
Gil McVean
Department of Statistics
What is it to be a human?
What is it to be an individual?
Species
Diversity (percent)
Humans
0.08 - 0.1
Chimpanzees
0.12 - 0.17
Drosophila simulans
2
E. coli
5
HIV1
30
Photos from UN photo gallery www.un.org/av/photo
Is it your genes?
Is it your transcripts?
Is it your proteins?
Is it your protein interactions?
Is it your systems?
Bioinformatics and genome biology
•
Bioinformatics is the analytical wing of genome biology
•
It concerns itself with large amounts of data (more than you can look at!)
•
It uses computers and efficient algorithms
•
It is
–
–
–
–
Data assembly
Data summary
Data modelling
Data analysis
The raw material
The output
Classical bioinformatics I: DNA and protein sequence alignment
Classical bioinformatics II: Genome assembly
Classical bioinformatics III: Gene finding
Classical bioinformatics IV: Protein structure prediction
Bioinformatics of genetic variation
•
An area of considerable current attention is human genetic variation
•
The aim of current experiments is to map the genetic basis of human
phenotypic variation
– Disease susceptibility
– Normal variation
•
It is challenging because of
– The scale of the data
– The structure of the data
– The underlying processes that shape variation
•
Bioinformatics is needed to
– Assemble, collate, check and summarise data
– Model the data
– Make inferences
What does the data look like?
•
•
Single Nucleotide Polymorphisms (SNPs)
Insertion-Deletion Polymorphisms (INDELs)
TGCTTGGCAGGGCAGACTGACTGT
TGCTTGGCAGGGCAGACTGACTGT
TGCATGGCAGGGCAG-CTGACTGT
TGCATGGCAGGGCAG-CTGACTGT
TGCATGGCAGGGCAGACTGACTGT
TGCATGGCAGGGCAGACTGACTGT
SNP
INDEL
Collections of SNPs
HCB
JPT
YRI
CEU
SNP
Engineering challenges
•
Identifying SNPs
•
Working out which SNPs will work on a given platform
•
Controlling the genotyping work-flow
•
Controlling the output quality
•
Performing quality-assurance exercises
•
Identifying problems, gaps and inconsistencies
A Bioinformatics problem: How small is my P-value?
•
The basic idea of association studies is to look for genetic differences
between groups
Cases (D)
It is easy to ask the question
“Is there a significant difference in
the frequency of a mutation
between groups?”
Controls (C)
Locus of interest
The problems
•
In a study of several hundred thousand mutations (or even millions) it is
unlikely that we have actually typed the causal variant(s).
•
In a study of several hundred thousand mutations (or even millions), even if
NONE of them are causal a lot of them will show significance at the 5%, 1%
or even 0.01% level
•
Differences in the frequency of disease incidence between groups (for
example African Americans and European Americans) will be associated
with ANY genetic difference between them
What we really want to ask
•
“Does any of the genome show an association with disease over and above
any effect I might expect from the correlation between genotype and
environmental risk?”
•
“If so, what is the most likely position for the causal mutation(s)?”
•
Answering these questions is difficult, but a natural way to approach the
problem is to model the process
Modelling genetic variation
Evolutionary parameters
Population
Stochastic
Evolutionary
process
Sample
Stochastic
Sampling
process
Selection
ATGCATGGGCTATTGGACCT
ATGGATGGGCTATTGCACCT
Mutation
ATGCATGGGCAATTGCACCT
ATGCATGGGCAATTGGACCT
ATGGATGGGCTATTGCACCT
Genetic drift
Recombination
Migration
Inference
Genes in populations
Present day
Ancestry of current population
Present day
Ancestry of sample
Present day
The coalescent: samples in populations
Most recent common ancestor (MRCA)
coalescence
Ancestral lineages
Present day
time
How does this help us to think about mapping disease?
•
Individuals are related to each other through their genealogical history
•
Two nearby points on the genome will have similar genealogical histories, a
result of which is that mutations at these positions will also be correlated
•
Understanding how genealogical history changes along the genome
(through recombination) and between populations (through historical
demography) will allow us to
– Construct more powerful tests for disease association
– Localise disease-associated mutations
The bioinformatics module
•
Genomic technologies
•
Annotating genomes
•
Modelling gene evolution
•
Mapping disease genes
•
Measuring gene and protein expression
•
Predicting protein structure