No Slide Title

Download Report

Transcript No Slide Title

The Human Genome Project
• Main reference: Nature (2001) 409, 860-921
• http://www.abdn.ac.uk/~gen155/lectures/hgpcore.ppt
• http://www.nature.com/ng/web_specials/
• Whole issue also available from Nature Genome
Gateway
www.nature.com/genomics/human/
• Describes the publicly funded project; Celera’s
private HGP published in Science
Main points
•
•
•
•
•
•
•
Basic genome statistics
Genome browsers e.g. UCSC, Ensembl
Genomic “landscape”
Repeated DNA as a “fossil record”
Number of genes
Polymorphism
Applications
The Strategy
• The genome sequence was a multinational
collaboration involving 100s of scientists,
millions of dollars, many countries
• The strategy was “top-down” using methods
developed on small genomes (e.g. yeast)
• Figure 2 in the Nature paper
Genome statistics
• Total size = 3290 Mb
• 212 Mb of heterochromatin
• Chromosomes range from 279 Mb (#1) to
45 Mb (#21) (fig 9, table 8 in paper)
• Total “raw” sequence 23,000 Mb
• Number of genes = about 31,000
• About 30% of the genome is transcribed
• About 1.5% of the genome is protein coding
Repeat DNA “fossils”
• Genomes are full of repeated DNA
sequences of various kinds (table 11/12)
• Each type of repeat has a single origin and
has replicated many times within the
genome, transposing to new sites and
accumulating mutations
• By comparing copies of the repeat to see
how much they have diverged, can get an
idea of how old repeat is (fig 18)
Humans versus worms and flies
• Humans have only about twice as many genes as
worms or flies (table 23)
• But human genes are subject to more alternative
splicing (60% vs 22%; average 3 different
transcripts per gene)
• So humans probably have about 5 times as many
proteins as worms or flies
• Complexity is not proportional to numbers of genes
or proteins, but to the number of interactions they
can have
Index of human genes and proteins
• 3 basic methods to predict genes from the
genomic DNA:
Comparison with ESTs, mRNAs
Homology with other known genes/proteins
Purely computational methods based on Hidden
Markov Models (HMMs)
• Started with predictions by Ensembl,
combined with other information…..
The Human Proteome
• Key database is InterPro, which combines
information on all known protein domains
• Only 94 of the 1262 InterPro types (7%) are
vertebrate-specific - so most domains are older than
common ancestor of all animals - new ones are not
“invented” very often
• Many of these are concerned with
defence/immunity and the nervous system
• Most novelty is generated by new protein
“architectures”, combining old domains in new
ways (fig 42/45)
Genome History
• Mouse and human diverged about 100Mya, so there is
200My of evolution between them
• Chromosome translocations are involved in the formation
of new species
• By comparing locations in the genome of homologous
genes, can define regions of synteny (fig 46)
• Breakage seems to occur randomly, but tends to be in
gene-poor regions
• No convincing evidence for whole-genome duplications
Polymorphism
• More than a million SNPs (single nucleotide
polymorphisms were found
• Average 1 SNP per 1.9kb or 15 SNPs per gene
• Combinations of closely linked SNP alleles form
haplotypes
• Not all possible haplotypes are found in population - e.g
about 4-5 per gene (theoretically could have 215 = about
32000)
• HapMap – the haplotype mapping project
• A paper (Trends in Genetics) on the subject of haplotype
blocks
Applications in medicine
• Having the genome sequence, and databases
of genes, makes it much easier to find
disease genes by positional cloning (e.g.
BRCA2 for breast cancer)
• Sequence reveals new drug targets: e.g. a
new type of serotonin receptor, predicted
from sequence, shown to be a candidate for
treating mood disorders and schizophrenia
Latest - the Y chromosome
• Nature paper