Chapter 2 PowerPoint Slides

download report

Transcript Chapter 2 PowerPoint Slides

Genome Sequence
• Chapter 2
1
2
Position Weight Matrix
• TATA box -- believed to be used by RNA
polymerase to find transcription start site
• Given a PWM (Table MM2.1) -- how do we decide
if any given 15 nts form a "TATA" box?
• Ex: if genome wide GC average == 44%
• P(A pos. 1) = (1-.44)*P(A or T) = .56*.5 = .28
(expected value
• If TATA box, P(A pos. 1) = 61/392 ~ 0.1556
(from table MM2.1) (observed value)
• Finally, P(TATA)/P(A overall) == 0.1556/.28 ==>
take logorithm == log odds ratio
3
log odds
• log odds > 1 ==> NT likely at position for
real TATA box
• log odds < -1 ==> NT is less likely to occur
at that position of a TATA box than overall
• since log -- can sum at each position to get
log odds score
4
5
Table MM2.1
6
Tables MM2.2 and MM2.3
Genome resources for Annotated
Genes
• Entrez
http://www.ncbi.nlm.nih.gov/gquery/gquery.fc
gi
• GeneCard
• http://www.genecards.org/index.shtml
7
BLOSUM62 substitution Matrix
• default substitution matrix for BLAST alignment
between two amino acid sequences
• provides the "score" (or "cost") of aligning one
amino acid to another
• common substitutions have higher scores
• based on observed amino acid substitution in
orthologous proteins with aligned sequences
8
9
Table MM2.4
10
Table MM2.5
Protein Structure
• PDB
http://www.rcsb.org/pdb/home/home.do
• Entrez
• http://www.ncbi.nlm.nih.gov/gquery/gquery
.fcgi
11
Protein Structure Prediction
• predict 3D structure from primary amino
acid sequence
• considered computationally intractable
• however, many individuals are working on
this problem
12
Structure and Function Descriptions
• Gene Ontology (or "GO")
• Controlled vocabulary of hierarchical terms
that describes genes/proteins
• biological process -- overall objective
• molecular function -- biochemical activity
• cellular component -- location of protein
activity
• http://www.geneontology.org/
13
•
•
•
•
survey
Survey
SURVEY
ENDS -- 12/02/2007
14
Caution
• 1) Note -- multiple listings of information/data in
multiple different databases are not necessarily
independent validations (many sites crosslink/cross-reference). The literature is probably
the lowest-level confirmatory resources
• 2) Do not assume that all data sources/repositories
contain the same information (examples: gene
assemblies)
15
Mapping
• STSs -- sequence tagged sites -- a pair of primers
that amplifies a distinct portion of the genome
• chromosomes were fragmented and inserted into
bacteria and/or yeast -- to maintain the DNA
• bacterial vectors carried approximately 150 kb of
sequence -- BAC (E. coli.)
• YACs -- 150 kb to 1.5 Mb
• Using restriction maps, and the STSs, the BACs
and YACs could be assembled into longer contigs
• Mapping was considered crucial by the public
effort due to the number and sizes of large repeats
in the genome
16
Vector
A vector in biology has several meanings:
* An organism (biotic vector, pollinator) or medium
(abioti vector, e.g. wind) which transports pollen to a
stigma.
* An organism that transmits disease by conveying
pathogens from one host to another (vector insect)
* A virus used to deliver genetic material into a cell
* A piece of DNA meant to carry DNA fragments into a
host cell
17
18
Figure 2.5
ESTs - Expressed Sequence Tags
• cDNA (DNA made from RNA)
• short reads of cDNAs (typically 200-800 nts from
the 3' and 5' ends of genes (cDNA)
• Massive efforts to sequence ESTs -- and
assembled into a database -- UniGene
• ESTs provide information on:
–
–
–
–
–
existence of genes
tissue specific expression
alternative splicing
cluster of ESTs -- bioinformatics problem
(Show NEIBank)
19
Whole Genome Shotgun
Sequencing
• TIGR (the institute for genomic research) -sequenced a bacteria
• TIGR and Celera (Craig Venter) split
– the basic idea of WGSS is to cut up the DNA
into small pieces
– sequence all the pieces
– then using software, assemble all of the
overlapping sequences
20
Human Genome Project
• Publicly funded effort
– considerable effort was put into mapping and marker
identification
• for assembly
• and organization of sequences for sequencing
• markers used to choose the minimum number of slightly
overlapping fragments that completely spanned each
chromosome
• called the "golden tiling path" -- ~45,000 fragments
21
Clones, Libraries, and Mapping
http://www.genome.gov/Pages/Education/Kit/main.cfm?pageid=92#
E. coli
BACs -100,000 - 200,000
markers or STS
22
Mapping -- an aside
• Genetic map
• physical map
– radiation hybrid map
23
Clones, Libraries, and Mapping
• BAC's are typically cleaved into smaller
fragments -- about 2000 bases, and stored
on E. coli viruses (a plasmid)
• precise order of larger BACs is determined - because determining the order of many
smaller fragments is more work -- however
• shorter fragments are more amenable to the
chemistry of the sequencing reactions
24
Celera -- Shotgun Sequencing
• Genome -- cleaved into 27,271,853
fragments
• Celera had access to the public data
• Claimed not to have used it (not clear why)
for assembly. Any guesses why?
25
Whole Genome Shotgun VS Mapping
• WGS fails in large sections of highly repetitive
DNA
• Yet, the WGS "lost" only 103 genes (pretty good),
given the cost/time savings
• In hindsight, a hybrid approach appears to be
optimal
– WGS to get a majority of sequence (say 6x coverage)
– minimum tiling path to resolve repetitive regions
– estimated that 3000 BACs would be sufficient for
human (93% less than was sequenced for human)
• However, it would be impossible to know this
without the results of both approaches for
comparison
26
27
Figure 2.B1
28 locations
690 KB
46 kb
28
Figure 2.B2
Annotating Genomes -- Summary
"Fully annotating a newly sequenced genome
requires many people with different academic
backgrounds working together in teams. As you
might guess, software development for genome
analysis is a very hot research area in computer
science, mathematics, engineering, and biology.
Few people can master more than one or two of
these areas, so collaborations are common. If you
learn both math and biology, you will have many
career opportunities…"
29
How many proteins? From "one"
"gene"
• Nox1 -- ESTs identified the gene for
voltage-gated H+ ion channels
• 3 different mRNAs are encoded (2 long, 1
short) thru alternative splicing -- that are
tissue specific
30
31
Figure 2.6
Imprinting
• Prader-Willi Syndrome, and
– weak muscles, short in stature, obese, mental
retardation
• Angelman Syndrome
– balance problems, motor skills, excessively happy,
severely retarded
• deletion of Mb in 15q11-13
• disease status determined by "imprinting" -marking of a gene so that only the paternal or
maternal copy will be transcribed (the other copy
is not transcribed)
• Is this form of inheritance supported by Mendel's
32
rules?
Imprinting
• many genes involved with the placenta and
developing embryos
• analogous to a parasitic drain on the mother
• genetically speaking, the male would prefer to
extract maximal nutrients, to ensure propagation
of genetic line
• female would prefer to ration resources to increase
the chance of having multiple lines
• opposition is genetic "conflict"
33
Imprinting -- example
• IGF2 -- insulin-like growth factor
(paternally expressed gene)
• IGF2R -- receptor expressed only by the
maternally inherited allele
• by controlling the embryonic expression of
the receptor, the mother maintains control of
the paternally driven ligand from IGF2
34
Figure 2.7 Igf2 gene expression -in 2 strains of mouse
35
Methylation -- a mechanism
• DNA methylation -- the addition of a methyl
group (CH3) to cytosine ( C ) in DNA
• methylation
– associated with non-transcription
– loss of methylation observed in cancerous cells
• regulation of gene expression without altering the
DNA sequence is -- epigenetic regulation
• ~400,000 methylated sites in any given cell type
• ~ 100 unique cell types in humans ==> 40,000,000
methylated sites?
• Plus gender-based differences in expression
36
37
Figure 2.8
Other interesting finds…
•
•
•
•
haplotypes and genetic "invariability"?
micro-RNA's (a new kind of gene)
pseudogenes
more duplications and deletions than
expected (copy number variations)
38