PPT - Bruce Blumberg

Download Report

Transcript PPT - Bruce Blumberg

BioSci 145B Lecture #6 5/11/2004
• Bruce Blumberg
– 2113E McGaugh Hall - office hours Wed 12-1 PM (or by appointment)
– phone 824-8573
– [email protected]
• TA – Curtis Daly [email protected]
– 2113 McGaugh Hall, 924-6873, 3116
– Office hours Tuesday 11-12
• lectures will be posted on web pages after lecture
– http://eee.uci.edu/04s/05705/ - link only here
– http://blumberg-serv.bio.uci.edu/bio145b-sp2004
– http://blumberg.bio.uci.edu/bio145b-sp2004
BioSci 145B lecture 6
page 1
©copyright
Bruce Blumberg 2004. All rights reserved
Useful software for molecular biology (contd)
•
NCBI – www.ncbi.nlm.nih.gov
– main information and analysis resource
– indispensable resource
BioSci 145B lecture 6
page 2
©copyright
Bruce Blumberg 2004. All rights reserved
Useful software for molecular biology (contd)
•
•
NCBI – Blast – how to find similar genes
www.ncbi.nlm.nih.gov/BLAST/
BioSci 145B lecture 6
page 3
©copyright
Bruce Blumberg 2004. All rights reserved
Useful software for molecular biology (contd)
•
Why pay Celera?
BioSci 145B lecture 6
page 4
©copyright
Bruce Blumberg 2004. All rights reserved
Routes to gene identification
• Genome sequences are minimally useful without annotation
– Annotation = description, biological information
• Functional annotation – information on the function
• Structural annotation – identification of genes, sequence elements
– Much annotation is done automatically today
• Via sequence comparisons with various databases
– Gene sequences
– ESTs
• Algorithms predict promoters, splicing, polyadenylation sites and,
most importantly ORFs
• ORFs – open reading frames are putative proteins
– Algorithms miss in both directions
– Source of much disagreement
• Field of bioinformatics has grown to encompass many types of analysis
related to gene function
– www.igb.uci.edu
BioSci 145B lecture 6
page 5
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified?
• Functional cloning
– Finding a gene by using a functional assay
• Positional cloning
– Find a gene by where it is located, what it is near
• Bioinformatic analysis that relates back to functional or positional cloning
BioSci 145B lecture 6
page 6
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified?
• Functional cloning (aka expression cloning) – identifying by a functional assay
– What are functional assays?
• Enzyme activity – kinases (add PO4 to proteins)
• Ligand binding – peptide hormone (e.g. glucagon) receptors
• Transport (ions, sugars, etc) – e.g., intestinal glucose transporters
• Mutant rescue – restore function to a cell or embryo
– Introduce cDNA library pools (~10,000 cDNAs)
• via transfection, microinjection, infection
• Perform functional assays
– Robust, sensitive,
accurate is key
– positive pools are subdivided
and retested to obtain pure cDNAs
• cycle is repeated until
single clones obtained
– Applications – many enzymes
transporters and growth factor
receptors cloned this way
BioSci 145B lecture 6
page 7
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified?
• Positional cloning – identifying by where the gene is located
– from genetic linkage map
– by walking from nearby sequence tags (ESTs, STS, STC, etc)
– Gene trap techniques (week 8)
– Interspecific backcrosses (Mus musculus vs M. spretus)
– When possible – try to rescue phenotype with candidate region
• Positional cloning – ok you have identified a region where your gene of
interest may be located - now what?
– How do we figure out what genes are in this region without knowing
function?
– General problem for annotation of sequences
• Genome sequencing vis a vis positional cloning
BioSci 145B lecture 6
page 8
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions
– Cross-species hybridization
• Probe another species with this genomic region
• coding sequences are conserved -> should see hybridization where
genes are
• What do you think are limitations to this approach?
– Species must be sufficiently different to reduce “noise” from
overall sequence conservation
» mouse vs human probably not great
» Human vs frog or fish probably good
– Must be sufficiently similar for genes to be conserved
» Human vs frog or fish probably good
» Humans vs yeast only good for common genes
– Target species region needs to be well characterized
• Computer parallels – compare sequence to be annotated with
annotated sequence from a different organism
– e.g., human with Drosophila
– Unknown bacterium with E. coli, etc.
BioSci 145B lecture 6
page 9
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions (contd)
– Hybridization to known genes or coding materials
• What are some examples?
– Hybridize to mRNA (Northern blots)
– Hybridize to cDNA libraries (must be right tissue, cell or stage)
– Capture cDNAs or mRNAs from solution
• Computer based parallels
– Compare with expressed sequences from other species
– Compare with ESTs
BioSci 145B lecture 6
page 10
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions (contd)
– Identify features found in typical promoters
• What are promoters?
Regions 5’ to a gene that are required for expression
• CpG islands – regions in eukaryotic genes that are hypomethylated
– Undermethylated
– Remember that methylation of DNA typically inhibits gene
expression
– Digest with enzymes that have CG in recognition site that would
be inhibited if methylated, e.g., SacII CCGCGG, run gel to check
» If nonmethylated (expressed) enzyme will cut, region will be
hypersensitive, get chopped up.
» If methylated (not expressed) enzyme will not cut and
region will not get digested
• DNAse I hypersensitive sites
– Similar principle – transcriptionally active DNA is “open”
– If open, it is more sensitive to DNAse I than non-active DNA
– Test by digestion and gel electrophoresis
BioSci 145B lecture 6
page 11
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions (contd)
– Exon-trapping
• Insert genomic clone into “intron” between two exons
• Transfect into cells
• Assay for size of transcript
– Known size from two exons
– If genomic clone has exon – size will increase
• Extraordinarily rarely used – much too painful
BioSci 145B lecture 6
page 12
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Duchenne muscular dystrophy (DMD) first gene positionally cloned
– Good example of how different things would be today
– RFLP mapping identified a chromosomal region Xp21 where gene was
• Many patients had translocations in the region
– Chromosomes spliced to other chromosomes
BioSci 145B lecture 6
page 13
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Duchenne muscular dystrophy (DMD) first gene positionally cloned
– One group did genomic subtraction cloning
• Strategy enriched for regions lost in DMD patient
– Hybridized enzyme digested normal DNA with excess sheared
DMD DNA
– Only hybrids with restriction site ends could be cloned
– Only hybrids with such ends would be from region absent in DMD
DNA (since DMD DNA was in excess)
• Made a library and tested clones by Southern blot to normal and DMD
DNA
– Today – for a sequenced organism – just go to the database identify
sequences in region of interest and verify by Southern or PCR as above
• Or look in large insert libraries with breakpoints
• Or do cDNA subtraction between tissues from normal individual and
DMD individual
– Presumes knowledge of source of mutation, i.e., the defect
resides in the affected tissue
– Would not detect a defect in inducing factor from other tissue
BioSci 145B lecture 6
page 14
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• Duchenne muscular dystrophy (contd)
– 2nd group cloned breakpoints
• Girl with translocation between X and 21
• 21 was rich in rRNA genes so made a
radiation hybrid panel from patient
• Identified hybrid cell carrying the breakpoint
– made a genomic library from it
• Screened library for clones with both rRNA genes and X chromosome
specific sequences
– Long, tedious process with many more failures than successes
– Finally found 1 such genomic clone
• Mapped this genomic DNA to male patients with DMD and found
deletions in many of them
– DMD gene is largest known – 2.4 megabases
– cDNA cloning followed – protein is dystrophin
BioSci 145B lecture 6
page 15
©copyright
Bruce Blumberg 2004. All rights reserved
How are genes identified? (contd)
• The problem with all of these methods is that experiments are required
– What do we do when sequences are coming in at the rate of tens of
gigabases/month?
– Need large-scale, robust, computerized methods to identify genes and
annotate sequences!
• All bioinformatics depends on databases
– UCI bioinformatics groups are among the best at designing and
constructing databases
• http://www.igb.uci.edu/servers/databases.html
• Three major databases of sequences (automatically duplicated)
– GENBANK http://www.ncbi.nlm.nih.gov/Genbank/index.html
– DNA Databank of Japan http://www.ddbj.nig.ac.jp/
– European Molecular Biology Laboratory (EMBL)
http://www.ebi.ac.uk/embl/index.html
BioSci 145B lecture 6
page 16
©copyright
Bruce Blumberg 2004. All rights reserved
Dystrophin as an example
• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dop
t=default&list_uids=1756
BioSci 145B lecture 6
page 17
©copyright
Bruce Blumberg 2004. All rights reserved
Genome annotation – how to identify genes?
• Gene identification/prediction is important but difficult
– Large variety of methods and algorithms to predict exons
– To identify genes must first identify open reading frames (ORFs)
• When dealing with cDNAs – look for regions that code for proteins
– Do all genes code for proteins?
– Correct reading frame for a sequence is assumed to be largest
with no stop codons (TGA, TAA, TAG)
• Lots of tricks can be employed
– Codon frequency for an organism
» Coding sequences follow codon usage
» Noncoding sequences do not, often have lots of stop codons
– Consensus sites
» Kozak translational initiation CCRCCATGG
• What is a very important consideration when searching sequences to
predict ORFs?
– Sequence must be accurate
» Incorrect base calls are troublesome
» But indels (insertions or deletions are disastrous
BioSci 145B lecture 6
page 18
©copyright
Bruce Blumberg 2004. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction methods
– Two major methods are in use
• Homology searches
– Compare with other known sequences
• Ab initio (from the beginning) prediction
– Use algorithms to recognize common features and predict genes
» Promoters
» Splice sites
» Polyadenylation sites
» ORFs
– Generally, microbial genomes are much easier to annotate – WHY?
• Simply identify ORFS > 300 bp (100 amino acids)
– Works very well
– But can miss small coding sequences
– Must run on both strands because there are shadow genes
(overlap on two strands)
• Using a variety of programs, can predict genes in bacterial genomes
– this week -> environmental sequencing papers
BioSci 145B lecture 6
page 19
©copyright
Bruce Blumberg 2004. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction
methods (contd)
– Huge variety of programs
available
• Neural networks –
attempt to model learning process
– Build decision trees, use probabilistic reasoning
• Rule-based systems
– Rules often not clear
– Have trouble with exceptions
• Hidden Markov models
– Break sequences down into small units based on statistical
analysis of composition
» Hexamers appear to be optimal size to search
– Classify sequences into types or “states”
– Identify transitions between states
– Very useful for large number of purposes
BioSci 145B lecture 6
page 20
©copyright
Bruce Blumberg 2004. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction
methods (contd)
– Training sets are used to
“teach” programs how
to solve problems
• Training set is actual data – genes with known features
– Programs use training sets to classify new data
• Neural networks use training data to build decision trees
• Rule-based systems use training data to generate rules
• HMM build table of probabilities for states and transitions
– Pierre Baldi in IGB is UCI expert in machine learning
• How well to gene predicting programs work?
– Extremely well on bacterial genomes
– Fairly well on simply eukaryotic genomes
– Variable on complex genomes
– Rule of thumb – use a group of programs and look for areas of agreement
among them
– The best current programs combine ab initio predictions with similarity
data to define a probability model
BioSci 145B lecture 6
page 21
©copyright
Bruce Blumberg 2004. All rights reserved
Identification of gene function
• You have identified a gene – what is its function?
– Always look for similarity to known sequences
• Book suggests swiss-prot
• GENBANK translated database is best
• BLAST is tool to use
• Amino acid searches more sensitive than nucleotide searches
– Because identical amino acid sequences might only be 67%
identical at nucleotide level
– What might you find?
• Match may predict biochemical and physiological function
– e.g., a known enzyme from another organism
• Match may predict biochemical function only
– e.g a kinase
• Match a gene from another organism with no known function
– May match ESTs or ORFs from other organisms
• Match a known gene with partly characterized function
– Search leads to clarification of function – NifS in book
• Might not match anything at all
– Expect this will happen less and less
BioSci 145B lecture 6
page 22
©copyright
Bruce Blumberg 2004. All rights reserved
Up-regulated by TTNPB
(> 1.5 Fold, p < 0.01, n=334)
Cytoskeleton
Miscellaneous 4% (14)
1% (4)
Energy metabolism
3% (9)
Extrcellular matrix
3% (10)
Unidentified
21% (73)
Housekeeping
15% (52)
Tumor suppressor
1% (2)
Transcriptional
15% (53)
Signal tranduction
8% (26)
Retinoid metabolism Neural
1% (3)
2% (7)
BioSci 145B lecture 6
page 23
©copyright
Hypothetical
26% (90)
Bruce Blumberg 2004. All rights reserved
Cytoskeleton
Energy metabolism
Extrcellular matrix
Housekeeping
Hypothetical
Neural
Retinoid metabolism
Signal tranduction
Transcriptional
Tumor suppressor
Unidentified
Miscellaneous
Identification of gene function (contd)
• You have identified a gene – what is its function? (contd)
– Does the sequence contain an obvious functional motif?
• Homeobox or other consensus DNA binding domain?
• Kinase domain?
• Serine protease, etc.
– InterPro database allows one to compare a protein sequence with whole
family of structural databases
• http://www.ebi.ac.uk/interpro/
• HICAICGDRSSGKHYGVYSCEGCKGFFKRTVRKDLTYTCRDSKDCMIDKRQRNRC
QYCRYQKCLAMGM
– Other sorts of similarity searches
• Identify protein secondary structure motifs
– Alpha helix, beta sheets, hydrophobicity
– Amphipathic helices
– Overall polarity of sequences
• Not used much
BioSci 145B lecture 6
page 24
©copyright
Bruce Blumberg 2004. All rights reserved
Identification of gene function (contd)
• You have identified a gene – what
is its function? (contd)
– Gene ontology – highly
structured vocabulary for gene
classification
• Genes are classified using
this vocabulary
• Relates protein function
with cellular or organismal
functions
– Nucleic acid binding
– Cell division
BioSci 145B lecture 6
page 25
©copyright
Bruce Blumberg 2004. All rights reserved
Genome annotation
• Extremely important as number of sequences increases
– Goals are to identify
• all of the sequences
• all of the features of each sequence
• All of the functions of the identified genes
– Often annotation does not agree with known function
• Human error
• New and updated information not propagated to database
• Inaccurate sequencing
• Sometimes annotation is correct but protein lacks function under
certain conditions (e.g., need cofactors)
– Gold standard for functional analysis is loss-of-function analysis
• Most accurate annotation
– Common to have “annotation jamborees” where biologists and
bioinformaticians come together to annotate new sequences
• Xenopus tropicalis jamboree will be in Spring 2005
BioSci 145B lecture 6
page 26
©copyright
Bruce Blumberg 2004. All rights reserved
Comparative genomics
• Study of similarities and differences between genome structure and
organization
– How many genes? Chromosomes?
– Genome duplications
– Gene loss
• Driving forces
– Understanding evolution in molecular terms
– Sequence annotation and function identification
• Sequences with important functions tend to be conserved across
evolution
• Orthology vs paralogy
– Homology – descended from a common ancestor
– Orthologs are homologous genes in different organisms that encode
proteins with the same function and which have evolved by direct
vertical descent
– Paralogs are homologous genes that encode proteins with related but
non-identical functions
• Derived by gene duplication
BioSci 145B lecture 6
page 27
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
Mean – 29 +/- 5.5
Range 23-39
1. (10 points) It is 2006 and a NASA Mars scout mission has returned soil samples from
an area of Mars formerly covered by a sea. Surprisingly, the sample contains viable
microorganisms and even more remarkably, these organisms are apparently eukaryotes
(have a nucleus). One of your colleagues has figured out how to culture these
organisms, which were named Mars burroughsii in the laboratory. Unfortunately, in
the process he accidentally discovered that they are quite pathogenic to mammals.
Worse, a sample was mistakenly poured down the drain and is now contaminating
Newport Beach. Your PI is a specialist at working with weird microorganisms and she
has decided to take the lead in determining the genome sequence of M. burroughsii.
a) (3 points) A genomic library will be necessary to facilitate the mapping and
sequencing of the genome. What type of library will you make, i.e., what type of
vector will you use? Justify your choice. What sort of equipment will you require?
b) (3 points) Describe how you will make a physical map of the M. burroughsii
genome prior to sequencing.
c) (4 points) Outline a method to quickly generate a high quality, finished, genome
sequence, which will be essential in understanding the pathogenicity of M.
burroughsii.
BioSci 145B lecture 6
page 28
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
a) It would be best to make a BAC or PAC library since these hold large inserts
and are relatively stable compared with lambda, cosmids and YAC libraries.
You will need standard laboratory equipment including an electroporator and
most critically pulsed field gel electrophoresis (PFGE)
b) you would want to map the large insert clones from the library made in a) by
hybridization, fingerprinting or map as you go
c) a high quality finished genome requires a long-range map (like you made in b)
and shotgun sequencing. Most sequencing centers today would chose to
combine whole genome shotgun sequencing with BAC end sequencing to
facilitate finishing
BioSci 145B lecture 6
page 29
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
2. (5 points) You look around the lab to find an E. coli strain that will be suitable for
propagating the library you made above. You can find two strains that might be
suitable. Their genotypes are the following (recall that the bacteria are mutated or
deficient in the genes listed):
strain A - mcrA, Δ(mrr-hsdRMS-mcrBC), ΔlacX74, Φ80lacZΔM15, recA1, araD139,
Δ(ara-leu)7697, galU, galK, endA1, nupG
strain B - mcrA, endA1 supE44, gyrA96, relA1, recA, recD, recJ, sbcC
Is either of these strains a good choice for your library? If so, which one? Or are both
ok? Explain your reasoning (i.e., which features are good or missing).
Strain A is a good choice for PAC or BAC libraries because only restriction
deficiency is required for the maximum efficiency in library construction.
There would be no real harm for the strain to be recombination deficient but
this is only required for vectors that exist in multiple copies/cell (such as
lambda or cosmids)
The desirable features are mcrA, Δ(mrr-hsdRMS-mcrBC) and the Φ80lacZΔM15
(although this is not essential).
Strain B is not a good choice because it is not restriction deficient
BioSci 145B lecture 6
page 30
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
3. (4 points) Your PI asks you to make a normalized M. burroughsii cDNA library for
EST sequencing and has suggested that you ensure the library is well normalized by
hybridizing the tester and driver to a large Cot½ value (e.g., 50)? Is this the correct
approach? Why or why not? If you plan to use this cDNA library for many purposes,
would it be a better idea to subtract it? Why or why not?
To make a normalized library, one needs to hybridize to a LOW Cot½ value
(e.g. 5) in order to reduce the frequency of abundant cDNAs. Hybridization to a
high Cot½ value (e.g. 50) will deplete the library in rare sequences as well as
abundant sequences.
If the library will be used for multiple purposes, it would be better to normalize it
since subtraction removes a significant number of sequences that are in
common between the driver and tester.
BioSci 145B lecture 6
page 31
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
4. (3 points) What are three important goals that one should always have when
constructing a genomic library?
Faithful representation of the genome – no chimerism or deletions, all
sequences represented, at least five fold coverage of the genome
Easy to screen
Easy to produce enough DNA for further analysis
BioSci 145B lecture 6
page 32
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
5. (3 points) What are three factors that determine whether a sequence can be stably
propagated in bacteria?
Toxicity
Restriction
recombination
BioSci 145B lecture 6
page 33
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm review
6. (4 points) An international panel of experts has suggested that an EST project be
implemented for M. burroughsii in order to speed up the identification of the
pathogenic protein discussed in question 1. Describe generally how you would go
about making a library of full-length cDNAs, including which type of vector you
would choose and why. Assume that this library will be used for EST sequencing and
functional analysis of proteins expressed from the library.
You would begin by isolating RNA and selecting poly A+ RNA to enrich for
mRNA.
For full-length library construction, use the oligo-capping or cap trapping
method to select those mRNAs that have 5’ cap structures.
Synthesize first strand cDNA with reverse transcriptase and second strand
cDNA with DNA polymerase I
Add linkers or adapters, restriction digest and ligate to the vector of interest
If the library is to be used for EST sequencing and functional analysis of
proteins expressed from the library, it will probably be best to use a plasmid
vector
BioSci 145B lecture 6
page 34
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm
7. (5 points) This graph depicts an experiment in which
genomic DNA was hybridized with an RNA tracer.
What conclusions can be drawn about the nature of the
RNA that is transcribed from this DNA? What
implications do these conclusions have for large-scale
genome sequencing projects such as Drosophila or
human?
The graph shows that RNA tracer only
hybridizes with moderately and nonrepetitive sequences, suggesting that RNA is
only transcribed from these regions
This is the justification for why sequencing
projects do not worry about regions of highly
repetitive DNA. These regions are not likely
to be transcribed into RNA and therefore are
not cost-effective to sequence
BioSci 145B lecture 6
page 35
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm
8. (3 point) You have identified a cDNA that encoded a protein which is essential for the
pathogenesis of M. burroughsii. The cDNA is 7 kb long. How would you fully and
completely determine the sequence of this cDNA so that it might be used to develop a
vaccine against M. burroughsii infection. Remember, time is of the essence since
people are getting very sick as a result of M. burroughsii.
Since time is of the essence, it is not acceptable to use methods that will
take a long time to generate the finished sequence. This excludes shotgun
sequencing and primer walking. You would want to use either restriction
fragment cloning, or progressive deletions, such as with Exonuclease III
combined with dideoxy sequencing to quickly get the sequence.
BioSci 145B lecture 6
page 36
©copyright
Bruce Blumberg 2004. All rights reserved
Midterm
9. (3 points) In studying the dispersion of M. burroughsii in the ocean, an epidemiologist
notices that many more people in Newport Beach get sick from swimming in the ocean
than do those in San Diego; although, the number of organisms in the water at both
locations is indistinguishable. Deductive logic suggests that there is something
different in the human populations in Newport Beach and San Diego that mediates this
apparent differential susceptibility to M. burroughsii. You and your colleagues have
discovered that M. burroughsii infects humans by binding to a protein expressed on the
surface of intestinal cells called CadF. Since the human genome has been sequenced,
you know the sequence of the gene encoding this protein. Describe generally how you
might identify single nucleotide polymorphisms in the CadF gene and perform a
relatively simple test to identify which people in the population might carry this
polymorphism and be resistant to infection with M. burroughsii.
The method I was looking for was to use a SNP chip to quickly identify
differences in the CadF gene among affected and unaffected people. Some
very creative answers were given and partial or full credit was given for ones
that might work.
BioSci 145B lecture 6
page 37
©copyright
Bruce Blumberg 2004. All rights reserved