PPT - Blumberg lab

Download Report

Transcript PPT - Blumberg lab

BioSci D145 Lecture #6
• Bruce Blumberg ([email protected])
– 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment)
– phone 824-8573
• TA – Riann Egusquiza ([email protected])
– 4351 Nat Sci 2– office hours M 1-3
– Phone 824-6873
• check e-mail and noteboard daily for announcements, etc..
– Please use the course noteboard for discussions of the material
• Updated lectures will be posted on web pages after lecture
– http://blumberg-lab.bio.uci.edu/biod145-w2017
BioSci D145 lecture 1
page 1
©copyright
Bruce Blumberg 2014. All rights reserved
Midterm Score Distribution
Mean
Median
Stdev
low
High
BioSci D145 lecture 6
page 2
25.7
27.3
6.7
13
32.5
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Functional Genomics - The challenge: Many new genes of unknown function
• Where/when are they expressed?
– Known genes (e.g. from genome projects)
• Gene chips (Affymetrix)
• Microarrays (Oligo, cDNA, protein) (Iyer)
– Novel genes
• Expression profiling
– Genomic tiling microarrays (Kapranov)
– SAGE and related approaches (RIKEN)
– Massively parallel sequencing (RNA-Seq) (Bentley)
• Personal ‘omic approaches to gene discovery (week 6 papers)
– Which genes regulate what other genes?
• Epigenetic modification of gene expression (week 7 papers)
• What is the phenotype of loss-of-function? (week 8 papers)
– Genome wide CRISPRi (Liu)
– Genome wide synthetic lethal screens (Luo)
– CRISPR/Cas (Gilbert)
• Detecting protein-protein interactions (week 9 papers)
• Metabolome & microbiome (week 10 papers)
BioSci D145 lecture 6
page 3
©copyright
Bruce Blumberg 2011. All rights reserved
Routes to gene identification
• Genome sequences are minimally useful without annotation
– Annotation = description, biological information
• Functional annotation – information on the function
• Structural annotation – identification of genes, sequence elements
– Much annotation is done automatically today
• Via sequence comparisons with various databases
– Gene sequences
– ESTs
• Algorithms predict promoters, splicing, polyadenylation sites and,
most importantly ORFs
• ORFs – open reading frames are putative proteins
– Algorithms miss in both directions
– Source of much disagreement
• Field of bioinformatics has grown to encompass many types of analysis
related to gene function
– www.igb.uci.edu
BioSci D145 lecture 6
page 4
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified?
• Random
– EST sequencing, select interesting looking gene
– Large scale expression analysis
• http://xenopus.nibb.ac.jp/
• From protein sequences
– Antibody screening
– Reverse translate and oligo screen
• Functional cloning
– Finding a gene by using a functional assay
• Positional cloning
– Find a gene by where it is located, what it is near
• By similarity to other sequences
– Gene family
– Cross-species
– Computer based equivalents
• Bioinformatic analysis that relates back to functional or positional cloning
BioSci D145 lecture 6
page 5
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified?
• Functional cloning (aka expression cloning) – identifying by a functional assay
– What are functional assays?
• Enzyme activity – kinases (add PO4 to proteins)
• Ligand binding – peptide hormone (e.g. glucagon) receptors
• Transport (ions, sugars, etc) – e.g., intestinal glucose transporters
• Mutant rescue – restore function to a cell or embryo
– Introduce cDNA library pools (~10,000 cDNAs)
• via transfection, microinjection, infection
• Perform functional assays
– Robust, sensitive,
accurate is key
– positive pools are subdivided
and retested to obtain pure cDNAs
• cycle is repeated until
single clones obtained
– Applications – many enzymes
transporters and growth factor
receptors cloned this way
BioSci D145 lecture 6
page 6
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? - Case studies
• Duchenne muscular dystrophy (DMD) first gene positionally cloned
– One group did genomic subtraction cloning
• Strategy enriched for regions lost in DMD patient
• Made a library and tested clones by Southern blot to normal and DMD
DNA
– 2nd group cloned breakpoints
• Girl with translocation between X and 21
• 21 was rich in rRNA genes so made a
radiation hybrid panel from patient
• Identified hybrid cell carrying the breakpoint
– made a genomic library from it
• Screened library for clones with both rRNA genes and X chromosome
specific sequences
• Mapped this genomic DNA to male patients with DMD and found
deletions in many of them
– DMD gene is largest known – 2.4 megabases
– cDNA cloning followed – protein is dystrophin
BioSci D145 lecture 6
page 7
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? (contd)
• Duchenne muscular dystrophy (contd)
– Today – for a sequenced organism – just go to the database identify
sequences in region of interest and verify by Southern or PCR as above
• Or look in large insert libraries with breakpoints
• Or do cDNA subtraction between tissues from normal individual and
DMD individual
– Presumes knowledge of source of mutation, i.e., the defect
resides in the affected tissue
– Would not detect a defect in inducing factor from other tissue
BioSci D145 lecture 6
page 8
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions
– Cross-species hybridization
• Probe another species with this genomic region
• coding sequences are conserved -> should see hybridization where
genes are
• What do you think are limitations to this approach?
– Species must be sufficiently different to reduce “noise” from
overall sequence conservation
» mouse vs human probably not great
» Human vs frog or fish probably good
– Must be sufficiently similar for genes to be conserved
» Human vs frog or fish probably good
» Humans vs yeast only good for common genes
– Target species region needs to be well characterized
• Computer parallels – compare sequence to be annotated with
annotated sequence from a different organism
– e.g., human with Drosophila
– Unknown bacterium with E. coli, etc.
BioSci D145 lecture 6
page 9
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions (contd)
– Hybridization to known genes or coding materials
• What are some examples?
–
–
–
–
Hybridize to mRNA (Northern blots)
Hybridize to cDNA libraries (must be right tissue, cell or stage)
Hybridization to genomic tiling microarrays
Capture cDNAs or mRNAs from solution
• Computer based parallels
– Compare with expressed sequences from other species
– Compare with ESTs
BioSci D145 lecture 6
page 10
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? (contd)
• Ways to identify genes in regions (contd)
– Identify features found in typical promoters
• What are promoters?
Regions 5’ to a gene that are required for expression
• CpG islands – regions in eukaryotic genes that are hypomethylated
– Undermethylated - methylation of promoter DNA typically
inhibits gene expression
– Digest with enzymes that have CG in recognition site that would
be inhibited if methylated, e.g., SacII CCGCGG, run gel to check
» If nonmethylated (expressed) enzyme will cut, region will be
hypersensitive, get chopped up.
» If methylated (not expressed) enzyme will not cut and
region will not get digested
• DNAse I hypersensitive (or MNase or MPE)
– Similar principle – transcriptionally active DNA is “open”
– If open, it is more sensitive to DNAse I than non-active DNA
– Test by digestion and gel electrophoresis
• Accessibility to transposase – ATAC-seq
BioSci D145 lecture 6
page 11
©copyright
Bruce Blumberg 2004-2016. All rights reserved
How are genes identified? (contd)
• The problem with all of these methods is that experiments are required
– What do we do when sequences are coming in at the rate of tens of
gigabases/month?
– Need large-scale, robust, computerized methods to identify genes and
annotate sequences!
• All bioinformatics depends on databases
– UCI bioinformatics has some unique databases (e.g., fuzzy PubMed)
• http://www.igb.uci.edu/tools/databases.html
• Three major databases of sequences (automatically duplicated)
– GENBANK http://www.ncbi.nlm.nih.gov/Genbank/index.html
– DNA Databank of Japan http://www.ddbj.nig.ac.jp/
– European Molecular Biology Laboratory (EMBL)
http://www.ebi.ac.uk/embl/index.html
BioSci D145 lecture 6
page 12
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Dystrophin as an example
• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dop
t=default&list_uids=1756
BioSci D145 lecture 6
page 13
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome annotation – how to identify genes?
• Gene identification/prediction is important but difficult
– Large variety of methods and algorithms to predict exons
– To identify genes must first identify open reading frames (ORFs)
• When dealing with cDNAs – look for regions that code for proteins
– Do all genes code for proteins? Depends on definition of “gene”
– Correct reading frame for a sequence is assumed to be largest
with no stop codons (TGA, TAA, TAG)
• Lots of tricks can be employed
– Codon frequency for an organism
» Coding sequences follow codon usage
» Noncoding sequences do not, often have lots of stop codons
– Consensus sites
» Kozak translational initiation CCRCCATGG
• What is a very important consideration when searching sequences to
predict ORFs?
– Sequence must be accurate
» Incorrect base calls are troublesome
» But indels (insertions or deletions) are disastrous
BioSci D145 lecture 6
page 14
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction methods
– Two major methods are in use
• Homology searches
– Compare with other known sequences
• Ab initio (from the beginning) prediction
– Use algorithms to recognize common features and predict genes
» Promoters
» Splice sites
» Polyadenylation sites
» ORFs
– Generally, microbial genomes are much easier to annotate – WHY? Smaller
No or few
• Simply identify ORFS > 300 bp (100 amino acids)
introns
– Works very well
– But can miss small coding sequences
– Must run on both strands because there are shadow genes
(overlap on two strands)
• Using a variety of programs, can predict genes in bacterial genomes
– Venter Sargasso sea paper
BioSci D145 lecture 6
page 15
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction
methods (contd)
– Huge variety of programs
available
• Neural networks –
attempt to model learning process
– Build decision trees, use probabilistic reasoning
• Rule-based systems
– Rules often not clear
– Have trouble with exceptions
• Hidden Markov models
– Break sequences down into small units based on statistical
analysis of composition
» Hexamers appear to be optimal size to search
– Classify sequences into types or “states”
– Identify transitions between states
– Very useful for large number of purposes
BioSci D145 lecture 6
page 16
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction
methods (contd)
– Training sets are used to
“teach” programs how
to solve problems
• Training set is actual data – genes with known features
– Programs use training sets to classify new data
• Neural networks use training data to build decision trees
• Rule-based systems use training data to generate rules
• HMM build table of probabilities for states and transitions
– Pierre Baldi in IGB is UCI expert in machine learning
• How well do gene predicting programs work?
– Extremely well on bacterial genomes
– Fairly well on simply eukaryotic genomes
– Variable on complex genomes
– Rule of thumb – use a group of programs and look for areas of agreement
among them
– The best current programs combine ab initio predictions with similarity
data to define a probability model
BioSci D145 lecture 6
page 17
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Identification of gene function
• You have identified a gene – what is its function?
– Always look for similarity to known sequences
• Swiss-prot is fairly well annotated
• GENBANK translated database is most complete
• BLAST is tool to use
• Amino acid searches more sensitive than nucleotide searches
– Because identical amino acid sequences might only be 67%
identical at nucleotide level
– What might you find?
• Match may predict biochemical and physiological function
– e.g., a known enzyme from another organism
• Match may predict biochemical function only
– e.g., a kinase
• Match a gene from another organism with no known function
– May match ESTs or ORFs from other organisms
• Match a known gene with partly characterized function
– Search leads to clarification of function
• Might not match anything at all
– Expect this will happen less and less
BioSci D145 lecture 6
page 18
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Up-regulated by TTNPB
(> 1.5 Fold, p < 0.01, n=334)
Cytoskeleton
Miscellaneous 4% (14)
1% (4)
Energy metabolism
3% (9)
Extrcellular matrix
3% (10)
Unidentified
21% (73)
Housekeeping
15% (52)
Tumor suppressor
1% (2)
Transcriptional
15% (53)
Signal tranduction
8% (26)
Retinoid metabolism Neural
1% (3)
2% (7)
BioSci D145 lecture 6
page 19
©copyright
Hypothetical
26% (90)
Bruce Blumberg 2004-2016. All rights reserved
Cytoskeleton
Energy metabolism
Extrcellular matrix
Housekeeping
Hypothetical
Neural
Retinoid metabolism
Signal tranduction
Transcriptional
Tumor suppressor
Unidentified
Miscellaneous
Identification of gene function (contd)
• You have identified a gene – what is its function? (contd)
– Does the sequence contain an obvious functional motif?
• Homeobox or other consensus DNA binding domain?
• Kinase domain?
• Serine protease, etc.
– InterPro database allows one to compare a protein sequence with whole
family of structural databases
• http://www.ebi.ac.uk/interpro/
HICAICGDRSSGKHYGVYSCEGCKGFFKRTVRKDLTYTCRDSKDCMIDKRQRN
RCQYCRYQKCLAMGM
• https://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20170213-223854-0435-74150021-es
• Other sorts of similarity searches
• Identify protein secondary structure motifs
– Alpha helix, beta sheets, hydrophobicity
– Amphipathic helices
– Overall polarity of sequences
– Not used much
BioSci D145 lecture 6
page 20
©copyright
Bruce Blumberg 2004-2016. All rights reserved
BioSci D145 lecture 6
page 21
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Identification of gene function (contd)
• You have identified a gene – what
is its function? (contd)
– Gene ontology (GO) – highly
structured vocabulary for gene
classification
• Genes are classified using
this vocabulary
• Relates protein function
with cellular or organismal
functions
– Nucleic acid binding
– Cell division
BioSci D145 lecture 6
page 22
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome annotation
• Extremely important as number of sequences increases
– Goals are to identify
• all of the sequences
• all of the features of each sequence
• All of the functions of the identified genes
– Sometimes annotation does not agree with known function
• Human error
• New and updated information not propagated to database
• Inaccurate sequencing
• Sometimes annotation is correct but protein lacks function under
certain conditions (e.g., need cofactors)
– Gold standard for functional analysis is loss-of-function analysis
• Most accurate annotation
– Common to have “annotation jamborees” where biologists and
bioinformaticians come together to annotate new sequences
• Xenopus tropicalis jamboree was in Spring 2006
• But many genes and gene models are still unannotated
BioSci D145 lecture 6
page 23
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Which genes regulate what other genes?
• The biggest defect of expression microarrays or transcript profiling is that
neither can distinguish direct targets from indirect targets
– Which genes are a primary response to the treatment vs.
– Which ones have one or more intermediates?
• How can we approach and solve this important problem?
– Identify the genes to which a transcription factor binds
– Identify to which genes RNA polymerase II is recruited
• And from which it is dismissed
• All eukaryotic DNA occurs as chromatin (DNA+histone and other proteins)
– Chromatin
conformation
influences
whether
transcription
can occur
BioSci D145 lecture 6
page 24
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Which genes regulate what other genes?
— Open chromatin is accessible to transcriptional machinery
• DNA is unmethylated
• Histones are methylated and acetylated (acetylase activates)
• Many transcriptional co-activators methylate and acetylate histones
and other chromatin associated proteins
− Opens the chromatin
BioSci D145 lecture 6
page 25
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Which genes regulate what other genes?
— Closed chromatin is inaccessible to transcriptional machinery
• DNA is methylated
• Some histone tails are methylated, others not
• Transcriptional co-repressors recruit histone deacetylases (HDACs)
which lead to chromatin condenstation
• Chromatin condensation leads to gene silencing
• Identification of chromatin-localized proteins is diagnostic for direct target
genes of transcription factors
− Most common application
• Identification of promoters to which RNA Pol II is recruited upon some
treatment is diagnostic for genes directly upregulated by the treatment
− Trickier but very useful
• Identification of promoters from which RNA Pol II is dismissed upon some
treatment is diagnostic for genes downregulated by the treatment
BioSci D145 lecture 6
page 26
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Chromatin immunoprecipitation - ChIP
• ChIP is the only method for large-scale identification of direct
transcriptional targets
• General strategy
– Crosslink proteins to nearby DNA with formaldehyde
• Works for about 2 angstrom distances
• What does this say about the specificity of the interaction?
—
BioSci D145 lecture 6
Only specific interactions will be reflected in crosslinks
page 27
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Chromatin immunoprecipitation - ChIP
• ChIP - general strategy (contd)
– Break chromatin into small chunks by sonicating
– Typically want ~500 bp fragments
– Evaluate sonication quality and extent by gel electrophoresis to ensure
that size range is obtained
• Needs MUCH optimization
BioSci D145 lecture 6
page 28
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Chromatin immunoprecipitation - ChIP
• ChIP - general strategy (contd)
– Precipitate chromatin with antibody against protein of interest
• Bind antibody, then capture complex
with protein G/protein A beads
• Reverse crosslinks – and remove proteins
with proteinase K digestion
• Purify DNA away from proteins
• Evaluate enrichment of individual
candidate binding sites by PCR
BioSci D145 lecture 6
page 29
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Chromatin immunoprecipitation - ChIP
• Flavors of ChIP commonly in use
– Standard ChIP – one antibody, few targets analyzed
• Most commonly used method
– ChIP-chip – chromatin immunoprecipitation on chip
• Recovered fragments are used to probe microarray of genomic DNA
• Allows identification of novel binding sites
• Requires good genomic microarrays
— Whole genome requires MANY chips
(at least 7 for human and mouse) = EXPENSIVE
— may not be available for your
target organism
— Affymetrix, Agilent, Nimblegen
are sources
BioSci D145 lecture 6
page 30
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Chromatin immunoprecipitation - ChIP
• Flavors of ChIP commonly in use
– ChIP-sequencing (ChIP-seq) – chromatin immunoprecipitation sequencing
• Massively parallel sequencing of recovered fragments
• Unbiased method to identify transcription factor binding sites
• Price now much less than ChIP-chip
• Requires excellent, well-characterized antibody!
BioSci D145 lecture 6
page 31
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Computer-based methods that may help to identify binding sites
• Phylogenetic footprinting – What is it?
– Powerful method to identify regulatory elements in DNA sequences
– Central assumption is that protein coding sequences evolve much more
slowly than DNA sequences (or DNA sequences evolve faster)
• Due to selective pressure on protein function
– Sequences conserved in related organisms likely to be functional
– Species selection• Must be sufficiently diverged that functional domains stand out
• Sufficiently conserved to that they can be identified
– A variety of algorithms exist – typical approach is to use multiple
programs and look for what is found in common.
BioSci D145 lecture 6
page 32
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Comparative genomics (contd)
• Phylogenetic footprinting comparison of zebrafish, mouse and xenopus caudal
orthologs (cdx4, cdx4, Xcad3)
– A number of putative conserved elements identified including
– TTCATTTGAATGCAAATGTA
– Absolutely conserved in all 3 promoters
– Compare with database
• http://www.ncbi.nlm.nih.gov/blast/
– Also found in human cdx4 – a good validation of the result
• As more genomes are sequenced and compared, phylogenetic footprinting
becomes a very powerful filter to identify potentially conserved regulatory
sequences
– ECR browser offers precomputed comparisons of conserved elements
• http://ecrbrowser.dcode.org/
• ENCODE folks claim no conservation of elements across species – (ridiculous)
BioSci D145 lecture 6
page 33
©copyright
Bruce Blumberg 2004-2016. All rights reserved
1. (8 points) Did you know that there are carnivorous plants that survive in nutrient poor
environments by eating insects? Among these are three types of "pitcher plants", that
all trap insects by drowning them in a sweet liquid contained in a modified leaf that
looks like a pitcher. Interestingly, the Australian, Asian and American pitcher plants all
look very similar and catch insects the same way. However, they are believed to be
completely unrelated biologically. The Australian pitcher plant is thought to be related
to star fruit, the Asian pitcher plant to buckwheat and the American pitcher plant to
kiwifruit. Your group's mission is to determine 1) whether this is an example of
convergent evolution or whether the plants are similar but have been misclassified and
2) what types of adaptations allow these plants to digest insects to extract nutrients
such as phosphorous and nitrogen.
a) (4 points) What approach would you take to determine whether these pitcher plants
are closely related to each other or not? How will you place them among the
evolutionary tree of plants and confirm or refute the classification of taxonomists?
Since you want to determine just how closely related these plants are, and
specifically study their functional adaptations (in b), the best answer would be to
perform whole genome sequencing for the 3 types of pitcher plants. It is 2017, so
you will want to perform Nextgen sequencing, most likely by Illumina Solexa
sequencing. Isolate DNA, generate Illumina libraries, sequence each genome to high
depth of coverage and assemble them with standard bioinformatic tools to generate
draft genome sequences. Then compare these sequences with each other to determine
how closely related they are and then with sequences known from other plants to
accurately place these pitcher plants on the plant phylogenetic tree. Check whether
your classification matches that of taxonomists.
BioSci D145 lecture 6
page 34
©copyright
Bruce Blumberg 2011. All rights reserved
b) (4 points) One hypothesis is that the plants harbor specific microorganisms in their
"pitchers" that enable them to extract nutrients from the insects, not dissimilar from
gut bacteria that enable primates to digest fiber to produce short-chain fatty acids. An
alternative hypothesis holds that the plants have modified proteins that were
originally responsible for cellular defense to produce digestive enzymes that break
down insects. What approach could you take to 1) determine whether the microbial
contents differ significantly between pitcher plants in the same species and among
the 3 different types of pitcher plants? How could you test the hypothesis that a
common cellular enzyme such as purple acid pyrophosphatase has specific amino
acid changes in carnivorous, vs. related non-carnivorous plants?
1) To address whether the microbiomes differ among plants, collect samples from
several (~5) individuals of each species, isolate DNA and perform environmental
shotgun sequencing, much like the Venter paper (but use Nextgen sequencing).
Compare the sequences in each species and between species to identify any
potential similarities and differences.
2) To test the hypothesis that specific changes in common enzymes are found in
carnivorous, vs. non-carnivorous plants, simply compare the sequences between
related carnivorous and non-carnivorous plants and with other plants. This was
actually done and showed that there were common substitutions in totally different
lineages that facilitated a carnivorous mode of obtaining nutrients.
2. (4 points) In an even more bizarre evolutionary development, the Asian pitcher plant
Nepenthes hemsleyana has abandoned catching insects for food and instead has
developed a mutualistic relationship with the wooly bat. N. hemsleyana doesn't produce
much fluid in its pitcher and has developed a shape perfectly complementary to that of
the bat such that the bats roost inside the plant. The bats defecate inside the plant,
providing the plant with nutrients. The closely related species, N. raffiesiana lives in the
same environment and catches insects in the usual way to obtain nutrients. Please
describe how would you identify potential gene candidates that enable N.
hemsleyana to attract bats and utilize their feces for nutrition compared with N.
raffiesiana?
Since you have two closely related species (and already sequenced one of them) it
would be relatively simple to sequence the other and compare what differences are
found between them. This will identify candidate genes that you could use for future
studies, perhaps after selecting those known to be related to nitrogen and
phosphorous metabolism and uptake. The key point is to sequence both and do a
detailed comparison.
BioSci D145 lecture 6
page 36
©copyright
Bruce Blumberg 2011. All rights reserved
3. (8 points) There is a genus of lizards, Geckolepsis, commonly referred to as the fish
scale geckos which are found only in Madagascar. This week, a paper was published
describing a new species, Geckolepsis megalepsis, that has gigantic scales that can
rapidly detach when the lizard is attacked by a predator. The predator is left with a
mouthful of scales while the lizard gets away and regenerates its skin and scales
perfectly (i.e., without scarring) in a few weeks. Other species in the Geckolepsis genus
have large scales (although not as large as G. megalepsis) but lack this rapid
detach/regenerate mechanism - they can lose a few scales but regenerate them
imperfectly. Your group's mission is to identify how G. megalepsis can detach and
regenerate its skin and scales while the spotted fish scale gecko, G. maculata cannot.
a) (4 points) An obvious starting point would be to sequence the genomes of G.
megalepsis and G. maculata. One of the TAs, Ron, has given you an Applied
Biosystems 377 capillary sequencer and 4 PCR machines and suggests that you use
these to do cycle sequencing of the genomes as pioneered by Craig Venter in his
Sargasso Sea paper that we read. Is Ron correct? Can this approach generate
complete genome sequences in one quarter? If he is correct, please explain
why. If he is not correct, please describe succinctly how you will produce a high
quality draft sequence in one quarter.
Ron is incorrect. A capillary sequencer is for Sanger sequencing, not Nextgen
sequencing so you will not be able to come close to even a fragment of one genome
in a quarter – it simply does not have enough capacity for rapid, whole genome
sequencing. I will isolate DNA from the two species of interest, fragment them up to
make Illumina sequencing libraries and do enough sequencing runs to generate the
entire sequencing. This could be as few as a single run, depending on the instrument
available. Let the computer assemble this sequence and produce draft genomes. If
you are very industrious, your group might consider adding a different sequencing
method (such as 454 or PacBio) to help resolve gaps.
b) (4 points) The approach your group took in a) was partially successful - you
generated draft genome sequences but these are highly fragmented. The
estimated total genome size is 1.4 gigabases, about half of human. There are 24
chromosomes, but your analysis generated more than 10,000 scaffolds for each
species. Oops. Ron suggests that you quickly generate a radiation hybrid map of
the two genomes to facilitate the assembly since the large phenotypic difference
between two closely related species suggests that there may only be a small
number of actual changes. Is Ron correct? If so, please say why and what you
will need to generate a good RH map. If he is not correct, please explain why
and what method you would use to generate a high quality genome map that
will allow you to assemble the genome. In either case, what markers will you
use and how will you obtain them?
Once again, Ron is incorrect (why is he your TA anyway?) He is wrong because it is
not possible to quickly generate a radiation hybrid panel and map – this could easily
take years. I would instead generate a BAC library from each species, use these for
BAC end sequencing (using old fashioned Sanger sequencing) and then use these
STCs as markers. Generate the map by comparing the BAC end sequences with
your draft genome to see which pieces go where. This will probably take longer
than one quarter, though. Alternatively, you could generate unique markers from
your genome sequencing and perform HAPPY mapping. This would be less
accurate, but perhaps a bit quicker. Either answer is ok if you described how it could
achieve your goals and what markers you used.
BioSci D145 lecture 6
page 38
©copyright
Bruce Blumberg 2011. All rights reserved
4. (15 points) Geckolepsis are classified with the largest subgroup of the Family Gekkonidae.
Gekkonidae are found worldwide, but are particularly diverse and species-rich in tropical
areas. Since the diversity of this group is so large, you might reasonably infer that the
rate of evolution within the Gekkonidae is unusually high.
a) (5 points) The next task is to generate a very precise phylogenetic analysis of
representative member of the Family Gekkonidae and the entire Genus Geckolepsis
(which has 5 species). Ron suggests that a microarray analysis would be the most
accurate and fastest way to generate an accurate phylogenetic tree. The other TA,
Riann says that Ron is wrong, but doesn't tell you why. Is Ron correct or not? If he is
correct, outline how you will perform the phylogenetic analysis and determine
which lizards have conserved, constrained or rapidly evolving regions of their
genomes. If Riann is correct, state why and then outline how you would perform
the same analysis.
Ron is still incorrect while Riann is correct. Although there are such things as
"phylogenetic microarrays" we did not discuss them and there is no possibility that
they will be the most accurate and fastest way to generate an accurate phylogenetic
tree for the Gekolepsis Genus together with representative members of the
Gekkonidae. The best approach would be like what was done in the Lindblad-Toh
paper. Collect the available reptile genomic sequences, then identify which species
you will choose from other groups as well as Gekkonidae groups and the 5 species
of Geckolepsis. Generate draft genomes of these, build phylogenetic trees by
computer and analyze to identify conserved, constrained and rapidly evolving
regions of the genome.
BioSci D145 lecture 6
page 39
©copyright
Bruce Blumberg 2011. All rights reserved
b) (5 points) Ron is getting pretty bossy (for a TA) and next decides that you should
look for copy number variations in the genomes of the 5 species of Geckolepsis.
Your group doesn't want to do any extra work and debates whether you should
listen to Ron, or instead start ignoring him and talk to Riann instead. Will the
analysis you have done in 4a be able to reveal most or all of the copy number
variations in the 5 species? If so, please explain why and what aspects of the
analysis you did in a) will provide this information so that you can move on to
the next task. If it will not, please say why not and how you would go about
identifying most or all of the copy number variations in the 5 species. Be sure
to say what materials you needed for your analysis.
It is unlikely that the genome sequences will reveal copy number variations,
although, they could give some idea about whether such variations exist. You
will want to generate genome tiling microarrays like in the Redon paper and use
these to identify all of the copy number variations in the 5 species of
Geckolepsis. With such microarrays, you will also be able to identify CNVs
among individuals within a species.
BioSci D145 lecture 6
page 40
©copyright
Bruce Blumberg 2011. All rights reserved
c) (5 points) Unfortunately, neither the phylogenetic analysis in a), nor the CNV
analysis in b) identified why and how G. megalepsis is able to shed its scales/skin
and easily escape predators AND regenerate both skin and scales perfectly. Clearly
the next step is to ask whether the profile of RNA transcripts differs in the skin of
G. megalepsis vs. G. maculata. Once again, Riann and Ron are offering conflicting
advice - Ron wants you to use microarray analysis and Riann says that RNA-seq is
the way to go. 4 people in your group vote for microarrays and 4 for RNA-seq - you
have to break the tie. Both methods will work to some extent - your TAs are smart
people after all. Your goal is to identify all of the transcripts responsible for the
ability of G. megalepsis to shed/regenerate its skin and scales. Please describe
how you will tackle this problem and why your approach will provide the best
chance to identify the gene(s) responsible.
Since you want to identify ALL of the transcripts that could be responsible for
the ability of G. megalepsis to shed/regenerate its skin and scales, RNA-seq is
really the only choice. This is because any sort of expression microarray will
use only expressed genes as targets on the chip. In contrast, RNA-seq can
identify any sort of transcript, irrespective of whether it is mRNA, lcRNA,
microRNA or any other sort of strange RNA. I would prepare RNA from the
skin of G. megalepsis and G. maculata before and after injury, then perform
RNA-seq and compare which genes are expressed before and after injury. An
acceptable, although probably less effective, approach would be to use whole
genome tiling arrays to analyze which transcripts are produced and perform the
same sort of comparison.
BioSci D145 lecture 6
page 41
©copyright
Bruce Blumberg 2011. All rights reserved