PPT - Bruce Blumberg

Download Report

Transcript PPT - Bruce Blumberg

BioSci D145 Lecture #4
• Bruce Blumberg ([email protected])
– 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment)
– phone 824-8573
• TA – Riann Egusquiza ([email protected])
– 4351 Nat Sci 2– office hours M 1-3
– Phone 824-6873
• check e-mail and noteboard daily for announcements, etc..
– Please use the course noteboard for discussions of the material
• Updated lectures will be posted on web pages after lecture
– http://blumberg-lab.bio.uci.edu/biod145-w2017
• Last year’s midterm is now posted.
• Term paper outlines due Friday (2/3) by midnight.
BioSci D145 lecture 1
page 1
©copyright
Bruce Blumberg 2014. All rights reserved
Term paper outline
• Title of your proposal
• A paragraph introducing your topic and explaining why it is important; i.e.,
what impact will the knowledge gained have.
– Why should any funding agency give you money to pursue this research?
• NIH now requires a statement of human health relevance for all grant
applications
• NSF wants to know what is the intellectual merit of your proposed
research and what broader impacts of your proposed research
• Present your hypothesis
– A supposition or conjecture put forth to account for known facts; esp. in
the sciences, a provisional supposition from which to draw conclusions
that shall be in accordance with known facts, and which serves as a
starting-point for further investigation by which it may be proved or
disproved and the true theory arrived at.
• Enumerate 2-3 specific aims in the form of questions that test your
hypothesis
– At least one of these aims needs to have a strong “whole genome”
component
BioSci D145 lecture 4
page 2
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Modern DNA sequence analysis
• Cycle sequencing
– Virtually all commercial DNA sequencing today is done by cycle
sequencing with fluorescent ddNTPs
• ABI Big Dye chemistry
– Template preparation still tedious for small scale
• TempliPHi used in genome centers (obviated need for most
automation)
– Capillary sequencers predominant form of technology in use
• But, next generation sequencing is already coming online and will rapidly
displace old technology in genome centers.
– 454 sequencing (Roche)
– Solexa (Illumina)
– SoLID (Applied Biosystems)
• 3rd generation sequencing (individual DNA molecule) now available
– e.g., Pacific Biosciences (sequence reads of 1,000-10K bases)
BioSci D145 lecture 4
page 3
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies
• Sequencing by hybridization
– Construct a high-density
microchip with all possible
combinations of a short
oligonucleotide
• Up to 25-mers
• By photolithography
– Synthesized on
chip directly
– Label and hybridize
fragment to be sequenced
– Wash stringently
– Read fluorescent spots
– Reconstruct sequence
by computer
BioSci D145 lecture 5
page 4
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Sequencing by hybridization rarely used for de novo sequencing
– Extremely fast and useful to sequence something you already know the
sequence of but want to identify mutation - resequencing
– Disease causing changes
• e.g in mitochondrial DNA
– SNP discovery
– Works best for examining sequence of <10 kb
BioSci D145 lecture 5
page 5
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• http://www.affymetrix.com/products/arrays/index.affx
• SNP discovery
– Photo shows
mitochondrial chip
– Right panel shows pairs
of normal (top) vs
disease (bottom)
(Leber’s Hereditary
Optic Neuropathy)
• Top 3 disease
mutations
• Bottom control
with no change
BioSci D145 lecture 5
page 6
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies – Next Generation sequencing
• 2nd generation = high throughput, short sequences
• 3rd generation = single molecule sequencing
• Small number of sequence templates (thousands) but very long reads
(~105 bp)
• What is the immediate implication of this technology for genome
assembly?
We should now be able to completely sequence large insert clones
directly and avoid fragmentation by repetitive elements!
• Key review is Metzger, M.L. (2010) Sequencing technologies — the next
generation, Nature Reviews Genetics 11, 31-46.
BioSci D145 lecture 5
page 7
©copyright
Bruce Blumberg 2004-2007. All rights reserved
3rd generation
Other sequencing technologies (contd)
• Illumina (Solexa) sequencing
– https://www.illumina.com/content/dam/illuminamarketing/documents/products/illumina_sequencing_introduction.pdf
– Based on synthesis of complementary strand to a template (like Sanger)
• Detection of elongation with labeled terminators
– Steps
• Library generation - fragment genome to appropriate size (depends
on application) and add adapters to each end
• Cluster generation – capture fragments on lawn of oligos and amplify
• Sequencing – reversible terminator
• Data analysis –
– align reads to reference genome
– Analysis of reads
BioSci D145 lecture 5
page 9
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Illumina sequencing (contd)
– Library preparation – fragment target and add adapters.
• Can multiplex to gain additional capacity
• That is, Hiseq-X can generate
1.8 Tb of data per run, but don’t
need this much for most
applications so use different
adapters and “bar-code” samples.
BioSci D145 lecture 5
page 10
©copyright
Bruce Blumberg 2004-2007. All rights reserved
• Bar coding sequence analysis
BioSci D145 lecture 5
page 11
©copyright
Bruce Blumberg 2004-2017. All rights reserved
Other sequencing technologies (contd)
• Deep sequencing
– What is the point?
• Can generate huge number of reads in parallel
• Miniseq – 7.5 Gb (25 million reads/run 2 x 150 bp)
• MiSeq – 15 gb (15 million reads/run 2 x 300 bp)
• NextSeq – 120 Gb (400 million reads/run 2 x 150 bp)
• HiSeq – 1.5 Tb (5 billion/run 2 x 150 bp)
• HiseqX – 1.8 Tb (6 billion/run 2 x 150 bp)
• What is massively parallel sequencing good for?
–
–
–
–
–
–
–
Rapid sequencing of genomes, or resequencing of known sequences
Ancient DNA (even dinosaurs? – Svante Pääbo says ~200K years is limit)
ChIP-sequencing (week 6)
Sequencing ESTs or other tags
Determining microbial diversity in field samples
Transcriptome sequencing
Identifying variations in
• Viral populations
• Gene sequences in mixed populations
BioSci D145 lecture 5
page 12
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Amplicon sequencing
• Idea is to sequence many copies of the same thing
– Gene sequence
– mRNA transcript
BioSci D145 lecture 5
page 13
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Amplicon sequencing (contd)
• What is amplicon sequencing good for?
– Discovery of rare somatic mutations in complex samples (e.g., cancerous
tumors - mixed with germline DNA) based on ultra-deep sequencing of
amplicons
– Sequencing collections of exons from populations of individuals to
identify diversity
– Sequencing collections of human exons from populations of individuals
for the identification of rare alleles associated with disease
– Analysis of viral quasispecies present within infected populations in the
context of epidemiological studies
– Evolutionary biology in populations
BioSci D145 lecture 5
page 14
©copyright
Bruce Blumberg 2004-2007. All rights reserved
The human genome
• In Feb 12 2001, Celera and Human Genome project published “draft” human
genome sequencs
– Celera -> 39114
– Ensembl -> 29691
– Consensus from all sources ~30K
• Number of genes
– C. elegans – 19,000
– Arabidopsis - 25,000
• Predictions had been from 50-140k human genes
– What’s up with that?
– Are we only slightly more complicated than a weed?
– How can we possibly get a human with less than 2x the number of genes
as C. elegans
– Implications?
• UNRAVELING THE DNA MYTH: The spurious foundation of genetic
engineering, Barry Commoner, Harpers Magazine Feb, 2002
BioSci D145 lecture 4
page 15
©copyright
Bruce Blumberg 2004-2016. All rights reserved
The human genome
• The answer – Gene sets don’t overlap completely (duh)
– Floor is 42K
– 130029build #236 UniGene Clusters (from EST and mRNA sequencing)
– http://www.ncbi.nlm.nih.gov/unigene
– Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous
years) (“final” count
• Important questions to be
answered about what
constitutes a “gene”
– Crick genes?
DNA-RNA-protein
– How about RNAs?
– miRNAs?
– Antisense transcripts?
– lncRNAs?
BioSci D145 lecture 4
page 16
©copyright
= 42113
Bruce Blumberg 2004-2016. All rights reserved
Genome sequencing(contd)
– Whole genome shotgun sequencing (Celera)
• premise is that rapid generation of draft sequence is valuable
• why bother trying to clone and sequence difficult regions?
– Basically just forget regions of repetitive DNA - not cost effective
• using this approach, genomes rarely are completely finished
– rule of thumb is that it takes at least as long to finish the last 5%
as it took to get the first 95%
• problems
– sequence may never be complete as is C. elegans
– much redundant sequence with many sparse regions and lots of
gaps.
– Fragment assembly for regions of highly repetitive DNA is dubious
at best
– “Finished” fly and human genomes lack more than a few already
characterized genes
BioSci D145 lecture 4
page 17
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Genome sequencing (contd)
• Knowing what we know now – how to approach a large new genome?
– Xenopus tropicalis 1.7 Gb (about ½ human)
– BAC end sequencing
– Whole genome shotgun
– HAPPY mapping and radiation hybrid mapping to order scaffolds
– Gaps closed with BACS
– 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes)
– Finishing now in process
• But how “finished” will it be?
• 2016 update – now version 9.0
– FINALLY integrated BAC end sequences
– Integrated genetic map
– 50% of contigs > 72 kb
– Xenopus laevis – v9.1 –
• >90% of genome in chromosomal scaffolds
• 2 “subgenomes” fully characterized.
• annotation remains a big challenge.
BioSci D145 lecture 4
page 18
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Functional Genomics - Analysis of gene function on a whole genome basis
• Genome projects
– DNA sequencing
– Human genome, mouse, rat, Drosophila, C. elegans “finished”
– model organisms progressing rapidly
– Lots of new genes, but many lack known function
• Functional genomics
– Identification of gene functions
• associate functions with new genes coming from genome projects
• function of genes identified from characterizing diseases or mutants
– Identification of genes by their function
• discovery of new genes
BioSci D145 lecture 4
page 19
©copyright
Bruce Blumberg 2004-2016. All rights reserved
*Methods of profiling gene expression – large scale to whole genome
• What are the possibilities
– Array – micro or macro
– Sequence sampling (EST generation)
– SAGE – serial analysis of gene expression
– Massively parallel signature sequencing (RNA-seq, Illumina, 454)
• DNA microarray analysis was, until now totally dominant method
– Two basic flavors
• Spotted (spot DNA onto support)
– cDNA microarrays
– Oligonucleotide arrays
– Moderately expensive
• Synthesized (use photolithography to synthesize oligos onto silicon or
other suitable support
– Affymetrix Gene Chips dominate
– VERY expensive
– Both are in wide use and suitable for whole genome analysis
BioSci D145 lecture 4
page 20
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Spotted arrays
• Source material is prepared
– cDNAs are PCR amplified OR
– Oligonucleotides synthesized
• Spotted onto treated glass slides
• RNA prepared from 2 sources
– Test and control
• Labeled probes prepared from RNAs
– Incorporate label directly
– Or incorporate modified NTP
and label later
– Or chemically label mRNA directly
• Hybridize, wash, scan slide
• Express as ratio of one channel to
other after processing
BioSci D145 lecture 4
page 21
©copyright
Bruce Blumberg 2004-2016. All rights reserved
DNA microarray types
• Stanford type
microarrayer
– http://cmgm.stanford
.edu/pbrown/mguide/
index.html
• Printing method
– Reminiscent of
fountain pen
BioSci D145 lecture 4
page 22
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Strategy to identify RAR target genes
Agonist - TTNPB
Antgonist - AGN193109
Harvest st 18
Poly A+ RNA
Poly A+ RNA
Amino-allyl labeled
1st strand cDNA
Amino-allyl labeled
1st strand cDNA
Alexa Fluor
555 (cy3)
Alexa Fluor
647 (cy5)
Alexa Fluor
555 (cy3)
Alexa Fluor
647 (cy5)
Probe microarrays
upregulated
BioSci D145 lecture 4
page 23
©copyright
downregulated
Bruce Blumberg 2004-2016. All rights reserved
DNA microarray
• Statistical analysis of output – VERY IMPORTANT!
• Replicates are very important
• Preprocessing of data is needed
– To remove spurious signals
BioSci D145 lecture 4
page 24
©copyright
Bruce Blumberg 2004-2016. All rights reserved
DNA microarray
• Advantages
– Custom arrays possible and
affordable
– Ratio of fluorescence is robust
and reproducible
• Disadvantages
– Availability of chips
– Expense of production on your own
– Technical details in preparation
BioSci D145 lecture 4
page 25
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Affymetrix GeneChips
• High density arrays are synthesized directly on support
– 4 masks required per cycle -> 100 masks per chip (25-mers)
– Pentium IV requires about 30 masks
– G.P. Li in Engineering directs a UCI facility that
can make just about anything using photolithography
BioSci D145 lecture 4
page 26
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Affymetrix GeneChips
Streptavidin/phycoerythrin
BioSci D145 lecture 4
page 27
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Affymetrix GeneChips
– Each gene is represented by a series of oligonucleotide pairs
• One perfect match
• One with a single mismatch
– Only hybridization to perfect match
but not mismatch is considered to be real
– Gene is considered “detected” if
> ½ of oligo pairs are positive
– Number of pairs depends on
organism and how well
characterized array behavior is
• Human uses 8 pairs
• Xenopus uses 16 pairs
BioSci D145 lecture 4
page 28
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Affymetrix GeneChips
• Result is in single color
– Always need two chips – control and experimental for each condition
– Also need replicates for each condition
– For diverse biological samples (e.g., humans) 10 replicates required!
– For less diverse samples (cell lines) probably 5 replicates needed
• Advantages
– Commercially available
– Standardized
• Disadvantages
– About $700 to buy, probe and
process each chip (at UCI)!
• About $500 elsewhere
– May not be available for your
organism of interest
– No ability to compare probes
directly on the same chip
• Must rely on technology
BioSci D145 lecture 4
page 29
©copyright
Bruce Blumberg 2004-2016. All rights reserved
DNA microarrays
• What are they good for?
– Identifying genes expressed in one condition vs. another
• One tissue vs. another (heart vs liver)
• Tissue vs. tumor
(liver vs. hepatocarcinoma)
• In response to a treatment
(e.g., RA)
• In response to disease
(e.g., after viral infection)
– Building expression profiles
• Tissues
• Cancers
• Developmental stages
• Expressed genes
– Identifying organisms in food
• Array can identify which
animals are present in a mix
• http://www.dnavision.com/files/FOODIDBrosh%20En.pdf
BioSci D145 lecture 4
page 30
©copyright
Bruce Blumberg 2004-2016. All rights reserved
DNA microarrays
• What are they good for? (contd)
– Response of animal to drugs or chemicals
• Toxicogenomics
• Pharmacogenomics
– Diagnostics
• SNP analysis to identify disease loci
• Specific testing for known diseases
BioSci D145 lecture 4
page 31
©copyright
Bruce Blumberg 2004-2016. All rights reserved
DNA microarrays
• What are the limitations of microarray technology? What sorts of factors
might confound the experiment?
– Signal intensity (or signal/noise)
• Improved dyes, label uniformly
– Biological variation (samples are inherently different)
• Sufficient # of replicates is key
• keep individuals separate
– Not all mRNAs will be present at sufficient levels to detect
• Amplification, but beware of bias
– Good statistical analysis is required
• Bayesian statistics are best (Pierre Baldi is local expert)
– calculating the probability of a new event on the basis of earlier
probability estimates which have been derived from empiric data
– i.e., don’t assume random distribution in datasets, calculate
probability based on real data
– Bayesian approach great for small number of replicates,
converges on t-test at high number of replicates
• http://cybert.microarray.ics.uci.edu/
BioSci D145 lecture 4
page 32
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Other methods of transcriptome analysis - parallel
• Microarray was once the dominant
method
– Direct RNA sequencing methods
are rapidly displacing microarrays
– SAGE (serial analysis of gene
expression)
• Nanostring is modern
implementation
• Short sequences
– RNAseq
• Directly sequence large
numbers of RNAs
• Longer sequences
• SAGE
– Relies on generating many very
short sequences and matching
these to the genome
– 10 bp = short SAGE
– 17 bp = “long” SAGE
BioSci D145 lecture 4
page 33
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Other methods of transcriptome analysis - parallel
• SAGE (continued)
– What is the obvious shortcoming
of this method?
– Sequences may not be unique and
could have difficulty mapping to
the genome
BioSci D145 lecture 4
page 34
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Other methods of transcriptome analysis - parallel
• RNA seq – Ali Mortazavi is
local expert
– Use of massively parallel
sequencing allows precise
quantitation of transcript
– Also allows discovery of
rare splice forms
– Discovery of unexpected
transcripts
– Main problem is in
mapping sequence calls to
genome
• Sequencing has 1-2%
errors which can make
mapping to genome
fail
• or induce “in silico
cross-hybridization”
– Mapping to
incorrect genomic
location
BioSci D145 lecture 4
page 35
©copyright
Bruce Blumberg 2004-2016. All rights reserved
Microarray vs. RNAseq
• RNAseq
– No assumption re transcripts
but need genome sequence
• Microarray
– Assumes you know all the
transcripts
– Any sequence you did not know
was expressed will not be
there.
• except whole genome tiling
arrays – Kapranov paper
– Can discover novel sequences
or new splice forms not yet
characterized (if you have
genome)
– Detection limit issues
• Signal-noise ratio
– Detection limits are not a
problem – can detect small #
– Well validated , expression
analysis can be quantitative
– Getting better, expression
analysis can be quantitative
BioSci D145 lecture 4
page 36
©copyright
Bruce Blumberg 2004-2016. All rights reserved