Department of Health Information Management

Download Report

Transcript Department of Health Information Management

Genomics and Personalized
Care in Health Systems
Lecture 10. High Throughput Technologies
Leming Zhou, PhD
School of Health and Rehabilitation Sciences
Department of Health Information Management
Department of Health Information Management
Outline
• Polymerase Chain Reaction (PCR)
• Genome Sequencing
• Microarray
• Pathway Analysis
Polymerase Chain Reaction
(PCR)
Department of Health Information Management
Polymerase Chain Reaction (PCR)
• A technique that allows us to generate a large number of a
particular DNA sequence from an extremely small sample
• Procedure:
– Determine one particular sequence – the target sequence
– Mix sample, primers, nucleotides to build new DNA strands
– Apply cycles of heating, cooling, reheating on the mixture
• The number of the target in the mixture will grow
exponentially with the number of cycles
• Primer selection is critical. The primers should be at
least 15-20 bases to ensure specificity.
• If you are unsure of the exact sequence, you can use a
mixture of primers (vary at third codon position)
Department of Health Information Management
PCR
Double-stranded DNA target
primers
Primers are complementary to opposite ends of target seq.
Department of Health Information Management
PCR
Department of Health Information Management
PCR Applications
• Making a lot of protein
– Use RT-PCR, “reverse transcriptase” PCR, to create DNA with
introns removed and then insert it into bacteria to clone the gene.
e.g. to make proteins for X-ray crystallography.
• Medical diagnosis
– Detect HIV viral proteins long before AIDS symptoms arise
– Rapid tuberculosis test
• Forensics
– Detect trace amounts of DNA at a crime scene
• …
Genome Sequencing
Department of Health Information Management
DNA Sequencing
• The process of determining the order of the nucleotide
bases along a DNA strand
• In 1977 two separate methods for sequencing DNA were
developed:
– Chain termination method (Sanger et al.)
– Chemical degradation method (Maxam and Gilbert)
– Both methods were equally popular to begin with, but, for many
reasons, the chain termination method is the method more
commonly used later
• Chain termination method is based on the principle that
single-stranded DNA molecules that differ in length by
just a single nucleotide can be separated from one
another using polyacrylamide gel electrophoresis
Department of Health Information Management
Chain Termination Method
• Idea: If we know the distance of each type of base from a known
origin, then it is possible to deduce the sequence of the DNA.
• For example, if we knew that there was an:
– A at positions 2, 3, 11, 13 ... G at positions 1, 12, ... C at positions 6, 7, 8, 10, 15... T
at positions 4, 5, 9, 14....then we can reconstruct the sequence
• Obtaining this information is conceptually simple. The idea is to
cause a termination of a growing DNA chain at a known base (A,G,C
or T) and at a known location in the DNA
• In practice, chain termination is caused by the inclusion of a small
amount of a single dideoxynucleotide base in the mixture of all four
normal bases (e.g. dATP, dTTP, dCTP, dGTP and ddATP). The small
amount of ddATP would cause chain termination whenever it would
be incorporated into the DNA.
Automatic DNA sequencing
Whole Genome
Shotgun Sequencing
Department of Health Information Management
Metrics for Evaluating Sequencing Methods
• Throughput
– Number of high quality bases per unit time
– Difficulty of sample preparation
– Number of independent samples run in parallel - multiplexing
• Yield
– Number of useful reads per sample
– Read length
• Cost
– Per run and per base; Equipment; Reagents; Infrastructure;
Labor; Analysis
• The goal of all new sequencing technologies is to increase
throughput and yield while reducing cost
Department of Health Information Management
Sanger Sequencing
• Radiolabeled dideoxyNTPs
• 800 bp reads
• Low throughput (several kb/gel)
Department of Health Information Management
Next Generation Sequencing
• Increasing sequencing production
–
–
–
–
Massive parallelization
Reduction in per-base cost
Eliminate need for huge infrastructure
Millions of reads (>1Gb sequence per run)
• Technologies
–
–
–
–
454
SOLiD
Illumina
…
• Challenges
– Read length
– Quality
– Data analysis
Department of Health Information Management
454
• Throughput & Yield
– 1 million 400 bp reads/10 hour run
– >8 samples/run (more with barcoding)
• Cost
– Machine: $500k; reagents ~$8000k/run
• Issues
– High indel rate in homopolymers
– Longer reads but fewer than other systems
Department of Health Information Management
Other Short Read Technologies
• Illumina
–
–
–
–
Sequencing by synthesis
100 million 36-75 bp reads/run
$6500 in reagent cost/run
3-6 day run time
• SOLiD
–
–
–
–
Sequencing by ligation
~400 million 35-50 bp reads/run
~$5000 in reagent cost/run
3-6 day run time
• Helicos
–
–
–
–
–
Sequencing by synthesis
No amplification
750 million reads/run
$18k run cost
8 day run time
Department of Health Information Management
Third-Generation Sequencing
• Extremely high-throughput sequencing at very low cost
• Pacific Biosciences
–
–
–
–
Sequence in real time with fluorescent NTPs
Rate limited by processivity of polymerase
Very long reads (>10 kb)
Not well parallelized (few reads)
• Nanopore sequencing
– Sequencing by exonuclease cleavage of native DNA
– Bases are read as they pass through a modified nanopore
• base-specific change in current
Department of Health Information Management
Genome Sequencing Videos
• Wash U Genome Center
– http://www.nslc.wustl.edu/elgin/genomics/gscmaterials.html
– Sanger Technology Tour Videos
– Next Generation Technology Tour Videos
http://gep.wustl.edu/curriculum/course_materials_WU/introduction_t
o_genomics/nextgen_video_tour
• Other videos
– PCR: http://www.youtube.com/watch?v=eEcy9k_KsDI
– Sanger: http://www.youtube.com/watch?v=aPN8LP4YxPo
– SOLiD: http://www.youtube.com/watch?v=nlvyF8bFDwM
– Solexa: http://www.youtube.com/watch?v=77r5p8IBwJk
– Helicos: http://www.youtube.com/watch?v=TboL7wODBj4
DNA Sequence Assembly
Department of Health Information Management
Outline
• Basic concepts in sequence assembly
– whole-genome shotgun methods
• Sources of error in assemblies
– Repeats
– Polymorphism
– Sequencing errors
• Alignment and assembly of next-generation sequencing
data
– Tiling reads onto reference vs. de novo assemblies
Department of Health Information Management
Whole Genome Assembly
•
•
•
•
•
•
•
Multiple copies of the genome are broken into pieces
Both ends of every piece are read.
Length (and orientation) of each piece form constraints.
Reads: 500-1000 bp
Quality array for each position.
Reconstruct genome from reads and constraints.
Issues: both ends of a read usually low quality, chimeric
reads, repetitive regions.
Department of Health Information Management
DNA Sequencing Data Set
• Millions of reads, some of them are low quality reads
• Millions of constraints, such as paired ends, quality
values
• After removing repeats, if two reads overlap large
enough, merge
• A contig is an ordered and oriented list of overlapping
reads.
• A scaffold is an ordered and oriented list of contigs.
Department of Health Information Management
Scaffolds
Department of Health Information Management
Sequence Assembly: Basic Approach
Generate reads
Find overlapping reads
Assemble reads into contigs
Join contigs into
scaffolds using
mate pairs
Join scaffolds into
“finished” sequence
Department of Health Information Management
Alignment and Assembly with Short Reads
• Map to reference genome
– Many tools
• De novo assembly
– Much harder
– Reference-guided assembly (MOSAIK)
– “True” de novo assebmly (Velvet)
Department of Health Information Management
Many DNA Assembly Systems
• PHREP
• CAP
• Euler
• Celera Assembler
• Arachne
• LSA
Microarray Technique
Department of Health Information Management
Microarrays
• Used to study gene expression levels in cells.
• Cells can differ dramatically in the amounts of various
proteins that they synthesize; e.g. due to different cell
types or different external/internal conditions.
• In fact, in higher level organisms only a fraction of the
genes in a cell are expressed at a given time, and that
subset depends on the cell type.
• Via microarrays it is possible to study the expression
levels of tens of thousands of genes simultaneously.
Department of Health Information Management
Microarray Technology
• A microarray is a glass slide with spots of DNA on it; each
spot is a probe (or target). Thousands of probes can fit on
a single slide. The slides can be spotted by robots.
• The DNA is single-stranded cDNA and may consist of an
entire gene or part of one
• If the microarray is exposed to a solution containing
mRNA, then the mRNA molecules will bind to those
probes to which they are complementary
• Genes you can study with a microarray depends on the
collection of probes on it.
• There are a number of commercial manufacturers; e.g.
Affymetrix
Department of Health Information Management
Microarray Probes
Single-stranded cDNA sequences
Department of Health Information Management
Microarray Experiments
• Start with two cell types, e.g. “healthy” and “diseased”.
• Isolate mRNA from each cell type, generate cDNA with
fluorescent dyes attached, e.g. green for healthy and red
for diseased.
• Mix the cDNA samples and incubate with the microarray.
• After incubation the cDNA in the samples has had a
chance to bind (hybridize) with the probes on the chip.
• The chip is read by a scanner that uses lasers to excite the
fluorescent tags; the intensity levels of the dyes are
recorded for each probe gene and stored in a computer.
Department of Health Information Management
The Colors of a Microarray
• Green: control DNA, where either DNA or cDNA derived
from normal tissue is hybridized to the target DNA
• Red: sample DNA, where either DNA or cDNA is derived
from diseased tissue hybridized to the target DNA
• Yellow: a combination of control and sample DNA,
where both hybridized equally to the target DNA
• Black: areas where neither the control nor sample DNA
hybridized to the target DNA
• The location and intensity of a color can tell us whether
the gene, or mutation, is presented in either the control
and/or sample DNA
• It may also provide estimate of the expression level of the
gene(s) in the sample and control DNA
Department of Health Information Management
Microarray Data Representation
• Microarray data is often arranged in an n x m matrix M
with rows for n genes and columns for m biological
samples in which gene expression has been monitored.
– mij is the expression level of gene i in sample j.
– A row ei is the gene expression pattern of gene i over all the
samples.
– A column sj is the expression level of all genes in a sample j and is
called the sample expression pattern
Department of Health Information Management
Microarray Data Analysis
• Gene chips allow the simultaneous monitoring of the
expression level of thousands of genes. Many statistical
and computational methods are used to analyze this data
– Statistical hypothesis tests for differential expression analysis
– Principal component analysis and other methods for visualizing
high-dimensional microarray data
– Cluster analysis for grouping together genes or samples with
similar expression patterns
• Different clustering algorithms may be used, e.g. hierarchical with
different metrics, or k-means, k-medians.
– Hidden Markov models, neural networks and other classifiers for
predictively classifying sample expression patters as one of
several types (diseased vs. normal)
Department of Health Information Management
For What Do We Use Microarray Data
• Genes with unusual expression levels in a sample
• Genes whose expression levels vary across samples
– This can be used to compare normal and diseased tissues or
diseased tissue before and after treatment.
• Samples that have similar expression patterns
– This can also be used to compare normal and diseased tissues or
diseased tissue before and after treatment.
• Tissues that might be diseased
– We can take the gene expression pattern of sample and compare
it to library expression patterns that indicate diseased or not
diseased tissue.
Department of Health Information Management
Statistical Methods Can Help
• Data Pre-processing
– Normalization: rescaling data from different microarrays so that they
can be compared
– Center: subtracting the mean and dividing by the variance.
• Data Visualization
– Principle component analysis and multidimensional scaling are two
useful techniques for reducing multidimensional data to two and three
dimensions. This allows us to visualize it.
• Cluster Analysis
– By associating genes with similar expression patterns, we might be able
to draw conclusions about their functional expression.
• Statistical Inference
– This is the formulation and statistical testing of a hypothesis and
alternative hypothesis.
• Classifiers for the Data
– We can construct classes from data, such a diseased vs. non-diseased
tissue. We can build a model that fits know data for the different classes.
This can the be used to classify previously unclassified data.
Department of Health Information Management
Measuring Dissimilarity of Expression Data
• We might want to compare two or more gene or sample
expression patterns
• This might be used to differentiate between diseased and
normal cells or finding out the genetic similarity of
tissues.
• To do this we need a distance metric or a dissimilarity
measure.
Department of Health Information Management
Example Distance Metric
• Euclidean Distance-This is the most common distance
measure.
• This should not be used if either
– Not all components of the vectors being compared have equal
weight.
– There is missing data.
• Preprocessing the data can often alleviate these
problems.
• We can also use the normalized Euclidean distance
Department of Health Information Management
Cluster Analysis of Microarray Data
• Hierarchical Clustering-Assume each data point is in a
singleton cluster.
– Find the two clusters that are closest together. Combine these to
form a new cluster.
– Compute the distance from all clusters to new cluster using some
form of averaging.
– Find the two closest clusters and repeat.
• K-Means Clustering: partitions the data into k clusters
and finds cluster means for each cluster.
– Usually, the number of clusters k is fixed in advance. To choose k
something must be know about the data. There might be a range
of possible k values.
– To decide which is best, optimization of a quantity that
maximizes cluster tightness i.e. minimizes distances between
points in a cluster
Department of Health Information Management
Challenges in Microarray Analysis
• Different platforms
– Ilumina, Affymetrix, Agilent….
• Many file types, many data formats
• Need to learn platform dependent methods and software required
• Analysis
– How to get started?
– Which methods? Which software?
• Many freely available tools. Some commercial
– How to interpret results
Department of Health Information Management
Public Databases
• Many sources for public data – labs, consortia, government
• Publications require that data files including raw files be made
public
• GEO
– http://www.ncbi.nlm.nih.gov/geo/
• Array Express
– http://www.ebi.ac.uk/arrayexpress/#ae-main[0]
Department of Health Information Management
Data Analysis
• Class discovery
• Class comparison
• Class prediction
• Biological annotation
• Pathway analysis
Department of Health Information Management
Hierarchical Clustering
• Eisen Cluster and Treeview
– http://rana.lbl.gov/EisenSoftware.htm
• Import data
• Filter
– Filter or not to filter, %P calls, SD etc
• Adjust data
– Log transform, center, normalize
• Clustering
– Cluster array or genes
– Computationally intensive
– Choose distance metric
• .cdt file created
– Open with Treeview
Department of Health Information Management
Cluster from Microarray Data
Department of Health Information Management
Experimental Design
• Sample size
– How many samples in test and control
• Replicates
– Technical vs. biological
• Biological replicates is more important for more heterogeneous
samples
• Need replicates for statistical analysis
• All experimental steps from sample acquisition to hybridization
– Microarray experiments are very expensive. So, plan experiments
carefully
Department of Health Information Management
Video on YouTube
• DNA Microarray
– http://www.youtube.com/watch?v=VNsThMNjKhM&
Pathway Analysis
Department of Health Information Management
KEGG
• Kyoto Encyclopedia of Genes and Genomes
(KEGG) http://www.genome.jp/kegg/pathway.html
Department of Health Information Management
Biological Pathways
http://www.sabiosciences.com/
Department of Health Information Management
Microarray Data Analysis
raw
data
Statistical
packages
Gene List
Literature
findings
Biology
•Tools
Ingenuity IPA
 GeneGO Metacore
 BioBase ExPlain
Department of Health Information Management
Microarray Processed Data
Department of Health Information Management
Ingenuity IPA
• Search and Explore
– Genes, proteins, diseases and chemicals
– Connect genes
– Build pathways
– Explore pathways
• Analyze dataset
– Interpret high-throughput data in the context of biological
processes, pathways and networks
Department of Health Information Management
IPA Analysis
Department of Health Information Management
Interaction Network Maps
Department of Health Information Management
IPA Analysis
Department of Health Information Management
Pathway Map: p53