Genomic Signal Processing

Download Report

Transcript Genomic Signal Processing

Genomic Signal Processing
Dr. C.Q. Chang
Dept. of EEE
Outline
•
•
•
•
•
Basic Genomics
Signal Processing for Genomic Sequences
Signal Processing for Gene Expression
Resources and Co-operations
Challenges and Future Work
Basic Genomics
Genome
• Every human cell contains 6 feet of double stranded (ds) DNA
• This DNA has 3,000,000,000 base pairs representing 50,000100,000 genes
• This DNA contains our complete genetic code or genome
• DNA regulates all cell functions including response to disease,
aging and development
• Gene expression pattern: snapshot of DNA in a cell
• Gene expression profile: DNA mutation or polymorphism over
time
• Genetic pathways: changes in genetic code accompanying
metabolic and functional changes, e.g. disease or aging.
Gene: protein-coding DNA
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
In more detail
(color ~state)
Signal Processing for Genomic
Sequences
The Data Set
The Problem
• Genomic information is digital letters A, T, C and G
• Signal processing deals with numerical sequences,
character strings have to be mapped into one or more
numerical sequences
• Identification of protein coding regions
• Prediction of whether or not a given DNA segment
is a part of a protein coding region
• Prediction of the proper reading frame
• Comparing to traditional methods, signal processing
methods are much quicker, and can be even more
accurate in some cases.
Sequence to signal mapping
a  1  j , t  1  j, c  1  j, g  1  j
y[n]  x[n]  x[n  1] / 2  x[n  2] / 4
Signal Analysis
• Spectral analysis (Fourier transform,
periodogram)
• Spectrogram
• Wavelet analysis
• HMT: wavelet-based Hidden Markov
Tree
• Spectral envelope (using optimal
string to numerical value mapping)
Spectral envelope of the BNRF1
gene from the Epstein-Barr virus
(a) 1st section (1000bp), (b) 2nd section (1000bp),
(c) 3rd section (1000bp), (d) 4th section (954bp)
Conjecture: the 4th quarter is actually non-coding
Signal Processing for Gene
Expression
Biological
Question
Data Analysis
& Modeling
Microarray
Life Cycle
Sample
preparation
Microarray
Detection
Taken from Schena & Davis
Microarray
Reaction
excitation
cDNA clones
(probes)
laser 2
PCR product amplification
purification
printing
scanning
laser 1
emission
mRNA target)
overlay images and normalise
0.1nl/spot
microarray
Hybridise target
to microarray
analysis
Image Segmentation
• Simple way: fixed circle method
• Advanced: fast marching level set
segmentation
Advanced
Fixed circle
Clustering and filtering methods
Principal approaches:
• Hierarchical clustering (kdb trees, CART, gene shaving)
• K-means clustering
• Self organizing (Kohonen) maps
• Vector support machines
• Gene Filtering via Multiobjective Optimization
• Independent Component Analysis (ICA)
Validation approaches:
• Significance analysis of microarrays (SAM)
• Bootstrapping cluster analysis
• Leave-one-out cross-validation
• Replication (additional gene chip experiments, quantitative PCR)
ICA for B-cell lymphoma data
Data: 96 samples of normal and malignant lymphocytes.
Results: scatter-plotting of 12 independent components
Comparison: close related to results of hierarchical clustering
Resources and Co-operations
Resources: databases on the internet such as
• GeneBank
• ProteinBank
• Some small databases of microarray data
Co-operations in need:
• First hand microarray data
• Biological experiment for validation
Challenges and Future Work
• Genomic signal processing opens a new signal
processing frontier
• Sequence analysis: symbolic or categorical signal,
classical signal processing methods are not directly
applicable
• Increasingly high dimensionality of genetic data sets
and the complexity involved call for fast and high
throughput implementations of genomic signal
processing algorithms
• Future work: spectral analysis of DNA sequence and
data clustering of microarray data. Modify classical
signal processing methods, and develop new ones.