EE150a – Genomic Signal and Information Processing

Download Report

Transcript EE150a – Genomic Signal and Information Processing

EE150a – Genomic Signal and
Information Processing
•
Seminar series
–
lectures on first 3 meetings, followed by students presentations
–
statistical signal processing basics
–
background reading for each meeting
•
Location: Moore 080 (except today)
•
List of papers with links: www.its.caltech.edu/~hvikalo/gsip.html
–
•
minor modifications of the list are likely
Contact: Haris Vikalo, Moore 125
–
Phone: 395-4184
–
E-mail: [email protected]
• Occasionally check website for updates and increasing list of
research related links
• Today’s handouts:
– basic course info and a list of papers
– R. Karp’s “Mathematical Challenges from Genomics and
Molecular Biology”
– sign-up sheet
• Next time: Prof. Vaidyanathan’s lecture on “Signal Processing
Problems in Genomics”
• In two weeks: lecture on DNA microarray technology and
novel estimation techniques of gene expression levels
• Today: introduction with brief overview of the topics for
presentation
Central Dogma of Molecular Biology
• Flow of information in a cell:
• [Due to Francis Crick. It has recently been realized that the
dogma requires modifications but more about that later in
course.]
• Recent development of high-throughput technologies that
study the above flow
– requires interdisciplinary effort
– dealing with a huge amount of information
DNA Structure
• Four nucleotides: adenine (A), cytosine
(C), guanine (G), and thymine (T)
• Bindings:
– A with T (weaker), C with G (stronger)
• Forms a double helix – each strand is
linked via sugar-phosphate bonds
(strong), strands are linked via hydrogen
bonds (weak)
• Genome is the part of DNA that encodes
proteins:
– …AACTCGCATCGAACTCTAAGTC…
genetics.gsk.com/ graphics/dna-big.gif
Sidenote: Sequence Alignment
• Perhaps the most fundamental operation in bioinformatics
– used to decide if two genes or proteins are related by function,
structure, or evolutionary history
– can identify patterns of conservation and variability
• Performs pairwise matching between characters of each
sequence
• One place where it is useful: SNP (single-nucleotide
polymorphism) detection
– SNPs may indicate a disease development (myocardial diseases,
arthritis, etc. have been associated with SNPs)
• Sequence alignment is the first student presentation topic in
the series (HMM, dynamic programming, Bayesian methods)
Details of the information flow
• Replication of DNA
– {A,C,G,T} to {A, C, G,T}
• Transcription of DNA to mRNA
– {A,C,G,T} to {A, C, G,U}
• Translation of mRNA to proteins
– {A,C,G,U} to {20 amino-acids}
http://www-stat.stanford.edu/~susan/courses/s166/central.gif
Genes can be turned on and off
Microarray Technology
• A medium for matching known and unknown DNA samples
based on hybridization (base-pairing)
• Two major applications
– identification of a sequence (gene or gene mutation)
– determination of expression level (abundance) of genes
• Enables massively parallel gene expression studies
• Two types of molecules take part in the experiments:
– probes, orderly arranged on an array
– targets, the unknown samples to be detected
Types of Microarrays
• “Traditionally”, there are two formats:
– probe cDNA immobilized to a solid surface using robot
spotting and exposed to a set of targets, and
– an array of oligonucleotide probes synthesized on chip (via,
e.g., photolithography)
• Targets are typically fluorescently labeled cDNA molecules
obtained from mRNA samples
– hybridize to their complementary probes
– image readout
Illustration: DNA microarray
http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg
Sample Microarray Readout
Some Design Issues
• Hybridization is binding of a target to its perfect complement
• However, when a probe differs from a target by a small number
of bases, it still may bind
• This non-specific binding (cross-hybridization) is a source of
measurement noise
• In special cases (e.g., arrays for gene detection), designer has a
lot of control over the landscape of the probes on the array
• Second topic for presentations considers a combinatorial design
of such arrays
• [How to deal with cross-hybridization on arrays used for
expression level measurements is the topic of the third lecture.]
Clustering Gene Expression Profiles
• Microarrays measure expression levels of thousands of gene
simultaneously
• For instance, we might take samples at different times during a
biological process
• Cluster data in the expression level space
– relatedness in biological function often implies similarity in
expression behavior (and vice versa)
– similar expression behavior indicates co-expression
• Clustering of expression level data is one of the topics
(traditional statistical methods but also graph-theoretic
approach, information-theoretic approach, etc.)
Example of Clustering
• Rows: various gene
expression levels
• Columns: Time progression
• So-called hierarchical
clustering
http://www.genomatix.de/gif/node43_documentation.gif
Co-regulated genes
• Co-expressed genes may be co-regulated
– a combination of transcription factors (activating or
repressing proteins) regulates genes jointly
• Finding binding sites (control
regions) of co-regulated genes
is another topic
• HMM, probabilistic methods
(EM, Gibbs sampling)
Genetic Regulatory Networks
• Proteins take part in the gene regulation
– feedback loop in the Central Dogma information flow
• Thus to fully understand gene regulation, we need to consider
interactions
– DNA, RNA, proteins, small molecules
• Requires network formalism
– directed graphs, Boolean networks, Bayesian networks,
differential equations etc.
• Explore some of these models in gene regulation context
An Illustration of a Regulatory Network
Protein Translation/Folding
• [Should time permit.]
• Sequence-structure relationship will play very important role
in the postgenomic era
– potential great impact on genetics and pharmaceutical
chemistry, protein design
– diseases such as Alzheimer’s are believed to be related to
protein misfolding
• Computationally very hard
– parallel, distributed computing
Genomic data fusion
• Consider the problem of classification of a protein and assume
that we know:
– original gene sequence encoding the protein
– gene expression levels
– some of the protein-protein interactions
• Question: how to combine various types of data to classify the
protein
• The last (right now…) topic of the seminar will be data fusion
of the various genomic data listed above
– efficient convex optimization based statistical learning
algorithm
Summary
• Trying to understand gene regulation
• Recent technologies revolutionized research
– huge amount of data
• Multidisciplinary; identify opportunities
• Challenging problems, quite important:
– understanding information processes on genetic level gives
insights about phenotypic effects (disease)
– some of the ultimate goals are molecular diagnostics and
creating personalized drugs