Lecture 1 - Pitt CPATH Project

Download Report

Transcript Lecture 1 - Pitt CPATH Project

Genomics and Personalized
Care in Health Systems
Lecture 1: Introduction
Leming Zhou, PhD
Department of Health Information management
School of Health and Rehabilitation Sciences
The University of Pittsburgh
Department of Health Information Management
Text Books
• Jonathan Pevsner, Bioinformatics and
Functional Genomics, Second Edition,
Wiley-Blackwell, 2009.
• Ebook: Genes and Disease, searchable and freely available
http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/gnd/gnd.pdf
or http://www.ncbi.nlm.nih.gov/disease/
Department of Health Information Management
Course Description
• This course will focus on general introduction to genomics,
gene structure and annotation, and gene and disease
association.
• Other topics such as RNA and protein structure, and
microarray experiments will also be briefly covered.
• Students will understand gene structure and be familiar
with various genome analysis tools by working on novel
gene annotation projects.
Department of Health Information Management
Course Objectives (1/2)
• Explain eukaryotic gene structure and molecular biology
central dogma
• Demonstrate the skills of annotating eukaryotic genes
using online tools
• Demonstrate the skills of performing sequence similarity
search using blast
• Demonstrate the skills of collecting evidence from UCSC
genome browser
• Describe major DNA and protein databases and the
method of extracting data from them
Department of Health Information Management
Course Objectives (2/2)
• Explain major gene finding methods, their advantages and
disadvantages
• Describe different types of genetic diseases and the
relationship between genetic variations and diseases
• Demonstrate the skills of determining protein and RNA
secondary structures using online tools
• Explain basic ideas behind microarray and DNA
sequencing technologies
Department of Health Information Management
Method of Presentation
• Lectures
• In-Class Laboratory Sessions
• Student Projects and Presentations
• Term Paper (graduate students)
Department of Health Information Management
Course Outline (Tentative)
Date
Lecture 1
Lecture 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Lecture 7
Reading
Chapter 1
Chapter 2
Chapters 3, 6, 7
Chapter 4
UCSC website
Chapter 13
GEP websites
GEP websites
Lecture 8 Chapter 20
Lecture 9 Chapters 10 and 11
Lecture 10 Chapters 13, 16
Topics
Overview of the course and introduction to DNA,
RNA, and protein.
Molecular biology databases
Sequence alignment
Blast search
Genome browsers
Gene finding methods – Part I: Theory
Gene finding methods – Part II: Practices
Gene annotation lab
Genomic variations and disease
Protein/RNA structure
High-throughput technologies
Course review
Student Presentations
Final Exam
Basic Concepts
Department of Health Information Management
DNA (1/3)
• DNA (Deoxyribonucleic Acid), a helical molecular
comprising a sequence of four nucleotides (bases)
– Adenine (A) – purine; Thymine (T) – pyrimidine
– Guanine (G) – purine; Cytosine (C) - pyrimidine
Adenine
Guanine
Cytosine
Thymine
Department of Health Information Management
DNA (2/3)
• A is always paired with T, while G always with C
Department of Health Information Management
DNA (3/3)
• A DNA sequence can
be either singlestranded or doublestranded
• DNA sequences have
an orientation: from 5’
to 3’ or from 3’ to 5’
(chemical conventions)
Department of Health Information Management
Nucleotides
Department of Health Information Management
RNA
• RNA (RiboNucleic Acid), usually a singlestranded molecular
• It comprises four nucleotides
– A, C, G, and U (Uracil)
• Produced by copying one of the two strands
of a DNA molecule in the 5’ to 3’ direction
• Different types of RNAs
– Messenger RNA (mRNA)
– Transfer RNA (tRNA)
– Ribosomal RNA (rRNA)
– …
Uracil
Department of Health Information Management
Protein
• A molecule comprising a long chain of amino acids
connected by peptide bonds
• There are 20 standard amino acids encoded by the
universal genetic code
Molecular Biology of the Cell,Alberts et al. 2002
Department of Health Information Management
Cell Types
• Prokaryotes: a group of organisms that lack of
nucleus membrane, such as blue-green algae
and common bacteria (Escherichia coli). It has
two major taxa: Archaea and Bacteria
• Eukaryotes: unicellular and multicellular
organisms, such as yeast, fruitfly, mouse, plants,
and human
Department of Health Information Management
Gene
• A stretch of DNA containing the information
necessary for coding a protein/polypeptide
• Promoter region
• Transcription Factor Binding Site
• Translation Start Site
• Exon: coding (informative) regions of the DNA
• Intron: noninformative regions between exons
• Untranslated region (UTR)
• Codons
Department of Health Information Management
Eukaryotic Gene Structure
http://www.nslij-genetics.org/pic/dna-rna-protein.jpg
Department of Health Information Management
Eukaryotes
• In eukaryotes, transcription is complex:
– Many genes contain alternating exons and introns
– Introns are spliced out of mRNA
– mRNA then leaves the nucleus to be translated by ribosomes
• Genomic DNA: entire gene including exons and introns
– The same genomic DNA can produce different proteins by
alternative splicing of exons
• Complementary DNA (cDNA): spliced sequence
containing only exons
– cDNA can be manufactured by capturing mRNA and performing
reverse transcription
Department of Health Information Management
Central Dogma of Molecular Biology
• DNA

Transcription
DNA

RNA
Protein
Translation
RNA
protein
Department of Health Information Management
DNA Transcription
• RNA molecules synthesized by RNA polymerase
• RNA polymerase binds to promoter region on DNA
• Promoter region contains start site
• Transcription ends at termination signal site
• Primary transcript: direct coding of RNA from DNA
• RNA splicing: introns removed to make the mRNA
• mRNA: contains the sequence of codons that code for a
protein
• Splicing and alternative splicing
• Post-transcriptional modification
Department of Health Information Management
DNA Translation
• Ribosomes is made of protein and rRNA
• mRNA goes through the ribosomes
• Initiation factors: proteins that catayze the start of
transcription
• tRNA brings the different amino acids to the ribosome
complex so that the amino acids can be attached to the
growing amino acid chain
• When a STOP codon is encountered, the ribosome releases
the mRNA and synthesis ends
• An open reading frames (ORF): a contiguous sequence of
DNA starting at a start codon and ending at a STOP codon
http://www.youtube.com/watch?v=5bLEDd-PSTQ
Department of Health Information Management
Chromosomes
• A chromosome is a long and tightly wound DNA string (visible under
a microscope)
• Chromosomes can be linear or circular
• Prokaryotes usually have a single chromosome, often a circular
DNA molecule
• Eukaryotic chromosome appear in pairs (diploid), each inherited
from one parent
– Homologous chromosomes carry the same genes
– Some genes are the same in both parents
– Some genes appear in different forms called alleles, e.g., human blood type has
three alleles: A, B, and O
• All genes are presented in all cells, but a give cell types only
expressed a small portion of the genes
Department of Health Information Management
Chromosomal Location
Department of Health Information Management
Genome
• The genome is formed by one or more chromosomes
• A genome is the entire set of all DNA contained in a cell
• A human genome has 46 chromosomes
• The total length of a human genome is 3 billion bases
Department of Health Information Management
Genome Sequences
Species
All
Eukaryotes
Complete
Draft Assembly
(Almost complete)
In process
Total
1153
1285
889
3327
36
319
294
649
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
Retrieved on 1/8/2012
Department of Health Information Management
Genome Sequence Sizes
DNA Sequence size is measured as base pairs (bp)
• Phage phiX174
5,368
• HIV virus
9,193
• SARS
29,751
• Haemophilus influenzae (bacteria)
1,830,000
• Escherichia coli K12
4,600,000
• Saccharomyces cerevisiae (yeast)
12,500,000
• Drosophila melanogaster (fruit fly)
180,000,000
• Arabidopsis thaliana (thale cress)
125,000,000
• Homo sapiens (human)
3,000,000,000
Department of Health Information Management
The Whole Picture
Department of Health Information Management
Genomics
• The definition of genomics may be different from person to
person
• Genomics involves large data sets (whole genome
sequences) and high-throughput methods (DNA
sequencing technologies)
– Genetics research focuses on one or a set of genes
• Genomics may or may not include other specific research
areas, such as proteomics, transcriptomics, variomics,
metabolomics, etc.
• In this course, genomics includes DNA sequence analysis,
genomics variations, gene expression, and proteomics.
Department of Health Information Management
Topics in This Course
• Molecular Biology Databases
• Sequence Alignment
• Blast Search
• Genome Browser
• Gene Finding Methods
• Genomic Variations and Disease
• Protein and RNA Secondary Structure
• High-throughput Technologies
Molecular Biology Databases
Department of Health Information Management
Important Databases
• Genome
– NCBI
– European Molecular Biology Lab ( EMBL )
– DNA Database of Japan ( DDBJ )
– Go ( Gene Ontology )
– Consortium of databases
• Flybase, Mouse Genome Database (MGD)
• Protein
– Protein Data Bank (PDB)
– ENBL-EBI ( European Bioinformatics Institute )
• Uniprot, Expasy, Swiss-Prot
• KEGG: Kyoto Encyclopedia of Genes and Genomes
Department of Health Information Management
NCBI (www.ncbi.nlm.nih.gov)
• NCBI – National Center for Biotechnology Information
• Established in 1988 as a national resource for molecular biology
information
• NCBI creates public databases, conducts research in computational
biology, develops software tools for analyzing genome data, and
disseminates biomedical information
• Databases
– GenBank, dbSNP, RefSeq, etc.
– PubMed, OMIM, MMDB, UniGene
– The Taxonomy Browser
• Tools
– Blast, Cn3D, etc.
– Entrez is NCBI’s search and retrieval system that provides users with integrated
access to sequence, mapping, taxonomy, and structural data
Department of Health Information Management
PDB (www.pdb.org)
• The Protein Data Bank (PDB) is the single worldwide
depository of information about the 3D structures of large
biological molecules, including proteins and nucleic
acids.
• Understanding the shape of a molecule helps to
understand how it works.
• The PDB was established in 1971 at Brookhaven
National Lab and originally contained 7 structures
• In 1998, the Research Collaboratory for Structural
Bioinformatics(RCSB) became responsible for the
management of the PDB
• PDB provides
– Sequence, atomic coordinates, derived geometric data,
secondary structure, and annotations about protein literature
references
Department of Health Information Management
KEGG
• KEGG: Kyoto Encyclopedia of Genes and Genomes
• Contains Pathway information as well as (1/10/2011)
– KEGG PATHWAY:
126,336 pathways generated from 379
reference pathways
– KEGG GENES:
6,121,933 genes in 139 eukaryotes +
1144 bacteria + 94 archaea
– KEGG GENOME:
1,508 organisms
– KEGG DISEASE:
375 disease
– KEGG DRUG:
9,316 drugs
Sequence Alignment
Department of Health Information Management
Sequence Similarity
• Similarity: The extent to which nucleotide or protein
sequences are related. It is based upon identity plus
conservation.
• Identity: The extent to which two sequences are invariant.
• Conservation: Changes at a specific position of a DNA or
amino acid sequence that preserve the properties of the
original residue.
• The distance between two sequences, based on an
evolutionary model, describes when the two sequences
had a common ancestor
Department of Health Information Management
Sequence Alignment
Sequence alignment is the procedure of comparing two or more DNA
or protein sequences by searching for a series of individual
characters or character patterns that are in the same order in the
sequences.
Given two sequences A and B, an alignment is a pair of sequences A’ and B’ such that:
1. A’ is obtained from A by inserting gap character ‘-’
2. B’ is obtained from B by inserting gap character ‘-’
3. A’ and B’ have some length: |A’|=|B’|
4. No position has gap characters in both A’ and B’
Example:
A = ATGGCT
B = TGCTA
A’= ATGGCTB’= -TG-CTA
Goal: given two sequences, find the “best” alignment according some scoring function
Department of Health Information Management
Types of Sequence Alignment
• Pairwise Alignment – compare two sequences
• Multiple Alignment – compare one sequence to many
others
For each of the above we can do
• Local Alignment – compare similar parts of two
sequences
• Global Alignment – compare the whole sequence
For the different types of alignments there are different
assumptions and methods
Department of Health Information Management
Global Alignment vs. Local Alignment
• Local alignment: finds continuous or gapped
high-scoring regions which do not span the
entire length of the sequences being aligned
• Global alignment: finds the optimal full-length
alignment between the two sequences being
aligned
Department of Health Information Management
Pairwise Alignment
• The process of lining up two sequences to achieve
maximal levels of identity/similarity for the purpose of
assessing the degree of similarity and the possibility of
homology.
• It is used to decide if two genes are structurally or
functionally related
• It is used to identify domains or motifs that are shared
between proteins
• It is used in the analysis of genomes
Department of Health Information Management
An Example of Pairwise Alignment
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 LAC
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 LAC
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 LAC
• Symbols between two sequences (Ssearch format):
Bar: identical; One dot: somewhat similar; Two dots: very similar
• Dots in sequences: gaps
Department of Health Information Management
Multiple Sequence Alignment
• Multiple sequence alignment is an alignment of three or
more sequences such that each column of the alignment
is an attempt to represent the evolutionary changes I one
sequence position, including substitutions, insertions,
and deletions.
• It is believed that over time the functional components
embedded within the sequences are conserved in order
to retain function
– One of the most important elements of sequences is the
phylogenetic information that similarities represent
– The sequence similarities gives insight into the evolution of
families of protein or DNA sequences
Department of Health Information Management
An Example of Multiple Sequence Alignment
fly
human
plant
bacterium
yeast
archaeon
GAKKVIISAP
GAKRVIISAP
GAKKVIISAP
GAKKVVMTGP
GAKKVVITAP
GADKVLISAP
SAD.APM..F
SAD.APM..F
SAD.APM..F
SKDNTPM..F
SS.TAPM..F
PKGDEPVKQL
VCGVNLDAYK
VMGVNHEKYD
VVGVNEHTYQ
VKGANFDKY.
VMGVNEEKYT
VYGVNHDEYD
PDMKVVSNAS
NSLKIISNAS
PNMDIVSNAS
AGQDIVSNAS
SDLKIVSNAS
GE.DVVSNAS
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNSITPVA
fly
human
plant
bacterium
yeast
archaeon
KVINDNFEIV
KVIHDNFGIV
KVVHEEFGIL
KVINDNFGII
KVINDAFGIE
KVLDEEFGIN
EGLMTTVHAT
EGLMTTVHAI
EGLMTTVHAT
EGLMTTVHAT
EGLMTTVHSL
AGQLTTVHAY
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TGSQNLMDGP
SGKLWRDGRG
SGKLWRDGRG
SMKDWRGGRG
SHKDWRGGRG
SHKDWRGGRT
NGKP.RRRRA
AAQNIIPAST
ALQNIIPAST
ASQNIIPSST
ASQNIIPSST
ASGNIIPSST
AAENIIPTST
fly
human
plant
bacterium
yeast
archaeon
GAAKAVGKVI
GAAKAVGKVI
GAAKAVGKVL
GAAKAVGKVL
GAAKAVGKVL
GAAQAATEVL
PALNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELQGKLTGM
PELEGKLDGM
AFRVPTPNVS
AFRVPTANVS
AFRVPTSNVS
AFRVPTPNVS
AFRVPTVDVS
AIRVPVPNGS
VVDLTVRLGK
VVDLTCRLEK
VVDLTCRLEK
VVDLTVRLEK
VVDLTVKLNK
ITEFVVDLDD
GASYDEIKAK
PAKYDDIKKV
GASYEDVKAA
AATYEQIKAA
ETTYDEIKKV
DVTESDVNAA
Department of Health Information Management
Evolutionary Basis of Sequence Comparison
• The simplest molecular mechanisms of evolution are
substitution, insertion, and deletion
• If a sequence alignment represents the evolutionary
relationship of two sequences, residues that are aligned
but do not match equal substitutions
• Residues that are aligned with a gap in the sequence
represent insertions or deletions
Department of Health Information Management
Homology
• Homology: Similarity attributed to descent from a common ancestor.
• There are two type of homology: Paralogs and Orthologs
• Orthologs:
– Homologous sequences in different species that arose from a common ancestral
gene during speciation;
– May or may not be responsible for a similar function.
– Members of a gene family in various organisms
• Paralogs:
– Homologous sequences within a single species that arose by gene duplication.
– Members of gene family within a species
• Genes either are homologous, or they are not. There are no degrees
of homology
Blast Search
Department of Health Information Management
Similarity Search
• Find statistically significant matches to a protein or DNA
sequence of interest.
• Obtain information on inferred function of the gene
• Sequence alignment algorithms
– Dynamic Programming
• Needleman-Wunsch Global Alignment (1970)
• Smith-Waterman Local Alignment (1981)
• Guaranteed to find the best alignment
• Slow, especially search against a large database
Department of Health Information Management
FASTA and BLAST
• Sequence Alignment Heuristics
– FASTA and BLAST: heuristic approximations to Smithwaterman
• Fast and results comparable to the Smith-Waterman algorithm
• FASTA and BLAST also calculate significance of the
search results alignments
Department of Health Information Management
BLAST
• Basic Local Alignment Search Tool: A sequence
comparison algorithm optimized for speed used to search
sequence databases for optimal local alignments to a
query.
• Expected Value (E)
– The number of matches expected to occur randomly with a given score.
– The number of different alignments with scores equivalent to or better
than S that are expected to occur in a database search by chance.
– The lower the E value, more significant the match.
– The Expect value can be any positive real number.
Department of Health Information Management
BLAST Search
>seq example
GAKKVIISAPSADAPMFVCGVNLDAYKPDMKVVSNASCTTNCLAPLAK
VINDNFEIVEGLMTTVHATTATQKTVDGPSGKLWRDGRGAAQNIIPAST
GAAKAVGKVIPALNGKLTGMAFRVPTPNVSVVDLTVRLGKGASYDEIKA
K
Genome Browser
Department of Health Information Management
Genome Browser
• Genome Browser is a computer program which helps to display gene
maps, browse the chromosomes, align genes or gene models with
ESTs or contigs etc.
UCSC Genome Browser (http://genome.ucsc.edu)
Department of Health Information Management
NCBI Mapviewer
Gene Finding Methods
Department of Health Information Management
Gene Prediction Methods
•
Ab initio genes prediction programs
•
Programs using expressed sequences
•
Programs using evolutionary conservation
Department of Health Information Management
Evolution
• Evolution in two ways:
– Mutation
– Selection pressure to eliminate random mutations
• Mutations which cause frame shifts in the coding exon regions
of important proteins will most likely not survive.
• Mutations in introns or in non-gene regions will have very little
effect on the survival of the species and therefore they will be
kept in the sequence.
• When two sequences are aligned and compared, the regions
that are conserved will be most likely the gene-regions.
Department of Health Information Management
Gene Annotation
http://www.pggrc.co.nz/Portals/0/Mbb%20ruminantium%20genome%20DIAGRAM.jpg
Genomic Variations and Disease
Department of Health Information Management
DNA Variations
• DNA Mutation
– Synonymous mutations
– Non-synonymous mutation
Department of Health Information Management
Genome Sequences and Diseases
http://genomics.energy.gov
Department of Health Information Management
Single Nucleotide Polymorphisms
• Genomic sequences from two unrelated individuals are 99.9%
identical.
• The 0.1% difference is due to genetic variations, and mainly one form
of variation called single nucleotide polymorphisms (single-base
mutations).
• Other genetic variations may produced from nucleotide insertions and
deletions (Tandem repeat polymorphisms and insertion / deletion
polymorphisms)
• These polymorphisms are considered one of the key factors that
makes each and every one of us different and can have a major
impact on how we respond to diseases; environmental insults such as
bacteria, viruses and chemicals; and drugs and other therapies.
Department of Health Information Management
SNPs and Mutations
• Terminology for variation at a single nucleotide position is
defined by allele frequency.
– A single base change, occurring in a population at a
frequency of >1% is termed a single nucleotide
polymorphism (SNP)
– When a single base change occurs at <1% it is
considered to be a mutation
Protein/RNA Structure
Department of Health Information Management
RNA Structure
• RNA can have a complicated secondary structure
Gene VIII, Lewin, 2004
Department of Health Information Management
Protein Structure
• Primary structure: amino acid sequence
• Secondary structure: local structure such as
alpha helix and beta sheets
• Tertiary structure: 3D structure of a protein
monomer
• Quaternary structure: 3D structure of a fully
functional protein (protein complexes)
Department of Health Information Management
Protein Secondary Structure
• Protein can have secondary structure
• Alpha helix and Beta sheet
Molecular Cell Biology, Lodish et al. 2000
Department of Health Information Management
Protein 3D Structure
• Protein structure is closely
related to its biological
function/activity
• One protein may have multiple
domains which are used to
have functional interactions
with different molecules
– Domains in one protein may have
extensively interaction or simply
be connected by the protein
sequence
Human P53 core domain
MMDB ID: 69151
PDB ID: 3D0A
High-Throughput Technologies
Department of Health Information Management
DNA Sequencing Technologies
• Sanger method, 1977
• Used in Human Genome Project
• Slow, and expensive ($300m/genome)
• Whole Genome Shotgun sequencing (1990s)
•
•
•
•
Break the genome into short pieces
Sequence all the pieces in parallel
Put all the pieces back together (sequence assembly)
Faster and cheaper (~$10m/genome)
• Next generation sequencing technologies (2000s)
• Much faster speed & lower cost (<$5k/genome,2010)
• May be used for personal genomics
Department of Health Information Management
http://beespotter.mste.uiuc.edu/topics/genome/Honey%20bee%20genome.html
Department of Health Information Management
Microarray
http://www.coriell.org/index.php/content/view/93/184/