Biology 162: Computational Genetics
Download
Report
Transcript Biology 162: Computational Genetics
Biology 162:
Computational Genetics
Fall 2004
Todd Vision
Assistant Professor
Department of Biology, UNC Chapel
Hill
Bioinformatics vs
computational genetics
• Bioinformatics: The application of
computing technology to molecular
biology
• Computational genetics: The
interdisciplinary intersection of genetics,
computer science and statistics
Course emphasis
• Data analysis in molecular genetics
• We will not cover
– Developments in IT hardware
– Analysis of protein structure
– Modeling of metabolic pathways, cells,
tissues, organs, etc. (i.e. systems biology)
Prerequisites
• Bio 50: Molecular Biology and Genetics
– Gene/protein structure and expression
– Principles of inheritance
• Comp Sci 14: Introduction to Programming
– Algorithms and their design
– Fundamental programming skills
• Stat 31: Introduction to Statistics
– Probability and Distributions
– Hypothesis testing and parameter estimations
Related courses at UNC
• Biology 170/Math 107, Mathematical and
Computational Models in Biology (Tim Elston
and Maria Servedio)
• Summer courses in
– Computer Science
• Graduate courses in
– Bioinformatics and Computational Biology
– Biostatistics
– School of Pharmacy
Readings
• Gibson and Muse, A Primer of Genome
Science, Sinauer Associates.
– Available in Student Bookstore
– Primarily covers genomic technologies
– Brief on computational/statistical aspects
• Supplemental papers
– Handed out in class or posted on Blackboard
– Includes
• More detail on computational/statistical aspects
• Papers which you will review for class assignments
https://blackboard.unc.edu
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Computer labs / Problem sets
• Thursdays 3:30-4:30 in Wilson 132
• Assignments are due following Tuesday
• Purpose:
– Familiarity with genomic databases and tools
• Functional and evolutionary sequence analysis
• Gene expression analysis
• Mapping of genomes and complex traits
– Comfort with command-line tools and computing
– Exercise of scientific reasoning and biological
judgement
– No programming required (but learn Perl anyway!)
Research paper
• Critical review of the computational
challenges involved in assembly of the
human genome
• Based on opposing articles from the main
players in the drama
• Paper will be judged on
– Understanding of content
– Critical and synthetic reasoning
– Clarity of scientific writing
Late policy
• Assignments are due at beginning of
class on the due date
• Late assignments receive half-credit
• Exceptions can be made but require
more than 24 hours notice
Group work
• You are encouraged to work together on
most assignments (some exceptions)
• What you turn in should be your own
– Show your work
– Be able to defend your answers
• Know and love the UNC Honor Code
– http://honor.unc.edu
Exams
• Two midterms
• Final exam will be cumulative
• May include material from labs/problem
sets, readings and lectures
• Most questions will be similar to those
on lab/problem sets
• You will receive a study guide in
advance
Grading
•
•
•
•
•
10 Labs/problem sets - 50% (5% each)
Review paper - 10%
Midterms - 20% (10% each)
Final exam - 20%
Final grades
– No curve, point divisions at discretion of instructor
– Different divisions for undergraduate/graduate
students
Computer lab server: Biolinux
• All necessary analysis software is
installed
• Dell PowerEdge server
– Linux Redhat operating system
– 2 Xeon processors
– 2 GB RAM
– 60 GB disk space
• Requires an ONYEN for login
• Uses AFS file space
Connecting to Biolinux
• biolinux.bio.unc.edu (IP 152.2.66.25)
• Windows
– Zip archive contains necessary connection
software
• MacOSX
– X11 for graphical sessions
– Fugu for secure ftp
• Linux/Solaris/etc.
– Should work as is
https://onyen.unc.edu
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
http://cilantro.bio.unc.edu/biolinu
x
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cretaceous Park?
• In 1994, researchers reported a remarkably
well-preserved Cretaceous dinosaur fossil.
• DNA was extracted
– Care was taken to prevent contamination
• Specific regions were amplified
– 20 different PCR primer pairs used, including 6
pairs from mitochondrial cytB
– How would you design primers for dinosaur DNA?
– All yielded products in mammals, birds and
reptiles
– Only one cytB pair yielded a product from the
fossil
Cretaceous Park?
• One cytB fragment amplified
• 9 sequences obtained from two bone
samples
– Variability was present within and between the two
samples, none were identical
• Consensus sequences used to search for
homologs
– Genbank (215,000 sequences)
– BLAST
• Measured percent identity
• Closest matches were ~70% identical
– Equidistant to mammals, birds, and reptiles
Cretaceous Park?
• One would expect dinosaur DNA to be most
similar to that of birds, and then crocodilians
• Other authors reanalyzed the data
– Multiple alignment
– Protein sequence scoring matrix
– Phylogenetic analysis
• All concluded that the DNA was clearly
mammalian, possibly human
• One group showed that similar sequences
could be amplified from human nuclear DNA
Cretaceous Park?
• Three possibilities
– Preparation of human nuclear DNA could have
been contaminated by dinosaur DNA
– Dinosaurs and humans might have hybridized
during the Cretaceous
– Dinosaur extracts were contaminated by human
DNA
• Study revealed an interesting aspect of
human molecular evolution, but not much
about dinosaurs
• Lesson learned: naïve computational analysis
can lead to very misguided conclusions!
Discussion question
• You are given the sequence of a new
gene and asked to determine its
function.
• How would you begin?
– What ‘wet lab’ approaches are possible?
– What ‘in silico’ approaches are possible?
– What approaches might require both wet
lab and in silico components?
Biological topics
•
•
•
•
•
•
•
Sequence alignment and assembly
Sequence homology searching
Sequence evolution and phylogenetics
Finding genes and other features
Patterns of gene expression
Genetic mapping
Dissecting genetic diseases and quantitative
traits
Computational topics
•
•
•
•
•
•
•
Dynamic programming
Regular expressions and suffix trees
Markov chains
Hidden Markov models and machine learning
Techniques for clustering and classification
Maximum likelihood and Bayesian statistics
Graph traversal
Some informatics tools
• Genbank, Uniprot, and major sequence
repositories
• InterPro and protein signature dBs
• Gene Ontology
• Model organism genome databases (SGD,
FlyBase, Ensembl)
• A sampling of software programs
– Chosen primarily for pedagogical utility
Genomics
•
•
•
•
Genetics on lots of genes?
Hypothesis-free science?
Some technologies
Enabled by
– Robotics
– Computers
Genome database examples
• Primary databases
– Genbank/EMBL/DDBJ
• Secondary databases
– Pfam (protein domains)
• Organism-specific
– SGD (yeast genomics)
• Specialized dBs
– OMIM (human genetic disorders)
• Annual database issue of Nucleic Acids
research:
http://www3.oup.co.uk/nar/database/c/
Growth of Genbank
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
http://www.expasy.org/cgi-bin/show_thumbnails.pl?2
First bacterial genome: 1995
• Haemophilus influenzae (TIGR)
– 1.8 x 106 bp shotgun assembly
– Required 9 months of computer time
• Now there are hundreds
– 160 Bacterial
– 19 Archaeal
– 32 Eukaryotic
• Over a thousand projects ongoing
• And a bacterial genome takes only days to
sequence and assemble
Tree of life
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
More protein families await
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Other types of genomic data
•
•
•
•
Spatiotemporal gene expression
Alternative transcription
Genetic knockout/overexpression phenotypes
Genetic variability
– Molecular polymorphism
• Phenotypic variation / disease
• Comparative data / molecular evolution
• Protein
– Structure, including modifications
– Interactions with other molecules
• Metabolic profiling, etc., etc.
Algorithmic/statistical
innovations
• The most fundamental and heavily used
application in the field is pairwise alignment
– Smith-Waterman algorithm (1981)
• Still too slow for general database search
– BLAST (1987)
• Made database search of 107-108 sequences feasible
• Statistical ranking of each alignment
• Statistical methods in molecular evolution <25
yrs old
• Modern genetic mapping methods ~15 yrs old
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Things to review
• Chemical differences among amino
acids
• Prokaryotic and eukaryotic gene
structure
• The central dogma
• Anatomy of a typical protein
Reading for Thursday
• Gibson and Muse, Ch.1 Genome
Projects, pgs. 1-58.