OverviewLecture1
Download
Report
Transcript OverviewLecture1
Bioinformatics
Overview
Carlow IT September 2006
What, me?
•
•
•
•
•
•
•
•
•
Andrew Lloyd
[email protected]
087-225-9850
053-9255717
Director INCBI 1993-2000
Population genetics, evolution
Whole genome analysis
Immunology, chickens, FIRM
http://ercbinfo1.ucd.ie/itc/
Definition/scope
• Storage retrieval and analysis of biological
(sequence) information.
• Insert better definition here
• Case can be made of microarray analysis
• NOT
– ecoinformatics (ecology)
– Image analysis
– Bar-coding hospital sheets
Subtext
Critical thinking: crap detecting
•
•
•
•
•
•
•
Is it true?
Supporting evidence?
Assumptions made?
Alternative explanations?
Is statement testable? Has it been tested?
More information necessary?
Consequences? Predictions?
Philosophy
• “Nothing worth learning can be taught”
Oscar Wilde
• Read the handout before class
• Finish the exercises out of class
• Read science, talk science, think science
• You de punter!
• Stop me !!!!
Exams
• The most boring, stressful and hateful part
of the course (for me)
• 30% is Continuous assessment
– Easy marks, forward planning
– Practical exams
– Gene day presentations
• Exams: any extra info, original thorts =
BONUS points
Getting bioinformation
• Type it in: A,T,C,C,G,T,C,A (1991)
• Access databases
–
–
–
–
–
Literature (Medline/Pubmed)
Medical (OMIM)
DNA sequence (EMBL/GenBank)
Protein sequence (UniProt, SwissProt, PIR)
3-D structure (PDB)
Annotation
• In any DB, half is data and half context.
– Parsing sequence (ORF, RBS, Intron, -helix)
– Recognising similar sequences (evolution!)
– Complementary info : DB cross-referencing
• (DNA -> Protein -> 3D structure -> motifs)
Secondary databases
•
•
•
•
•
•
•
•
Protein motifs, domains, families
RNA structures (16S ribosomal RNA…)
Taxonomy/classification
Metabolic pathways
Enzymes
SNPs mutations and variants
Disease DBs
Immuno, epitope DBs
Complete genomes
• Ensembl (complex, basically vertebrate)
– Uniform look-and-feel; cross-refs
– See also UCSC GoldenPath browser
• Plants
• Bacterial genomes
–
–
–
–
Mitochondrial, chloroplast
Eubacteria archaea
Each idiosyncratic & in its own place
Some meta-DBs
Annotated/known genes
• What does my gene do?
• Blast (fasta) against the DB
• SRS/Entrez to access databases
– Neighboring (similar things in same DB)
• DB cross-references
– full picture of attributes
– What biochemical pathway?
OMIM
Maps &
Genomes
FullText
Journals
GenBank/EMBL
DNA Sequence
PubMed
UniProt
Protein sequence
Prosite
Pfam
Taxonomy
The territory
PSSM
PDB
3-D struct
Databases
• BIG
• EMBL/GenBank 400GB, 60m entries, 2500
complete genomes, 200K species
• Encyclopedia Britannica 180m letters. 1.3m
• EMBL 3km of Britannica Volumes
• Doubling every 14-18 mo
• Human genome is ?
New Unknown Gene
•
•
•
•
•
•
•
•
Blast homology searching
Genomic location/neighboring genes
Where is it expressed?
How regulated (control sequences)
Intron/exon structure
Domain structure
Restriction sites etc.
Primer design
DNA/gene structure
• Four bases A T C G U
– 2 pyrimidine, 2 purine
– LOTS of them: how many?
•
•
•
•
Open reading frame
5’ signals, 3’ signals
Introns/exons
Neighbours (operons)
Two sequences
• Alignment
– Local
– Global
• Dotplot
• Threading
One seq vs many
•
•
•
•
•
•
Homology search vs database
Special case of 2-seq alignment
Blast vs fasta
Limit by species/taxon
Substitution matrices
Low complexity masking
Multiple sequence alignment
• MSA
• Progressive alignment
• ClustalW or (better) T-Coffee
Phylogenetic trees
• Computationally intensive
• Distance matrix methods
– Neighbor-joining (NJ)
– UPGMA
• Minimum evolution
• Maximum parsimony
• Maximum likelihood
– Bayesian methods
Genefinding
• Special case of DNA analysis
• How to annotate a genome
• Bacterial
– Find open reading frames (ORFs)
– With start/stop codons
– With promoter, RBS, CAAT, TATA
• Eukaryotic
– As above PLUS
– Introns/exons
– Alternative splicing
Protein substructure
• DNA makes protein and protein (enzymes)
make everything else.
• 20 Amino acids
• Amino acid properties
• Motifs
• Domains
• Biological units
Amino acid properties
again … and again and again
Protein 3-D structure
• Relationship between sequence & structure
• Secondary structure
– Alpha helix
– Beta sheet
– Coil
– Turn
• Threading sequence to homologous structure
Gene Expression
•
•
•
•
EST
SAGE
MicroArray
Clustering of same expressed genes
Genomics
• Complete DNA seq for a species
• Gene order
• Gene clusters/operons
– Missing operons
• Gene duplication
• Whole genome duplication (WGD)
SNPs
• Key issue in genetics is that two organisms
are both the same and different:
– Humans vs chimps vs mouse
– Parent vs offspring vs co-national vs human
• Single nucleotide polymorphisms
• Variation between individuals
• Pharmacogenetics
– Personal tailored medicine
Summary/take home
• Course designed to give you access to
databases, software tools
• …and ways of thinking about data