Bioinformatics

Download Report

Transcript Bioinformatics

Bioinformatics
Overview
School of B&I TCD May 2010
Who, me?
•
•
•
•
•
•
•
Andrew Lloyd
[email protected]
087-225-9850, 053-9255717, 01-896-2450
Director INCBI 1993-2000
Population genetics, evolution
Whole genome analysis
Immunology, chickens, FIRM
Definition/scope
• Storage, retrieval and analysis of biological
(sequence) information.
• Insert better definition here
• Case can be made for microarray analysis
• NOT
– ecoinformatics (ecology)
– Image analysis
– Bar-coding hospital sheets
Philosophy
“Nothing worth learning can be taught”
Oscar Wilde
Getting bioinformation
• Type it in: A,T,C,C,G,T,C,A (1991)
• Access databases
–
–
–
–
–
Literature (Pubmed)
Medical (OMIM)
DNA sequence (EMBL/GenBank)
Protein sequence (UniProt, SwissProt, PIR)
3-D structure (PDB)
Annotation
• In any DB, half is data and half context.
–
–
–
–
Gene ontology (language)
Parsing sequence (ORF, RBS, Intron, -helix)
Recognising similar sequences (evolution!)
Complementary info : DB cross-referencing
• (DNA -> Protein -> 3D structure -> motifs)
Secondary databases
•
•
•
•
•
•
•
•
Protein motifs, domains, families
RNA structures (16S ribosomal RNA…)
Taxonomy/classification
Metabolic pathways (KEGG)
Enzymes (Brenda, TCD, Ireland)
SNPs: mutations and variants
Disease DBs (OMIM)
Immuno, epitope DBs
Complete genomes
• Ensembl (complex, basically vertebrate)
– Uniform look-and-feel; cross-refs
• UCSC GoldenPath browser
• Plants
• Bacterial genomes
– Including mitochondrial, chloroplast
– Eubacteria vs Archaea vs Eukaryotes
Annotated/known genes
• What does my gene do?
• Blast (fasta) against the DB
• SRS/Entrez to access databases
– Neighboring (similar things in same DB)
• DB cross-references
– full picture of attributes
– What biochemical pathway?
OMIM
Maps &
Genomes
FullText
Journals
GenBank/EMBL
DNA Sequence
PubMed
UniProt
Protein sequence
Prosite
Pfam
Taxonomy
The territory
PSSM
PDB
3-D struct
Databases
• BIG
• EMBL/GenBank 200Gbp, 100m entries,
2500 complete genomes, 200K species
• Encycl. Britannica 180m letters. 40m words
• EMBL 1km of Britannica Volumes
• Doubling every 14-18 mo
• Human genome is X bp?
Intrinsic vs Context
Internal
• DNA, protein sequence
– DNA: Purine/Pyrimidine
– AAs: small, hydrophobic, aromatic, polar
– Variants: SNPs, Indels, Alt Splicing
• 2ndry structure
– DNA: stem/loops
– Protein: helix, sheet, turn, loop
Intrinsic vs Context
External, context for your molecule
• In other species (homologs, phylog trees)
• In which cell
• In which cellular location (GO)
• Molecular complex (dimers)
• Which pathway (KEGG)
• Where in genome (neighbors, synteny)
New Unknown Gene
•
•
•
•
•
•
•
•
Blast homology searching
Genomic location/neighboring genes
Where is it expressed?
How regulated (control sequences)
Intron/exon structure
Domain structure
Restriction sites etc.
Primer design
DNA/gene structure
• Four bases A T C G U
– 2 pyrimidine, 2 purine
– LOTS of them: how many?
•
•
•
•
Open reading frame
5’ signals, 3’ signals
Introns/exons
Neighbours (operons)
Two sequences
• Alignment
– Local
– Global
• Dotplot
• Threading
One seq vs many
•
•
•
•
•
•
Homology search vs database
Special case of 2-seq alignment
Blast vs fasta
Limit by species/taxon
Substitution matrices
Low complexity masking
Multiple sequence alignment
• MSA
• Progressive alignment
• ClustalW or (better) T-Coffee
Phylogenetic trees
• Computationally intensive
• Distance matrix methods
– Neighbor-joining (NJ)
– UPGMA
• Minimum evolution
• Maximum parsimony
• Maximum likelihood
– Bayesian methods
Genefinding
• Special case of DNA analysis
• How to annotate a genome
• Bacterial
– Find open reading frames (ORFs)
– With start/stop codons
– With promoter, RBS, CAAT, TATA
• Eukaryotic
– As above PLUS
– Introns/exons
– Alternative splicing
Typical mammalian gene structure
Start (ATG)
Control
Region
miRNAs?
Introns
Stop
DNA
gt..
5’
Exon 2
Exon 1
Introns “spliced out” and discarded
Exon 3
…ag
3’
Exon 4
RNA
RNA
Stop: TAG, TGA, TAA
ATGCCCAGGAGATTTGGA . . .
PROTEIN
MetProArgArgPheGly
. . .
Protein substructure
• DNA makes protein and protein (enzymes)
make everything else.
• 20 Amino acids
• Amino acid properties
• Motifs
• Domains
• Biological units
Amino acid properties
again … and again and again
Protein 3-D structure
• Relationship between sequence & structure
• Secondary structure
– Alpha helix
– Beta sheet
– Coil
– Turn
• Threading sequence to homologous structure
Gene Expression
•
•
•
•
EST
SAGE
MicroArray
Clustering of same expressed genes
Genomics
• Complete DNA seq for a species
• Gene order
• Gene clusters/operons
– Missing operons
• Gene duplication
• Whole genome duplication (WGD)
SNPs
• Key issue in genetics is that two organisms
are both the same and different:
– Humans vs chimps vs mouse
– Parent vs offspring vs co-national vs human
• Single nucleotide polymorphisms
• Variation between individuals
• Pharmacogenetics
– Personal tailored medicine
Summary/take home
• Course designed to give you access to
databases, software tools
• …and ways of thinking about data