What is Bioinformatics?

Download Report

Transcript What is Bioinformatics?

What is Bioinformatics?
• Bioinformatics: collection and storage
of biological information
• Computational biology: development
of algorithms and statistical models to
analyze biological data
Jobs for bioinformaticians
Databases make biological data
available to scientists
• As biology has increasingly turned into
a data-rich science, the need for storing
and communicating large datasets has
grown tremendously.
– Nucleotide, protein sequences
– Protein structure
– Expression data
– Gene/protein networks
Nucleotide Databases
• EMBL www.ebi.ac.uk/embl/
– The EMBL (European Molecular Biology
Laboratory) nucleotide sequence database is
maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK.
Nucleotide Databases cont.
• GenBank: maintained by the National
Center for Biotechnology Information
(NCBI); contains Entrez for accession to
nucleotides, proteins, annotations, etc.
www.ncbi.nlm.nih.gov/Genbank/
• UniGene: a non-redundant set of geneoriented clusters
www.ncbi.nlm.nih.gov/UniGene/
Protein Databases
• SWISS-PROT: SWISS-PROT is a protein
sequence database to provide a high level
of annotations (such as the description of
the function of a protein, its domains
structure, post-translational modifications,
variants, etc.), a minimal level of
redundancy and high level of integration
with other databases.
www.expasy.ch/sprot/
Protein Databases
• PIR
http://pir.georgetown.edu/
-The Protein Information Resource (PIR) is a
division of the National Biomedical Research
Foundation (NBRF) in the US. It is involved in
a collaboration with the Munich Information
Center for Protein Sequences (MIPS) and the
Japanese International Protein Sequence
Database (JIPID). Release 67.00 (31 Dec
2000) contains 198,801 entries.
Sequence Motif Databases
• Pfam
www.sanger.ac.uk/Software/Pfam/
– Pfam is a database of protein families defined
as domains (contiguous segments of entire
protein sequences). For each domain, it
contains a multiple alignment of a set of
defining sequences (the seeds) and the other
sequences in SWISS-PROT that can be
matched to that alignment.
3D-Structure Databases
• PDB
www.rcsb.org/pdb/
-The PDB is the main primary database for 3D
structures of biological macromolecules
determined by X-ray crystallography and NMR.
Structural biologists usually deposit their
structures in the PDB on publication, and some
scientific journals require this before accepting a
paper. It also accepts the experimental data
used to determine the structures.
How to get sequences?
• Entrez Database provides nucleotide and
protein sequences in different formats.
• One of the formats is FASTA
FASTA FORMAT
• Each sequence begins with a description
line ‘>’
A protein in FASTA format
>HBA_ALLMI
VLSMEDKSNVKAIWGKASGHLEEYGAEALEMF
CAYPQTKIYFPHFDMSHNSAQIRAHGKKVFSA
LHEAVNHIDDLPGALCRLSELHAHSLRVDPVNF
KFLAHCVLVVFAIHHPSALSPEIHASLDKFLCAV
SAVLTSKYR
• The first line is the description line, starts with a
character '>' shows that the description line of a
sequence follows the string following the '>' and
ending at the first space (' ') is the sequence id
(HBA_ALLMI).
A DNA sequence in Fasta
>X sequence
ATGAATAGCACAGAGAGACCAAGAGAG
AGAGAGAGACCCAGATATATCAGATAGA
GA
Why align sequences?
• Find evolutionary relationship between
species and/or genes.
• Identify novel genes and define similar
genes in other species.
• Study genomes and how they change.
Sequence Alignment
• Homology means that two (or more)
sequences have a common ancestor.
• An example to sequence alignment
Sequence 1
Sequence 2
CLUSTALW: A software for aligning
sequences
http://www.ebi.ac.uk/clustalw/
Genome Databases
• www.ensembl.org
Genome Databases: Gene
Prediction
• Define the location of genes (coding sequences,
regulatory regions)
• Gene prediction using software based on rules and
patterns. Find Open Reading Frames (ORFs), with
additional criteria for good start sequence for a gene.
• Gene identification through alignment with known
proteins and EST sequences (Expressed Sequence
Tags; mRNA sequences).
• Gene prediction through similarity with proteins or ESTs
in other organisms.
• Gene prediction through comparison with other
genomes; conserved regions are probably coding or
regulatory regions.
Genome Databases: Annotation
• Annotation of the genes: Compare with genes/proteins of
known function in other organisms.
• Functional classification. Broad groups of functional
characterization, such as 'ribosomal proteins', 'nucleotide
metabolism', 'signal transduction'.
Genome Databases: Evolution
• Evolutionary history
• Genome duplications
• Gene loss
Transcription Databases
• Microarrays can analyze 1000s of transcripts simultaneously.
– Allow analysis of genes that are high or low in expression between
normal and disease, for example.
• Microarray Databases contain expression data (large amounts).
– Stanford Microarray Database:
Signaling & Metabolic Pathways
• Analyze how genes/proteins interact and learn
about function of genes
– KEGG: Kyoto Encyclopedia of Genes and Genomes
–
http://www.genome.ad.jp/kegg/