Archives and Information Retrieval

Download Report

Transcript Archives and Information Retrieval

Archives and
Information Retrieval
Reading:
Introduction to Bioinformatics.
Arthur M. Lesk. Fourth Edition
Chapter 4
Introduction
• Learning objectives:
• What is the general arrangement of biological data in the
public databases?
• To know the information retrieval skills that will allow you to
make effective use of the databases.
• To become familiar with basic operations.
• How does one retrieve information on a particular subject in
the literature?
Primary public domain bioinformatics
servers
Public Domain
Bioinformatics
Facilities
National Center
For Biotechnology
Information (NCBI)
United States
Databases
Analysis
Tools
European Bioinformatics
Institute (EBI)
United Kingdom
Databases
Analysis
Tools
Genome
Net
(KEGG & DDBJ)
Japan
Databases
Analysis
Tools
The Archives
• Massive biological experimental data
• These biological information databases can be
classified into two types
• The first level databases
• Come from the raw data which were obtained via the
experiments. “simple”
• The second level databases
• Further reorganized based on.. in order to achieve some
specific goals
The Archives
• Some examples:
• The first level databases
• Nucleic acid sequence databases: GenBank, EMBL Data Library,
DNA Database of Japan (DDBJ)
• Protein sequence database: SWISS-PROT, PIR
• Protein structure database: PDB
• The second level databases
• GDB
• TRANSFAC
• SCOP
Nucleic acid sequence databases
• International DNA Sequence Database
Collaboration
• NCBI (GenBank) – USA (1982)
• EMBL (Data Library)– Europe (1982)
• DDBJ (DNA Data Bank)– Japan (1988)
NCBI
• Established in USA in 1988 as a national resource
for molecular biology information
• creates public databases
• conducts research in computational biology
• develops software tools for analyzing genome data
• disseminates biomedical information
Nucleic acid sequence databases
• GenBank
•
•
•
•
•
nucleic acid sequence and the protein sequence
literature work
biological annotation
A new release is made every two months
GenBank information retrieval system
NCBI ENTREZ
• A platform that provides access to and links to
databases with biological information
ENTREZ
PubMed
MedLine
GenBank
Protein Genomes
databases
PopSet
Taxonomy
OMIM
NCBI ENTREZ
MedLine
OMIM
Literature Database
Database of human genes and genetic disorders
GenBank
Database of all publicly available DNA sequences
Protein
databases
Database of amino acid sequences from SwissProt, PIR, PRF,
PDB, and translations from annotated coding regions in
GenBank and RefSeq.
Genomes
Database of genomes from organisms and viruses
PopSet
Taxonomy
Database of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population.
Database of names of organisms with sequences in GenBank or Prot
PubMed Center
• the U.S. National Library of Medicine's digital
archive of life sciences journal literature
• Access to the full text of articles in PMC is free,
except where a journal requires a subscription for
access to recent articles
OMIM-Online Mendelian
Inheritance in Man
• A catalog of human genes linked to diseases
• Began by Victor A. McKusick at Johns Hopkins University
• A good place to start when you want to research a certain
disease or biological molecule
• This database is cross-referenced to PubMed and other NCBIbased databases
Complete ENTREZ database divisions
How to submit sequence data to
GenBank
• Bankit based web interface
• http://www.ncbi.nlm.nih.gov/BankIt
• Sequin program
• http://www.ncbi.nlm.nih.gov/Sequin
Protein databases
• The Protein Information Resource (PIR) was established in
1984 by the National Biomedical Research Foundation
(NBRF).
• The PIR Protein Sequence Database evolved from the
original NBRF Protein Sequence Database, developed over
20 years
• PIR-International is a collaboration between NBRF, the
Munich Information Center for Protein Sequences (MIPS),
and the Japan International Protein Information Database
(JIPID)
• collect and publish what is now the oldest and largest
database of biomolecular sequence, source, literature, and
feature information.
PIR
• PIR-International Protein Sequence Database: an annotated, nonredundant and cross-referenced database of protein sequences.
• PIR Alignment Database, PIR-ALN: contains sequence alignments of
superfamilies, families and homology domains produced from
information in the Protein Sequence Database.
• FAMBASE Family Database: a searchable database containing a single
representative sequence from each protein family.
• RESID Database of Amino Acid Modifications: based on feature
information in the Protein Sequence Database.
PIR
• http://www-nbrf.georgetown.edu/pir/
SWISS-PROT
• http://www.ebi.ac.uk/swissprot/
• an well-annotated protein sequence database established in 1986.
• It is maintained collaboratively by the Swiss Institute for Bioinformatics
(SIB) and the European Bioinformatics Institute (EBI).
• a curated protein sequence database that provides a high level of
annotation, a minimal level of redundancy and a high level of
integration with other databases.
Note: UniProtKB/TrEMBL and UniProtKB/Swiss-Prot have been
incorporated into the UniProt (Universal Protein Resource). a one-stop
shop allowing easy access to all publicly available information about
protein sequences.
PROSITE
• http://ca.expasy.org/prosite/
• a method of determining what is the function of
uncharacterized proteins translated from genomic
or cDNA sequences.
• a database of biologically significant sites
• patterns formulated in such a way that with appropriate
computational tools it can rapidly and reliably identify to
which known family of protein (if any) the new sequence
belongs.
PDB
• http://www.rcsb.org/pdb/
• The single international repository for public data on the 3dimensional structures of biological macromolecules
• Is established by the Brookhaven National Lab of United
States
• The contents are primarily experimental data derived from
X-ray crystallography and NMR experiments
• Rasmol may demonstrate 3D structure of the biological
macromolecule according to the PDB document