SINGAPORE’S R&D FRAMEWORK and the TECHNOLOGY …

Download Report

Transcript SINGAPORE’S R&D FRAMEWORK and the TECHNOLOGY …

Lecture 1:
Biological information database and data mining
•
•
•
•
Biology as an information intensive science
Typical databases
Introduction to data mining
Data mining in biology
Biology as an information intensive science
Organization of living systems:
Ecosystems=> Communities=> Populations => Organisms => Organ systems
=> Organs => Tissues => Cells => Molecules.
Ecosystem:
All living things in a particular area (such as an island) and all non-living,
physical components of the environment that affect living things (such as air,
soil, water, sunlight).
Community:
All living things in an ecosystem (such as all animals, plants, bacteria, fungal,
viruses etc. in a rain forest).
Population:
A group of interbreeding individuals of one species (such as all flying
squirrels in a rain forest).
Organism:
An individual living thing (such as one flying squirrel).
Organ system:
A group of related body components that perform a specific type of function
(such CNP).
Organ:
Functional group of organ system (such as brain).
Biology as an information intensive science
Fundamental Theory:
Evolution:
Simple molecules => Organic molecules => RNA-based life systems => Single
cells => Multiple cellular organisms => Higher organisms
Molecular Basis of Life:
DNA (Genes) => RNAs => Proteins:
Structural organization
Chemical reaction, synthesis and destruction of molecules
Signal transduction
Transportation of molecules.
Regulation
Biology as an information intensive
science
Cell
Organization
and Function:
Structural
organization
Chemical
reaction,
synthesis and
destruction
of molecules
Signal
transduction
Transportation
of molecules.
Regulation
Biology as an information intensive science
Information (Molecular Level):
DNA:
30,000 ~ 100,000 genes for human (many with unknown functions)
3x109 base pairs for human DNA (< 10% coding region)
Protein:
60,000 ~ 100,000 proteins for human.
Individual level:
sequence, 3D structure, molecular function.
Group level:
pathways, cellular location, collective function.
Classification:
Family: superfamily, family, subfamily (based on evolution and function)
Type: receptor, ion channel, enzyme, carrier, regulator, structure
Function:
Physiological function, diseases, therapeutics, toxicity, pharmacokinetics,
agriculture, plant, environmentally relevant.
Typical Databases
Category:
•General
•Sequence
•3D structure
•Protein function, proteomics, and pathways.
•Pharmainformatics
•Medical informatics and disease information
Reference:
Nucleic. Acids. Res., 30, 1-12 (2002).
Internet links:
http://www.cz3.nus.edu.sg/~yzchen/database.html
Typical Databases
General:
The National Center for Biotechnology Information
(NCBI). (http://www3.ncbi.nlm.nih.gov/)
Integrated ENTREZ retrieval software and databases for genetics, gene and
protein sequences, 3D structures, and on-line PubMed library. CAM
(Complementary and Alternative Medicine) on PubMed.
Pedro's BioMolecular Research Tools. A Collection of WWW Links to
Information and Services Useful to Molecular Biologists. Other mirror sites
in Germany, and Switzerland.
The CMS Molecular Biology Resource. This site is a compendium of
electronic and Internet-accessible tools and resources for
Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry, and
Biomolecular Modeling. Other mirror sites in Japan, Canada, France,
Germany, Italy, and UK.
Typical Databases
Sequence:
•GenBank DataBase (GenBank). (http://www.ncbi.nih.gov/Genbank/)
The GenBank database contains and distributes publicly available DNA
sequences from more than 130,000 different organisms. It contains DNA
sequences, their derived protein sequences, and annotations describing
biological, structural, and other relevant features. It currently contains
27213748 loci, 33865022251 bases, from 27213748 reported sequences
SWISS-PROT (http://us.expasy.org/sprot/)
Annotated protein sequence database. Information includes the
description of the function of a protein, its domains structure, posttranslational modifications, variants, etc.
Release 42.0 of 10-Oct-2003 of Swiss-Prot contains 135850 sequence
entries, comprising 50046799 amino acids abstracted from 109694
references.
Typical Databases
Sequence-related knowledge databases:
Online Mendelian Inheritance in Man.
(http://www3.ncbi.nlm.nih.gov/omim/)
Database that catalogs the human genes and genetic disorders. Located at
NCBI. It currently contains 14831 entries
Pfam: Protein families database of alignments and HMMs.
(http://www.sanger.ac.uk/Software/Pfam/ ). A large collection of multiple
sequence alignments and hidden Markov models covering many common
protein domains. In this way, proteins are grouped into domain-based
families. It currently covers 6190 families.
Typical Databases
Structure:
Protein Data Bank (PDB). (http://www.rcsb.org/pdb/ )
3D crystal and NMR structure of proteins, DNA, RNA and ligand-bound complexes.
Official mirror site in Singapore, and other places in China., Japan, Taiwan and
several places in USA: Boston, North Carolina. It currently contains 22874
Structures.
Nucleic Acids Database (NDB). 3D crystal structure of DNA and RNA. Mirror sites
in UK, Japan, and other sites in USA: San Diego.
Typical Databases
Structure derived knowledge databases:
SCOP. Structural classification of proteins. Mirror sites in Singapore, China, the
U.S., and Japan.
CATH. Protein Structure Classification. A hierarchical domain classification of protein
structures in PDB.
MODBASE. A database of Comparative Protein Structure Models. Models were
generated by PSI-BLAST and MODELLER. As of Aug 2000, there are 3,379 reliable
models for domains in 2,220 proteins, and 5433 reliable fold assignments for
domains in 3,083 proteins.
Typical Databases
Function and pathways:
GeneCards. A database of human genes, their products and their involvement in diseases. It
offers concise information about the functions of all human genes that have an approved
symbol, as well as selected others [gene listing].
PROSITE. Protein families and domains. It consists of biologically significant sites, patterns
and profiles that help to reliably identify to which known protein family (if any) a new sequence
belongs. Mirror sites in Australia, Canada, China, Taiwan.
PRINTS. Protein fingerprint database. A fingerprint is a group of conserved motifs used to
characterise a protein family.
PROCAT. A database of 3D enzyme active site templates. It can be thought of as the 3D
equivalent of the 1D templates found in sequence motif databases such as PROSITE and
PRINTS.
KEGG: Kyoto Encyclopedia of Genes and Genomes. Site contains Pathway Info, Disease
Catalogs, Cell Catalogs, Molecule Catalog, and Genomic Info. It also provides Links to
Pathway and Other Databases.
SPAD: Signaling Pathway Database. An integrated database for genetic information and
signal transduction systems. Divided into four categories based on extracellular signal
molecules (Growth factor, Cytokine, and Hormone) and stress, that initiate the intracellular
signaling pathway.
Pharmainformatics:
Typical Databases
TTD: Therapeutic Target Database. A database to provide information about the
known and newly proposed therapeutic protein and nucleic acid targets, the targeted
disease, pathway information and the corresponding drugs/ligands directed at each
of these targets. Links to relevant databases also provided.
MedChem/Biobyte QSAR Database. A collection of 10,000 of QSAR datasets
that covers both biological and physical-organic chemistry.
The NCI Drug Information System 3D Database. A collection of 3D structures for
over 400,000 drugs which was built and is maintained by the Developmental
Therapuetics Program Division of Cancer Treatment, National Cancer Institute. The
database is an extension of the NCI Drug Information System.
Drug Discovery Databases Compiled by The Biophysical Pharmacology Group
at NCI. Site has links to several therapeutics program databases and tools, and a
2D-Gel protein expression database.
Pharmaceutical Information Network . A comprehensive information database
about drugs and diseases.
U. S. Food and Drug Administration Center for Drug Evaluation and Research.
Introduction to Data Mining
Main Objective:
Pattern identification, Classification, Extraction of related data (character) set.
Tasks:
•
•
•
Generation of association rules.
Classification and clustering.
Pre-processing and post-processing of relevant dataset.
General Procedure:
1.
2.
3.
4.
5.
6.
Understanding of application domain.
Data source identification and data selection.
Pre-processing: feature selection, discretization, data cleaning.
Data mining: pattern extraction and model building.
Post-processing: identification of interesting/useful/novel patterns/rules.
Incorporation of patterns in real world tasks.
Introduction to Data Mining
Example:
Generation of association rules:
Record of customer purchases:
John: Jacket, Boots
Alfred: Milk, Cheese, Bread, Shoes
Green: Milk, Bread
Brown: Milk, Bread, Shoes, Greeting Cards, Pork
Eric: Cheese, Milk, Shoes, Beef
Bob: Jacket, Boots, Ski Pants
Form of association rules:
Item A => Item B [sup, conf]
sup = support = % of records containing both item A and B
conf = confidence = sup / (% of records containing item B)
Data Mining in Biology
Types of Tasks:
•
Search for similar pattern in a subsection of each member of datasets (e.g.
protein sequence motifs).
•
Classification of datasets into groups (e.g. proteins into families).
•
Search for a dataset matching given characteristics (e.g. alignment of a
protein sequence against all entries in a protein sequence database).
•
Extraction of particular information from literature (e.g. drugs that bind to a
particular protein).
Proc. Natl. Acad. Sci. USA 95, 10710-10715 (1998)
Structure 7, 1099-1112 (1999)
Bioinformatics 17, 721-728 (2001)
Bioinformatics 17, 155-161 (2001); 17, 359-363 (2001))
Homework
1. Write a very short report about a
database assigned to you.
2. Can you give at least two more
examples to each type of tasks in
biological data mining?
3. Read the reference about typical
biological database and get a broad
picture about the current status of
publicly-accessible bioinformatics
databases.
4. Read at least one of the references
about data mining in biology and be
prepared to give a brief description
about the paper.