Lect 6 - BIDD
Download
Report
Transcript Lect 6 - BIDD
SMA5422: Special Topics in Biotechnology
Chen Yu Zong
Department of Computational Science , NUS
Office: Blk SOC1 Room 07-24
Tel.: 65-874-6877. E-mail: [email protected]
Topics in 2nd Part:
Biological Information and Tools.
Molecular Modeling Technology and
Applications.
Computer-aided drug design
Schedule
•
Lecture 6 (Feb 14): Biological information database and data
mining.
•
Lecture 7 (Feb 21): Gene and protein sequence alignment methods.
•
Lecture 8 (Feb 26): Machine learning techniques in sequence
analysis.
•
Lecture 9 (Mar 5): Computer modeling of biomolecules: Structure,
motion, and binding.
•
Lecture 10 (Mar 7): Computer aided drug design: structure-based
approach.
•
Lecture 11 (Mar 12): Computer aided drug design: QSAR approach.
Lecture 6:
Biological information database and data mining
•
•
•
•
Biology as an information intensive science
Typical databases
Introduction to data mining
Data mining in biology
Biology as an information intensive science
Organization of living systems:
Ecosystems=> Communities=> Populations => Organisms => Organ systems
=> Organs => Tissues => Cells => Molecules.
Ecosystem:
All living things in a particular area (such as an island) and all non-living,
physical components of the environment that affect living things (such as air,
soil, water, sunlight).
Community:
All living things in an ecosystem (such as all animals, plants, bacteria, fungal,
viruses etc. in a rain forest).
Population:
A group of interbreeding individuals of one species (such as all flying
squirrels in a rain forest).
Organism:
An individual living thing (such as one flying squirrel).
Organ system:
A group of related body components that perform a specific type of function
(such CNP).
Organ:
Functional group of organ system (such as brain).
Biology as an information intensive science
Fundamental Theory:
Evolution:
Simple molecules => Organic molecules => RNA-based life systems => Single
cells => Multiple cellular organisms => Higher organisms
Molecular Basis of Life:
DNA (Genes) => RNAs => Proteins:
Structural organization
Chemical reaction, synthesis and destruction of molecules
Signal transduction
Transportation of molecules.
Regulation
Biology as an information intensive
science
Cell
Organization
and Function:
Structural
organization
Chemical
reaction,
synthesis and
destruction
of molecules
Signal
transduction
Transportation
of molecules.
Regulation
Biology as an information intensive science
Information (Molecular Level):
DNA:
30,000 ~ 100,000 genes for human (many with unknown functions)
3x109 base pairs for human DNA (< 10% coding region)
Protein:
60,000 ~ 100,000 proteins for human.
Individual level:
sequence, 3D structure, molecular function.
Group level:
pathways, cellular location, collective function.
Classification:
Family: superfamily, family, subfamily (based on evolution and function)
Type: receptor, ion channel, enzyme, carrier, regulator, structure
Function:
Physiological function, diseases, therapeutics, toxicity, pharmacokinetics,
agriculture, plant, environmentally relevant.
Typical Databases
Category:
•General
•Sequence
•3D structure
•Protein function, proteomics, and pathways.
•Pharmainformatics
•Medical informatics and disease information
Reference:
Nucleic. Acids. Res., 30, 1-12 (2002).
Internet links:
http://www.cz3.nus.edu.sg/~yzchen/database.html
Typical Databases
General:
The National Center for Biotechnology Information (NCBI). Integrated
ENTREZ retrieval software and databases for genetics, gene and protein
sequences, 3D structures, and on-line PubMed library. CAM (Complementary
and Alternative Medicine) on PubMed.
Pedro's BioMolecular Research Tools. A Collection of WWW Links to
Information and Services Useful to Molecular Biologists. Other mirror sites
in Germany, and Switzerland.
The CMS Molecular Biology Resource. This site is a compendium of
electronic and Internet-accessible tools and resources for
Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry, and
Biomolecular Modeling. Other mirror sites in Japan, Canada, France,
Germany, Italy, and UK.
Typical Databases
Sequence:
The Genome Data Base (GDB). Database for genes of human and other
species. Located at Johns Hopkins University School of Medicine. Mirror
site in Japan.
Genome Sequence DataBase. Located at the National Center for Genome
Resources (NCGR) in Santa Fe. Site has info on Human Genome Project,
gentics and public issues, education and references.
SWISS-PROT Annotated protein sequence database.
Online Mendelian Inheritance in Man. Database that catalogs the human
genes and genetic disorders. Located at NCBI.
Pfam: Protein families database of alignments and HMMs. A large
collection of multiple sequence alignments and hidden Markov models
covering many common protein domains.
Typical Databases
Structure:
Protein Data Bank (PDB). 3D crystal and NMR structure of proteins, DNA, RNA
and ligand-bound complexes. Official mirror site in Singapore, and other places in
China., Japan, Taiwan and several places in USA: Boston, North Carolina.
Nucleic Acids Database (NDB). 3D crystal structure of DNA and RNA. Mirror sites
in UK, Japan, and other sites in USA: San Diego.
SCOP. Structural classification of proteins. Mirror sites in Singapore, China, the
U.S., and Japan.
CATH. Protein Structure Classification. A hierarchical domain classification of protein
structures in PDB.
MODBASE. A database of Comparative Protein Structure Models. Models were
generated by PSI-BLAST and MODELLER. As of Aug 2000, there are 3,379 reliable
models for domains in 2,220 proteins, and 5433 reliable fold assignments for
domains in 3,083 proteins.
Typical Databases
Function and pathways:
GeneCards. A database of human genes, their products and their involvement in diseases. It
offers concise information about the functions of all human genes that have an approved
symbol, as well as selected others [gene listing].
PROSITE. Protein families and domains. It consists of biologically significant sites, patterns
and profiles that help to reliably identify to which known protein family (if any) a new sequence
belongs. Mirror sites in Australia, Canada, China, Taiwan.
PRINTS. Protein fingerprint database. A fingerprint is a group of conserved motifs used to
characterise a protein family.
PROCAT. A database of 3D enzyme active site templates. It can be thought of as the 3D
equivalent of the 1D templates found in sequence motif databases such as PROSITE and
PRINTS.
KEGG: Kyoto Encyclopedia of Genes and Genomes. Site contains Pathway Info, Disease
Catalogs, Cell Catalogs, Molecule Catalog, and Genomic Info. It also provides Links to
Pathway and Other Databases.
SPAD: Signaling Pathway Database. An integrated database for genetic information and
signal transduction systems. Divided into four categories based on extracellular signal
molecules (Growth factor, Cytokine, and Hormone) and stress, that initiate the intracellular
signaling pathway.
Pharmainformatics:
Typical Databases
TTD: Therapeutic Target Database. A database to provide information about the
known and newly proposed therapeutic protein and nucleic acid targets, the targeted
disease, pathway information and the corresponding drugs/ligands directed at each
of these targets. Links to relevant databases also provided.
MedChem/Biobyte QSAR Database. A collection of 10,000 of QSAR datasets
that covers both biological and physical-organic chemistry.
The NCI Drug Information System 3D Database. A collection of 3D structures for
over 400,000 drugs which was built and is maintained by the Developmental
Therapuetics Program Division of Cancer Treatment, National Cancer Institute. The
database is an extension of the NCI Drug Information System.
Drug Discovery Databases Compiled by The Biophysical Pharmacology Group
at NCI. Site has links to several therapeutics program databases and tools, and a
2D-Gel protein expression database.
Pharmaceutical Information Network . A comprehensive information database
about drugs and diseases.
U. S. Food and Drug Administration Center for Drug Evaluation and Research.
Introduction to Data Mining
Main Objective:
Pattern identification, Classification, Extraction of related data (character) set.
Tasks:
•
•
•
Generation of association rules.
Classification and clustering.
Pre-processing and post-processing of relevant dataset.
General Procedure:
1.
2.
3.
4.
5.
6.
Understanding of application domain.
Data source identification and data selection.
Pre-processing: feature selection, discretization, data cleaning.
Data mining: pattern extraction and model building.
Post-processing: identification of interesting/useful/novel patterns/rules.
Incorporation of patterns in real world tasks.
Introduction to Data Mining
Example:
Generation of association rules:
Record of customer purchases:
John: Jacket, Boots
Alfred: Milk, Cheese, Bread, Shoes
Green: Milk, Bread
Brown: Milk, Bread, Shoes, Greeting Cards, Pork
Eric: Cheese, Milk, Shoes, Beef
Bob: Jacket, Boots, Ski Pants
Form of association rules:
Item A => Item B [sup, conf]
sup = support = % of records containing both item A and B
conf = confidence = sup / (% of records containing item B)
Data Mining in Biology
Types of Tasks:
•
Search for similar pattern in a subsection of each member of datasets (e.g.
protein sequence motifs).
•
Classification of datasets into groups (e.g. proteins into families).
•
Search for a dataset matching given characteristics (e.g. alignment of a
protein sequence against all entries in a protein sequence database).
•
Extraction of particular information from literature (e.g. drugs that bind to a
particular protein).
Proc. Natl. Acad. Sci. USA 95, 10710-10715 (1998)
Structure 7, 1099-1112 (1999)
Bioinformatics 17, 721-728 (2001)
Bioinformatics 17, 155-161 (2001); 17, 359-363 (2001))
Homework
1. Write a very short report about a
database assigned to you.
2. Can you give at least two more
examples to each type of tasks in
biological data mining?
3. Read the reference about typical
biological database and get a broad
picture about the current status of
publicly-accessible bioinformatics
databases.
4. Read at least one of the references
about data mining in biology and be
prepared to give a brief description
about the paper.