TY_BSC lectures - Dhananjay bhole`s Virtual Home

Download Report

Transcript TY_BSC lectures - Dhananjay bhole`s Virtual Home

Dhananjay Bhole
Project Student,
Bioinformatics Centre
University of Pune,
Email: [email protected]
Contact no. 9850123212
Topics to be covered
 Bioinformatics definition
 History
 Scope
 Goal
 Importance and limitations.
 Computer in biology and medicine
 What is computer
 Mini-frame computer and main frame computer.
 Application of computer in biology.
Database concept
 Biological databases:
 Type of biological databases, DNA databases, protein databases
 What is genomics and proteomics.
 Human genome project:
 History and importance
What is Bioinformatics
 Definition:
 Definition by The NIH Biomedical Information Science and
Technology Initiative Consortium
 Bioinformatics: Research, development, or application of
computational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those to
acquire, store, organize, archive, analyze, or visualize such data.
 Computational Biology: The development and application of
data-analytical and theoretical methods, mathematical modeling
and computational simulation techniques to the study of
biological, behavioral, and social systems.
History
 A succinct chronological landmark events on the development of




bioinformatics
First major bioinformatics project:
 Margaret Dayhoff in 1965, developed a first protein sequence
database called Atlas of Protein Sequence and Structure
In the early 1970s, the Brookhaven National Laboratory
established the Protein Data Bank for archiving threedimensional protein structures
The first sequence alignment algorithm:
 Needleman and Wunsch in 1970
The first protein structure prediction algorithm:
 Chou and Fasman in 1974
History cont…
 In early 1980s, GenBank was established…
 The fast database searching algorithms was developed such as
FASTA by William Pearson and BLAST by Stephen Altschul
and coworkers.
 The start of the human genome project in the late 1980s
provided a major boost for the development of bioinformatics.
 The development and the increasingly widespread use of the
Internet in the 1990s made instant access to, and exchange and
dissemination of, biological data possible.
History cont…
 Fundamental reason that bioinformatics gained prominence as a
discipline was the advancement of genome studies that
produced unprecedented amounts of biological data
 The explosion of genomic sequence information generated a
sudden demand for efficient computational tools to manage and
analyze the data.
 The development of computational tools depended on
knowledge generated from a wide range of disciplines including
mathematics, statistics, computer science, information
technology, and molecular biology.
 The merger of these disciplines created an information-oriented
field in biology, which is now known as bioinformatics.
Scope
 Bioinformatics consists of two subfields:
 The development of computational tools and databases and
the application of these tools and databases in generating
 Biological knowledge to better understand living systems.
 The subfields are complementary to each other.
 The tool development includes writing software for sequence,
structural, and functional analysis, and the construction and
curating of biological databases.
Scope cont…
 These tools are used in three areas of genomic and molecular
biological research:
 Molecular sequence analysis
 Molecular structural analysis
 Molecular functional analysis.
 The analyses of biological data often generate new problems and
challenges that in turn spur the development of new and better
computational tools.
Scope cont…
 The areas of sequence analysis include sequence alignment,
sequence database searching, motif and pattern discovery, gene
and promoter finding, reconstruction of evolutionary
relationships, and genome assembly and comparison.
 Structural analyses include protein and nucleic acid structure
analysis, comparison, classification, and prediction.
 The functional analyses include gene expression profiling,
protein–protein interaction prediction, protein subcellular
localization prediction, metabolic pathway reconstruction, and
simulation
Goal
 The ultimate goal of bioinformatics is to better understand a




living cell and how it functions at the molecular level.
By analyzing raw molecular sequence and structural data,
bioinformatics research can generate new insights and provide a
“global” perspective of the cell.
The functions of a cell can be better understood by analyzing
sequence data.
Cellular functions are mainly performed by proteins whose
capabilities are ultimately determined by their sequences.
Thus solving functional problems using sequence and
sometimes structural approaches has proved to be a fruitful
endeavor.
Applications
 Apart from molecular biology, Bioinformatics is having a major impact




on many areas of biotechnology and biomedical sciences.
It has applications, for example, in knowledge-based drug design,
forensic DNA analysis, and agricultural biotechnology.
Computational studies of protein–ligand interactions provide a
rational basis for the rapid identification of novel leads for synthetic
drugs.
Knowledge of the three-dimensional structures of proteins allows
molecules to be designed that are capable of binding to the receptor
site of a target protein with great affinity and specificity.
Such informatics-based approach significantly reduces the time and
cost necessary to develop drugs with higher potency, fewer side effects,
and less toxicity.
Application cont…
 In forensics, results from molecular phylogenetic analysis have
been accepted as evidence in criminal courts.
 Example: Some sophisticated Bayesian statistics and
likelihood-based methods for analysis of DNA have been
applied in the analysis of forensic identity.
 It is worth mentioning that genomics and bioinformtics are now
poised to revolutionize our healthcare system by developing
personalized and customized medicine.
 Bioinformatics tools are being used in agriculture as well. Plant
genome databases and gene expression profile analyses played an
important role in the development of new crop varieties that
have higher productivity and more resistance to disease.
Limitations
 Bioinformatics depends on experimental science to produce raw data for
analysis.
 Bioinformatics predictions are not formal proofs of any concepts. They do not
replace the traditional experimental research methods of actually testing
hypotheses.
 The quality of bioinformatics predictions depends on the quality of data and
the sophistication of the algorithms being used.
 Bioinformatics is by no means a mature field. Most algorithms lack the
capability and sophistication to truly reflect reality. They often make incorrect
predictions that make no sense when placed in a biological context. Errors in
sequence alignment, for example, can affect the outcome of structural or
phylogenetic analysis.
 The outcome of computation also depends on the computing power available.
Many accurate but exhaustive algorithms cannot be used because of the slow
rate of computation. Instead, less accurate but faster algorithms have to be
used. This is a necessary trade-off between accuracy and computational
feasibility.
Computer in biology and medicine
 What is computer :
 Computer is an automatic electronic device used to perform an arithmatic and






logical operation.
Types of computers:
 Micro computer, mini-frame computer and main frame computer work stations.
Micro computer: A common small computer used for personal purpose . eg personal
desk top or laptop computers.
Miniframe computers: The larger computers or work stations used for commercial
perpose eg servers in small computer lab.
Operating systems and architectures is arose in the 1970s and 1980s, but minicomputers
are generally not considered mainframes.
Main frame computers:
 Mainframes (often colloquially referred to as Big Iron) are computers used mainly by
large organizations for critical applications, typically bulk data processing such as
census, industry and consumer statistics, ERP, and financial transaction processing.
Most large-scale computer system architectures were firmly established in the 1960s.
Application of computers in biology
 To store vast, diverse, and complex life sciences data
 To have fast and easy accessibility of biological data
 To make biological information more understandable and
useful by using various visualization tools.
 To analyze biological data for addressing theoretical and
experimental questions in biology by using mathematical and
computational approaches.
Basic database concept
 Any form of information whether on paper or in electronic form
may refer to as data. any electronic file no matter what the
format:
 database data, text, images, audio and video. Everything read and
written by the computer can be considered data except for
instructions in a program that are executed (software).
 The term data is the plural of "datum," which is one item of data.
 Technically, raw facts and figures, such as orders and payments,
which are processed into information, such as balance due and
quantity on hand.
 A common misconception is that software is also data. Software
is executed, or run, by the computer. Data are "processed." Thus,
software causes the computer to process data.
Basic database concepts cont…
 What is Database?
 A database is a computerized archive used to store and organize data in




such a way that information can be retrieved easily via a variety of
search criteria.
The chief objective of the development of a database is to organize data
in a set of structured records to enable easy retrieval of information.
Database management systems
(DBMS) are collections of tools used to manage databases. Four basic
functions performed by all DBMS are:
 Create, modify, and delete data structures, e.g. tables
 Add, modify, and delete data
 Retrieve data selectively
 Generate reports based on data
Database components
 Databases are composed of related tables, while tables are composed of
fields and records.
 Field: A field is an area (within a record) reserved for a specific piece of
data.
 Examples: customer number, customer name, street address, city,
state, phone, current balance.
 Fields are defined by:
 Field name
 Data type
 Character: text, including such things as telephone numbers and
zip codes
 Numeric: numbers which can be manipulated using math operators
 Date: calendar dates which can be manipulated mathematically
Database components cont…
 Size
 Amount of space reserved for storing data
 Record
 A record is the collection of values for all the fields pertaining to
one entity: i.e. a person, product, company, transaction, etc.
 Table
 A table is a collection of related records. For example, employee
table, product table, customer, and orders tables.
 In a table, records are represented by rows and fields are represented as
columns.
Relationships
 There are three types of relationships which can exist between tables:
 One-to-One
 One-to-Many
 Many-to-Many
 The most common relationships in relational databases are One-to-Many and
Many-to-Many.
 Key Fields: In order for two tables to be related, they must share a common
field. The common field (key field) in the "one" table of a One-to- Many
relationship needs to be a primary key. The same field in the "many" table of a
One-to-Many relationship is called the foreign key.
 Primary key: A Primary key is a field or a combination of two or more fields.
The value in the primary key field for each record uniquely identifies that
record.
 Foreign key: For the "many" records of the Order table, the foreign key
identifies with which unique record in the Customer table they are
associated.
Biological databases
 Need: As the volume of genomic data grows, sophisticated
computational methodologies are required to manage the data
deluge. The very first challenge in the genomics era is to store
and handle the staggering volume of information through the
establishment and use of computer databases.
 Biological database is the development of databases to handle
the vast amount of molecular biological data, which is a
fundamental task of bioinformatics.
Types of biological databases
 Overview:
 There are over 1,000 public and commercial biological
databases. These biological databases usually contain
genomics and proteomics data.
 Databases are also used in taxonomy.
 The data are nucleotide sequences of genes or amino acid
sequences of proteins.
 Also contain information about function, structure,
localisation on chromosome, clinical effects of mutations as
well as similarities of biological sequences.
Biological databases cont…
 Biological databases are generally of 3 types:
 Sequence databases
 Structure databases
 Functional databases.
 Further the databases can be classified as dna databases
and protein databases
Biological databases cont…
 Most important public databases for molecular biology:
 Primary sequence databases
 Meta-databases
 Genome Browsers
 Specialized databases
 Expression, regulation & pathways databases
 Protein sequence databases
 Protein structure databases
 Microarray-databases
 Protein-Protein Interactions
 Reference:
Most important public databases for molecular biology from
http://www.kokocinski.net/bioinformatics/databases.php
DNA sequence databases
 Some well known DNA sequence databases:
 NCBI
 EMBL
 DDBJ.
 NCBI: National centre for biotechnology information developed
by national library of medicine, National institute of helth USA.
 Established in 1988 as a national resource for molecular biology
information, NCBI creates public databases, conducts research
in computational biology, develops software tools for analyzing
genome data, and disseminates biomedical information - all for
the better understanding of molecular processes affecting
human health and disease.
DNA sequence databases cont…
 EMBL:
European molecular biology laboratory. Developed by
European bioinformatics institute Heidelberg Germany
 It also archives up to date and detail information about
biological macro molecules such as nucleotide sequences
and protein sequences.
DNA sequence databases cont…
 DDBJ: (DNA Data Bank of Japan)
 Began DNA data bank activities in earnest in 1986 at the
National Institute of Genetics (NIG) with the endorsement of
the Ministry of Education, Science, Sport and Culture.
 It also provide worldwide many tools for data retrieval and
analysis developed by at DDBJ and others.
DNA sequence databases cont…
 Database collaboration:
 NCBI, EMBL and DDBJ are collaborated internationally for
exchange of data and information on Internet and by
regularly holding two meetings, the International DNA Data
Banks Advisory Meeting and the International DNA Data
Banks Collaborative Meeting.
 The three data banks share virtually the same data at any given
time.
Protein sequence databases
 Protein sequence databases:
 swis prot, tremble.
 Swiss-Prot: it is a manually curated biological database of protein sequences.
 Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and
developed by the Swiss Institute of Bioinformatics and the European
Bioinformatics Institute.
 Swiss-Prot strives to provide reliable protein sequences associated with a high
level of annotation such as the description of the function of a protein, its
domains structure, post-translational modifications, variants, etc.
 The UniProt consortium was created: it is a collaboration between the Swiss
Institute of Bioinformatics, the European Bioinfomatics Institute and the
Protein Information Resource (PIR), thus these protein database produce the
UniProt Knowledgebase, the world's most comprehensive catalogue of
information on proteins.
Protein sequence database cont…
 Tremble:
 Translated nucleotide sequence database of European
molecular biology laboratory.
 The database also archive the same kind of information as
that of swis prot.
Genomics and proteomics
 What is gene ,genome and genomics?
 Gene : A segment of dna or chromosome responsible for coding
one or more functional protein.
 Genome: The genome is the gene complement of an organism. A
genome sequence comprises the information of the entire
genetic material of an organism.
 Genomics: the science deals with the study of entire genome,
gene organization such as gene order, gene arrangement, gene
ontology etc
 The goal of Genomics is to determine the complete DNA
sequence for all the genetic material contained in an organism's
complete genome.
Structural genomics and
functional genomics
 Structural Genomics: is the systematic effort to gain a complete







structural description of a defined set of molecules, ultimately for an
organism’s entire
proteome.
Structural genomics projects apply X-ray crystallography and NMR
spectroscopy in a high-throughput manner.
It also applies bioinformatics or incilico approach to solve structures of
nucleic acids and proteins.
Functional genomics: is the aims at determining the function of the
proteome (the protein complement encoded by
an organism's entire genome).
It expands the scope of biological investigation from studying single
genes or proteins to studying all genes or proteins at once in a
systematic fashion.
uses large-scale experimental methodologies combined with statistical
analysis of the results.
Comparitive genomics
 Comparative genomics: is the analysis and comparison of genomes from different






species to gain a better understanding of how species have
evolved and to determine the function of genes and noncoding regions of the
genome. .
Genome researchers look at many different features when comparing genomes:
sequence similarity, gene location, the length and number of coding regions (called
exons) within genes, the amount of noncoding DNA in each genome, and
highly conserved regions maintained in organisms as simple as bacteria and as
complex as humans.
Comparative genomics involves the use of computer programs that can line up
multiple genomes and look for regions of similarity among them. Eg blast, phylif
etc
Comparative genomics is applied to study phylogenetic relationship and evolution
of different organisms.
Proteom and proteomics
 Proteom: The Proteome is the protein complement
expressed by a genome. While the genome is static,
the proteome continually changes in response to
external and internal events.
 Proteomics: The study of how the entire set of proteins
produced by a particular organism interact
 It encompasses the identification and quantification of
proteins, and the effect of their modifications,
interactions, activities, and
 function, during disease states, and treatment.
Human genome project
 What was the Human Genome Project?
 The Human Genome Project (HGP) was the international,
collaborative research program whose goal was the complete mapping
and understanding of all the genes of human beings.
 The HGP was the natural culmination of the history of genetics
 In HGP researchers have deciphered the human genome in three major
ways:
1. determining the order, or "sequence," of all the bases in our genome's
DNA.
2. making maps that show the locations of genes for major sections of all
our chromosomes.
3. and producing what are called linkage maps, complex versions of the
type originated in early Drosophila research, through which inherited
traits (such as those for genetic disease) can be tracked over
generations.
Human genome project cont
 The HGP has revealed that there are probably somewhere between





30,000 and 40,000 human genes. The completed human sequence can
now identify their locations.
This ultimate product of the HGP has given the world a resource of
detailed information about the structure, organization and function of
the complete set of human genes.
The International Human Genome Sequencing Consortium published
the first draft of the human genome in the journal Nature in February
2001 with the sequence of the entire genome's three billion base pairs
some 90 percent complete.
The full sequence was completed and published in April 2003.
from the outset. Another major component of the HGP - and an
ongoing component of NHGRI - is therefore devoted to the analysis of
the ethical, legal and
social implications (ELSI) of our newfound genetic knowledge, and the
subsequent development of policy options for public consideration.
Techniques in HGP
 The tools created through the HGP also help to characterize the entire









genomes of several other organisms used extensively in biological
research, such as mice, fruit flies and flatworms. These efforts support
each other, because most organisms have many similar, or
"homologous," genes with
similar functions.
These techniques include:
DNA Sequencing
The Employment of Restriction Fragment-Length Polymorphisms
(RFLP)
Yeast Artificial Chromosomes (YAC)
Bacterial Artificial Chromosomes (BAC)
The Polymerase Chain Reaction (PCR)
Electrophoresis