Transcript Document
Part 7 Collecting and Storing
Sequences in Lab
1
What is Bioinformatics?
2
NIH – definitions
What is Bioinformatics? - Research, development,
and application of computational tools and on molecular
approaches for expanding the use of biological,
medical, behavioral, and health data, including the
means to acquire, store, organize, archive, analyze,
or visualize such data.
What is Computational Biology? - The
development and application of analytical and
theoretical methods, mathematical modeling and
computational simulation techniques to the study of
biological, behavioral, and social data.
3
NSF – introduction
Large databases that can be accessed and analyzed with
sophisticated tools have become central to biological
research and education. The information content in the
genomes of organisms, in the molecular dynamics of
proteins, and in population dynamics, to name but a few
areas, is enormous. Biologists are increasingly finding that
the management of complex data sets is becoming a
bottleneck for scientific advances. Therefore,
bioinformatics is rapidly become a key technology in all
fields of biology.
4
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment
of computer scientists into this evolving field, the limited
availability of developed databases of biological information, and
the need for more efficient and intelligent search engines for
complex databases.
5
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment of
computer scientists into this evolving field, the limited availability
of developed databases of biological information, and the need for
more efficient and intelligent search engines for complex databases.
6
Molecular Bioinformatics
Molecular Bioinformatics involves the use
of computational tools to discover new
information in complex data sets (from the
one-dimensional information of DNA through
the two-dimensional information of RNA and
the three-dimensional information of proteins,
to the four-dimensional information of
evolving living systems).
7
From DNA to Genome
Watson and Crick
DNA model
Sequence
alignment
PDB (Protein
Data Bank)
Sanger sequences
insulin protein
1955
1960
1965
1970
1980
1985
ARPANET
(early Internet)
Sanger dideoxy
DNA sequencing
1975
GenBank database
Dayhoff’s Atlas
PCR (Polymerase
Chain Reaction) 8
SWISS-PROT
database
NCBI
FASTA
1990
BLAST
Human Genome
Initiative
EBI
1995
First bacterial
genome
World Wide Web
Yeast genome
2000
First human
genome draft
9
Origin of bioinformatics and
biological databases:
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
Nearly a decade later, the first nucleic acid
sequence was reported, that of yeast
tRNAalanine with 77 bases.
10
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
The Protein DataBank followed in 1972 with a
collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence
11
database began in 1987.
Nucleotides
12
Complete Genomes
1994
0
1995
1
December 2006
Eukaryotes
376
22
Bacteria
327
Archaea
27
13
What can we do with sequences and other type of molecular information?
14
Open reading frames
Annotation
Functional sites
Structure, function
15
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
16
TF binding site
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
ORF = Open Reading Frame
CDS = Coding Sequence
Transcription
Start Site
promoter
Ribosome binding Site
17
Comparing ORFs
Identifying orthologs
Comparative
genomics
Inferences on structure
and function
Comparing functional sites
Inferences on regulatory
networks
18
Similarity profiles
Researchers can learned a great deal about the structure and
function of human genes by examining their counterparts in
19
model organisms.
Alignment preproinsulin
Xenopus
Bos
MALWMQCLP-LVLVLLFSTPNTEALANQHL
MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus
Bos
CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus
Bos
AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**.
** *
*
*****
Xenopus
Bos
EQCCHSTCSLFQLENYCN
EQCCASVCSLYQLENYCN
**** *.***:*******
20
21
Ultraconserved Elements in the
Human Genome
Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart
Stephen, W. James Kent, John S. Mattick, & David Haussler
(Science 2004. 304:1321-1325)
There are 481 segments longer than 200 base pairs (bp) that
are absolutely conserved (100% identity with no insertions or
deletions) between orthologous regions of the human, rat, and
mouse genomes. Nearly all of these segments are also
conserved in the chicken and dog genomes, with an average of
95 and 99% identity, respectively. Many are also significantly
conserved in fish. These ultraconserved elements of the human
genome are most often located either overlapping exons in
genes involved in RNA processing or in introns or nearby genes
involved in the regulation of transcription and development.
There are 156 intergenic, untranscribed,
ultraconserved segments
22
Junk:
Supporting evidence
Junk is real!
23
Genome-wide profiling of:
• mRNA levels
• Protein levels
Functional
genomics
Co-expression of genes
and/or proteins
Identifying protein-protein
interactions
Networks of interactions
24
Understanding the function of genes and other
parts of the genome
25
Structural
genomics
Assign structure to all
proteins encoded in
a genome
26
Structural Genomics
27761 structures
Currently
~300 unique folds
~300
unique folds
in PDB
27
Structural Genomics
Estimate
1000-3000
unique folds
in “structure space”
28
Origin of tools
Immediately after the establishment of the
first databases, tools became available to
search them - at first in a very simple
manner, looking for keyword matches and
short sequence words and, then, in a more
sophisticated manner by using pattern
matching, alignment based methods, and
machine learning techniques.
29
Despite the huge explosion in the number
and length of sequences, the tools used for
storage, retrieval, analysis, and
dissemination of data in bioinformatics are
very similar to those from 15-20 years ago.
30
Biological
databases
Database or databank?
Initially
• Databank (in UK)
• Database (in the USA)
Solution
• The abbreviation db
What is a database?
• A collection of data
–
–
–
–
structured
searchable (index)
updated periodically (release)
cross-referenced (hyperlinks)
-> table of contents
-> new edition
-> links with other db
• Includes also associated tools (software) necessary
for access, updating, information insertion,
information deletion….
• Data storage management: flat files, relational
databases…
Database: a « flat file » example
Flat-file database (« flat file, 3 entries »):
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
• Easy to manage: all the entries are visible at the same time !
Database: a « relational » example
Relational database (« table file »):
Teacher
Accession
number
Education
Amos
1
Biochemistry
Dan
2
Genetics
John
3
Scientology
Course
Year
Involved
teachers
Advanced
Pottery
2000; 2001
1; 2
Ballet for Fat
People
2001; 2002
2; 3
Why biological databases?
• Exponential growth in biological data.
• Data (genomic sequences, 3D structures,
2D gel analysis, MS analysis,
Microarrays….) are no longer published in
a conventional manner, but directly
submitted to databases.
• Essential tools for biological research.
Distribution of sequences
•
•
•
•
•
•
•
•
Books, articles
Computer tapes
Floppy disks
CD-ROM
FTP
On-line services
WWW
DVD
1968 -> 1985
1982 -> 1992
1984 -> 1990
1989 ->
1989 ->
1982 -> 1994
1993 ->
2001 ->
Some statistics
• More than 1000 different ‘biological’ databases
• Variable size: <100Kb to >20Gb
–
–
–
–
DNA: > 20 Gb
Protein: 1 Gb
3D structure: 5 Gb
Other: smaller
• Update frequency: daily to annually to seldom to forget
about it.
• Usually accessible through the web (some free, some not)
Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb,
BBDB, BCGD,
Beanref, Biolmage,
BioMagResBank,
BIOMDB,
BLOCKS,
BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK,
GenProtEC,
GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISSMODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
Categories of databases for Life
Sciences
•
•
•
•
•
•
•
•
•
•
Sequences (DNA, protein)
Genomics
Mutation/polymorphism
Protein domain/family
Proteomics (2D gel, Mass Spectrometry)
3D structure
Metabolism
Bibliography
Expression (Microarrays,…)
Specialized
Resources
NCBI (National Center for Biotechnology Information) is a
resource for molecular biology information. NCBI creates and
maintains public databases, conducts research in computational
biology, develops software tools for analyzing genome data, and
disseminates biomedical information. The NCBI site is constantly
being updated and some of the changes include new databases
and tools for data mining.
NCBI offers several searchable literature, molecular and genomic
databases and many bioinformatic tools. An up-to-date list of
databases and tools can be found on the NCBI Sitemap.
41
Literature Databases:
Bookshelf: A collection of searchable biomedical books linked to
PubMed.
PubMed: Allows searching by author names, journal titles, and a
new Preview/Index option. PubMed database provides access to
over 12 million MEDLINE citations back to the mid-1960's. It
includes History and Clipboard options which may enhance your
search session.
PubMed Central: The U.S. National Library of Medicine digital
archive of life science journal literature.
OMIM: Online Mendelian Inheritance in Man is a database of
human genes and genetic disorders (also OMIA).
42
GenBank:
http://www.ncbi.nlm.nih.gov/Genbank/
EBI:
http://www.ebi.ac.uk/
DDBJ:
http://www.ddbj.nig.ac.jp/
43
Type in a Query term
• Enter your search words in the
query box and hit the “Go” button
44
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching
The Syntax …
1. Boolean operators: AND, OR, NOT must be entered in
UPPERCASE (e.g., promoters OR response elements). The default
is AND.
2. Entrez processes all Boolean operators in a left-to-right sequence.
The order in which Entrez processes a search statement can be
changed by enclosing individual concepts in parentheses. The terms
inside the parentheses are processed first. For example, the search
statement: g1p3 OR (response AND element AND promoter).
3. Quotation marks: The term inside the quotation marks is read as one
phrase (e.g. “public health” is different than public health, which will
also include articles on public latrines and their effect on health
workers).
4. Asterisk: Extends the search to all terms that start with the letters
before the asterisk. For example, dia* will include such terms as 45
diaphragm, dial, and diameter.
46
Refine the Query
• Often a search finds too many (or too few) sequences, so you
can go back and try again with more (or fewer) keywords in
your query
• The “History” feature allows you to combine any of your past
queries.
• The “Limits” feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc.
• [Many other features are designed to search for literature in
MEDLINE]
47
Related Items
You can search for a text term in sequence annotations or in
MEDLINE abstracts, and find all articles, DNA, and protein
sequences that mention that term.
Then from any article or sequence, you can move to "related
articles" or "related sequences".
•Relationships between sequences are computed with BLAST
•Relationships between articles are computed with "MESH" terms
(shared keywords)
•Relationships between DNA and protein sequences rely on accession
numbers
•Relationships between sequences and MEDLINE articles rely on both
shared keywords and the mention of accession numbers in the articles.
48
49
50
51
Database Search Strategies
• General search principles - not limited
to sequence (or to biology).
• Start with broad keywords and narrow
the search using more specific terms.
• Try variants of spelling, numbers, etc.
• Search many databases.
• Be persistent!!
52
PubMed
• MEDLINE publication database
– Over 17,000 journals
– Some other citations
• Papers from 1960 and on
– Over 12,000,000 entries
• Alerting services
– http://www.pubcrawler.ie/
– http://www.biomail.org/
53
Searching PubMed
• Structureless searches
– Automatic term mapping
• Structured searches
– Tags, e.g. [au], [ta], [dp], [ti]
– Boolean operators, e.g. AND, OR, NOT, ()
• Additional features
– Subsets, limits
– Clipboard, history
54
Start working:
Search PubMed
1. cuban cigars
2. cuban OR cigars
3. “cuban cigars”
4. cuba* cigar*
5. (cuba* cigar*) NOT smok*
6. Fidel Castro
7. “fidel castro”
8. #6 NOT #7
55
“Details” and “History” in
PubMed
56
“Details” and “History” in
PubMed
57
The OMIM (Online Mendelian
Inheritance in Man)
– Genes and genetic disorders
– Edited by team at Johns Hopkins
– Updated daily
58
MIM Number Prefixes
*
+
#
%
no prefix
gene with known sequence
gene with known sequence and
phenotype
phenotype description, molecular
basis known
mendelian phenotype or locus,
molecular basis unknown
other, mainly phenotypes with
suspected mendelian basis
59
Searching OMIM
• Search Fields
– Name of trait, e.g., hypertension
– Cytogenetic location, e.g., 1p31.6
– Inheritance, e.g., autosomal dominant
– Gene, e.g., coagulation factor VIII
60
OMIM search tags
All Fields
Allelic Variant
Chromosome
Clinical Synopsis
Gene Map
Gene Name
Reference
[ALL]
[AV] or [VAR]
[CH] or [CHR]
[CS] or [CLIN]
[GM] or [MAP]
[GN] or [GENE]
[RE] or [REF]
61
62
Start working:
Search OMIM
How many types of hemophilia are there?
For how many is the affected gene known?
What are the genes involved in hemophilia A?
What are the mutations in hemophilia A?
63
Online Literature databases
1. How to use the UH online Library?
2. Online glossaries
3. Google Scholar
4. Google Books
5. Web of Science
64
How to use the online UH Library?
http://info.lib.uh.edu/
65
66
Online Glossaries
Bioinformatics :
http://www.geocities.com/bioinformaticsweb/glossary.html
http://big.mcw.edu/
Genomics:
http://www.geocities.com/bioinformaticsweb/genomicglossary.html
Molecular Evolution:
http://workshop.molecularevolution.org/resources/glossary/
Biology dictionary:
http://www.biology-online.org/dictionary/satellite_cells
Other glossaries, e.g., the list of phobias:
http://www.phobialist.com/class.html
67
4. Google Scholar
http://www.scholar.google.com/
68
What is Google Scholar?
Enables you to search specifically for scholarly
literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical
reports from all broad areas of research.
69
Use Google Scholar to find articles from a
wide variety of academic publishers,
professional societies, preprint repositories
and universities, as well as scholarly articles
available across the web.
70
Google Scholar
orders your
search results by
how relevant they
are to your query,
so the most
useful references
should appear at
the top of the
page
This relevance
ranking takes into
account the: full
text of each article.
the article's author,
the publication in
which the article
appeared and how
often it has been
cited in scholarly
literature.
71
What other DATA can we retrieve from the record?
72
73
74
5. Google Book Search
75
76
Start working:
Search Google Books
How many times is the tail of the giraffe
mentioned in On the Origin of Species by Mr.
Darwin?
77
6. Web of science
http://portal01.isiknowledge.com.ezproxy.lib.uh.edu/portal.cgi?DestApp=WOS&Func=Frame
78
79
80