Transcript Document

Bioinformatics & LIS
A brief talk for librarians,
information scientists, and
computer scientists about
resources and collaborative
opportunities with biology.
April 18, 2006
G. Benoit
Outline of the talk
•
•
•
•
Bioinformatics defined
Generation of data
Tools and databases
Activities for Librarianship,
Computer and
Information Science
• Examples:
– Entrez, NCBI, Visualization
• Collaborations
Bioinformatics defined
• Over 70 defintions
• Differences arise from the work
• Nat’l Center for Biotechnical Information
(NCBI)
• The development of new algorithms and statistics with
which to assess relationships among members of large
data sets;
• The analysis and interpretation of various types of data
including nucleotide and amino acid sequences, protein
domains, and protein structures; and
• The development and implementation of tools that
enable efficient access and management of different
types of information.
Without getting into the science…
• How the data started …
• Four chemical bases (purines [adenine
(A), guanin (G)] and pyrimidines
[cytosine (C) and thymine (T)] )
• Their precise order and linking
(attached to a sugar molecule and to a
phosphate molecule to create a
nucleotide) …
DNA
• A pairs with T; G with C to
make unique and very
long strings, called
sequences
• E.g., AATGACCAT codes
for a different gene than
GGGCCATAG would
• Replication: RNA consists
of A, G, C, and Uracil and
has ribose instead of
deoxyribose
• Point is one can predict
missing data,
sometimes…
In short…
the nucleotides are linked in a certain order or sequence through the phosphate group;
their precise order and linking within the DNA determines what proteins the gene produces
and the phenotype of the organism
Generation of Data
• Raw data from sequencing
• Expression data
• Data generated by linking other raw data in
very large, multidimensional databases (e.g.,
OMIM)
• Research literature (full-text journals)
• Data models to describe the literature for
retrieval, linking to other data, and linking to
the raw data
• New data models to support greater
flexibility in describing & manipulating
data …
Generation of Data
• To support integrated search and
retrieval
• To focus on single organisms or find
similarities across them
• Feed other technology
• Visualization of natural phenomena and
of abstract phenomena
Tools & Databases
• A host of tools for database searching…
– BLAST (basic local alignment search tool)
– FASTA (sequence strings)
– ChopUp (protein analysis)
– Integrated packages (Lasergene Sequence
Analysis Software)
– The many services offered through
NCBI and NLM
• Take a look at
handout, Table 1,
publically
accessible
databases
Data Categories
• Monographs, Journals, Announcements (text)
• Datasets:
–
–
–
–
–
–
–
–
–
Bibliographic (http://www.expasy.org/links.html)
Taxonomic
Nucleic acid
Genomic (e.g., GDB, OMIM)
Protein DB (SwissProt, TrEMBL)
Protein families, domains, and functional sites
Proteomics initiative
Enzyme/metabolic pathways
Sequence Retrieval System (SRS) and NCBI Data
Model
• Take a look at
handout, Table 2,
publicallyaccessible
databases defined
and then
• Entrez sample,
Table 3
Entrez example
• Notice the familiar access points
(author, journal, title) as well as domainspecific ones (exon, gene, organism)
• Notice, too, the DNA …
NCBI Homepage
• http://www.ncbi.nih.gov/
• Notice the variety of tools (left menu)
• Site map:
http://www.ncbi.nih.gov/Sitemap/index.html
• Alpha list http://www.ncbi.nih.gov/Sitemap/AlphaList.html
Linking across resources
•
•
•
•
•
•
•
http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html
NCBI’s structure database is called Molecular Modeling Database (MMDB), and is
a subset of non-theoretical models 3D structures obtained from the Protein Data
Bank (PDB). Data are obtained from X-ray crystallography and NMRspectroscopy. Goal is to make it easier to compare structures.
Searching: variety of access points: author, title, text terms, or a PDB 4-character
code or a numerical MMDB-id
MMDB Data: PDB records are parsed (to extract sequences and citations from
PDB records, and structural info). Converted to ASN.1.
Taxonomy: is used to help end users see term relationships and databases, along
with literature references:
Example: http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=
Escherichia+coli&lvl=0&srchmode=1
Linking across resources
• XML - there are hundreds of XML
schema used in biology
• Calls for mapping to ASN1 records [see
NCBI example]
• Calls for mapping across schema
• Calls for exporting data for different
devices…
Visualization
• Cn3D - uses MMDB-Entrez’s structure
database
– http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
• RasMol http://www.umass.edu/microbio/rasmol/
• Protein Explorer
http://www.umass.edu/microbio/rasmol/rotating.htm
• OpenRasMol http://www.openrasmol.org/
• MolviZ.org http://www.umass.edu/microbio/chime
• World Index of Molecular Visualization
http://molvis.sdsc.edu/visres/index.html
Recap main points
• Very large data sets “homogenized” thru ASN.1
• Goal to integrate (text-text,
visualization-text, text-vis)
• Raw data + research literature +
visualization
• Biologists provide domain
knowledge
• XML is a big player
• CS and IS provide technology
• Librarians provide maintenance
and access to resources
Collaborative Opportunities
• For LIS and CS:
– Domain analysis
– information use, communication, theories of
information;
– systems analysis and design,
– data modeling,
– classification,
– storage and retrieval,
– HCI mapped onto a generalized model of a
molecular biology experimental cycle
• [Denn & MacMullen, 2002, p. 556]
Collaborative Opportunities
• “Insertion Points” - development of new tools
and methods for managing, integrating &
visualization
• For local use: download selected
data sets for local needs
(Stapley & Benoit, 2000)
•
•
•
•
XML Transformations
XML - SVG - X3D
Automated retrieval
Clustering (data- and text-mining)
Collaborative Opportunities
• Biologists’ needs:
– To go beyond mining of genomic data to
investigate causal entailments in intra- and
intracellular dynamics
• LIS’s response:
– To aid understanding of the scientific
processes thru visualization of literature,
metadata and graphic representations in
general and for disease-specific analysis
Back to you…
• Thanks …