ppt - Home - National University of Singapore

Download Report

Transcript ppt - Home - National University of Singapore

Use of Computers in
Molecular Biology
Meena K Sakharkar
Training Manager, BioInformatics Centre
National University of Singapore
What is BioInformatics?
• Many related terms and buzzwords
•
A multiplicity of names:
–
–
–
–
–
–
bioinformatics
biocomputing
biological computing
computational biology
computational genomics
biological data mining
Overview of the challenges of Molecular Biology
Computing
• The huge dataset problem
– automated DNA sequencers
– the Human Genome Project
– bulk sequencing of cDNAs (ESTs)
GenBank Growth Chart
1600000000
Bases
1400000000
1200000000
1000000000
800000000
600000000
400000000
200000000
Apr-98
Oct-97
Apr-97
Oct-96
Apr-96
Oct-95
Apr-95
Oct-94
Apr-94
Oct-93
Apr-93
Sep-92
Dec-91
Mar-91
Jun-90
Sep-89
Dec-88
Jun-88
Sep-87
Feb-87
May-86
May-85
Sep-84
Dec-82
0
Year
As of Oct. 1999, GenBank contains over 3.8 billion bases of DNA and protein sequence,
which requires about 18 gigabytes of computer disk storage space.
Human Genome Project
• What is the Human Genome Project?
– 15-year effort formally begun in October 1990. coordinated by the
U.S. Department of Energy and the National Institutes of Health.
– identify all the estimated 80,000 genes in human DNA,
– determine the sequences of the 3 billion chemical bases that make
up human DNA,
– store this information in databases,
– develop tools for data analysis, and
– address the ethical, legal, and social issues (ELSI) that may arise
from the project.
• Who is head of the U.S. Human Genome
Project?
– The DOE Human Genome Program is directed by Ari Patrinos,
and Francis Collins directs the NIH Human Genome Program.
– Ari Patrinos also heads the Department of Energy Office of
Biological and Environmental Research.
Related fields
•
•
•
•
•
•
•
•
•
molecular evolution
origin of life
genomics and proteomics
the Human Genome Project
theoretical biology
complexity and information theory
biotechnology
lead drug discovery
computing with biomolecules
Our ( working) definition
• Bioinformatics: the body of tools, algorithms and know-how
needed to handle complex biological information the technological
aspect
• Computational biology: the application of bioinformatics tools to
perform biological studies the scientific aspect very broad and diverse
field
• Bioinformatics is clearly a multi disciplinary field
including:
– computer systems management
– networking, database design
– computer programming
– molecular biology
Integrating bioinformatics and computational
biology:
• A biologist can use existing tools but might misinterpret results
The black-box effect - the 'software kit'
• A biologist might refrain from doing some interesting analysis if the
existing software doesn't offer it as an option
The ability to program is important
• A computer scientist or a programmer can produce interesting and/or
efficient algorithms and tools, but these might lack biological relevance.
A biological training/background is important
• Beware of the 'just a tool maker' stigma
• Best results are achieved by integrating the development of
tools with their usage in interesting biological systems
How to handle all the
information?
•
•
•
•
•
•
•
•
•
Producing
Processing
Storing
Sharing
Querying
Retrieving
Visualising
Annotating
Curating
Use of Computers in Molecular Biology
• Powerful tools to organise the data itself.
–
Exponential growth.
– A new release is made every two months.
• Data Analysis.
–
Retrieval.
– Homology Search.
– Modelling purposes - Drug Design
• Data Integration
• Data Visualisation
Paradigmatic Shift:
• Getting new sequences is now easy.
• Having a new sequence, we can start by analysing it using the
computer, or we can start by doing experimental work.
• "A month in the lab can often save an hour in the library." Westheimer ... or searching the Internet, or doing computerised
analyses.
• From 'wet lab' to 'soft lab'.
• in vivo, in vitro, and in
silico
Information is being collected, organized,
and made available:
• GenBank is the central sequence information database in the United
States
• Data is shared between GenBank and European Molecular Biology
Laboratory (EMBL) and the DNA Database of Japan (DDBJ)
• All sequence data submitted to any of these databases is automatically
integrated into the others.
• Sequence data is also incorporated from the Genome Sequence Data
Base (GSDB) and from patent applications.
Similarity Searching in the databanks
• "Are there any sequences in the databanks similar to my
sequence?"
• Directly searching the databanks by comparing sequences
uses too much computer time
• The Biologist uses timesaving tools: FASTA and BLAST
• Relies on statistics and the informed judgement of the
Biologist.
Pairwise and Multiple Alignments
• Multiple Alignment is the basis for the study of
protein families and functional domains.
• When pairwise alignment is expanded to multiple
sequences, it becomes a computationally huge
problem.
• To reduce the nearly infinite permutations, a
simplified heuristic (approximate) algorithm is
used known as progressive pariwise alignment
Structure-function relationships:
Sequence patterns that predict function
• Challenging areas of computational molecular
biology is the prediction of the function of protein
molecules from their sequence.
• Sequence determines 3-D structure, structure
determines function
• Identify conserved regions (domains or motifs)
• Domain databases can be used to scan any unknown
protein sequence
Searching Literature using
PubMed at NCBI
PubMed
• Project by NIH and NLM.
• Search Tool for accessing literature
citations.
• PubMed Search system - MedLine and Pre
Medline Database and Molecular Biology
Databases indexed under Entrez.
MedLine
• MedLine - MEDlers OnLINE Database NCBI’s premier bibliographic database.
• Covers medicine, nursing, dentistry,
veterinary medicine, the health care
sciences and pre-clinical sciences.
• Has over 3900 current biomedical journals
published in the US and other foreign
countries.
MedLine
• 9 million records.
• Since 1966.
PreMedLine
• Introduced in August 1996.
• Basic Citation and abstracts before the full
records are prepared and added to Medline.
MEDLINE SAMPLE RECORD
UI
AU
TI
MH
MH
MH
MH
AB
PT
SO
98408838
Tao X, Dafu D
Relationship between synonymous codon usage and
protein structure.
Codon*
Protein Folding*
Protein Structure, Secondary*
Proteins / genetics ……
The hypothesis that synonymous codon usage is related
to protein three- dimensional structure is examined by
…
Journal article
FEBS Lett 1998 Aug 28 : 434 (1- 2) : 93- 6
MEDLINE Indexing
• MeSH Terms to LIMIT Retrieval
– human, animal, male, female,
– age groups, organism, etc.
• Publication Types ( Another way to LIMIT )
– review, clinical trial, letter, journal article, etc.
MEDLINE
Subject Headings
Advantages of MeSH Terms
• Represent a subject concept & no term
synonyms needed
• Find relevant articles on a search topic that may
not be explicitly mentioned in a title or abstract
• Focus search & be specific to eliminate irrelevant
records
• Increase search efficiency to save time … Get
reliable results
Searching MEDLINE
Subject Headings
• Disadvantages of MeSH
•
Thesaurus terms may not cover all concepts,
esp. jargon
• Not every concept in abstract or article can get
thesaurus terms
MEDLINE Searching
Search terms are combined with
Boolean “OR” and “AND” .
Modifying Retrieval -- NOT ENOUGH Found
• Reduce number of concepts to combine
• Add synonyms or related terms
– Use both free- text words & MeSH terms
– Truncate free- text words as appropriate
– Explode subject term, if it has narrower terms
• Do NOT use limits ( e. g., major point, review )
• Consult a professional searcher … Librarian
Modifying Retrieval --TOO Many Found
• Use MeSH terms only … Use no free- text words
• Use “MeSH Power” to Focus Your Search
– Try a more specific MeSH term
– Limit MeSH terms to MAJOR point of article
– Use a Subheading with your MeSH term
• Reduce number of synonyms, if free- text
searching
• Add additional concepts to your search
• Use Limits … English language, reviews
• Restrict to human, animal, or organism
Internet Tools and Searches
Network Utilities
What is the Internet?
• A world wide collection of networks of
computers
• A network of computer networks
• A network based on the TCP/IP protocol
Standalone Computer
PC
Printer
A typical setup at home
Speakers
LAN
A Small Local Area Network
of two computers
and one printer
in your office
Inter-Departmental Network
Campus Wide Network
Campus Network
Wide Area Network
National Network
InterCountry Network
Global Network
The INTERNET
What can you do with Internet?
INTERNET APPLICATIONS
• Electronic Mail (Email)
• Internet Talk/Chat (IRC)
• File Transfer (FTP)
• Remote Login (Telnet)
• Internet News (Usenet)
• Info retrieval (Gopher, World Wide Web)
• AudioVideo Conferencing (CU-SeeMe,
Mbone)
• Internet Phone
FTP: File Transfer Protocol
ftp ncbi.nlm.nih.gov
login: anonymous
passwd: email address
If you want to ftp from a server then use your
own login and passwd
Ftp commands continued…..
•
•
•
•
•
•
•
cd - change directory
ls - listing
pwd - present working directory
bin - transfer in binary mode
asc - transfer in ascii mode
hash - show the transfer.
lcd - local change directory
FTP commands continued..
• prompt - multiple file tranfer
• mget - multiple file tranfer
else you can just use get
• mput - put multiple files onto the server
put - single file transfer
Telnet
• Work on another machine by remote login.
• Telnet intron.bic.nus.edu.sg
login:
passwd:
• Must have an account on the machine for
doing telnet
• Must have internet connection
• Space allocated to you on the machine
HTML- an Introduction
What is Hypertext?
• Non-Linear Text
• Links embedded in the text
• Jumps to other locations in the
document/db
the quick
brown fox
jumps over
the fence
Fence
......
......
......
......
Creating a Web Page
• Terms to Know
• WWW/Web: World Wide Web
• HTML: Hyper Text Mark-up Language
• URL: Uniform Resource Locator
• I assume that:
– know how to use Netscape or some other Web browser
– have access to a Web server (or that you want to
produce HTML documents for personal use in localviewing mode)
Creating a Web Page
What an HTML Document Is?
• Collection of styles
• HTML documents are plain-text files
• Can be created using any text editor
• You can also use word-processing software if you
remember to save your document as "text only
with line breaks."
• HTML is not case sensitive.
• TAGS are used to mark the element of the file for
your browser.
Creating a Web Page
TAGS Explained
• Every HTML document should contain certain
standard HTML tags.
• Each document consists of head and body tags.
• The head contains the title, and the body contains
the actual text that is made up of paragraphs, lists,
and other elements.
<html>
<head>
<TITLE>A Simple HTML Example</TITLE>
</head>
<body>
<H1>HTML is Easy To Learn</H1>
<P>Welcome to the world of HTML.
This is the first paragraph. While short it is
still a paragraph!</P>
<P>And this is the second paragraph.</P>
</body>
</html>
• The required elements are the <html>, <head>, <title>, and <body> tags
(and their corresponding end tags).
• Note: Because you should include these tags in each file, you might
want to create a template file with them.
TAGS Explained
•
HTML:
– This element tells your browser that the file contains
HTML-coded information.
– The file extension .html also indicates this an HTML
document and must be used.
•
HEAD:
– The head element identifies the first part of your
HTML-coded document that contains the title.
TITLE
The title element contains your document title and identifies its content in a global
context.
BODY
Contains the content of your document.
HEADINGS
HTML has six levels of headings, numbered 1 through 6.
With 1 being the most prominent.
Headings are displayed in larger and/or bolder fonts than normal body text.
The syntax of the heading element is:
<Hy>Text of heading </Hy>
where y is a number between 1 and 6 specifying the level of the heading.
PARAGRAPHS
Carriage returns in HTML files aren't significant.
Word wrapping can occur at any point in your source file, and multiple spaces are
collapsed into a single space by your browser.
The </P> closing tag can be omitted. This is because browsers understand that when
they encounter a <P> tag, it implies that there is an end to the previous paragraph.
Using the <P> and </P> as a paragraph container means that you can
center a paragraph by including the ALIGN=alignment attribute in
your source file.
<P ALIGN=CENTER>
This is a centered paragraph.
[See the formatted version below.]
</P>
This is a centered paragraph.
Lists
HTML supports unnumbered, numbered, and definition lists. You can nest
lists too, but use this feature sparingly because too many nested items can
get difficult to follow.
Unnumbered Lists
To make an unnumbered, bulleted list,
1.start with an opening list <UL> (for unnumbered list) tag
2.enter the <LI> (list item) tag followed by the individual item; no closing </LI> tag is needed
3.end the entire list with a closing list </UL> tag
Below is a sample three-item list:
<UL>
<LI> apples
<LI> bananas
<LI> grapefruit
</UL>
The output is:
•
•
•
apples
bananas
grapefruit
Numbered Lists
A numbered list (also called an ordered list, from which the tag name derives) is
identical to an unnumbered list, except it uses <OL> instead of <UL>. The items are
tagged using the same <LI> tag. The following HTML code:
<OL>
<LI> oranges
<LI> peaches
<LI> grapes
</OL>
produces this formatted output:
1.oranges
2.peaches
3.grapes
A definition list (coded as <DL>) usually consists of alternating a definition term (coded
as <DT>) and a definition definition (coded as <DD>). Web browsers generally
format the definition on a new line.
The following is an example of a definition list:
<DL>
<DT> NCSA
<DD> NCSA, the National Center for Supercomputing
Applications, is located on the campus of the
University of Illinois at Urbana-Champaign.
<DT> Cornell Theory Center
<DD> CTC is located on the campus of Cornell
University in Ithaca, New York.
</DL>
The output looks like:
NCSA
NCSA, the National Center for Supercomputing Applications, is located on the campus of
the University of Illinois at Urbana-Champaign.
Cornell Theory Center
CTC is located on the campus of Cornell University in Ithaca, New York.
Nested Lists
Lists can be nested. You can also have a number of paragraphs, each containing a nested list, in a single
list item. Here is a sample nested list:
<UL>
<LI> A few New England states:
<UL>
<LI> Vermont
<LI> New Hampshire
<LI> Maine
</UL>
<LI> Two Midwestern states:
<UL>
<LI> Michigan
<LI> Indiana
</UL>
</UL>
The nested list is displayed as
•
A few New England states:
–
–
–
•
Vermont
New Hampshire
Maine
Two Midwestern states:
–
–
Michigan
Indiana
Forced Line Breaks/Postal Addresses
The <BR> tag forces a line break with no extra (white) space between lines. Using <P> elements
for short lines of text such as postal addresses results in unwanted additional white space. For
example, with <BR>:
National Center for Supercomputing Applications<BR>
605 East Springfield Avenue<BR>
Champaign, Illinois 61820-5518<BR>
The output is:
National Center for Supercomputing Applications
605 East Springfield Avenue
Champaign, Illinois 61820-5518
Horizontal Rules
The <HR> tag produces a horizontal line the width of the browser window. A horizontal rule is
useful to separate sections of your document. For example, many people add a rule at the end of
their text and before the <address> information.
You can vary a rule's size (thickness) and width (the percentage of the window covered by the
rule). Experiment with the settings until you are satisfied with the presentation. For example:
<HR SIZE=4 WIDTH="50%">
displays as:
• Physical Styles
<B> bold text
<I> italic text
<TT> typewriter text, e.g. fixed-width font.
• Linking
Power - link text and/or image.
Browser highlights the identified text or image with color and/or underlines to indicate
that it is a hypertext link.
HTML's single hypertext-related tag is <A>, which stands for anchor. To include an
anchor in your document:
1.start the anchor with <A (include a space after the A)
2.specify the document you're linking to by entering the parameter HREF="filename"
followed by a closing right angle bracket (>)
3.enter the text that will serve as the hypertext link in the current document
4.enter the ending anchor tag: </A> (no space is needed before the end anchor tag)
Here is a sample hypertext reference in a file called US.html:
<A HREF="http://www.bic.nus.edu.sg">BIC HomePage</A>
This entry makes the words BIC HomePage the hyperlink to the document
http://www.bic.nus.edu.sg/index.html,
You can make it easy for a reader to send electronic mail to a specific
person or mail alias by including the mailto attribute in a hyperlink.
The format is:
<A HREF="mailto:emailinfo@host">Name</a>
For example, enter:
<A HREF="mailto:[email protected]"> Meena KS</a>
to create a mail window that is already configured to open a mail window
for the Meena KS . (You, of course, will enter another mail address!)
To include an inline image, enter:
<IMG SRC=ImageName>
where ImageName is the URL of the image file.
The syntax for <IMG SRC> URLs is identical to that used in an anchor
HREF. If the image file is a GIF file, then the filename part of ImageName
must end with
.gif. Filenames of X Bitmap images must end with .xbm; JPEG image
files must end with .jpg or .jpeg; and Portable Network Graphic files must
end with .png.
Image Size Attributes
<IMG SRC=SelfPortrait.gif HEIGHT=100 WIDTH=65>
Demo:
http://www.ncbi.nlm.nih.gov