Themes throughout the course
Download
Report
Transcript Themes throughout the course
Introduction to Bioinformatics
Part 1 of 2
M.E:440.714
September 8, 2003
Jonathan Pevsner, Ph.D.
[email protected]
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley & Sons, Inc.
These images and materials may not be used
without permission from the publisher. We welcome
instructors to use these powerpoints for educational
purposes, but please acknowledge the source.
The book has a homepage at http://www.bioinfbook.org
Including hyperlinks to the book chapters.
Teaching assistants
Hugh Cahill
Mayra Garcia
Gek Ming Sia
Who is taking this course?
• People with very diverse backgrounds in biology
• People with diverse backgrounds in computer
science and biostatistics
• Most people have a favorite gene, protein, or disease
What are the goals of the course?
• To provide an introduction to bioinformatics with
a focus on the National Center for Biotechnology
Information (NCBI) and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you
solve research problems
Themes throughout the course
Textbooks
Web sites
Literature references
Gene/protein families
Computer labs
Themes throughout the course: textbooks
Several textbooks are available on reserve:
• Baxevanis and Ouellette
• David Mount
• Durbin et al.
I have written a textbook that will appear Oct. 1,
Bioinformatics and Functional Genomics.
The chapters contain content, lab exercises,
and quizzes that were developed in this course.
We will provide chapters as handouts.
Once the book becomes available, we will put
copies on reserve. The book is recommended
(not required).
Themes throughout the course: web sites
The course website is:
http://pevsnerlab.kennedykrieger.org/
bioinfo_course.htm
The textbook website is:
http://www.bioinfbook.org
This has 1000 URLs, organized by chapter
The site offers a 15% discount on book purchases
(although the book is not required)
The principal website we will explore is NCBI:
http://www.ncbi.nlm.nih.gov
Themes throughout the course:
Literature references
You are encouraged to read original source
articles. Although articles are not required,
they will enhance your understanding of the
material.
You can obtain articles through PubMed
and through the WelDoc service at Welch.
Some articles will be available on reserve.
Themes throughout the course:
gene/protein families
We will use retinol-binding protein 4 (RBP4) as a model
gene/protein throughout the course. RBP4 is a member
of the lipocalin family. It is a small, abundant carrier
protein. We will study it in a variety of contexts including
--sequence alignment
--gene expression
--protein structure
--phylogeny
--homologs in various species
We will also use the Pol protein of HIV-1 as an example.
The HIV-1 pol gene encodes three proteins
Aspartyl
protease
Reverse
transcriptase
PR
RT
Integrase
IN
Themes throughout the course: computer labs
There is a computer lab each Friday. This is a chance
to gain practical experience using a variety of
web resources.
You can do the lab on your own if you wish.
However, during the lab you can get help on problems,
and in some cases the computers will have
specialized software.
Grading
30% weekly quizzes (open book)
30% final exam November 13
40% discovery of a novel gene (by Oct. 9)
and phylogenetic tree (by Nov. 13)
extra credit: find a mistake in a database
What is bioinformatics?
• Interface of biology and computers
• Analysis of proteins, genes and genomes
using computer algorithms and
computer databases
• Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
Top ten challenges for bioinformatics
[1] Precise models of where and when transcription
will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing
[3] Precise models of signal transduction pathways;
ability to predict cellular responses to external stimuli
[4] Determining protein:DNA, protein:RNA, protein:protein
recognition codes
[5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics
[6] Rational design of small molecule inhibitors of proteins
[7] Mechanistic understanding of protein evolution
[8] Mechanistic understanding of speciation
[9] Development of effective gene ontologies:
systematic ways to describe gene and protein function
[10] Education: development of bioinformatics curricula
Source: Ewan Birney,
Chris Burge, Jim Fickett
Three perspectives on bioinformatics
The tree of life
The organism
The cell
Time of
development
Body region, physiology,
pharmacology, pathology
DNA
RNA
protein
phenotype
DNA
genomic
DNA
databases
RNA
cDNA
ESTs
UniGene
protein
protein
sequence
databases
phenotype
There are three major public DNA databases
EMBL
GenBank
DDBJ
The underlying raw DNA sequences are identical
There are three major public DNA databases
EMBL
Housed
at EBI
European
Bioinformatics
Institute
GenBank
DDBJ
Housed
at NCBI
National
Center for
Biotechnology
Information
Housed
in Japan
>100,000 species are represented in GenBank
all species
128,941
viruses
6,137
bacteria
31,262
archaea
2,100
eukaryota
87,147
The most sequenced organisms in GenBank
Homo sapiens (6.9 million entries)
Mus musculus (5.0 million)
Zea mays (896,000)
Rattus norvegicus (819,000)
Gallus gallus (567,000)
Arabidopsis thaliana (519,000)
Danio rerio (492,000)
Drosophila melanogaster (350,000)
Oryza sativa (221,000)
National Center for Biotechnology
Information (NCBI)
www.ncbi.nlm.nih.gov
www.ncbi.nlm.nih.gov
PubMed is…
• National Library of Medicine's search service
• 11 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
Entrez integrates…
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Entrez is a search and retrieval system
that integrates NCBI databases
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 80,000 searches per day
OMIM is…
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick, others at JHU
Books is…
• searchable resource of on-line books
TaxBrowser is…
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
Structure site includes…
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
Four questions we can answer
at NCBI (and elsewhere):
[1] How can I do a literature
search using PubMed?
[2] How can WelchWeb help?
[3] How can I use Entrez to
find information about a
particular gene or protein?
(What is an accession number?)
[4] How can I find information
about a particular disease?
Question #1:
How can I use
PubMed at NCBI
to find literature
information?
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations
and author abstracts from over 4,000 journals
published in the United States and in 70 foreign
countries.
It has 12 million records dating back to 1966.
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used
for subject analysis of biomedical literature at NLM.
MeSH vocabulary is used for indexing journal articles
for MEDLINE.
The MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical literature.
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries
lipocalin AND disease
Try using “limits”
Try “LinkOut” to find external resources
Obtain articles on-line via Welch Medical Library
(and download pdf files):
http://www.welch.jhu.edu/
1 AND 2
1
2
lipocalin AND disease
(35 results)
1 OR 2
1
2
lipocalin OR disease
(1,300,000 results)
1 NOT 2
1
2
lipocalin NOT disease
(350 results)
Question #2: How can I use WelchWeb
(from the Welch Medical Library) to do
literature (and other) searches?
WelchWeb is available at http://www.welch.jhu.edu
WelchWeb is available at http://www.welch.jhu.edu
E-mail gateway
PubMed gateway
Library catalog
Remote access
to Welch services
Request literature
Browse journals
Browse databases
WelchWeb URLs of interest
Basic Sciences Subject Guide
http://www.welch.jhu.edu/internet/bsci.html
RAUL (remote access)
http://proxy.hcf.jhu.edu/
Weldoc (Inter Library Loan, and electronic delivery of articles)
http://weldoc.welch.jhmi.edu/weldoc/logon.html
MyWelch (personal library portal)
https://mywelch.welch.jhmi.edu
Welch E-Learning page (online tutorials and hand-outs)
http://www.welch.jhu.edu/classes/elearning/index.html
Johns Hopkins Author Publishing Tool
http://openaccess.jhmi.edu/authors_resource.cfm
Browse Welch E-Resources by Subject
http://www.welch.jhu.edu/eresources/edatabases_subject.cfm
Liaison Librarian Program (every dept has a liaison librarian)
http://www.welch.jhu.edu/liaison/index.html
Thanks to Brian Brown ([email protected]), the
Welch Medical Library liason to the basic sciences
Visit the Basic Sciences
Subject guide for a long
list of bioinformaticsrelated sites...
This lecture continues in part 2
with a discussion of more
NCBI resources
http://pevsnerlab.kennedykrieger.org/ppts/lecture_bioinf_ch2.ppt