Transcript Slide 1
Introduction to
Bioinformatics
Fall 2008
1
Administration
Adi
Doron [email protected]
Nimrod Rubinstein [email protected]
Dudu Burstein [email protected]
Reception hours:
by appointment
Britania 405, 6409245
2
Course Website
http://bioinfo.tau.ac.il/~intro_bioinfo/
3
Exercises
Each
student participates once in 2 weeks:
Sunday 16:00-18:00
Monday 12:00-14:00
Monday 14:00-16:00
Computer classroom Sherman 03
4
Requirements
– 80% of final grade
Assignments – 20% of final grade
(Compulsory)
Exam
Assignments include class and home works:
• Class works are planned to be completed during the
exercise. They should be mailed to the TA. They will
be checked but not graded.
• Home works should be handed in the following
exercise (2 weeks after the hand out date). They will
be checked and graded.
5
Goals
To familiarize the students with research topics
in bioinformatics, and with bioinformatic tools
The emphasis will be on tools and their use
Prerequisites
Familiarity
with topics in molecular biology
(cell biology and genetics)
Basic familiarity with computers & internet
6
BIOINFORMATIC DATABASES
7
What’s in a database?
Sequences – genes, proteins, etc.
Full genomes
Annotation – information about the gene/protein:
- function
- cellular location
- chromosomal location
- introns/exons
- protein structure
- phenotypes, diseases
Publications
8
NCBI and Entrez
One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
Entrez is the search engine of NCBI
Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
http://www.ncbi.nlm.nih.gov/
9
Search for published papers
Yang X, Kurteva S, Ren X, Lee S,
Sodroski J. “Subunit stoichiometry of human
immunodeficiency virus type 1 envelope glycoprotein
trimers during virus entry into host cells “, J Virol. 2006
May;80(9):4388-95.
10
Use fields!
Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]
For the full list of field tags: go to help -> Search Field Descriptions and Tags
11
Exercise
Retrieve
all publications in which the first
author is: Pe'er I and the last author is:
Shamir R
12
Using Limits
Retrieve the publications of
Friedman N, in the journals:
Bioinformatics and Journal of
Computational Biology, in the
last 5 years
13
Google scholar
http://scholar.google.com/
14
15
NCBI gene & protein databases:
GenBank
GenBank
is an annotated collection of all
publicly available DNA sequences.
Holds 65 billion bases (Oct. 2007)
GenPept
is a database of translated
coding sequences from GenBank
16
Search demonstration
Searching for CD4 human using
Entrez
17
18
Using Field Descriptions,
Qualifiers, and Boolean Operators
Cd4[GENE] AND human[ORGN]
Or
Cd4[gene name] AND human[organism]
List of field codes:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers
Boolean Operators:
AND
OR
NOT
Note: do not use the field Protein name [PROT], only
GENE!
19
20
RefSeq
REFSEQ: sub-collection of NCBI databases with
only non-redundant, highly annotated entries
(genomic DNA, transcript (RNA), and protein
products)
21
22
An explanation on GenBank records
23
Accession Numbers
GenBank
EMBL
Two letters followed by six digits, e.g.:
AY123456
One letter followed by five digits, e.g.:
U12345
GenPept (a.a.
translations of
GenBank)
Three letters and five digits, e.g.:
AAA12345
Refseq
RefSeq accession numbers can be distinguished from
GenBank accessions by their prefix distinct format of [2
characters+underscore], e.g.: NP_015325.
NM_: nucleotide, NP_: protein
SWISS-PROT
(another protein
database)
All are six characters:
Character/Format
1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]
5 [A-Z,0-9] 6 [0-9]
e.g.:P12345 and Q9JJS7
PDB (Protein Data
Bank – structure
database)
one digit followed by three letters, e.g.:
1hxw
24
Swissprot
A protein
sequence database which strives
to provide a high level of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
One entry for each protein
25
26
GenBank Vs. Swiss-Prot
GenBank results
Swiss-Prot results
27
Downloading & Fasta format
Fasta format
> sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 precursor
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK
ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL
LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG
TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW
QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA
LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV
LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV
RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Save Accession Numbers for future use (makes searching quicker):
Refseq: NP_000607
Swissprot: P01730
28
29
PDB: Protein Data Bank
Main
database of 3D structures.
Includes ~47,000 entries (proteins,
nucleic acids, others).
Proteins organized in groups, families etc.
Is highly redundant.
http://www.rcsb.org
30
CD4 in complex with gp120
gp120
PDB ID
1G9M
CD4
31
Organism specific
Model organisms have independent database:
HIV database
http://hiv-web.lanl.gov/content/index
32
Genecards
All
in one database of human genes (a
project by Weizmann institute)
Attempts to integrate as many as possible
databases, publications and all available
knowledge
http://www.genecards.org
33
34
Summary
General and comprehensive databases:
Genome specific databases:
NCBI, EMBL, DDBJ
ENSEMBL, UCSC genome browser
Highly annotated databases:
Human genes
• Genecards
Proteins:
• Swissprot, Refseq
Structures:
• PDB
35
The MOST important of all
1.Google (or any search engine)
36
And always remember:
2.RT(F)M –
Read the manual!!
37
Help!
Read
the Help section
Read the FAQ section
Google the question!
38