Transcript Slide 1

Introduction to
Bioinformatics
Fall 2008
1
Administration
 Adi
Doron [email protected]
 Nimrod Rubinstein [email protected]
 Dudu Burstein [email protected]
 Reception hours:
by appointment
Britania 405, 6409245
2
Course Website
http://bioinfo.tau.ac.il/~intro_bioinfo/
3
Exercises
 Each
student participates once in 2 weeks:
Sunday 16:00-18:00
Monday 12:00-14:00
Monday 14:00-16:00
Computer classroom Sherman 03
4
Requirements
– 80% of final grade
 Assignments – 20% of final grade
(Compulsory)
 Exam

Assignments include class and home works:
• Class works are planned to be completed during the
exercise. They should be mailed to the TA. They will
be checked but not graded.
• Home works should be handed in the following
exercise (2 weeks after the hand out date). They will
be checked and graded.
5
Goals

To familiarize the students with research topics
in bioinformatics, and with bioinformatic tools
 The emphasis will be on tools and their use
Prerequisites
 Familiarity
with topics in molecular biology
(cell biology and genetics)
 Basic familiarity with computers & internet
6
BIOINFORMATIC DATABASES
7
What’s in a database?

Sequences – genes, proteins, etc.

Full genomes

Annotation – information about the gene/protein:
- function
- cellular location
- chromosomal location
- introns/exons
- protein structure
- phenotypes, diseases

Publications
8
NCBI and Entrez

One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
 Entrez is the search engine of NCBI
 Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
 http://www.ncbi.nlm.nih.gov/
9
Search for published papers

Yang X, Kurteva S, Ren X, Lee S,
Sodroski J. “Subunit stoichiometry of human
immunodeficiency virus type 1 envelope glycoprotein
trimers during virus entry into host cells “, J Virol. 2006
May;80(9):4388-95.
10
Use fields!
Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]
For the full list of field tags: go to help -> Search Field Descriptions and Tags
11
Exercise
 Retrieve
all publications in which the first
author is: Pe'er I and the last author is:
Shamir R
12
Using Limits
Retrieve the publications of
Friedman N, in the journals:
Bioinformatics and Journal of
Computational Biology, in the
last 5 years
13
Google scholar
http://scholar.google.com/
14
15
NCBI gene & protein databases:
GenBank
 GenBank
is an annotated collection of all
publicly available DNA sequences.
 Holds 65 billion bases (Oct. 2007)
 GenPept
is a database of translated
coding sequences from GenBank
16
Search demonstration
Searching for CD4 human using
Entrez
17
18
Using Field Descriptions,
Qualifiers, and Boolean Operators

Cd4[GENE] AND human[ORGN]
Or
Cd4[gene name] AND human[organism]

List of field codes:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers

Boolean Operators:
AND
OR
NOT
Note: do not use the field Protein name [PROT], only
GENE!
19
20
RefSeq

REFSEQ: sub-collection of NCBI databases with
only non-redundant, highly annotated entries
(genomic DNA, transcript (RNA), and protein
products)
21
22
An explanation on GenBank records
23
Accession Numbers
GenBank
EMBL
Two letters followed by six digits, e.g.:
AY123456
One letter followed by five digits, e.g.:
U12345
GenPept (a.a.
translations of
GenBank)
Three letters and five digits, e.g.:
AAA12345
Refseq
RefSeq accession numbers can be distinguished from
GenBank accessions by their prefix distinct format of [2
characters+underscore], e.g.: NP_015325.
NM_: nucleotide, NP_: protein
SWISS-PROT
(another protein
database)
All are six characters:
Character/Format
1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]
5 [A-Z,0-9] 6 [0-9]
e.g.:P12345 and Q9JJS7
PDB (Protein Data
Bank – structure
database)
one digit followed by three letters, e.g.:
1hxw
24
Swissprot
 A protein
sequence database which strives
to provide a high level of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
 One entry for each protein
25
26
GenBank Vs. Swiss-Prot
GenBank results
Swiss-Prot results
27
Downloading & Fasta format

Fasta format
> sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 precursor
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK
ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL
LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG
TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW
QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA
LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV
LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV
RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Save Accession Numbers for future use (makes searching quicker):
Refseq: NP_000607
Swissprot: P01730
28
29
PDB: Protein Data Bank
 Main
database of 3D structures.
 Includes ~47,000 entries (proteins,
nucleic acids, others).
 Proteins organized in groups, families etc.
 Is highly redundant.
 http://www.rcsb.org
30
CD4 in complex with gp120
gp120
PDB ID
1G9M
CD4
31
Organism specific

Model organisms have independent database:
HIV database
http://hiv-web.lanl.gov/content/index
32
Genecards
 All
in one database of human genes (a
project by Weizmann institute)
 Attempts to integrate as many as possible
databases, publications and all available
knowledge
 http://www.genecards.org
33
34
Summary

General and comprehensive databases:


Genome specific databases:


NCBI, EMBL, DDBJ
ENSEMBL, UCSC genome browser
Highly annotated databases:

Human genes
• Genecards

Proteins:
• Swissprot, Refseq

Structures:
• PDB
35
The MOST important of all
1.Google (or any search engine)
36
And always remember:
2.RT(F)M –
Read the manual!!
37
Help!
 Read
the Help section
 Read the FAQ section
 Google the question!
38