Transcript Slide 1

Tools in bioinformatics
Fall 2009-10
1
Overview
Goals
 To
provide students with practical
knowledge of bioinformatics tools and their
application in research
Prerequisites
 The
course “Introduction to bioinformatics”
 Familiarity with topics in molecular biology
(cell biology, biochemistry, and genetics)
 Basic familiarity with computers & internet
2
Administration
Course website
http://ibis.tau.ac.il/intro_bioinfo/tools.html
3
Administration
Classes:
A class will be given every two weeks
There are three class groups:
Sunday 16:00-18:00
Monday 12:00-14:00
Monday 14:00-16:00
Location:
Computer classroom Sherman 03
4
Administration
Teachers:
 Nimrod Rubinstein [email protected] (Sundays)
 Daiana Alaluf [email protected] (Mondays
I)
 Osnat
Penn [email protected] (Mondays II)
 Reception hours:
Email your instructor any question at any
time or set an appointment (Britania 405,
6409245)
5
Requirements

Assignments – 50% of final grade (compulsory)

Assignments include class and home works:
• Class works are planned to be completed during the lesson and
handed in at the end of it. They will be checked but not graded.
• Home works should be handed in the following lesson (two weeks
after their hand out). They will be checked and graded.

Final project – 50% of final grade
When emailing your instructor (a question, your assignment, or
whatever) please state in the “Subject” field: “Tools in Bioinfo”,
IDs, CW/HW number (if relevant)
6
BIOINFORMATICS DATABASES
7
What’s in a database?
Sequences – genes, proteins, etc…
Full genomes
Expression data
Structures
Annotation – information about genes/proteins:
- function
- cellular location
- chromosomal location
- introns/exons
- phenotypes, diseases
 Publications





8
NCBI and Entrez

One of the most largest and comprehensive
databases belonging to the NIH (national
institute of health. The primary Federal agency
for conducting and supporting medical research
in the USA)
 Entrez is the search engine of NCBI
 Search for :
genes, proteins, genomes, structures, diseases,
publications, and more
http://www.ncbi.nlm.nih.gov
9
PubMed: NCBI’s database of
biomedical articles
Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry
of human immunodeficiency virus type 1 envelope glycoprotein trimers
during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
10
Use fields!
Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]
For the full list of field tags: go to help -> Search Field Descriptions and Tags
11
Example
 Retrieve
all publications in which the first
author is: Davidovich C and the last
author is: Yonath A
12
Using limits
Retrieve the publications of
Yonath A, in the journals:
Nature and Proc Natl Acad
Sci U S A., in the last 5 years
13
Google scholar
http://scholar.google.com/
14
15
GenBank: NCBI’s gene & protein
database
 GenBank
is an annotated collection of all
publicly available DNA sequences (and
their amino-acid translations)
 Holds ~106.5 billion bases of ~108.5
million sequence records (Oct. 2009)
16
Search demonstration
Searching NCBI for the protein
human CD4
17
18
Using field descriptions, qualifiers,
and boolean operators

Cd4[GENE] AND human[ORGN]
Or
Cd4[gene name] AND human[organism]

List of field codes:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers

Boolean Operators:
AND
OR
NOT
Note: do not use the field Protein name [PROT], only GENE!
19
This time we directly search in the protein database
20
RefSeq

Subcollection of NCBI databases with only nonredundant, highly annotated entries (genomic DNA,
transcript (RNA), and protein products)
21
22
An explanation on GenBank records
23
Fasta format
header
ID/accession
description
> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK
ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL
LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG
TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW
QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA
LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV
LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV
RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Save accession numbers for future use (makes searching quicker):
RefSeq accession number: NP_000607.1
sequence
24
Downloading
25
Swissprot
A
protein sequence database which
strives to provide a high level of annotation
regarding:
* the function of a protein
* domains structure
* post-translational modifications
* variants
 One entry for each protein
http://www.expasy.ch/sprot
26
27
GenBank Vs. Swissprot
GenBank results
Swiss-Prot results
28
PDB: Protein Data Bank
 Main
database of 3D structures of
macromolecules
 Includes ~61,000 entries (proteins, nucleic
acids, complex assemblies)
 Is highly redundant
http://www.rcsb.org
29
Human CD4 in complex with HIV gp120
PDB ID 1G9M
gp120
CD4
30
Accession Numbers
GenBank
EMBL
Two letters followed by six digits, e.g.:
AY123456
One letter followed by five digits, e.g.:
U12345
Refseq
RefSeq accession numbers can be distinguished from
GenBank accessions by their prefix: 2
characters+underscore], e.g.: NP_015325
NM_: mRNA transcript, NP_: protein
SWISSPROT
Six characters:
1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]
5 [A-Z,0-9] 6 [0-9]
e.g.:P12345 and Q9JJS7
PDB
One digit followed by three letters/digits, e.g.:
1hxw
31
GeneCards
 All-in-one
database of human genes (a
project by the Weizmann institute)
 Attempts to integrate as many as possible
databases, publications, and all available
knowledge
http://www.genecards.org
32
33
Organism specific databases

Model organisms have independent databases:
HIV database
http://hiv-web.lanl.gov/content/index
34
Summary

General and comprehensive databases:


Genome specific databases (to be discussed):


NCBI, EMBL
UCSC, ENSEMBL
Highly annotated databases:

Human genes
• Genecards

Proteins:
• Swissprot, RefSeq

Structures:
• PDB
35
As important:
1.Google (or any search engine)
36
And always remember:
2.RT(F)M
-
Read the manual!!! (/help/FAQ)
37
GO: Gene Ontology
Gene Ontology

Strives to provide consistent descriptions of gene products obtained
from different databases
 GO annotations include three hierarchical ontologies of gene
products:
 cellular component(s) – the environment in which the gene
product functions
 biological processe(s) – the biological program/pathway in which
the gene product is involved
 molecular function(s) – the elemental activities of the gene
product
 E.g., cytochrome c:
 cellular components: mitochondrial matrix and mitochondrial
inner membrane
 biological processes: oxidative phosphorylation and induction of
cell death
39
 molecular functions: oxidoreductase activity
AmiGO: the official GO browser
40
42
..
43
Through NCBI
44
..
..
45
Enrichment analysis
Query set
Reference set
N
n
k
K
Total – n genes
Total – N genes
Function f – k genes
Function f – K genes
Is k/n > K/N, significantly ???
46
Statistical significance testing
Problem formulation:
In a group of N genes there are K “special” ones
If we sample n genes out of N (without replacement), and
found k “special” ones, would that be considered a random
outcome?
Mathematically, we use the hypergeometric distribution to
compute the probability of obtaining k or more “special” ones in
a sample of n
k 1
p  value  1  
i 0
 K  N  K 
 

k 1 
i  n  i 

f HG (i; N , K , n)  1  
N
i 0
 
n
47
48
Materials & Methods
21,121 siRNA
knockdown assays,
literally covering the
entire coding-sequence
part of the genome
49
Results
273 HIV-dependency factors (HDFs) were discovered
Biological processes
50
Subcellular localizations
Molecular functions
51
Observations
 Nuclear
pore complex: their loss may
impede HIV nuclear access
 Mediator members (couples TFs to Pol II):
requirement for activators to bind HIV
LTRs
 Enzymes involved in glycosilation: HIV’s
envelope protein is heavily glycosilated
assisting in the virus entry to cells
52