Transcript Slide 1
Tools in bioinformatics
Fall 2009-10
1
Overview
Goals
To
provide students with practical
knowledge of bioinformatics tools and their
application in research
Prerequisites
The
course “Introduction to bioinformatics”
Familiarity with topics in molecular biology
(cell biology, biochemistry, and genetics)
Basic familiarity with computers & internet
2
Administration
Course website
http://ibis.tau.ac.il/intro_bioinfo/tools.html
3
Administration
Classes:
A class will be given every two weeks
There are three class groups:
Sunday 16:00-18:00
Monday 12:00-14:00
Monday 14:00-16:00
Location:
Computer classroom Sherman 03
4
Administration
Teachers:
Nimrod Rubinstein [email protected] (Sundays)
Daiana Alaluf [email protected] (Mondays
I)
Osnat
Penn [email protected] (Mondays II)
Reception hours:
Email your instructor any question at any
time or set an appointment (Britania 405,
6409245)
5
Requirements
Assignments – 50% of final grade (compulsory)
Assignments include class and home works:
• Class works are planned to be completed during the lesson and
handed in at the end of it. They will be checked but not graded.
• Home works should be handed in the following lesson (two weeks
after their hand out). They will be checked and graded.
Final project – 50% of final grade
When emailing your instructor (a question, your assignment, or
whatever) please state in the “Subject” field: “Tools in Bioinfo”,
IDs, CW/HW number (if relevant)
6
BIOINFORMATICS DATABASES
7
What’s in a database?
Sequences – genes, proteins, etc…
Full genomes
Expression data
Structures
Annotation – information about genes/proteins:
- function
- cellular location
- chromosomal location
- introns/exons
- phenotypes, diseases
Publications
8
NCBI and Entrez
One of the most largest and comprehensive
databases belonging to the NIH (national
institute of health. The primary Federal agency
for conducting and supporting medical research
in the USA)
Entrez is the search engine of NCBI
Search for :
genes, proteins, genomes, structures, diseases,
publications, and more
http://www.ncbi.nlm.nih.gov
9
PubMed: NCBI’s database of
biomedical articles
Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry
of human immunodeficiency virus type 1 envelope glycoprotein trimers
during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
10
Use fields!
Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]
For the full list of field tags: go to help -> Search Field Descriptions and Tags
11
Example
Retrieve
all publications in which the first
author is: Davidovich C and the last
author is: Yonath A
12
Using limits
Retrieve the publications of
Yonath A, in the journals:
Nature and Proc Natl Acad
Sci U S A., in the last 5 years
13
Google scholar
http://scholar.google.com/
14
15
GenBank: NCBI’s gene & protein
database
GenBank
is an annotated collection of all
publicly available DNA sequences (and
their amino-acid translations)
Holds ~106.5 billion bases of ~108.5
million sequence records (Oct. 2009)
16
Search demonstration
Searching NCBI for the protein
human CD4
17
18
Using field descriptions, qualifiers,
and boolean operators
Cd4[GENE] AND human[ORGN]
Or
Cd4[gene name] AND human[organism]
List of field codes:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers
Boolean Operators:
AND
OR
NOT
Note: do not use the field Protein name [PROT], only GENE!
19
This time we directly search in the protein database
20
RefSeq
Subcollection of NCBI databases with only nonredundant, highly annotated entries (genomic DNA,
transcript (RNA), and protein products)
21
22
An explanation on GenBank records
23
Fasta format
header
ID/accession
description
> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK
ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL
LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG
TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW
QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA
LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV
LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV
RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Save accession numbers for future use (makes searching quicker):
RefSeq accession number: NP_000607.1
sequence
24
Downloading
25
Swissprot
A
protein sequence database which
strives to provide a high level of annotation
regarding:
* the function of a protein
* domains structure
* post-translational modifications
* variants
One entry for each protein
http://www.expasy.ch/sprot
26
27
GenBank Vs. Swissprot
GenBank results
Swiss-Prot results
28
PDB: Protein Data Bank
Main
database of 3D structures of
macromolecules
Includes ~61,000 entries (proteins, nucleic
acids, complex assemblies)
Is highly redundant
http://www.rcsb.org
29
Human CD4 in complex with HIV gp120
PDB ID 1G9M
gp120
CD4
30
Accession Numbers
GenBank
EMBL
Two letters followed by six digits, e.g.:
AY123456
One letter followed by five digits, e.g.:
U12345
Refseq
RefSeq accession numbers can be distinguished from
GenBank accessions by their prefix: 2
characters+underscore], e.g.: NP_015325
NM_: mRNA transcript, NP_: protein
SWISSPROT
Six characters:
1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]
5 [A-Z,0-9] 6 [0-9]
e.g.:P12345 and Q9JJS7
PDB
One digit followed by three letters/digits, e.g.:
1hxw
31
GeneCards
All-in-one
database of human genes (a
project by the Weizmann institute)
Attempts to integrate as many as possible
databases, publications, and all available
knowledge
http://www.genecards.org
32
33
Organism specific databases
Model organisms have independent databases:
HIV database
http://hiv-web.lanl.gov/content/index
34
Summary
General and comprehensive databases:
Genome specific databases (to be discussed):
NCBI, EMBL
UCSC, ENSEMBL
Highly annotated databases:
Human genes
• Genecards
Proteins:
• Swissprot, RefSeq
Structures:
• PDB
35
As important:
1.Google (or any search engine)
36
And always remember:
2.RT(F)M
-
Read the manual!!! (/help/FAQ)
37
GO: Gene Ontology
Gene Ontology
Strives to provide consistent descriptions of gene products obtained
from different databases
GO annotations include three hierarchical ontologies of gene
products:
cellular component(s) – the environment in which the gene
product functions
biological processe(s) – the biological program/pathway in which
the gene product is involved
molecular function(s) – the elemental activities of the gene
product
E.g., cytochrome c:
cellular components: mitochondrial matrix and mitochondrial
inner membrane
biological processes: oxidative phosphorylation and induction of
cell death
39
molecular functions: oxidoreductase activity
AmiGO: the official GO browser
40
42
..
43
Through NCBI
44
..
..
45
Enrichment analysis
Query set
Reference set
N
n
k
K
Total – n genes
Total – N genes
Function f – k genes
Function f – K genes
Is k/n > K/N, significantly ???
46
Statistical significance testing
Problem formulation:
In a group of N genes there are K “special” ones
If we sample n genes out of N (without replacement), and
found k “special” ones, would that be considered a random
outcome?
Mathematically, we use the hypergeometric distribution to
compute the probability of obtaining k or more “special” ones in
a sample of n
k 1
p value 1
i 0
K N K
k 1
i n i
f HG (i; N , K , n) 1
N
i 0
n
47
48
Materials & Methods
21,121 siRNA
knockdown assays,
literally covering the
entire coding-sequence
part of the genome
49
Results
273 HIV-dependency factors (HDFs) were discovered
Biological processes
50
Subcellular localizations
Molecular functions
51
Observations
Nuclear
pore complex: their loss may
impede HIV nuclear access
Mediator members (couples TFs to Pol II):
requirement for activators to bind HIV
LTRs
Enzymes involved in glycosilation: HIV’s
envelope protein is heavily glycosilated
assisting in the virus entry to cells
52