This is Slide 1

Download Report

Transcript This is Slide 1

CS177 Lecture 8
Bioinformatics Databases
(and genetic diseases)
Tom Madej 11.01.04
Lecture overview
•
Very brief overview of on-line databases.
•
Formulating queries in Entrez.
•
Example: Molecular biology of diseases.
Bioinformatics Resources
• Reference: Chapter 3 in Sequence – Evolution –
Function, E.V. Koonin and M.Y. Galperin, Kluwer
Academic 2003.
• Available on the NCBI Bookshelf.
Sequence Databases
• GenBank, EMBL, DDBJ; archival (International Nucleotide
Sequence Database Collaboration); sequences have a common
accession
• SWISS-PROT curated, non-redundant, entries hyperlinked e.g. to
PubMed; TrEMBL entries not yet ready for SWISS-PROT
• Motifs: PROSITE, BLOCKS, PRINTS
• Domains: Pfam, SMART, ProDOM, COGs (NCBI)
• Motifs/domains: InterPro, CDD (NCBI)
More databases…
• Structure: PDB/RCSB, MMDB (NCBI), SCOP, CATH,
FSSP
• Organism-specific: e.g. E. coli, B. subtilis, Synechocystis
sp. (bacteria); yeast (unicellular eukaryote); Arabidopsis,
C. Elegans (WormBase), Fruitfly, Human
• COGs clusters of orthologous groups; KEGG
biochemical pathways; BIND protein-protein interactions;
ENZYME; LIGAND enzymes and their substrates
• PubChem (NCBI) chemical substances
PubChem (new)
The (ever expanding) Entrez System
NLM Catalog
PubChem
Compounds
PubMed
BioAssays
Substances
OMIM
PubMed Central
Journals
3D Domains
Books
Structure
Taxonomy
CDD/CDART
Entrez
Protein
Genome
UniSTS
HomoloGene
HomoloGene
SNP
UniGene
Gene
Gene
GEO/GDS
PopSet
Nucleotide
Links Between and Within Nodes
Word weight
Computational
PubMed
abstracts
3 -D
3-D
Structures
Structure
Taxonomy
Phylogeny
VAST
Computation
Genomes
Nucleotide
BLAST
Computationalsequences
Protein
sequences
BLAST
Computationa
Pubmed: Computation of Related Articles
The neighbors of a document are those documents in the database that
are the most similar to it. The similarity between documents is
measured by the words they have in common, with some
adjustment for document lengths.
The value of a term is dependent on Global and Local types of
information:
G - the number of different documents in the database that contain the
term;
L - the number of times the term occurs in a particular document;
Global and local weights
• The global weight of a term is greater for the less
frequent terms. The presence of a term that occurred in most of the
documents would really tell one very little about a document.
•
The local weight of a term is the measure of its importance in a
particular document. Generally, the more frequent a term
within a document, the more important it is in
representing the content of that document.
is
How we define similar documents
•
The similarity between two documents is computed by adding up the
weights (local wt1 × local wt2 × global wt) of all of the terms the two
documents have in common. All results are ranked and the most similar
documents become Related Articles
Entrez database queries
• The databases are indexed by different sets of terms.
• You can get to a particular DB by selecting it and then entering a
“null” query.
• The “Preview/Index” tab displays the index terms and can be used
to formulate a query (if you can’t remember the syntax for the index).
• “Limits” can be used e.g. to select publications in a specified time
range.
• “Details” shows the interpretation of the query.
Exercises!
• How many protein structures are there that include DNA and are
from bacteria?
• In PubMed, how many articles are there from the journal Science
and have “Alzheimer” in the title or abstract, and “amyloid beta”
anywhere? How many since the year 2000?
• Notice that the results are not 100% accurate!
• In 3D Domains, how many domains are there with no more than two
helices and 8 to 10 strands and are from the mouse?
Investigating genetic diseases
• Now we will see examples of how bioinformatics
databases can be used to investigate genetic diseases.
Gene variants that can affect protein
function
• Mutation to a stop codon; truncates the protein product!
• Insertion/deletion of multiple bases; changes the sequence of amino
acid residues.
• Single point change could alter folding properties of the protein.
• Single point change could affect the active site of the protein.
• Single point change could affect an interaction site with another
molecule.
Lodish et al. Molecular Cell Biology, W.H. Freeman 2000
Sickle cell anemia
• The first “molecular disease”, i.e. the first genetic disease with a
known molecular basis.
• The most common variant is caused by a Glu6Val mutation in the
Hemoglobin β-chain (HbS). However, there are 100’s of other
mutations that can cause this (OMIM lists 524 variants!).
• This mutation causes the hemoglobin to polymerize, in turn the red
blood cells form sickle shapes and clump together under low oxygen
conditions or high hemoglobin concentrations.
• Confers some resistance to malaria, by inhibiting parasite growth.
NHLBI web site
Exercise!
• Find an appropriate Hemoglobin structure and view it in
Cn3D.
• Check the position of the Glu6Val mutation.
P53 tumor suppressor protein
• Li-Fraumeni syndrome; only one functional copy of p53
predisposes to cancer.
• Mutations in p53 are found in most tumor types.
• p53 binds to DNA and stimulates another gene to
produce p21, which binds to another protein cdk2. This
prevents the cell from progressing thru the cell cycle.
G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.
Exercise!
• Use Cn3D to investigate the binding of p53 to DNA.
• Formulate a query for Structure that will require the DNA
molecules to be present (there are 2 structures like this).
Important note!
• Most diseases (e.g. cancer) are complex and involve
multiple factors (not just a single malfunctioning protein!).
Investigating a genetic disease…
• The following EST comes from a hemochromatosis patient; your
task is to identify the gene and specific mutation causing the illness,
and why the protein is not functioning properly.
• The sequence:
TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG
TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA
ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT
GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA
TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG
GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC
TGGATCAGCCCCTCATTGTGATCTGGG
ESTs
• Expressed Sequence Tags; useful for discovering genes,
obtaining data on gene expression/regulation, and in
genome mapping.
• Short nucleotide sequences (200-500 bases or so)
derived from mRNA expressed in cells.
• The introns from the genes will already be spliced out.
• mRNA is unstable, however, and so it is “reverse
transcribed” into cDNA.
Hemochromatosis 2
•
BLAST the EST vs. the Human genome (could take a
few minutes).
- Which chromosome is hit?
- What is the contig that is hit (reference assembly)?
- Is the EST identical to the genomic sequence?
- Take note of the coords of the difference.
•
Click on “Genome View”.
•
Select the map element at the bottom corresponding to
the contig.
Hemochromatosis 3
• What gene is hit? Zoom in on the BLAST hit a few
times.
• Display the entire gene sequence vi “dl” and “Display”.
• Copy and save the genomic sequence.
• Record the coords for the start of the genomic sequence.
Hemochromatosis 4
• Click on a UniGene link Hs.233325.
• Note: Expression profile presents data for the expression
level of the gene in various tissues.
• How many mRNAs and ESTs are there for the HFE
gene?
• Take note of the mRNA accession NM_000410.
Hemochromatosis 5
• Go to “spidey”: http://www.ncbi.nlm.nih.gov/spidey/
• To determine the intron/exon structure, paste the HFE
gene sequence into the upper box, and enter the HFE
mRNA accession NM_000410 in the lower box.
• Click “Align”.
Hemochromatosis 6
• How many exons are there?
• Which exon codes the residue that is changed in the
original EST? (You have to do a little arithmetic!)
• Record some of the protein sequence around the
changed residue: EQRYTCQVEHPG
Hemochromatosis 7
• From the Map Viewer page click on the HFE gene link.
• How many HFE transcripts are there? Which is the
longest isoform?
• Follow “Links” to “Protein” and then to the report for
NP_000410.
• Determine the residue number that corresponds to the
mutation.
RNA splicing and isoforms
Hemochromatosis 8
• What effect does the mutation in the original EST have
on the protein? (Look at the table for the Genetic Code.)
• Go back to the Gene Report; read the summary and take
note of the GeneRIF bibliography.
• Now go to “Links” and then to “GeneView in dbSNP” to a
list of known SNPs.
Hemochromatosis 9
• In the SNP list note that the one you want is currently
shown.
• Select “view rs in gene region” and then click on “view
rs”.
• How many nonsynonomous substitutions do you see?
• Do you see the one we are particularly interested in?
Digression: SNPs
• Single Nucleotide Polymorphisms.
• A single base change that can occur in a person’s DNA.
• On average SNPs occur about 1% of the time, most are outside of
protein coding regions.
• Some SNPs may cause a disease; some may be associated with a
disease; others may affect disposition to a disease; others may be
simple genetic variation.
• dbSNP archives SNPs and other variations such as small-scale
deletion/insertion polymorphisms (DIPs), etc.
Hemochromatosis 10
• Back to the Gene Report, click on “Links” and go to
“OMIM” (can also get there via the Map Viewer).
• In the OMIM entry you can read a bit; also click on “View
List” for Allelic Variants, where you can see the mutation
again.
Hemochromatosis 11
• From the Gene Report again follow “Links” to “Protein”
and scroll down to NP_000401.
• Click on “Domains” and then “Show Details”.
• What is the Conserved Domain in the region of interest?
• Follow the link to the CD.
• Click on “View 3D Structure”.
Hemochromatosis 12
• Look for residue position 282 in the query sequence.
• Highlight that column.
• Is the Cys282 conserved in the family?
• The C282Y mutation therefore likely has the effect of …
Aligning a sequence on a structure with
Cn3D (example)
• Example: Use structure 1ne3A, align sequence for 1m5xA.
• In Sequence/Alignment Viewer window select the menu item
“Imports/Show Imports”.
• In the Import Viewer window select the menu item “Edit/Import
Sequences”.
• In the Select Chain dialogue box select 1N3E A and click OK.
• In the Select Import Source dialogue box select “Network via
GI/Accession” and click OK.
• In the Import Identifier dialogue box enter the accession 31615545
and click OK. The new sequence will appear.
• Select “Algorithms/BLAST single” and use the cursor to click
anywhere on the 1m5xA sequence to align it using BLAST.
Aligning a sequence on a structure with
Cn3D (example cont.)
• Select the menu item “Alignments/Merge All” to make the new
alignment appear in the Sequence/Alignment Viewer window.
• The alignment should now appear in the Sequence/Alignment
Viewer window, aligned residues will be red.
• Close the Import Viewer window, pick another color style for the
alignment, if desired (e.g. identity).
• You can do this with multiple sequences; especially useful if there is
no CD for the structure.
PDB
PDB File: Header
HEADER
TITLE
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
KEYWDS
ISOMERASE/DNA
01-MAR-00
1EJ9
CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX
MOL_ID: 1;
2 MOLECULE: DNA TOPOISOMERASE I;
3 CHAIN: A;
4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765;
5 EC: 5.99.1.2;
6 ENGINEERED: YES;
7 MUTATION: YES;
8 MOL_ID: 2;
9 MOLECULE: DNA (5'10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP*
11 TP*TP*TP*T)-3');
12 CHAIN: C;
13 ENGINEERED: YES;
14 MOL_ID: 3;
15 MOLECULE: DNA (5'16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP*
REMARK
1
17 TP*TP*TP*T)-3');
REMARK
2
18 CHAIN:
D;
REMARK
2 RESOLUTION. 2.60 ANGSTROMS.
19 ENGINEERED:
REMARK
3YES
MOL_ID:
1;
REMARK
3 REFINEMENT.
2 ORGANISM_SCIENTIFIC:
HOMO SAPIENS;
REMARK
3
PROGRAM
: X-PLOR 3.1
3 EXPRESSION_SYSTEM_COMMON:
BACULOVIRUS
REMARK
3
AUTHORS
: BRUNGER EXPRESSION SYSTEM;
4 EXPRESSION_SYSTEM_CELL:
SF9 INSECT CELLS;
…
5 MOL_ID:
REMARK2;
280
6 SYNTHETIC:
YES;
REMARK 280
CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20
7 MOL_ID:
REMARK3;
280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT
8 SYNTHETIC:
YES
REMARK 290
PROTEIN-DNA
COMPLEX, TYPE I TOPOISOMERASE, HUMAN
...
PDB File: Data
ATOM
ATOM
ATOM
ATOM
ATOM
ATOMATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
Name
ATOM
ATOM
…
1
2
3
4
5
6
1
7
8
9
10
11
12
13
14
Atom
N
CA
C
O
CB
CGN
CD1
CD2
NE1
CE2
CE3
CZ2
CZ3
CH2
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP TRP
A 203 A
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP A 203
TRP A 203
Number
Atom
Name
Residue
Name
30.156
30.797
30.369
29.315
30.518
30.847
203
32.028
29.980
31.956
30.704
28.657
30.149
28.101
28.849
-4.908 37.767 1.00 50.81
-4.667 36.431 1.00 49.96
-3.337 35.766 1.00 49.18
-3.238 35.147 1.00 49.27
-5.863 35.513 1.00 46.77
-5.651
34.081 -4.908
1.00 44.60 37.767
30.156
-5.234 33.553 1.00 49.72
-5.876 32.984 1.00 43.73
-5.191 32.177 1.00 45.45
-5.582 31.805 1.00 45.23
-6.305 32.877 1.00 46.48
-5.705 30.539 1.00 46.06
-6.431 31.622 1.00 43.08
-6.131 30.463 1.00 45.77
X
Y
50.81
Occupancy
Residue
Number
Chain ID
Z
N
C
C
O
C
C 1.00
C
C
N
C
C
C
C
C
Temperature
Factor
Issues:
Justification
Nomenclature
From Coordinates to Models
1EJ9: Human topoisomerase I
Building the Structure Summary
Taxonomy
Pubmed
Protein
Domains
Nucleotide
3D Domains
Creating Sequence Records
One record per chain
Protein
Nucleotide
1EJ9A
1EJ9C
Nucleotide
1EJ9D
Annotating Secondary Structure
1EJ9: Human topoisomerase I
α-Helices
β-strands
coils/loops
Creating 3D Domains
3D Domain 0: 1EJ9A0 = entire polypeptide
Creating 3D Domains
1EJ9A1
1EJ9A4
3D Domains 1EJ9A3
1EJ9A5
1EJ9A2
< 3 Secondary Structure Elements
3D Domain Indexing
Entrez
• SDI
• MMDB-ID
• Accession
• MMDB entry date
• Organism
• Domain number
• Cumulative number
PDB
• Accession
• Release date
• Class
• Source
• Description
• Comment
Literature
Counters
• Article title
• Author
• Publication date
• Modified amino acids
• α-Helices
• β-Strands
• Residues
• Molecular weight
Find all viral four helix bundles
4[helixcount] AND 0[strandcount] AND
0[domainno] AND viruses[organism]
REMEMBER:
3D Domain 0 is the entire
polypeptide chain!