Lecture 2 - Pitt CPATH Project

Download Report

Transcript Lecture 2 - Pitt CPATH Project

Genomics and Personalized Care
in Health Systems
Lecture 2. Databases
Leming Zhou, PhD
School of Health and Rehabilitation Sciences
Department of Health Information Management
Department of Health Information Management
Outline
• Nucleotide and protein sequence databases
– NCBI
• GenBank; RefSeq; dbEST; UniGene
– PDB
• Flybase
• dbSNP
• OMIM and HuGE Navigator
Department of Health Information Management
Molecular Biology Databases
• Categories
–
–
–
–
–
–
–
Nucleotide Sequence Databases
Protein Sequence Databases
Structure Databases
Metabolic and Signaling Pathways
Human Genes and Diseases
Microarray Data and other Expression Databases
…
• Each Database contains specific information
• Each of these databases is interrelated
Nucleotide and Protein
Sequence Databases
Department of Health Information Management
NCBI
• Created as a part of National Library of Medicine
in 1988
– Establish public databases
– Perform research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
• Databases
– Sequence, such as GenBank, RefSeq, dbSNP
– Literature, such as PubMed, OMIM
• Tools
– Entrez. Blast, Cn3D, etc.
NCBI Homepage
Department of Health Information Management
Molecular Databases
• Primary Databases
– Original submissions by experimentalists
– Database staff organize but don’t add additional
– Information, for instance, GenBank
• Derivative Databases
– Human curated
• Compilation and correction of data
• Example: SWISS-PROT, NCBI RefSeq
– Computationally Derived
• Example: UniGene
Department of Health Information Management
GenBank
• http://www.ncbi.nlm.nih.gov/Genbank/
• Nucleotide only sequence database
• GenBank Data
– Direct submissions individual records (BankIt, Sequin)
– Batch submissions via email (EST, GSS, STS)
– ftp accounts established for sequencing centers
• Data shared nightly amongst three collaborating
databases:
– GenBank
– DNA Database of Japan (DDBJ).
– European Molecular Biology Laboratory Database (EMBL)
160
140
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
Sequences (million)
Base Pairs (billion)
Department of Health Information Management
Growth of GenBank (1990-2011)
Base Pairs
Sequences
120
100
80
60
40
20
0
Year
Department of Health Information Management
GenBank Release 187.0
• ftp://ftp.ncbi.nih.gov/genbank/
• Full release every two months
• Incremental and cumulative updates daily
Release 181.0 (12/15/2011)
• 146,413,798
• 135,117,731,375
Sequences
Base Pairs
Department of Health Information Management
GenBank Record (Header)
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
JOURNAL
PUBMED
NM_001963 5600 bp mRNA linear PRI 15-JAN-2012
Homo sapiens epidermal growth factor (EGF),
transcript variant 1, mRNA.
NM_001963
NM_001963.4 GI:296011011
.
Homo sapiens (human)
Homo sapiens Eukaryota; Metazoa; Chordata;
Craniata; Vertebrata; Euteleostomi; Mammalia;
Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
1 (bases 1 to 5600)
de Diesbach,M.T., Cominelli,A., N'Kuli,F.,
Tyteca,D. and Courtoy,P.J. TITLE Acute ligandindependent Src activation mimics low EGF-induced
EGFR surface signalling and redistribution into
recycling endosomes
Exp. Cell Res. 316 (19), 3239-3253 (2010)
20832399
Department of Health Information Management
GenBank Record (Features)
FEATURES Location/Qualifiers
source
1..5600
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="4"
/map="4q25"
gene
1..5600
/gene="EGF"
/gene_synonym="HOMG4; URG"
/note="epidermal growth factor"
/db_xref="GeneID:1950“ /db_xref="MIM:131530"
exon
1..579
/number=1
CDS
453..4076
/codon_start=1
/protein_id="NP_001954.2"
/db_xref="GI:166362728“ /db_xref="GeneID:1950“ /db_xref="MIM:131530"
/translation="MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP …
exon
580..779
/number=2
exon
780..961
/number=3
Department of Health Information Management
GenBank Record (Sequence)
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
aaaaagagaa
caagggttgt
gccttgctct
cataagggtg
ctgtggcttc
ggacaacagc
ccgcatctgg
ttttcttttg
cagtagtttc
aaggtactct
tctcccatgg
tggtggatgc
gggtggattt
gagtatgtaa
ttatttggtc
cccacattct
ggtttatatt
tgggagtgaa
actgttggga
agctggaact
gtcacagtga
tcaggtattt
ccttggcagg
acaacaggag
ggtcaatcat
aaagttcaaa
aaaatttagt
cgcaggaaat
aaatagtatc
tggtgtctca
agaaagacaa
tatagagaaa
aaatcaacag
tttaagtgct
ttggtcttca
ggctctgttg
gaggaatcgt
ttccatcagt
agtcagccag
cttactggct
ctgcattcag
agtaaaagat
actcaccttg
ctcatcaaga
tttgttagtc
gggaattcta
tttaggattg
gtgatcatgg
cttttgcaaa
aatgtttctg
gaaggaatca
ttaaaatatc
gaggtggctg
gagacatcag
atctccatat
tcttcctttc
agcagggctg
tccaaagaaa
aaggtctctc
gccccagggc
cccgggccat
ttatgctgct
tctcagcacc
cttgtgtggg
acacagaagg
attttcatta
gagtttttct
gaatggcaat
ttacagtaac
ctgcaaatgt
gaagccttta
agaaaataac
ttcttctttc
tttttcctct
ttaaactctg
catagataaa
agttgaagaa
tgaggcctcc
gctccagcaa
cactcttatc
gcagcactgg
tcctgcaccc
aaccaattat
taatgagaaa
gaatgggtca
aaattggata
agatatgaaa
agcagttgat
tagagcagat
agctgtgtca
agccccaatc
ctaagccttt
tgaaatttgt
gaaatctttc
agagcttgga
gctcaggcag
aatcaagctg
attctgttgc
agctgtcctg
ttcttaattt
gagcaattgg
agaatctatt
aggcaagaga
aatgaagaag
ggaaataatt
ccagtagaaa
ctcgatggtg
ttggatgtgc
Department of Health Information Management
FASTA Format
>gi|371502116|ref|NM_001126113.2| Homo sapiens tumor protein p53 (TP53),
transcript variant 4, mRNA
GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAG
TTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTG
CTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACG
ACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCAC
TGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTC
AGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGT
CCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGAT
ATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeq
• Database of reference sequences
– http://www.ncbi.nlm.nih.gov/RefSeq/
• Curated
– Many experimentally validated
– Some partially validated via ESTs
– Some computationally predicted
• Non-redundant; one record for each gene, or each splice
variant, from each organism represented
• Status Codes:
– Provisional (temporary)
– Reviewed
– Predicted
Department of Health Information Management
Department of Health Information Management
Accession Numbers
• DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence or
other record relevant to molecular data
• RefSeq provides an expertly curated accession number
that corresponds to the most stable, agreed-upon
“reference” version of a sequence.
• RefSeq identifiers include the following formats:
– Complete chromosome
NC_######
– Genomic contig
NT_######
– mRNA (DNA format)
NM_######, XM_#####
– Protein
NP_######, XP_####
Page 26
EST
Department of Health Information Management
EST
• mRNA: Genomic regions actively transcribed in cell
• cDNA (complementary DNA)
– Copy of mRNA using mRNA as a template
– Sequence is complementary to mRNA
• EST: Expressed Sequence Tag (a short sub-sequence of a
transcribed cDNA sequence)
–
–
–
–
–
Partial cDNA sequence
Can be 5’ or 3’
Typical size: 200 - 500 bp
Represents mRNA actively transcribed in cell
Use to identify
• Genes; Alternative splicing; etc.
Department of Health Information Management
dbEST (release 120111, Dec. 1, 2011)
• http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html
• Number of Entries: 71,276,166
– Homo sapiens (human)
8,315,294
– Mus musculus (mouse)
4,853,562
– Arabidopsis thaliana (thale cress)
1,529,700
– Danio rerio (zebrafish)
1,488,275
– Drosophila melanogaster (fruit fly)
– Gallus gallus (chicken)
821,005
600,433
Department of Health Information Management
Access to dbEST Data
• EST sequences are included in the EST division of
GenBank, available from NCBI by anonymous ftp and
through Entrez
• The nucleotide sequences may be searched using the
BLAST server
• EST sequences are also available as a flat file in the FASTA
format by anonymous ftp in the /repository/dbEST
directory at ftp.ncbi.nih.gov
Protein Structure
Department of Health Information Management
Cn3D: ftp://ftp.ncbi.nih.gov/cn3d/Cn3D-4.3.msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databases
• Proteins have structure and function
• InterPro: Protein families and domains
http://www.ebi.ac.uk/interpro
• Protein Information Resource (PIR):
http://pir.georgetown.edu/
• SWISS-PROT/TrEMBL curated protein sequences:
http://www.expasy.ch/sprot
• UniProt: http://www.expasy.uniprot.org/index.shtml
Department of Health Information Management
Protein Sequence Motifs Databases
• Proteins have conserved regions (motifs,
domains) which may have functional significance
• Databases exist to store protein families, motifs,
and structural domains
• CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
• Pfam: http://www.sanger.ac.uk/Software/Pfam/
• PROSITE: http://www.expasy.org/prosite
Department of Health Information Management
Protein Structure Databases
• Proteins take on 3D structure
• 3D data for some proteins is available due to techniques
such as NMR and X-Ray crystallography
– PDB http://www.pdb.org/
– SCOP http://scop.mrc-lmb.cam.ac.uk/scop
– MMDB http://www.ncbi.nlm.nih.gov/Structure/
Department of Health Information Management
PDB (www.pdb.org)
• The Protein Data Bank (PDB) is the single worldwide
depository of information about the 3D structures of large
biological molecules, including proteins and nucleic acids.
• Understanding the shape of a molecule helps to
understand how it works.
• As of January 2010, there are 62,787 searchable structures
in the PDB database
• PDB provides
– Sequence, Atomic Coordinates, Derived geometric data, Secondary
structure content, Annotations about protein literature references
Department of Health Information Management
PDB Statistics
Yearly Growth of Total Protein Structures in
PDB
90000
80000
70000
60000
50000
40000
Yearly
Total
30000
20000
10000
0
http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100
FlyBase
http://www.flybase.org
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page: gfzf
Department of Health Information Management
More Details: Gene Model & Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database, Inputting
Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphisms
• Genomic sequences from two unrelated individuals are
99.9% identical.
• The 0.1% difference is due to genetic variations, and
mainly (~90%) one form of variation called Single
Nucleotide Polymorphisms (SNPs, single-base variations).
Department of Health Information Management
Importance of Genetic Variations
• Genetic variations underlie phenotypic differences among
different individuals
• Genetic variations determine our predisposition to
diseases and responses to drugs, therapies, and
environmental insults such as bacteria, virus, and
chemicals
• Genetic variations reveal clues of ancestral human
migration history
Department of Health Information Management
Major Types of Genetic Variations
• Single nucleotide mutation
– Majority of SNPs do NOT directly contribute to any phenotypes
• Insertion or deletion of one or more nucleotides
– Tandem repeat polymorphisms (Genomic regions consisting of variable length,
usually 1-100 bases long, of sequence motifs repeating in tandem with variable
copy number)
• Used as genetic markers for DNA finger printing (forensic, parentage testing)
• Many cause genetic diseases
– Insertion/Deletion polymorphisms (Often resulted from localized
rearrangements between homologous tandem repeats)
• Gross chromosomal aberration
– Deletions, inversions, or translocation of large DNA fragments
– Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutations
• Terminology for variation at a single nucleotide position is
defined by allele frequency
– A single base change, occurring in a population at a frequency of
>1% is termed a single nucleotide polymorphism (SNP)
– When a single base change occurs at <1% it is considered to be a
mutation
• A SNP is a polymorphic position where the point mutation
has been fixed in the population
• In practice, however, SNPs databases contains multiple
types of variations, including SNPs, mutations, insertions,
deletions, tandem repeats, copy number variations, etc.
Department of Health Information Management
SNPs
• SNPs can occur anywhere on a genome, they are classified
based on their locations.
– Many SNPs in genomic, non-coding regions
– SNPs in gene regions, including promoter region, coding region,
intronic, exonic regioin, UTR, etc.
• Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPs
• The phenotypic consequence of a SNP is significantly
affected by the location where it occurs (gene or nongene), as well as the nature of the mutation (synonymous
or non-synonymous)
– No consequence
– Affect gene transcription quantitatively or qualitatively
– Affect gene translation quantitatively or qualitatively
– Change protein structure and functions
– Change gene regulation at different steps
Department of Health Information Management
Simple/Complex Genetic Diseases and SNPs
• Simple genetic diseases (Mendelian diseases) are often
caused by mutations in a single gene
– e.g. Huntington’s, Cystic fibrosis, etc.
• Many complex diseases are the result of mutations in
multiple genes, the interactions among them as well as
between the environmental factors
– e.g. cancers, heart diseases, Alzheimer's, diabetes, asthmas,
obesity, etc.
Department of Health Information Management
Sickle Cell Anemia
• Due to 1 swapping an A for a T, causing inserted amino
acid to be valine instead of glutamine in hemoglobin
1 Normal red blood cells; 2 Sickled red blood cells
http://mmcenters.discoveryhospital.com/shared/enc/img_htm/IM-56.htm
Department of Health Information Management
A Few Relevant Concepts
• Allele: A specific “version” of a gene, or an alternative
DNA sequences at the same physical locus, which may or
may not result in different phenotypic traits
• Genotype: the genetic constitution of a cell, an organism,
or an individual
• Genotyping: the process of identifying what genotype a
person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databases
• dbSNP
– http://www.ncbi.nlm.nih.gov/SNP/
• Online Mendelian Inheritance in Man (OMIM)
– http://www.ncbi.nlm.nih.gov/omim
• International HapMap Project
– http://www.hapmap.org/
• Genome Variation Server (Seattle SNPs)
– http://gvs.gs.washington.edu/GVS/
Department of Health Information Management
dbSNP
• The Single Nucleotide Polymorphism database (dbSNP) is a publicdomain archive for a broad collection of simple genetic variations
• This collection of polymorphisms includes:
– Single-base nucleotide substitutions (or single nucleotide polymorphisms
-SNPs)
• Roughly 10 million in human population or on average 1 per 300 bps
• Less than half of these SNPs are identified and stored in the database
– Microsatellite repeat variations (or short tandem repeats - STRs)
• In sillico estimation of potentially polymorphic variable number tandem
repeats (VNTR) are over 100,000 across the human genome
– Small-scale multi-base deletions or insertions
• The short insertion/deletions are difficult to quantify and the number is likely
to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Types
• The dbSNP contains two classes of records:
– Submitted record
• The original observations of sequence variation; submitted SNPs (SS)
records started with ss
– Computationally annotated record
• Generated during the dbSNP "build" cycle by computation based the
original submitted data, Reference SNP Clusters (ref SNP) start with
rs
Department of Health Information Management
A dbSNP Record
>gnl|dbSNP|ss5586300|allelePos=214|len=475|ta
xid=9606|alleles='A/G'|mol=Genomic
ATAAACATGG
ATCAAGTCAT
CTTTGAGGAA
TTCCAAGTAC
TTTAAAGRAG
R
CTCAAGCAAT
GTATTAATGA
AGAAACAGAG
ACCTGAGGTC
AAATAAAAAA
TTCTCTCCAT
ACTTTTACAA
YTGTTAAAAC
CATTCAATRT
AGTGAGCACA
CCA
AACCCATATC
TAAATGTAAG
CACCTGAAAG
ATTAGCCGTA
GTATACCACC
AAAAATCTGC
AGAAATGGGA
ATAACATTAG
ACTTTTTCCC
TAGAGGAAAA
AATGAGAACA
AGAAAATGTT
ATTAATGAAG
AATAGGTTCC
GGCCAAAATT
TATAAACAAA
GCAAGAATAT
A
TAGGTTCCAG
AGTGATGAAA
GAATGCTATG
GTCTTCCTGG
GAAGAAGTAG
TACTAATGAA
ACATTCAAGC
CTTAGATTAG
AAGTAATTGT
TTCAGACTGT
GTGGGCTCCA
AGAACTAGGT
GGGTTTTGCA
AAGCATCCTG
TAATACAGAT
Department of Health Information Management
International Union of Pure and Applied Chemistry
(IUPAC) Code and Meaning
IUPAC code
A
C
G
T
M
R
W
S
Y
K
V
H
D
B
N
Meaning
A
C
G
T
A or C
A or G
A or T
C or G
C or T
G or T
A or C or G
A or C or T
A or G or T
C or G or T
G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
• dbSNP web site
– Direct search of SS record; batch search; allow SNP record submission;
No search limit
• Entrez SNP
– http://www.ncbi.nlm.nih.gov/sites/entrez?db=Snp
– Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
• http://www.ncbi.nlm.nih.gov/SNP/index.html
Department of Health Information Management
dbSNP Search Examples
Search using wild-card(*), ranging(:), AND, OR, and NOT operators
Example
Description
BRC*[Gene Name] Search SNPs on all genes with names
starting with the letter 'BRC' (ie. BRCA1 and
BRCA2)
1[CHR] AND
Search SNPs located on chromosome 1 with
(frameshift[Function_ function class 'frame-shift'
Class])
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 2
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 2
NOT
detected by all methods except 'unknown'.
unknown[METHOD]
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP: Example
• Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
• Retrieve all validated non-synonymous coding reference
SNPs for BRCA1 from dbSNP
• Starting from the Entrez SNP:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP Ref
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=799920
Department of Health Information Management
SNP Location
>gnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|sn
pclass=1|alleles='A/C'|mol=Genomic|build=130
AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC
AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA
TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT
GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA
AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT
CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC
M
GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT
CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA
GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC
ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT
CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT
GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header Format
Fasta header line starts with '>' and has fields separated by '|'. Each field is
explained below:
Gnl Internal use.
dbSNP Database name.
dbSNP accession for the snp: ss# refers to submitted snp accession. rs# refers
ss or rs number
to the accession of refSNP cluster of one or more submitted snp.
allelePos Variation allele position(1 based) on the fasta. It is always the 5' length plus 1.
Total number of bases of the fasta sequence, a sum of length of 5', 3' and
len,totalLen variation. Variation is expressed in one IUPack code and has a length of 1 in
the totalLen calculation.
Only for submitted snp. The two fields after "totalLen" are the submitter
handle|submitted_snp_id
handle and submitter snp id.
Taxid NCBI taxonomy id
Molecular source of the sequence. Valid values are: genomic, cDNA or
Mol
mitochondria.
Variation class of the snp, most common value is 1 - single nucleotide
snpclass:
polymorphism. Click on "snpclass" for details.
Alleles Lists alleles of the snp separated by '/'.
Sequence in lower case is used for sequence identified by RepeatMasker as
Lower or upper case
low-complexity or repetitive elements.
ATCG Green color is used for assay sequence (observed by the submitter).
ATCG Black color is used for flank sequence (extracted from sequence databases ).
Header
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing Genes
Disease centric databases:
• OMIM: http://www.ncbi.nlm.nih.gov/omim/
• CDC HugeNavigator: http://hugenavigator.net/
• HGMD: https://portal.biobaseinternational.com/hgmd/pro/start.php
• A Catalog of Published Genome-Wide Association Studies:
http://www.genome.gov/26525384
Department of Health Information Management
NCBI—OMIM
Department of Health Information Management
Online Mendelian Inheritance in Man
(OMIM)
• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
• OMIM is a human genetic disorders database built and
curated using results from published studies
• Each OMIM record provides a summary of the current
state of knowledge of the genetic basis of a disorder, which
contains the following information:
– description and clinical features of a disorder or a gene involved in genetic
disorders; biochemical and other features; cytogenetics and mapping;
molecular and population genetics; diagnosis and clinical management;
animal models for the disorder; allelic variants.
• OMIM is searchable via NCBI Entrez, and its records are
cross-linked to other NCBI resources.
Department of Health Information Management
OMIM: Variant
• The OMIM database includes genetic disorders caused by
various mutation/variation, from SNPs to large-scale
chromosomal abnormalities
• Variants are represented by a 10-digit OMIM number, and
can be searched in two ways
– Search for a gene or a disease, when retrieved, view its variants
Department of Health Information Management
Variants in OMIM Records
• For most genes, only selected mutations are included
– Criteria for inclusion include: the first mutation to be discovered, high population
frequency, distinctive phenotype, historic significance, unusual mechanism of
mutation, unusual pathogenetic mechanism, and distinctive inheritance.
• Most of the variants represent disease-producing
mutations, NOT polymorphisms
• A few polymorphisms are included, many of which show a
positive statistical correlation with particular common
disorders
• Few neutral polymorphisms are included in OMIM
• Some SNPs in the dbSNP records are not linked to the
corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics, CDC
• The CDC established the Office of Public Health Genomics
(OPHG) in 1997.
• OPHG aims to integrate genomics into public health research,
policy, and programs. Doing so could improve interventions
designed to prevent and control the country’s leading chronic,
infectious, environmental, and occupational diseases.
• OPHG's efforts focus on
• conducting population-based genomic research
• assessing the role of family health history in disease risk and
prevention
• supporting a systematic process for evaluating genetic tests
• translating genomics into public health research and programs
• strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENet
• The Human Genome Epidemiology Network (HuGENet™) :
– Established to help translate genetic research findings into opportunities for
preventive medicine and public health by advancing the synthesis, interpretation,
and dissemination of population-based data on human genetic variation in health
and disease.
• HuGENetTM resources:
– HuGE Navigator, Coordinating centers, Collaborators, Workshops, Reviews, Case
studies, Book
• HuGE Navigator provides access to a continuously updated knowledge
base in human genome epidemiology
– information on population prevalence of genetic variants
– gene-disease associations
– gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Gene’s Associated Diseases
Department of Health Information Management
Disease Databases
• Genes are involved in disease
• Many diseases are well studied
• Description of diseases and what is known about
them is stored
– OMIM: http://www.ncbi.nlm.nih.gov/Omim/
– Tumor Gene Family Databases: http://www.tumorgene.org/tgdf.html
Department of Health Information Management
Homework 1
• Using PubMed, search for a recent paper related to genetic disease (or
disease with known genetic basis) and with a research method named
“Genome-Wide Association Study or GWAS”. The title of the journals
could be “PLoS genetics”, “Nature Genetics”, etc. The genetic disease
could be obesity, type II diabetes, etc.
• Choose one or a few SNPs record or genes mentioned in the paper
which may highly relevant to the disease; search for the corresponding
dbSNP records in dbSNP. Summarize the genetic variation, relevant
genes, and location of the variation.
• Search for the Genbank record of the relevant genes, extract and save
their sequences in FASTA format. Identify the corresponding protein
sequences and use Cn3D to display the structure of the protein.