Database Modeling in Bioinformatics

Download Report

Transcript Database Modeling in Bioinformatics

GENOME ANNOTATION
AND FUNCTIONAL
GENOMICS
The protein sequence
perspective
GENOME ANNOTATION
• Two main levels:
– STRUCTURAL ANNOTATION – Finding genes
and other biologically relevant sites thus building up a
model of genome as objects with specific locations
– FUNCTIONAL ANNOTATION – Objects are used
in database searches (and expts) aim is attributing
biologically relevant information to whole sequence
and individual objects
WHY PROTEIN RATHER THAN DNA?
•
•
•
•
•
•
•
•
Larger alphabet -more sensitive comparisons
Protein sequences lower signal to noise ratio
Less redundancy and no frameshifts
Each aa has different properties like size, charge etc
Closer to biological function
3D structure of similar proteins may be known
Evolutionary relationships more evident
Availability of good, well annotated protein sequence and
pattern databases
Large-scale genome analysis projects
• Rate-limiting step is annotation
• Whole genome availability provides context
information
• Main goal is to bridge gap between genotype and
phenotype
Definitions of Annotation
• Addition of as much reliable and up-to-date
information as possible to describe a sequence
• Identification, structural description,
characterisation of putative protein products and
other features in primary genomic sequence
• Information attached to genomic coordinates with
start and end point, can occur at different levels
• Interpreting raw sequence data into useful
biological information
ANNOTATION/FUNCTION CAN BE
MAPPED TO DIFFERENT LEVELS:
 ORGANISM -phenotypic function (morphology,
physiology, behaviour, environemntal response), context NB
 CELLULAR -metabolic pathway, signal cascades, cellular
localisation. Context dependent
 MOLECULAR -binding sites, catalytic activity, PTM, 3D
structure
 DOMAIN
 SINGLE RESIDUE
Annotation is the description of:
•
•
•
•
•
•
•
Function(s) of the protein
Post-translational modification(s)
Domains and sites
Secondary structure
Quaternary structure
Similarities to other proteins
Disease(s) associated with deficiencie(s) in the
protein
• Sequence conflicts, variants, etc.
Additional information for proteins
•
•
•
•
•
•
•
•
•
ALTERNATIVE PRODUCTS
CATALYTIC ACTIVITY
COFACTOR
DEVELOPMENTAL STAGE
DISEASE
DOMAIN
ENZYME REGULATION
FUNCTION
INDUCTION
•
•
•
•
•
•
PATHWAY
PHARMACEUTICALS
POLYMORPHISM
PTM
SIMILARITY
SUBCELLULAR
LOCATION
• SUBUNIT
• TISSUE SPECIFICITY
Amino-acid sites are:
•
•
•
•
•
•
•
•
Post-translational modification of a residue
Covalent binding of a lipidic moiety
Disulfide bond
Thiolester bond
Thioether bond
Glycosylation site
Binding site for a metal ion
Binding site for any chemical group (co-enzyme,
prosthetic group, etc.)
Regions:
•
•
•
•
•
•
SIGNAL SEQUENCE
TRANSIT PEPTIDE
PROPEPTIDE
CHAIN
PEPTIDE
DOMAIN
•
•
•
•
•
ACTIVE SITE
DNA BIND SITE
METAL BIND SITE
MOLECULE BIND SITE
TRANSMEMBRANE
Annotation sources:
• publications that report new sequence data
• review articles to periodically update the
annotation of families or groups of proteins
• external experts
• protein sequence analysis
Approaches to functional annotation:
 Automatic annotation (sequence homology, rules, transfer info
from pdb)
 Automatic classification (pattern databases, clustering,
structure)
 Automatic characterisation (functional databases)
 Context information (comparitive genome analysis, metabolic
pathway databases)
 Experimental results (2D gels, microarrays)
 Full manual annotation (SWISS-PROT style)
PROTEIN SEQUENCE ANALYSIS
• Protein sequence can come from gene predictions,
literature or peptide sequencing
• Analysis on different levels:
– molecular
– cellular
– organism
• Simplest case- match for whole sequence in databasedetermination of structure and function
• In between- partial matches across sequence to diverse
or hypothetical proteins
• Difficult case- no match, have to derive information
from amino acid properties, pattern searches etc
From
sequence
to function
Predicting function from sequence similarity
• Orthologues- arose from speciation, same gene in
different organisms -can have <30% homology
• Paralogues- from duplication within a genome, second
copy may have new or changed function
(difficult to distinguish between otho- and paralogues unless whole
genome is available)
• Equivalog- proteins with equivalent functions
• Analog- proteins catalyzing same reaction but not
structurally related
• Some enzymes may have seq similarity simply because
common catalytic site, substrate, pathway.
TYPES OF HOMOLOGY
Superfamily
PROTEIN/DOMAIN
Duplication within species
Paralogues
may have different
functions
A
B
Speciation
Orthologues
may have different
functions, if same -
Equivalogs
B1
B2
Sequence homology in genomes
When you do a whole genome BLAST search there
is a general pattern of results:
Maverick genes shared with
some other species
Common genes
Incorrect
predictions
Maverick genes
unique function
Maverick genes tend to diverge more frequently
than core genes
Using homology information for
automatic annotation- automatic
annotation of TrEMBL as an example
Requirements for automatic annotation
• Well-annotated reference database (eg SWISS-PROT)
• Highly reliable diagnostic protein family signature
database with the means to assign proteins to groups
(eg CDD, InterPro, IProClass)
• A RuleBase to store and manage the annotation rules,
their sources and their usage
Direct Transfer
XDB
• Search target
• Transfer annotation to
target database
• Example:
Target
FASTA against sequence
database and transfer of
DE line of best hit
Multiple Sources
• Usually more than one
external database is used
• Combine the different
results
XDB
Target
Conflicts
•
•
•
•
Contradiction
Inconsistencies
Synonyms
Redundancy
Translation
• Use a translator to map
XDB language to target
language
XDB
Target
Translation Examples
• ENZYME TrEMBL
CA L-ALANINE=D-ALANINE
CC -!- CATALYTIC ACTIVITY: L-ALANINE=
CC
D-ALANINE.
• PROSITE TrEMBL
/SITE=3,heme_iron
FT METAL
IRON
• Pfam TrEMBL
FT DOMAIN
zf_C3HC4
FT ZN_FING
C3HC4-TYPE
•
•
•
•
•
•
Demands on a system for
automated data analysis and
annotation
Correctness
Scalability
Updateable
Low level of redundant information
Completeness
Standardized vocabulary
What do we have?
•
•
•
•
SWISS-PROT
RuleBase
TrEMBL
PROSITE (and Pfam, PRINTS, ProDom, SMART, Blocks
etc)
• SWISS-PROT/TrEMBL/RuleBase in Oracle
Standardized transfer of annotation
from characterized proteins in
SWISS-PROT to TrEMBL entries
• TrEMBL entry is reliably recognized by a given
method as a member of a certain group of proteins
• corresponding group of proteins in SWISS-PROT
shares certain annotation
• common annotation is transferred to the TrEMBL
entry and flagged as annotated by similarity
Automatic annotation information flow
• Get information necessary to assign proteins to groups eg
using InterPro or other biological or family informationstore in RuleBase
• Group proteins in SWISS-PROT by these conditions
• Extract common annotation shared by all these proteinsstore in RuleBase
• Group unannotated sequences by the conditions
• Transfer common annotation flagged with evidence tags
• Note: can add taxonomic constraints
Extract Reference Entries
• Use XDB to extract entries
from standard database
• Example:
Pfam
Pfam:PF00509 Hemagglutinin
SWISS-PROT
TrEMBL
HEMA_IAVI7/P03435
HEMA_IANT6/P03436
HEMA_IAAIC/P03437
HEMA_IAX31/P03438
HEMA_IAME2/P03439
HEMA_IAEN7/P03440
HEMA_IABAN/P03441
HEMA_IADU3/P03442
HEMA_IADA1/P03443
HEMA_IADMA/P03444
HEMA_IADM1/P03445
HEMA_IADA2/P03446
HEMA_IASH5/P03447
Extract Common Annotation
132
131
125
6
131
130
130
125
125
75
31
131
102
1
130
107
102
entries read
ID
HEMA_XXXXX
DE
HEMAGGLUTININ PRECURSOR.
DE
HEMAGGLUTININ.
GN
HA
CC
-!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE
CC
VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.
CC
-!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO
CC
CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND.
DR
HSSP; P03437; 1HGD.
DR
HSSP; P03437; 1DLH.
KW
HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN
KW
SIGNAL
KW
COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE
FT
CHAIN
HA1 CHAIN.
FT
CHAIN
HA2 CHAIN.
FT
SIGNAL
Store Common Annotation
• Store the used conditions
and the extracted common
annotation in a separate
database
XDB
SWISS-PROT
TrEMBL
RuleBas
e
RULES
• Rules describe:
– the content of the annotation to be transferred
(ACTIONS),
– the CONDITIONS which the target TrEMBL entry
must fulfill in order to allow transfer of the annotation.
• Rules uniquely describe or delineate a set of SWISSPROT entries.
– The common annotation in these entries is transferred
to TrEMBL.
//
#RULE RU000482
#DATE 2001-01-11
#USER OPS$WFL
#PACK PROSITE
?PSAC PS00449
ACTIONS
?EMOT PS00449 }
CONDITIONS
!ECNO 3.6.1.34
!SPDE ATP synthase A chain
!CCFU KEY COMPONENT OF THE PROTON CHANNEL; IT MAY PLAY A DIRECT ROLE IN
THE TRANSLOCATION OF PROTONS ACROSS THE MEMBRANE (BY SIMILARITY)
!CCSU F-TYPE ATPASES HAVE 2 COMPONENTS, CF(1) - THE CATALYTIC CORE - AND CF(0)
- THE MEMBRANE PROTON CHANNEL. CF(1) HAS FIVE SUBUNITS: ALPHA(3), BETA(3),
GAMM A(1), DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN SUBUNITS: A, B AND C (BY
SIMILARITY)
!CCLO INTEGRAL MEMBRANE PROTEIN (By Similarity)
!CCSI TO THE ATPASE A CHAIN FAMILY
!SPKW CF(0)
!SPKW Hydrogen ion transport
!SPKW Transmembrane
//
Add Annotation to Target
• Use conditions to extract
entries from TrEMBL
• Add common annotation
to the entries
XDB
SWISS-PROT
TrEMBL
RuleBas
e
Automatic annotation using multiple dbs
• Extract conditions from XDB
ENZYME
Pfam
• Group SWISS-PROT by
INTERPRO
PROSITE
conditions
• Extract common annotation
• Group TrEMBL by conditions
TrEMBL • Add common annotation to
TrEMBL
SWISS-PROT
RuleBas
e
Using tree structure of InterPro
RU000652 with additional condition
connected by ‘AND’
//
#RULE
RU000652
#DATE
2001-01-11
#USER
OPS$WFL
#PACK
PROSITE
?IPRO
IPR002379
?PSAC
PS00605
Additional condition (parent signature)
?EMOT
PS00605
!SPDE
ATP synthase C chain (Lipid-binding protein) (Subunit C)
!ECNO
3.6.1.34
!CCSU
F-TYPE ATPASES HAVE 2 COMPONENTS, CF(1) - THE CATALYTIC CORE - AND
CF(0) - THE MEMBRANE PROTON CHANNEL. CF(1) HAS FIVE SUBUNITS: ALPHA(3),
BETA(3), GAMMA(1), DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN SUBUNITS: A, B
AND C (By Similarity)
!CCSI
TO THE ATPASE C CHAIN FAMILY
!SPKW
CF(0)
!SPKW
Hydrogen ion transport
!SPKW
Lipid-binding
!SPKW
Transmembrane
//
Condition types
• Signature hits:
- Prosite, Prints, Pfam, Prodom
•Taxonomy:
- Broad groups like:
Archaea
Bacteriophage
Eukaryota
Prokaryota
Eukaryotic viruses
- more specific such as species
•Organelle
•Conditions
•Negated conditions
Rule-building
•Grouping and extraction of common annotation:
- semi automated but involves manual data-mining
assisted by perl/shell scripts.
•Algorithmic data-mining:
- fully automated.
- fast.
- exhaustive exploration of condition-set/annotation
search-space .
- non-biological, validity of rules being assessed
by comparison with semi-manual approach.
Advantages of this method
• Uses reliable ref database, prevents propagation of
incorrect annotation
• Using common annotation of multiple entries, lower
over-prediction than from best hit of BLAST
• Can standardize annotation and nomenclature of target
sequences, since reference is standardized
• Can have different levels of common annotation from
different levels of family hierarchy
• Independent of multi-domain organisation
• Evidence tags allow for easy tracking and updating
Pitfalls of automatic functional analysis
• Multifunctional proteins- genome projects often assign
single function, info is lost in homology search
• Hypothetical proteins (40% oRFs unknown), and poorly or
even wrongly annotated proteins
• No coverage of position-specific annotation eg active sites
• Current methods provide only a phrase describing some
properties of the unknown protein
It is important to have evidence for all annotation added
EVIDENCE TAGS
Predicting function from non-homology
• Look at position of genes relative to others,
compare with other organisms
• Can still build up rules from annotated sequences
using information you have on other features like
fold, physical properties etc.
• Use physical properties and known attributes
Protein functions from regions
• Active sites- short, highly conserved regions
• Loops- charged residues and variable sequence
• Interior of protein- conservation of charged
amino acids
Protein functions from specific residues
• C
•
•
•
•
DE
G
H
KR
• P
• SR
• ST
disulphide-rich, metallothionein, zinc fingers
acidic proteins (unknown)
collagens
histidine-rich glycoprotein
nuclear proteins, nuclear
localisation
collagen, filaments
RNA binding motifs
mucins
• Polar (C,D,E,H,K,N,Q,R,S,T) - active
sites
• Aromatic (F,H,W,Y) - protein ligandbinding sites
• Zn+-coord (C,D,E,H,N,Q) - active site,
zinc finger
• Ca2+-coord (D,E,N,Q) - ligand-binding
site
• Mg/Mn-coord (D,E,N,S,R,T) - Mg2+ or
Mn2+ catalysis, ligand binding
• Ph-bind (H,K,R,S,T) - phosphate and
sulphate binding
Supplement annotation with Xrefs
to other databases
•
•
•
•
DDBJ/EMBL/GenBank Nucleotide Sequence Database
PDB
Genomic databases (FlyBase, MGD, SGD)
2D-Gel databases (ECO2DBASE, SWISS-2DPAGE,
Aarhus/Ghent, YEPD, Harefield), Gene expression data
• Specialized collections (OMIM, InterPro, PROSITE,
PRINTS, PFAM, ProDom, SMART, ENZYME, GPCRDB,
Transfac, HSSP)
Approaches to functional annotation:
 Automatic annotation (sequence homology, rules, transfer info
from pdb)
 Automatic classification (pattern databases, clustering,
structure)
 Automatic characterisation (functional databases)
 Context information (comparitive genome analysis, metabolic
pathway databases)
 Experimental results (2D gels, microarrays)
 Full manual annotation (SWISS-PROT style)
AUTOMATIC CLASSIFICATION
Annotation can by using Clustering methods eg
CluSTR (EBI), and pattern searches (InterPro etc)classification of proteins into different families
AUTOMATIC CHARACTERIZATIONFUNCTIONAL ANNOTATION SCHEMES
•
•
•
•
First attempt –Riley classification of E.coli
Genome sequencing projects driving force
Need standardised system and vocabulary
Functional schemes normally hierarchies of
different levels of generalisation
Databases for Functional Information
• KEGG -Kyoto encyclopedia of genes and genomes
– (http://www.genome.ad.jp/kegg/)
– Links genome information (GENES database) to high order functional
information stored in PATHWAY database.
– Also has LIGAND database for chemical compounds, molecules and reactions.
• PEDANT -Protein Extraction, Description and Analysis Tool
– (http://pedant.gsf.de/)
– Annotation for complete and incomplete genomes eg. List of ORFs, EC numbers,
functional categories, list seqs with homologs, gene clusters, domain hits, TM,
structure links, search facility for sequences etc
• WIT –What is there
– ( http://www.cme.msu.edu/WIT)
– Database of metabolic pathways, can text search for ORFs, pathways, enzymes
Databases for Functional Information (2)
• COG -Clusters of Orthologous Groups
–
–
–
–
(http://www.ncbi.nlm.nih.gov/COG)
Phylogenetic classification of proteins encoded in complete genomes.
Contains 2791 COGs including 30 genomes.
COGs thought to contain orthologous proteins, classified into broad functional
categories (transciption, replication, cell division).
– COGNITOR assigns proteins to COGs based on best-hit, divides multi-domain
proteins
– Can compare results with complete genomes, look for missing functions
• GO –Gene Ontology
– (http://www.geneontology.org)
– Standard vocabulary first used for mouse, fly and yeast
– Three ontologies: molecular function, biological process and cellular component
Databases for Functional Information (3)
• MIPS:MYGD FunCat –Functional catalogue (yeast)
http://www.mips.biochem.mpg.de/proj/yeast
• EcoCyc -Encyclopedia of E. coli Genes and Metabolism
http://ecocyc.doubletwist.com/ecocyc/ecocyc.html
• Enzyme database
http://wwwexpasy.ch/sprot/enzyme.html
• TIGR –Gene identification list
http://www.tigr.org/tdb/mdb/mdb.html
 All schemes have different depths, breadths and resolutions
 Schemes need to be applicable to all organisms, standardized
for comparisons and permit multiple assignments
Assignment of function
• Use a combination of databases, especially those
with standardised functional information
• Search function databases with sequences to find
matches -assign function eg PENDANT, PIR
superfamilies, COGs, GO (via InterPro)
FUNCTIONAL CLASSIFICATION
USING INTERPRO
• InterPro classification with 3-4 letter codes
• Mapping of InterPro entries to GO
• GO- Gene Ontology (SGD, FB & MGD)
universal ontology for
– molecular function
– biological process
– cellular component
Classification of IPRs
CGD Cell cycle/growth/death
-CGDc cell cycle/division
-CGDg cell growth/development
-CGDd cell death
CYS Cytoskeletal/structural
-CYSc cytoskeletal
-CYSs structural
-CYSv virus coat/capsid protein
DPT
Defense/pathogenesis/toxin
DRG
DNA/RNA-binding/regulation
DRM DNA/RNA metabolism
-DRMr DNA repair/recombination
-DRMp DNA replication
-DRMm DNA/RNA modification
-DRMt transcription/translation
-DRMb ribosomal protein
MET Metabolism
-METs substrate metabolism
-METe electron transfer
-METa amino acid metabolism
-METn nucleic acid metabolism
-METm metal binding proteins
OTH Other functions
-OTHm cell motility
-OTHt transposition
-OTHa cell adhesion
-OTHg miscellaneous functions
-OTHh hormones
-OTHi immune-response proteins
-OTHf multifunctional proteins
-OTHo multifunctional domains
PFD Protein folding & degradation
-PFDc chaperone
-PFDp protease/endopeptidase
-PFDi protease inhibitor
PRG Protein-binding/other regulation
-PRGg GPCRs
-PRGr other receptors
-PRGo other regulation
STD Signal transduction
-STDk sig transduction
-STDp sig transduction
-STDr sig transduction
-STDs sig transduction
-STDc cell signalling
kinases
phosphatases
response reg
sensors
TRS Transport and secretion
-TRSt transport (subtrates)
-TRSi transport (ions)
-TRSs secretion
-TRSr carrier proteins
UNK
Unknown function
Pie charts of whole proteome analysis of 4 organisms
Unknown
Transport
Signal transduction
Protein folding/degradation
Miscellaneous
Structural
Defense/Pathogenesis
Cell cycle
DNA/RNA metabolism
Regulation
Metabolism
Distribution of protein functions
25
20
15
M. tuberculosis
10
E. coli
B. subtilis
S. cerevisae
5
0
S. cerevisae
B. subtilis
E. coli
M. tuberculosis
GENOME ANNOTATION TOOLS
• Oakridge Genome Annotation Channel
(http://compbio.ornl.gov/channel/)
• ENSEMBL (http://ensembl.ebi.ac.uk)
• Artemis (http://www.sanger.ac.uk/Software/Artemis)
Sequence viewer and annotation tool
• GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
System for automated annotation of sequences, web
access required
• Genome Annotation Assessment Project (GASP1)
(http://www.fruitfly.org/GASP1)
PEDANT SYSTEM
Layer 1 bioinformatics tools
PSI-BLAST
IMPALA
PREDATOR
CLUSTALW
TMAP
SIGNALP
SEG
PROSEARCH
COILS
HMMER
Databases for searching
MIPS
PROSITE
BLOCKS
PIR
COGS
Layer 2 database to store information -MySQL
Layer 3 user interface to display results
parser of
results
Manual
annotation tool
Programs written in Perl5 and some in C++ -portable. Processing
of one sequence takes about 3 minutes
Summary of protein sequence annotation
•
•
•
•
•
•
Mask compositionally-biased and coiled-coil regions
Identify transmembrane regions, signal peptides, GPI anchors
Predict secondary structure
Look for known domains from protein pattern databases
Search sequence database for similar sequences
If no or few results search with subsequences, do iterative
searches
• Functional annotation: consider function of each domain
present, annotation from database homologs, function from
hits with 3D structure
SUMMARY OF ANNOTATION PIPELINE
NEW SEQUENCES FROM
SEQUENCING PROJECT
SEARCH FOR
PATTERNS &
FUNCTION
DBs
BLAST/
FASTA
NO SIGNIFICANT
HITS
PSI-BLAST
SIGNIFICANT
HITS
IF EQUIVALOG, INFER
FUNCTION
Search SCOP
NB look out for multidomain proteins, put
into genome context
NO SIGNIFICANT
HITS
HIT TO 3D PROTEINSTRUCTURE &
FUNCTION
PHYSICAL PROPERTIES,
LOCALISATION ETC
SIGNIFICANT
HITS
ASSIGN PROTEIN
FAMILY OR DOMAIN,
CF OTHER PROTEINS
IN FAMILY, INFER
FUNCTION
Supplement with
manual curation and
use evidence tags
LIMITS OF PROTEIN SEQUENCE ANALYSIS
• Predicting function from sequence requires another
sequence to be mapped to a function –many hypothetical
proteins in db and UPFs
• If sequence homologues are found, may not be functional
homologues -qualitative rather than quantitative process
- orthologues may have different functions
-enzyme homologues may be inactive
-equivalent functions may use different genes, not orthologue
• Analogy can infer molecular function, but not necessarily
cellular function
LIMITS OF PROTEIN SEQUENCE ANALYSIS (2)
• Databases are biased in sequence and aa composition
and search is dependent on size
• If no homology found- limited amount of information
can be inferred
• Incorrect annotation can be propagated when
similarity is over part on sequence not used in
annotation
• No answers to tissue-specificity, binding of ligands,
relationship between genotype and phenotype
LIMITS OF PROTEIN SEQUENCE ANALYSIS (3)
• Need additional information from experiments, eg
can predict glycosylation sites, but not kind of
sugar attached
• Problem with multidomain proteins (assign
orthology on basis of domains or domain
composition of whole protein?) -check also
known domain architectures and their taxonomic
limitations
Using different approaches to
functional annotation: Status for SPTR
• Automatic annotation (RuleBase): 20% of all protein
sequences/20% of all new sequences
• Automatic classification (InterPro, CluSTr, Structure): 60%
of all protein sequences/60% of all new sequences
• Automatic characterisation (GO): 40% of all protein
sequences/40% of all new sequences
• Full annotation (SWISS-PROT style): 20% of all protein
sequences/5% of all new sequences
Using different approaches to functional
annotation: Future for SPTR
• Automatic annotation (RuleBase): 50% of all protein
sequences in 2004
• Automatic classification (InterPro, CluSTr, Structure): 90%
of all protein sequences in 2004
• Automatic characterisation (GO): 70% of all protein
sequences in 2004
• Full annotation (SWISS-PROT style): 10% of all protein
sequences in 2004
IMPORTANT TO NOTE:
• DON’T COMPLETELY TRUST COMPUTER
RESULTS
• CHECK LITERATURE
• CONFIRM WITH WETLAB WORK- mutational
analysis gives valuable info about function
• COMPROMISE BETWEEN OVER AND UNDERPREDICTIONS -overpredictions can be checked by
curators, easier to delete than find missing info.