Module 1: Database, security and user interface

Download Report

Transcript Module 1: Database, security and user interface

BBP Objective
“This new platform will be a one-stop
information warehouse that hosts
phenotypic and genomic data for barley
breeders, barley geneticists, students,
extension agronomists and barley
producers in Alberta and across western
Canada.”
BBP Functions
• Module 1: Database, data security
• Login, data query, data entry/update
• Module 2: Statistical genomics
• QTL mapping, association genetics
• Genome-wide prediction
• Molecular breeding strategies
• Module 3: Bioinformatics
• Gene/regulatory region annotation,
• Comparative genomics
ER diagram
Phenotype group
Genotype group
Knowledge base group
Many-to-many
One-to-many (required)
One-to-many (optional)
Module 1: Database, security and
user interface
• Database design
• ER diagrams
• Variable groups
• Database implementation
• Table structure
• Table-to-table relationship
• Database security: user permission management
• Verification of user ID
• Permission assignment (permission table with permission levels for individual users)
• Permission implementation
• User interface with Excel
•
•
•
•
•
login
Data query
Data submission
Report
Data quality control
BBP module 1 progress
Table
Design
Implementation
Structure
ER
GUI
Query
Entry
Pedigree
Datatype
Trial
Phenotype
Marker
Genotype
QTL
Gene
Annotation
Contact
Completed
In progress
Not available
Data
loading
Variable groups
• Primary key
• Identity keys
• Checking duplicates in input data
• Checking existing records
• Required fields
• Observed values for phenotype or genotype
• Log info
• User info to track that last data operation
• Optional fields
DB Implementation: pedigree
• Role:
• Support table for “phenotype”
and “genotype”
• Providing Pedigree information
• Structure implementation
• Refers to the left
• Data loading
• 5402 records from FCDC
• 1 record for demonstration
DB Implementation: datatype
• Role:
• Support table for “phenotype”
• Convert datatype name to
datatype_id
• Structure implementation
• Refers to the left
• Data loading
• 34 records from FCDC
DB Implementation: trial
• Role:
• Support table for “phenotype”
• Define trial identity with year,
sourfile and site
• Structure implementation
• Refers to the left
• Data loading
• 2080 records from FCDC
DB Implementation: phenotype
• Role:
• Phenotyping data
• Structure implementation
• Refers to the left
• Data loading
• 886,337 records from FCDC
• 34 traits
• 2080 trials
DB Implementation: marker
• Role:
• Support table for “genotype”
• Structure implementation
• Refers to the left
• Data loading
• 223 records for demonstration
DB Implementation: genotype
• Role:
• Genotyping data
• Structure implementation
• Refers to the left
• Data loading
• 32,117 records for
demonstration
DB Implementation: QTL
• Role:
• Support table
• Public QTL information
• Structure implementation
• Refers to the left
• Data loading
• none
DB Implementation: gene
• Role:
• Support table
• Gene names and locations from
Ensemble
• Structure implementation
• Refers to the left
• Data loading
• none
DB Implementation: annotation
• Role:
• Support table
• Gene function annotations from
Ensemble and our research
project
• Structure implementation
• Refers to the left
• Data loading
• none
ER: ‘phenotype’ & ‘genotype’
• “selecode” is used to
link the two tables
• many-to-many
relationship
• selccode is required
field for both tables
ER: ‘phenotype’ and support tables
• trial to phenotype
• One-to-many
• Required
• datatype to phenotype
• One-to-many
• Required
• pedigree to phenotype
• One-to-many
• Optional
ER: ‘genotype’ and support tables
• marker to phenotype
• One-to-many
• Required
• pedigree to genotype
• One-to-many
• Optional in the
phenotype table
ER: ‘gene’, ‘marker’, ‘QTL’ and annotation
ER: ‘gene’, ‘marker’, ‘QTL’ and ‘annotation’
gene
marker
marker
QTL
Annotation
One-to-many
One-to-many
One-to-one
Many-to-many
QTL
All ERs in the above table are optional.
One-to-many
Database security:
Permission assignment
Administrator
Breeder
Molecular
biologist
Bioinformatics
support
Non-user
Pedigree
RW
RW
R
R
NA
Datatype
RW
RW
R
R
NA
Trial
RW
RW
R
R
NA
Phenotype
RW
RW
R
R
NA
Marker
RW
R
RW
R
NA
Genotype
RW
R
RW
R
NA
QTL
RW
R
RW
RW
NA
Gene
RW
R
RW
RW
NA
Annotation
RW
R
R
RW
NA
Contact
RW
R
R
R
R
Table
NOTES: R: read-only; RW: read and write; NA: not accessible
Database security : Permission
implementation
• barleyDB SQL server:
• Each user was explicitly allocated to a user group
defined by the permission assignment table.
• The user only be able to perform tasks that were
assigned to the corresponding group.
• If no group was assigned, no privilege will be associated
with the user.
Database security : Permission
implementation
• BBP client
• To login with valid user ID and password is mandatory
for data query, upload, and update in BBP.
• User ID and password are not cached on the local
computer.
• Only the menus and the tools for the authorized
operations associated with the current user are enabled
in the BBP user interface.
• No file is used to pass the user identity among modules
in BBP.
Steps of data query in BBP
1.
Submit user identity and query statement to R console from
VBA/Excel
2.
Using the RMySQL package in R environment to
a) Validating user authority
b) submit query statement to barleyDB server
3.
Save query result as temp file in R
4.
pass the file name back to
VBA/excel
5.
Load the file into worksheet
and delete it afterwards
1
2
4
3
5
Steps of data upload in BBP
1.
Check duplicates of data in the selected worksheet using
identity keys in VBA/EXCEL
2.
Format the data and write it to temporary files with VBA/Excel
3.
Submit user identity, SQL statement and the file name to R
console
4.
R console (RMySQL package):
a) Load the data into R and checking
permission for the user
b) submit data to barleyDB
1
c) Return updating summary
5.
Display updating and delete
the temp file
5
2
4a
3
4c
4b
Barley Breeding Platform in EXCEL (BBP)
• Launch EXCEL
• Click on ‘ADD-INS’ tab on the
ribbon to get into Barley
Breeding Platform (BBP)
menu
• Check on different function
buttons
Functions in BBP
• Login
• Data management
• Search (data query)
• Data submit
• Report
• Advance query
and data report
• Statistical genomic
• Bioinformatics
Login interface
• Enter User name & password
• Check level of permission
assigned to the login user
• Get user_id for tracking all the
actions by this user.
• E.g., to track who entered or
modified the data in table.
Query interface
Data submitting interface - Pedigree
• Identity keys
• pedcode
• Additional
•
•
•
•
•
•
•
Maternal
Paternal
Year
Crossnum
Rowtype
Hull
memo
Data submitting interface - Trial
• Identity keys
• Year
• Sourfile
• Site
• Additional
• Soil
• Soilzone
• Lat
• Long
• Area
• user_area1- 4
• conv_factor
• Design
• memo
Data submitting interface - datatype
• ID (record)
• Mandatory info
• trait
• Additional
•
•
•
•
•
•
•
•
category
Code
memo
Units
Max
Min
UseConvFact
label
Data submitting interface - phenotype
• Identity keys
•
•
•
•
•
Trial ID
Datatype ID
Replicate
Plotnum
Testnum
• Observations
• Value
• Additional
•
•
•
•
Quality
Codient
pedigree_id
Memo
Formats for phenotype data supported by BBP
1: Separating columns for ‘datatype’ and ‘value’
2: one column contains both datatype and value
Dynamic query in data submitting
1: row data
2: After loading “phenotype submit” window
3: After select column for “phenotype values”
Data submitting interface - genotype
• Identity keys
• marker_id
• selccode
Marker list
• Observations
• Value
• Additional
• Memo
• pedigree_id
Genotype
converting table
Pedcode list
Data submitting interface – maker info
• ID (record)
• Identity keys
• Name
• Additional
• Type
• seq
• chrom
• Physical_start
• Physical_end
• Genetic_position
Data quality assessment for redundancy
& conflicts
• Duplicated data in worksheet
• Checking by identity keys
• Existing records
• Checking by identity keys
• User decide to ignore or update the existing records
• Observation (phenotype and genotype)
• Records without valid observed values will NOT be
submit to database
• The strategy is to mimic the sparse matrix storage
approach
Individual value validating
• Checking by the type of variable
• Character
• Numeric (valid numbers)
• Year (integer between 1800 ~ current year)
• Missing value handling
• Missing values are invalid for mandatory variables
• Phenotyping data
• Must be numeric
• Range was defined in the DataType table
• Genotyping data
• Based on user-defined genotype converting table
Data validating: Genotype
• Selected cells are checked
based on the user-defined
“Genotype converting
table“, which is in the Data
submitting interface for
genotype.
• Checking results were
indicated by colors.
• Comments were added to
the cell as error detected.
Genotype converting table
Column with
invalid values
Missing value
Valid value
Row with invalid values
Invalid value
Row without invalid values
Unused data
Report module
Report module provides convenient tables and
figures for breeders and researchers based on
advanced query and preliminary analysis. In the
current version, the module includes the following
two functions:
• Breeder’s report
• Family tree plot
Breeder’s report
• Query phenotypic data
based on trials
• Multiple-trial
supported
• Data report in a
breeder friendly format
A general report format used by
breeders
A “breeder’s report” by BBP
• Least square means
(LSM) were listed if only
one trail was involved
• Multiple trails
• Approach 1: to show
LSMs for all trails,
separately
• Approach 2: output the
averages of the LSM
from the trials
Pedigree plot
• Algorithm for identifying family tree
• The first six letters in line/plant ID were used as the key
in the pedigree database/BarleyDB.
• Recursive query was called until found all foundations
that contributed to the current line.
• Full plant IDs of parents were truncated to pedccode as
querying the previous generations.
• The function was automated to extract data from
barleyDB database directly, and no data preparation is
required from the user side.
Pedigree plot (User interface)
The “family tree“ function is accessible
at: “BBP”  “Report”  “Family Tree”
Network visualization:
• The sizes of output figure
are dynamic calculated
based on the depth of the
actual family tree.
• The selccodes in figure
has been auto-formatted
for visualization.
Module 2: Statistical and
computational genomics
• Advanced data query
• Linking phenotype with genotype for given SELCODE
• QTL mapping
• Interval mapping
• Composite interval mapping
• Permutation test
• Trait-marker association analysis
• Multiple regression
• Bayesian methods (BayesA/B)
• Result visualization (e.g., Manhattan plot, marker effect profiles)
• Future considerations
• Virtual crosses and parental selection
• Genomic selection
Advanced data query
BBP provided an automatic tool helping users to
prepare genotypic & phenotypic data for statistical
genomic analysis.
• Guided inquiry for
matched genotypic and
phenotypic data based
on trials.
• Users are able to decide
whether an individual
to be exclude/include in
the follow-up analysis.
Formats for phenotype & genotype data
Phenotype worksheet
• First column: selccode (plant ID)
• The “B” column: the trait to be
used in the follow-up analysis.
Genotype worksheet
• The first 7 columns: marker and
map information
• Column 8 and after: genotyping
data. Each column for an
individual.
• Users may add/delete individuals or markers from the prepared data, however, the
individual IDs in the two worksheets have to be matched.
• The current data preparing tool in BBP extract only one trait at each time, but the
users are allowed to add more columns by manual to the “BBP_phenotype”
worksheet to be used in subsequent analysis.
Algorithms behind the advanced query
• Full list of matched phenotype & genotype data
• Query full lists of selccodes for genotype and genotype
• Generate a intersecting selccode list for genotypic and
phenotypic data
• Generate a list for involved trials
• Data selection
• Guided inquiry / direct selecting of trial with valid data
• Querying valid selccodes that involved in the trial
• Users may decide the individual to be included in the
subsequent analysis
BBP steps for QTL mapping and association analysis
The qtl/R package was used to implement interval mapping (IM) and
composite interval mapping (CIM) in BBP, while the association analysis
were implemented by the glmnet/R package.
1.
Data preparing by using the “Advanced data query”
in the statistical genomics module.
2.
Write the worksheets from the above step
to temporary files and submit analysis
requests to R.
3.
Load data to R, run requested analysis.
4.
Prepare resulting tables and figures in R.
5.
Load results from the R outputs in VBA
and cleanup temporary files.
1
1
4
2
3
3
QTL mapping dialogs
1) Trait select for subsequent analysis
1
2
2) Interval size (or maximum step) in QTL
mapping, in cM.
3) Filters for lines and markers.
3
3
Shortcuts: Ctrl+A: select all;
Esc: deselect all
4) Define cross type for the population
a) Unknown: the crosstype to be
inferred by markers
b) F2, Recombinant inbreed lines (RIL),
Backcross (BC), doubled haploid (DH)
c) BCsFt: Backcrossing (BCs) followed by
selfing (Ft)
35
6
4
7
QTL mapping dialogs
5) Select a strategy to declare the significant
QTLs.
1
2
6) Threshold for significant LOD scores.
When multiple values provided, only the minimum is
used in searching QTLs.
3
3
7) Permutation settings.
Large number of permutations is helpful to obtain
more accurate empirical LOD threshold, but it
requires more computing loading.
8) CIM specific settings
• N. Cov: number of marker covariant in CIM
• Windows (cM). The size of the region to protect
the current locus. No marker in the range will be
used as covariant.
• When N. Cov=0, or an extremely large value was
set to the “Windows”, equivalent results to
interval mapping will be obtained.
5
4
6
7
8
QTL mapping outputs in BBP
1.
QTL table. Significant QTLs that with valid drop-one LOD support
intervals based on LOD cutoff (or permutation)
2.
LOD profile for all chromosomes
3.
Significant QTLs and its’ support intervals
4.
Optional. Histogram of the maximum LOD scores obtained from the
permuted samples.
1
2
3
4
Association analysis algorithms in BBP
• Multiple regression
Implemented by the (glmnet/R package)
• LASSO
• Ridge regression
• ElasticNet (0<=alpha<=1)
• Bayesian analysis
Implemented by the (BGLR/R package)
• Bayes A
• Bayes B
Multiple regression
• The interfaces for the three multiple regression
method in BBP are similar
1) Data filters for selccodes and markers
2) The ElasticNet mixing parameter, 0 ≤ α ≤ 1.
Only available for the ElasticNet regularization
algorithm.
α = 1, the lasso penalty
α = 0, the ridge penalty
1
1
1
1
2
1
1
Multiple regression outputs in BBP
1.
Marker effect estimates and p-values
2.
Marker effect profile
3.
Manhattan plot
4.
Genetic map and significant effects
1
2
3
4
Bayesian Analysis
1) Data filters for selccodes and markers
3
2) Bayesian models (A & B)
2
3) The π value in Bayes B, ignored in Bayes A
1
4) MCMC settings
1)
N. Iters: number of total MCMC samples to be genearte
2)
BurnIn: number of premature samples to be discarded
in the posterior sample analysis
3)
Thin: interval between independent MCMC samples
5. Random seed
•
The seed to be used by the pseudo random number
generator in R to run MCMC analysis .
•
By default, zero, Bayes analysis generate different results
at each time.
•
However, identical results can be obtained by using the
same random seed.
1
4
5
Outputs of the Bayes analysis in BBP
1.
2.
3.
4.
5.
Marker effect estimates and p-values (2-tailed t-tests)
Manhattan plot
Effect profile
Genetic map and significant effects
Convergence of residual variance in MCMC
2
1
3
4
5
Module 3: Bioinformatics
• Advanced data query
• Linking phenotype with QTL/genes in genomic context
(chromosome position, gene functions) for given SELCODE
• Comparative genomic analysis
• Function/pathway-driven gene annotation and analysis
• Inferring about unknown genes in barley from known genes in
Arabidopsis
• Future considerations
• Array-based confirmation for gene annotation
• Gene network and system biology
Gene Annotation
• The module allow users to extract gene annotation
data from the Ensemble Database by providing
standard Gene Ontology terminology (or GO
terms).
• The GO terms data are download automatically by the
BBP software for every week from the Gene Ontology
Consortium website.
• Fuzzy search is implemented to allow users searching
standard GO terms at a easy way.
• Users are allowed to selected the target species for the
gene annotation data.
Gene Annotation window
1. Add or delete items for
GO term list
2. Add or delete items for
the species list
3. Select variables to be
extract from the
Ensemble Database
1
2
3
Keyboard shortcuts:
• Esc: deselect all items
• Ctrl+A: select all items
• Ctrl+C: copy the selected items
“Select GOTerms” dialog
• The interface allows user to
search the standard GO
terms that defined by the
gene Ontology Consortium
• The interface is invoke by the
“Add new” command in the
“Gene Annotation dialog”
1. Fuzzy searching is supported
2. The program remember all
selected items in multiple
searching.
3. Detail information of the
current item
4. Click "close" to back to the
“Gene Annotation” window
4
1
2
3
Select species dialog
• The dialog is invoked by the
“Gene Annotation” window.
• Gene annotation is currently
allow users to selected from
39 species.
• Multiple selection supported
• Click "close" return to the
“Gene Annotation” window.