GenSel4 Features & Tutorial by Dr. Dorian Garrick

Download Report

Transcript GenSel4 Features & Tutorial by Dr. Dorian Garrick

GenSel4
August 2011
Command Line
GenSel can be run from commandline
For example gensel4 (provided path set appropriately)
GenSel can be run from the BIGS web interface
GenSel jobs can be submitted to the queue on the HPC using
the bigscli command from the unix interface of BIGS
Usage
gensel input_file_name
nohup gensel input_file_name –s status_file_name
Genotypes & Phenotypes
Required for all analyses
trainPhenotypeFileName
markerFileName
Read by GenSel4
Used for analysis
Genotype File Structure
Space delimited unix file (dos2unix to convert)
header row plus one row for each animal
column for ID then a column for each genotype
One header row
Alphanumeric labels for each genotype/locus
One row for each animal
Alphanumeric ID followed by all the genotypes
-10, 0 or 10 for AA, AB or BB (no support for missing genotypes)
Ordered by genomic location if no map file
Read in binary format (end in .newbin)
Text files are converted to binary in the first analysis
Must be same number of columns in every row
Example Genotype File
ISU_ID.bt
isu_1
nadc_1
isu_2
isu_3
isu_4
ISU_Angus_1
-10
0
10
10
0
AAA00001
-10
10
-10
10
0
Tag_number_a
b
-10
-10
0
10
0
Casanova_bull
-10
10
6.5
-10
10
Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb
(and the same file in binary format is typically half the size)
Species are designated by the first letters of Genus and species
bt = Bos Taurus; hs=hom sapiens; oa=ovis ariesl ss=sus scrofa etc
This will later provide functionality for species specific genome browsing
Phenotype File Structure
Space delimited unix file
Separate phenotype file for each trait
Header row plus one row for each animal with phenotype
Alphanumeric animal ID must be in column 1
Trait value must be in column 2 (label in header)
Remainder of file is arbitrary but defines model for trait
Recommend to at least involve a column of 1’s for the mean
Columns headed by alphanumerics – all rows have same no of columns
Columns headed by name ending in $ are class variables
Columns headed by other names are covariates
Columns ending in # are ignored
Column headed by rinverse specifies a weighted analysis
Example Phenotype File
Animal
IQ
mean
dob
Sex$
Family#
rinverse
A_1
100
1
100
male
1
1.0
B_2
95
1
105
female
1
0.9
C.12345
103
1
97
spey
2
.95
Spot
110
1
90
male
2
1.1
rinverse is only proportional (scalar variance factored out)
covariates must be numbers!
categorical traits must be numbered from 1 upwards
trait in column 2 (not required for prediction)
sensible to at least have the mean
model does not need to be full rank
GenSel matches IDs
Only records with the same alphanumeric ID in the genotype
and phenotype file are available for subsequent analysis
Start of analysis reports the number of animals in the
genotype file, phenotype file and matching records
Genotypes & Map Files
GenSel now supports the use of a map file
A map file provides chromosome and basepair position
information for at least one build
Can support any number of builds
A map file may provide multiple aliases for marker names
Every marker name from the genotype file must exist
somewhere in the map file
Additional marker names can be in the map file.
Map File Structure
Rs_num
Ss_num
ISU_ID
UMD_chr
UMD_pos
BTA_chr
BTA_pos
Rs_001
101
isu_1
1
100000
1
95123
1234
102
isu_2
2
1234567
2
1500000
5678
103
isu_9
2
987654321
2
10000000
910a
104
isu_5
X
0
PAR
2543
newS
newS
nadc_1
unk
0
unk
N/A
Space delimited unix text file
Map File Options
The minimum requirements are
mapFileName
linkageMap (options depend upon your mapfile)
eg UMD or BTA for my example on last page
This will result in columns of the genotype file being sorted into
genomic order to facilitate formation of contiguous marker
windows – automatically formed in 1Mb sizes
Options include
addMapInfoToMarkers yes
Results in chromosome and base pair position added to output
outputMarkerHeaderName (options are aliases in your map file)
Filtering Genotypes
4 methods to filter columns of the genotype file for analysis
Two approaches are always available
includeFileName or excludeFileName
These files contain a list of marker names as in the genotype file
header that are to be included or excluded
Include takes precedence over exclude
Two other approaches are available if a map file is used
windowIncFileName or windowExclFileName
List of chromosome_names to include/exclude entire chromosome
List of chromosome_name start_bp end_bp
Map files & SNP names
Sometimes the genotype file uses one marker name (eg
database numeric ID), but the marker output file would
benefit from having a different name (eg rs number)
Given a map file, Predict can cross reference the different
marker names so you can exchange marker results (.mrkRes)
files with other users
Output File Name Conventions
Suppose GenSel is run using gensel4 demo.inp
The root for all output files will be “demo”
All options will produce output to demo.out# where # is the
next available integer not already used
The first run produces demo.out1, the next demo.out2 etc
Most other options produce additional files that will have the
same root name and the same suffix number as the .out file
demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc
Analysis Options
Many calculations are time consuming
Computing window Variance
Validating predictive accuracy in test data
Computing PEV and R2
These are only done in some iterations according to the
outputFrequency option
Default is 100 so these calculations occur for 1% iterations
Markov Chains use many random numbers
The seed option (default 1234) can be used to alter sampling
Print
analysisType Print
This can be used to get a printout of the X matrix, ordered by
map position if a map file is used, for just those animals in the
genotype and phenotype file
The output contains the covariates on a 0, 1, 2 scale, before
centering, not on the -10, 0 , 10 scale used in the marker
genotype file
LD
analysisType LD
This computes the pairwise squared correlation between every
pair of markers in the filtered genotype file
Also computes the minor allele frequencies (MAF)
The output file will be very large if you don’t filter it
Only squared correlations exceeding minLDoutput are stored
minLDoutput (default 0.1)
StepWise
analysisType StepWise
Computes (unweighted) forward and reverse submodels after
first fitting all the fixed effects
R2 is defined as the proportion of sums of squares after the
fixed effects
Three options control the model
inputMaxRsquared (default 0.8) will stop the analysis
inputMaxMarkers (default 100) will stop the analysis
alphaValue (default 0.05) controls significance
Bayes
analysisType Bayes
bayesType BayesB
Metropolis-Hastings
Gibbs Sampling
bayesType BayesA (Actually just BayesB with pi=0)
bayesType BayesC
bayesType BayesCPi (Actually BayesC but with pi estimation)
bayesType RBR (Robust Bayesian Regression)
Really Bayes B but with pi, Scale and df (genetic) estimation
FindScale options (no, yes) or for BayesCPi (thruPi)
Bayes Priors
Priors and associated degrees of freedom are required for the
genetic and residual variance
genVariance (default 1)
degreesFreedomEffectVar (default 4)
resVariance (default 1)
nuRes (default 10)
Better estimates of genVariance and resVariance should be used
From knowledge of heritability and phenotypic sd
Bayes Options
All analysisType Bayes jobs have extra options
burnIn is the number of iterations in the chain to discard
Probably doesn’t need to be very many (eg 1,000)
chainLength is the number of iterations in the chain
Typically use 41,000 or more (this includes burnIn)
Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a
fraction (1-pi) of markers have an effect and pi have 0 effect
Option is for example probFixed 0.95
Bayes Options
BayesB (and therefore BayesA = B0) used to use a
Metropolis-Hastings rather than a Gibbs sampler
MHG did 100 MH iterations
Our fast version used a different proposal distribution and
required no more than 10 MH iterations
You can specify numMHIter
Long developed an alternative sampler that does not use MH
You select this option using numMHIter 0
It is faster – the same speed as BayesC
Bayes Options
The 1 Mb windows formed using a map file can be used to
compute the variance of the window
This is turned on using windowBV yes
Note the number of markers in each window varies with SNP
density along the genome (many markers for chrom unk)
This provides posterior distributions of windows so that the
previous Permute and Bootstrap options are no longer needed
or supported
In the absence of a map file, the columns in the genotype file are
assumed to be consecutive, and the number of markers in a
window are defined by the windowWidth option
The default is 5
Automatically get graphs of posteriors and table of variances
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
Percentage of Genetic Variance by 1 mb windows
25
30
Note window Variances typically don’t sum to 100 due to nonzero covariances
120
110
100
90
80
70
60
50
40
30
20
0
200
400
600
800
Cumulative Genetic Variance by largest windows
1000
1200
0.06
0.05
0.04
0.03
0.02
0.01
0
10
15
20
25
30
35
Window contains 20 SNPs from Gga_rs14490890 to Gga_rs14491074
40
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
1
2
3
4
5
6
Window contains 27 SNPs from Gga_rs14693113 to Gga_rs13758442
7
8
Predict
analysisType Predict
markerSolFilename defines the name of a .mrkRes file from
a previous training analysis
windowWidth defines the number of markers in a
consecutive window from which the overlapping window
variances are computed
windowBV yes will result in a file full of ghats with a row for
each animal and a column for each overlapping window
GenerateData
Randomly chooses 1-probFixed proportion of loci to be QTL
Samples QTL effects and residual effects according to
normal distributions with mean 0 and variance determined
by varGenotypic and varResidual
Outputs the simulated genotypes and phenotypes
Phenotypes will be categorical if isCategorical yes with as
many categories as specified by numCat (default 2)
Categories will be equal sizes unless specified by the option
PortOfCat (eg 0.70:0.20) if numCat 3
Validation
There are two options for validation
Validation can be done jointly with the training analysis
trainPhenotypeFileName
testPhenotypeFileName
If no testPhenotypeFileName, training data is used
This will produce ghat, PEV and R2 for validation animals
Validation can be done in a later session from training
This will produce ghat but no PEV or R2
All columns of phenotype file are copied into the ghat file to
facilitate downstream analysis
Graphing Posteriors
Various posterior distributions will be output if desired using
the key word plotPosteriors yes
Samples used in the graphs are in .mcmcSamples which can
be produced without graphing if mcmcSamples yes
Requires that gnuplot is installed on the machine in a
location accessible using the defined path
Categorical Options
All analysisType Bayes will do categorical analyses if the
option isCategorical yes is used
Categories must start from 1, and be ordered without missing
categories
Required Libraries
Many routines use matvec libraries
Most matrix and vector computations use Eigen3
GSL is no longer used
Boost is used (only for format statements)
Limited use of STL
Graphics options require gnuplot
Environment must include paths to gnuplot (/opt/local/bin)
R version
We are developing an R version that will allow you to run
any or all of the options from R
Also allow you access to variables created during the analysis
Hope to allow you to replace existing procedures with your
own for prototyping new methods or features
Planned Developments
Addition of partial least squares (PLS), Bayesian Lasso
Addition of further random factors beyond the genotypes
Using pedigree, genomic or identity variance-covariance
matrices
Extension to multiple trait analysis
Implementation using CUDA graphics processors