Transcript Huber

Exporting and importing Stata
genotype data to and from PHASE
and HaploView
UK Stata Users Group Meeting 2009
September 10-11, 2009
John Charles “Chuck” Huber Jr, PhD
Assistant Professor of Biostatistics
Department of Epidemiology and Biostatistics
School of Rural Public Health
Texas A&M Health Science Center
[email protected]
Motivation
Many rapidly growing areas of research utilize
multiple specialty “boutique” computer
programs to conduct highly specialized
analyses.
The Stata user is faced with two choices:
1. Write new Stata commands that do the same analyses
2. Write Stata commands that efficiently export and import
data for these “boutique” programs
Stata for Genetic Data Analysis
Outline
1.
2.
3.
4.
5.
6.
Genetic Data Analysis using Stata
Genetics Background
The “file” commands in Stata
The phasein and phaseout commands
The haploviewout command
Summary
Stata for Genetic Data Analysis
2007 UK Stata Users Group meeting:
http://www.stata.com/meeting/13uk/
A brief introduction to genetic epidemiology using Stata
Neil Shephard, University of Sheffield
An overview of using Stata to perform candidate gene association analysis will be
presented. Areas covered will include data manipulation, Hardy–Weinberg
equilibrium, calculating and plotting linkage disequilibrium, estimating haplotypes,
and interfacing with external programs.
User Written Genetics Commands
Programs written by David Clayton
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ginsheet- Read genotype data from text files.
gloci - Make a list of loci.
greshape - Reshape a file containing genotypes to a file of alleles.
gtab - Tabulate allele frequencies within genotypes and generate indicators (performs Hardy-Weinberg
Equilibrium testing).
gtype - Create a single genotype variable from two allele variables.
htype - Create a haplotype variable from allele variables.
mltdt - Multiple locus TDT for haplotype tagging SNPs (htSNPs).
origin - Analysis of parental origin effect in TDT trios.
pseudocc - Create a pseudo-case-control study from case-parent trios.
pscc - Experimental version of pseudocc in which there may be several groups of linked loci.
pwld - Pairwise linkage disequilibrium measures.
rclogit - Conditional logistic regression with robust standard errors.
snp2hap - Infer haplotypes of 2-locus SNP markers.
tdt - Classical TDT test.
trios - Tabulate genotypes of parent-offspring trios.
User Written Genetics Commands
Programs written by Adrian Mander
•
•
•
•
•
•
•
•
•
•
gipf - Graphical representation of log-linear models.
hapipf - Haplotype frequency estimation using an EM algorithm and log-linear modelling.
pedread - Read's pedigree data file (in pre-Makeped LINKAGE format), similar to ginsheet
pedsumm - Summarises a pre-Makeped LINKAGE file that is currently in Stata's memory.
pedraw - Draws one pedigree in the graphics window
plotmatrix - Produces LD heatmaps displaying graphically the strength of LD between markers.
profhap - Calculates profile likelihood confidence intervals for results from hapipf
swblock - A step-wise hapipf routine to identify the parsimonious model to describe the Haplotype block
pattern.
qhapipf - Analysis of quantitative traits using regression and log-linear modelling when phase is unknown.
hapblock - attempts to find the edge of areas containing high LD within a set of loci
User Written Genetics Commands
Programs written by Mario Cleves
•
•
•
gencc - Genetic case-control tests
genhw - Hardy-Weinberg Equilibrium tests
qtlsnp - A program for testng associations between SNPs an a quantitative trait.
Programs written by Catherine Saunders
•
•
•
•
•
•
co_power - Power calculations for Case-only study designs.
gei_matching geipower - Power calculations for Gene-Environment interactions.
ggipower - Power calculations for Gene-Gene interactions.
tdt_geipower - Power calculations for Gene-Environment interactions via TDT analysis.
tdt_ggipower - Power calculations for Gene-Gene interactions via TDT analysis.
Programs written by Neil Shephard
•
genass- Performs a number of statistical tests on your genotypic data and collates the results into a Stata
formatted data set for browsing.
The Structure of DNA
Watson et al. (2004) pg 23, Figure 2.5
The Structure of DNA
Hartl & Jones (1998) pg 9, Figure 1.5
What is a SNP?
• A SNP is a single nucleotide polymorphism
(the individual nucleotides are called alleles)
Person 1 – Chromosome 1
Person 1 – Chromosome 2
Person 2 – Chromosome 1
Person 2 – Chromosome 2
ataagtcgatactgatgcatagctagctgactgacgcgat
ataagtccatactgatgcatagctagctgactgaagcgat
ataagtccatactgatgcatagctagctgactgacgcgat
ataagtcgatactgatgcatagctagctgactgaagcgat
SNP1
SNP2
Allelic Association
• Simple 2x2 table
• One table per SNP
• Compute a simple chi-squared statistic or odds
ratio for each SNP
Case
Control
SNP1 Allele
g
c
250
750
650
350
Genotypic Association
• Compute chi-squared tests
• Allows testing of various disease models
(dominant, recessive, additivity)
Case
Control
SNP1 Genotype
gg
gc
cc
100
250
150
300
150
50
What is a Haplotype?
• A haplotype is the combination of one or
more alleles found on the same chromosome
– Person 1 has a “gc” haplotype and a “ca” haplotype
– Person 2 has a “cc” haplotype and a “ga” haplotype
Person 1 – Chromosome 1
Person 1 – Chromosome 2
Person 2 – Chromosome 1
Person 2 – Chromosome 2
ataagtcgatactgatgcatagctagctgactgacgcgat
ataagtccatactgatgcatagctagctgactgaagcgat
ataagtccatactgatgcatagctagctgactgacgcgat
ataagtcgatactgatgcatagctagctgactgaagcgat
SNP1
SNP2
Haplotypic Association
• Compute chi-squared tests
• Two SNPs with genotypes a/g and c/t respectively
Case
Control
SNP1:SNP2 Haplotype
a:c
a:t
g:c
g:t
100
250
75
75
300
100
50
50
Why are haplotypes important?
2009 Oxford and Cambridge Boat Race
http://www.theboatrace.org/gallery/2009?page=7#
Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
President
VP
State
Defense
Treasury
Chromosome R
Chromosome D
Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
President
VP
State
Defense
Treasury
Chromosome R
Chromosome D
Rearranging the members of each “chromosome” could have a profound effect!
Why are haplotypes important?
Hartl & Jones (1998) pg 18, Figure 1.13
Hartl & Jones (1998) pg 18, Figure 1.13
Why are haplotypes important?
Watson et al. (2004) pg 29, Box 2-2
The PHASE Program
• Unfortunately, haplotypes are not observed directly
using modern, high-throughput lab techniques
• We observe genotypes and must infer the haplotype
structure using algorithms
• PHASE is a very popular program for inferring
haplotypes from many SNPs simultaneously
(Stephens, Smith & Donnelly, 2001)
The phaseout Command
Raw Genotype Data in Stata
The phaseout Command
Input file format for PHASE
The phaseout Command
I need to get my
data from here:
to here:
The “file” commands in Stata
Using “file open”, “file write” and “file close”
file
file
file
file
open Example1 using "ExampleFile.txt", write replace
write Example1 "Hello World" _newline(1)
write Example1 "Why so blue?" _newline(1)
close Example1
The “file” commands in Stata
Using “file open”, “file read” and “file close”
.
.
.
.
file
file
file
file
open Example2 using "ExampleFile.txt", read
read Example2 Line1
read Example2 Line2
close Example2
. disp "Line1: `Line1'"
Line1: Hello World
. disp "Line2: `Line2'"
Line2: Why so blue?
The phaseout Command
Syntax for phaseout
phaseout SNPlist , idvariable(string) filename(string)
[missing(string) separator(string) positions(string)]
Example
local SNPList "rs1413711 rs3024987 rs3024989"
local PositionsList "674 836 1955“
phaseout `SNPList' , idvariable("id") filename("VEGF.inp")
missing("X/X 9/9") positions(`PositionsList')
separator("/")
The phaseout Command
Example
local SNPList "rs1413711 rs3024987 rs3024989"
local PositionsList "674 836 1955“
phaseout `SNPList' , idvariable("id") filename("VEGF.inp")
missing("X/X 9/9") positions(`PositionsList')
separator("/")
The phaseout Command
Example
local SNPList "rs1413711 rs3024987 rs3024989"
local PositionsList "674 836 1955“
phaseout `SNPList' , idvariable("id") filename("VEGF.inp")
missing("X/X 9/9") positions(`PositionsList')
separator("/")
The phasein Command
Output file format from PHASE
The phasein Command
Syntax for phasein
phasein PhaseOutputFile [, markers(string)
positions(string)]
Example
phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
The phasein Command
Example
phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
The phasein Command
Example
phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
The HaploView Program
• Once we have inferred our haplotypes, we can
conduct further association analyses using the full
complement of Stata commands.
• We might also want to explore our data in the
popular program HaploView (Barrett et al, 2005)
The haploviewout Command
Syntax for haploviewout
haploviewout SNPlist , idvariable(string) filename(string)
[positions(string)] [familyid(string)] [poslabel]
Example
local MarkerList "rs1413711 rs3024987 rs3024989“
haploviewout `MarkerList', idvariable(id) filename("VEGF")
poslabel
The haploviewout Command
Example
local MarkerList "rs1413711 rs3024987 rs3024989“
haploviewout `MarkerList', idvariable(id)
filename("VEGF") poslabel
The haploviewout Command
Example
local SNPList "rs1413711 rs3024987 rs3024989“
haploviewout `MarkerList', idvariable(id)
filename("VEGF") poslabel
The haploviewout Command
The haploviewout Command
The haploviewout Command
Summary
Compared to recreating “boutique” programs
in Stata, it is relatively easy to create programs
for exporting and importing data.
Acknowledgements
• Grant 1-R01DK073618-02 from the National Institute
of Diabetes and Digestive and Kidney Diseases
• Grant 2006-35205-16715 from the United States
Department of Agriculture.
• Drs. Loren Skow, Krista Fritz, Candice BrinkmeyerLangford of the Texas A&M College of Veterinary
Medicine
• Roger Newson of the Imperial College London
References
• Barrett, J., Fry, B., Maller, J., & Daly, M. (2005). Haploview: analysis and
visualization of LD and haplotype maps. Bioinformatics, 21, 263-265.
• Hartl, D.L., Jones, E.W. (1998) Genetics: Principles and Analysis, 4th Ed.
Jones & Bartlett Publishers
• Stephens, M., & Donnelly, P. (2003). A Comparison of Bayesian Methods
for Haplotype Reconstruction from Population Genotype Data. American
Journal of Human Genetics, 73, 1162–1169.
• Stephens, M., Smith, N. J., & Donnelly, P. (2001). A New Statistical Method
for Haplotype Reconstruction from Population Data. American Journal of
Human Genetics, 68, 978–989.
• Watson, J.D., Baker, T.A., Bell, S.P., Gann, A., Levine, M., Losick, R. (2004)
Molecular Biology of the Gene, 5th Ed. Benjamin Cummings