Bioinformatics techniques COD+EK Nov07
Download
Report
Transcript Bioinformatics techniques COD+EK Nov07
Bioinformatics tools and techniques
Into the heart of darkness
Elaine Kenny
Colm O’Dushlaine
15/11/07
1/18
Summary
Simple overviews of some of the tools and methods used by EK and
CO’D
TK notebook
get_hapmap_snps.pl: retrieve HM genotype information for a list of
SNPs
GeneViewer.pl & cross_ref.pl: visualise e.g. SNPs in the context of
other genomic landmarks. Score SNPs depending on how many of
these landmarks they overlap with
ld_expander.pl: find SNPs in LD with SNPs of interest, based on
user-specified r2 and “LD window” (distance between SNPs)
STATA
VIM: command line text editor
Lab website
2/18
TK notebook
Application for saving notes, to-do lists, daily
logs, and any other kind of textual information
in a place where you can find it all again, and
where related information is easily found
Easy to edit and rapidly searchable
DEMO – editing
DEMO – search
3/18
get_hapmap_snps.pl
Simple script to read in a 1-column list of
SNPs and retrieve HapMap genotypes
Can select population and strand
DEMO
Retrieved data can be loaded into HaploView
DEMO
4/18
cross_ref_scored.pl
Score SNPs based on how many putatively functional regions
they overlap with:
On a per gene / chromosome basis
Gene basis:
Type: perl cross_ref_scored.pl file_A file_B file_C ...
where
file_A - 2-column file of SNPs (format = id, location)
file_B - 3-column file of EXONS (format = id/name, start, stop)
file_C ... - whatever you want, (format = id/name, start, stop)
i.e. other regions like CpGs, TFBS, clusters. Any order.
…
5/18
cross_ref_scored.pl example output:
Can then be merged with HapMap / Perlegen to retrieve MAF data
for SNPs
6/18
Merge cross_ref_scored data with HapMap/
Perlegen data using merge_per_hap.pl
Type:
perl merge_per_hap.pl perlegen.txt hapmap.txt overlapped_region_scored.txt
Where:
hapmap.txt = 3-column file (format: rsid, ref_allele, ref_allele_freq),
perlegen.txt = 3-column file (format: rsid, ref_allele, ref_allele_freq)
7/18
cross_ref.pl applied to WGA data
cross_ref.pl: Scoring SNPs throughout genome
Data analysed on coding/non-coding basis
(coding)
perl cross_ref.pl Overlapped_regions_scored.WTCCC.chr22.coding.txt 22
WTCCC_T2D_chr22_without_inferred.forCrossRef
WGA_databases/coding_non_synon_SNPs_UCSC.clean=3
WGA_databases/coding_synon_SNPs_UCSC.clean=2
WGA_databases/RefSeq_Genes_UCSC.byExon.uniqid=1 WGA_databases/Triplexes_may2006.bed=2
WGA_databases/splice_site_SNPs_UCSC.clean=2 >
Overlapped_regions_scored.WTCCC.chr22.coding.log &
(input-dependent, coding/non-coding dependent, arbitrary)
(noncoding)
perl cross_ref.pl Overlapped_regions_scored.WTCCC.chr22.NONcoding.txt 22
WTCCC_T2D_chr22_without_inferred.forCrossRef WGA_databases/TFBS.chr22=1
WGA_databases/CpG_islands_UCSC.uniqid=1
WGA_databases/Most_conserved_phastConsElements17way_UCSC.clean=1
WGA_databases/promoters_knowngene_hg18.txt=1 WGA_databases/sno_or_miRNA_UCSC.uniqid=1 >
Overlapped_regions_scored.WTCCC.chr22.NONcoding.log &
8/18
cross_ref.pl
cross_ref.pl output:
Load into STATA. If SNPs have e.g.
association p-values, calculate adjusted pvalue (R. Anney) as
-log10[P] + [cross_ref_score]
9/18
GeneViewer.pl
GeneViewer.pl: Visualise overlapping
features (e.g. exons, SNPs etc.) along e.g.
your gene of interest (html output)
10/18
ld_expander.pl
Find proxies (SNPs in LD) for a list of SNPs
User specifies the r2 and “LD window”
Currently configured to obtain proxies from HM CEU
Result is a list of additional proxy SNPs that have
been obtained by LD expansion
DEMO
Note: don’t LD expand >150000 SNPs, or HapMap
will ban you! CO’D has an alternative version that
uses local pre-computed pairwise LD SNP files
11/18
STATA
Extremely powerful and flexible
>65k rows handled – shock horror!
Can write scripts to automate tasks, e.g. read in file,
do analysis, save results
When use GUI to run some commands, the
commands are shown in the command window, so
can save in a do file
CO’D, EK and R. Anney strongly advocate this as a
platform for both file manipulation and statistical
analysis
12/18
STATA example using WTCCC data
Bipolar Disorder,
Coronary Artery Disease,
Crohn's Disease,
Hypertension,
Rheumatoid Arthritis,
Type 1 Diabetes,
Type 2 Diabetes
http://www.wtccc.org.uk/
13/18
DATA FORMAT
3 folders:
Basic
Combined cases
Combining other case collections as controls
Combined controls
Each case collection against the pooled control groups
58C and UKBS
Combining phenotypically relevant case collections
(e.g. RA/T1D, autoimmune )
Data are split by chromosome
14/18
Questions
How do I get all of the chromosome data for
my gene of interest into one file?
How do I search easily all of the SNP
information for my gene(s) of interest?
Create a “.do” file for all manipulations that you
want to carry out to the data
DEMO
Good starting resource:
http://www.ats.ucla.edu/stat/stata/
15/18
VIM
“Vi Improved”. Mainly UNIX but crossplatform text editor (available for Windows).
Full list of commands outside scope of this
demonstration
Very fast and efficient, esp. with search and
replace functions on large datasets
Regular expression pattern matching
DEMO
Integrates with Cygwin (www.cygwin.com –
very useful UNIX emulator for windows)
16/18
Group website
Some useful stuff up there!
Please send information about current
projects etc. Good for our image as a group
and minimal effort required on your part
DEMO
17/18
Conclusions
Small summary of some things you can do
Slides and video demonstrations will be online at:
http://www.medicine.tcd.ie/psychiatry/research/neurop
sychiatry/Protocols/
CO’D & EK available for advice (Friday’s 9-9.02am)
These things will help you in your work!!
18/18