Media:CAREcenterInformaticsCall102006

Download Report

Transcript Media:CAREcenterInformaticsCall102006

CARE Center Informatics
Subcommittee
Background and Progress Report for
10/20/06 Conference Call
Marcia Nizzari
Background Information
On CARE informatics, GAP
(Genetic Analysis Platform) and
Informatics Team
CaRE Center Informatics
• Builds on existing Genetic Analysis Platform
– Operational for 2+ years
– Genotyping and Resequencing
– Code base successfully reused
• CaRE Center enhancements:
– Data sharing strategy
– Phenotype/Trait thesaurus, meta thesaurus
– Customizable analytic pipelines
GAP by the Numbers…
• 510,000 lines of working source code
• Very large databases, one of the largest
tables has 640 M rows
• Standard industry metrics (SLOCCount)
estimate that this code base required
– $18M to develop (actual: ~ $4.5M)
– Staff of 40 for 3.5 years (actual: average 12/yr)
• Pretty good deal!
– Truly a World Class informatics team
GAP by the #’s (cont’d.)
• User statistics:
– 222 logins for the system (internal & external)
– Nina has trained (since 12/05):
• 47 people individually
• Held 25 small group sessions
• Jira statistics:
–
–
–
–
1,722 issues logged since Jira in production
1,330 resolved
392 open/in progress
Daniel Mirel is the champion Jira user
GAP by the #’s (cont’d)
• Informatics staff statistics:
– Total software dev experience: ~200 years
– Degrees held:
• 1 PhD (neuroscience)
• 6 masters degrees
– 4 comp sci, 1 molbio, 1 manufacturing
• 4 eng bachelor degrees
– 2 EE’s, 1 ChemEng, 1 SoftEng
• 3 biochem/molbio/biology bachelor degrees
• 4 comp sci bachelor degrees
• 1 physics bachelor degree
User Workflow in Software
Samples & Clinical Information
Genome Sequence & Genetic Variation
Biological
Sample
Platform
NCBI
Project
Management
Execute
Genotyping
Genotyping
Pipeline (ESP)
Experiment
Purchased
HapMap
SNPs dbSNP
DCC (Celera)
Plan Experiment in
Project Management
Resequencing
Pipeline
Execute
Resequencing
Experiment
Genetic
Analysis
Perform analysis and loop back to next round
of experiment planning
CARE Association Study Workflow
Analysis: Gene Pattern +
Production:
custom analysis tools
Sample Mgt, Project Mgt, Genotyping
Create Experiments
(Samples x Features)
Sample
DB
Project
DB
Feature
DB
Design and
Execute
Experiments
QC/Curate Results
Data Compile
Web Services
Upload Samples,
Peds, Individuals,
Phenotypes
LIMS DBs
Data Vault
Summarize/Filter
PLINK
Association &
Statistics Viewers
Custom
Algorithms, Viewers
Phenotype Component
Conceptual Architecture
Thesauri, Meta Thesaurus for CARE
Controlled Vocabulary Constraints
Base
Component
Phenotype Inquiry
Phenotype Capture and Validation
One
ontology
– either
Group or
Project
specified
CARE Progress Report
• PhenoMall functionality
– Rapid enhancement of capture function
• Meta data
• Mapping of all CARE phenotypes looks good
– Major enhancements for pheno inquiry
– Informatics goals of pilot
• Figure out how far up the controlled
vocab/thesaurus stack we need to go
• What curation tools are needed?
• Requirements gathering beyond pilot
• Awaiting the decision on data sharing…
Deliverables
• NIH Application/System Security Plan
– Two major revisions, July 17th and Oct 16th
– Security officer at NHLBI is Cindy Walczak
• When data sharing model decided:
– Research technologies, approaches, make
recommendation to subcommittee
– Spec/design and review by subcommittee
• Working pilot
– Need to discuss when to demo – Feb meeting
in Bethesda??
Security Considerations
Security Layers - General
• There are at least three levels:
– MIT firewalls
• Penetration testing, Tripwire, packet monitoring, etc.
– Broad
• New Cisco firewalls
• Route to host servers
– Explicit Allows only
• Wireless access goes out to MIT firewall
• Open jack goes to Broad firewall
– CARE Center application itself
The World
MIT
The Broad Institute
Firewalls
Cisco
ASA 5540
Internet
“Cloud”
MIT
On LIMS
Used for authentication for
VPN access
Radius
DB
Core
Router
Host B
Cisco
ASA 5540
Open jack
Access Rules
for Subnets:
Explicit allows,
e.g., allow host
on LIMS to talk to
host on server
Host on server
…
Allow Rules:
Explicit allows –
http = 80 -> host
Ssh = 22 -> host
https = 443 (SSL)
Host A
Must be in the list
to permit access
Unregistered 10.10
domain
Wireless
Security Layers - Application
• Genetic Analysis Platform application
security:
– Role-based security
– Passwords that expire
– Audit trails track user activity
• Detailed information available in NIH
Application/System Security Plan for
CARE Center
Summary: Issues/Questions
• Scope of phenotype-related enhancements
• Group/Project structure for CaRE Center
• CaRE user visibility into Process
Dashboard/LIMS
• Data release model decision
– Data Enclave scenarios and security
• User training and doco
– Analysis methodology
– System and security training
Security for Production & Analysis
BSP Lab
Technician
Users in JAAS domain
CaRE
Cohort
Technician
Project
Management
Groups,
Projects,
Grants,
Panels,
Feature Sets,
Sample Sets
Process/
LIMS
Proj Mgt
Security
Context
(Project)
Lab
Security
Broad Lab Technician, Context
Coordinator
(X-Project)
Biological Samples
Platform
BSP Security Context
(Sample Collection)
Shareable Objects:
Peds, Individuals,
Phenotypes, Samples,
Features
LSIDs
PIPS DB
Feature DB
CaRE
Scientist
Analysis
Pipelines
CaRE Analysis
Security Context
(Scope based on rules
of Data Enclave, could
cover multiple
Projects)
Postlude
How Users Can Help
• Specify! We need things nailed down…
• The classic specification:
– Genesis 6:14 - 16 (NKJV) 14 "Make yourself an ark
of gopherwood; make rooms in the ark, and cover it
inside and outside with pitch. 15 "And this is how you
shall make it: The length of the ark shall be three
hundred cubits, its width fifty cubits, and its height
thirty cubits. 16 "You shall make a window for the ark,
and you shall finish it to a cubit from above; and set
the door of the ark in its side. You shall make it with
lower, second, and third decks
• We live in the world of 0’s and 1’s!
Informatics Development Team
Jason Carey
Kristian Cibulskis
Michael Dinsmore
Tim Fennell
George Grant
Bob Handsaker
Nina Lapchyk
Pei Lin
James Nemesh (CH)
Huy Nguyen
Howard Rafal
Greg Rushton
Dennis Ryan
David Tefft
Alex Thomson
Ellen Winchester
Alec Wysoker
Names in bold have significant time allocated to CARE center activity.