Building the Efficient Site
Download
Report
Transcript Building the Efficient Site
Optimized Data Migration within a System of
Linked Medical Research Databases
By Jared
Christopherson
U. of Connecticut
Problem to Address
Medical research requires
connections to multiple
hospitals, institutions, or
online databases
Data must be compiled
manually
Process is time-consuming
General Project Goal
Project Goals:
Present data as though from a single source
Give the researcher flexibility with viewing the data
Optimize data flow with site caching
Real World Issue: Data Formatting
Real World Issue: Data Formatting
TABLE NAME: hiv_resistance_1
[gene] [drug_class] [compound] [aa_mutation] [codon_mutation] [cite]
TABLE NAME: db_data
[gene_name] [drug] [aa_info] [codon_info] [source]
TABLE NAME: ResearchDataHIV
[geneName] [drugClass] [compoundInfo] [aaMutation] [codonMutation]
Solution: Master Template
hiv_resistance_1: gene
db_data: gene_name
Gene
ResearchDataHIV: geneName
hiv_resistance_1: cite
db_data: source
Source
hiv_resistance_1: codon_mutation
db_data: codon_info
Codon Mutation
ResearchDataHIV: codonMutation
Master Template: [Gene] [Source] [Codon Mutation]
Master Templates and Display
Templates
Database Store
ID: 1
Stanford Cancer Database
64.434.343.99
Admin/password
Master Template - Cancer
[Age] [Cancer Type] [Drug Class]
Display Template 1
[Age] [Cancer Type] [Drug Class]
Display Template 2
ID -> Table Name -> Field Name
[Cancer Type] [Drug Class]
ID: 2
UConn Database
44.254.292.34
Admin/password
Master Template – HIV Resistance
[Compound] [Codon] [Result]
ID: 3
HIV Research Database
23.32.232.19
Admin/password
ID -> Table Name -> Field Name
Display Template 1
[Compound] [Result]
Display Template 2
[Compound] [Codon] [Result]
Basic Functionality
Provides a simple search for users
Researchers have the option of selecting a pre-set Display
Template to only display data relevant to their needs
Queries each database individually according to the Master
Template
Returns results and (optionally) compiles them into a single list
Use AJAX to return results for each database
Caching and Optimization
Goal: researchers should have fastest access possible to
the info they seek
X-RAY or MRI images could be 2-5Mb in size each
What if researchers in the US consistently need access
to data on a server in Asia?
Local access would be fastest
Caching Goal
JAPAN
USA
Caching Process
USA
Caching and Optimization:
Possible Solutions
Move everything to a central server
Move records around as they are accessed
Cache everything
Cache databases based on usage
Query and Result Set – No Caching
JAPAN
USA
SPAIN
Caching Process
JAPAN
USA
SPAIN
Caching Complete
JAPAN
USA
SPAIN
Region Caching
Database Caching Queue – What
to cache?
For each region, determines the top external servers
used based on a percentage of queries
Database Caching Queue
Need a method to determine the most heavily
requested external databases for each region
Track statistics:
Convert IP address -> region whenever a user performs
a search
Increment result count for the record that keeps track
of the region ID and database ID
DB Queue to Cache
USA
USA
DB1 (Japan) – 345 results
DB7 (Spain) – 793 results
DB16 (USA) – 539 results
DB3 (Japan) – 491 results
DB7 (Spain) – 49%
DB3 (Japan) – 30%
DB1 (Japan) – 21%
JAPAN
JAPAN
DB1 (Japan) – 212 results
DB7 (Spain) – 343 results
DB16 (USA) – 112 results
DB3 (Japan) – 312 results
DB7 (Spain) – 75%
DB16 (USA) – 25%
Caching and Optimization: Where
to cache data
Real-world constraints
allow_cache
supersite
bandwidth
cache_size
Cache complete
NO
Iterate through
regions
Get a list of
servers by region
Filter by
allow_cache
Filter by supersite
Are there more
DBs in the
queue?
Sort by bandwidth
YES
YES
Compare cache
size needed by DB
in queue to space
available on server
Is space
available?
Cache DB
YES
Iterate through
server list
Are there more
servers?
NO
Drop DB from the
queue
NO
Iterate to next DB
in the queue
Cache complete
NO
YES
Compare cache
size needed by DB
in queue to space
available on server
Iterate through
server list
Are there more
DBs in the
queue?
Are there more
servers?
YES
YES
Is space
available?
Cache DB
NO
YES
Do any servers in
the list have
space?
Drop DB from the
queue
NO
Sort by bandwidth
Filter by nonsupersite
Caching and Optimization: Script
Process
Runs at frequency set by admin
This process continues for each region with the
program assigning data to servers with progressively
lower bandwidth and cache_size scores until all the
server space from that region is exhausted
Caching and Optimization: Script
Process
At the end, each region should have as many local
copies of the most frequently requested databases as
possible
Cached copies are read-only
Further Work and Improvements
Allow different types of databases (DAL)
Remove overlapping data
Script to determine when individual caches need to be
updated