Transcript Document
Using large data sets to
study factors associated
with the incidence of
multiple sclerosis.
Tamah Fridman
David Glick
John Kidd
Multiple Sclerosis (MS)
• A complex autoimmune disease with both
acute and chronic phases.
• Confounding factors include:
o genetic background
o viral infections including EBV and HSV
o nutritional factors
o environmental factors such as latitude and
smoking
Multiple Sclerosis (MS)
• In a more general way, this module could
be used to explore the difference between
correlation and causation.
• For use in a course, the instructor will
supply appropriate background information
on the immune response as applied to
MS.
Multiple Sclerosis (MS)
• There is a vast literature examining the
effects of
o geography
o migration
o infectious diseases
o sunlight related to vitamin D levels
o cigarette smoking
o diet
o hormones
Multiple Sclerosis (MS)
• Over time a number of data sets have been
published that explore relationships between
environmental factors and MS.
• Many of these are single studies that were
later included in one or more “meta-analysis”
articles.
• In addition, there are incidence statistics
available from a variety of sources such as
CDC, World Life Expectancy.com, WHO, and
others.
Multiple Sclerosis (MS)
• In order to demonstrate the module’s
potential, we have constructed several
examples of analysis using a variety of
techniques linking MS incidence to rainfall
and viral diseases via:
o A GIS plot
o A scatter plot
o 3-D Principle Component Analysis (PCA)
• These are based on the same data to
demonstrate that large data sets can be
visualized and analyzed in a variety of ways.
Multiple Sclerosis (MS)
Multiple Sclerosis (MS)
• Link to interactive ArcGIS plot:
• http://arcgis.com/explorer/?open=2e7723
700ef942b7a5aa2f8cbd96a5fc&extent=37
882315.9514645,2989772.13723539,4414
4037.3085845,6061929.17807238
Multiple Sclerosis (MS)
• The Excel function “Correl” was used to look
for correlations with MS rates and a series of
viral diseases and a “lifestyle” disease.
o Hepatitis C: -0.0152
o Cervical cancer: -0.34991
o Liver cancer: -0.25501
o HIV: -0.1451
o Lung cancer: 0.547928
Multiple Sclerosis (MS)
This slide is a sample—the complete spreadsheet contains 192 countries.
Country
ms rate
Hep C rate
cerv ca rate
liv ca rate
HIV rate
lung ca rate
Afghanistan
0.4
3.8
2.6
3.8
0
7.2
Albania
2.8
0.1
1.5
6.7
0.2
31
Algeria
0.1
0.1
3.4
1.3
2
10.6
Andorra
0.4
0.6
0.8
4.9
0
21.6
Angola
0.2
1
12.5
9.6
79.2
2.3
Antigua/Bar.
0
0
5.4
5.2
19.7
8.3
Multiple Sclerosis (MS)
• The above spreadsheet data were also
used to construct scatter plots of MS v
Hepatitis C (a viral disease) and also v
Lung Cancer (an environmental/lifestyle
disease). These plots follow.
Multiple Sclerosis (MS)
ms rate (Y) versus Hep C rate (X)
3
2.5
y = -0.0482x + 0.311
R² = 0.0111
2
ms rate
1.5
Linear (ms rate)
1
0.5
0
0
1
2
3
4
5
6
Multiple Sclerosis (MS)
ms rate (Y) versus lung cancer rate (X)
3
2.5
y = 0.0173x + 0.0283
R² = 0.3002
2
1.5
ms rate
Linear (ms rate)
1
0.5
0
0
10
20
30
40
50
60
Multiple Sclerosis (MS)
• The complete Excel spreadsheet was also
used in Principal Component Analysis (PCA).
• The data were saved in a tab delimited
format and then imported into the NIA Array
Analysis Tool for Principle Component
Analysis.
• The results are password protected on this
site:
http://lgsun.grc.nia.nih.gov/ANOVA/index.html
Multiple Sclerosis (MS)
• As something completely different, metaanalysis data were extracted into Excel,
transformed into a PGPLOT, and a Fortran
program was written to analyze and display
these data.
• A great deal of difficulty was encountered
fitting disparate data points into congruent
categories, so the following graph are shown
with some reservation.
• However, students “inventing” their own
analysis can be expected to encounter similar
problems.
Multiple Sclerosis (MS)
Multiple Sclerosis (MS)
Multiple Sclerosis (MS)
• We are deeply indebted to:
• Ileana Betancourt and Colleen McLinn for
help with GIS
• Jeff Lutgen and Bruce Wiggins for help
with Excel.