Mining and Visualization of Flow Cytometry Data
Download
Report
Transcript Mining and Visualization of Flow Cytometry Data
Mining and Visualization
of Flow Cytometry Data
A N G EL A CHI N
UN I V ERSITY OF HOUSTON R ES EARCH E X P ERI ENCE FOR UN DE RGRADUATES
JU LY 3 , 2 0 1 3
1
Contents
1. Introduction to Flow Cytometry
2. The Problem
3. Current Approaches & Results
4. Future Work
2
Flow Cytometry
MEDICAL TECHNIQUE USED FOR CELL COUNTING AND CELL
SORTING
3
How it Works
Picture from: Abcam
http://www.abcam.com/index.html?pageconfig=resource&rid=11446
4
Flow Cytometry Application
Determine whether a person has b-cell lymphoma
Based on the number of clusters that result from flow cytometry
• Two clusters : cancer patient
• Three clusters : healthy individual
5
Example: Flow Cytometry Results
Healthy Patient
Cancer Patient
6
Problems with Current Methods
The process for determining if there are two or three clusters is
manual
Doctors’ time could be better spent on other tasks
7
The Problem
CREATING AN AUTOMATED METHOD TO DETERMINING THE
NUMBER OF CLUSTERS
8
Past Approaches
Many ways to determine number of clusters
• Most need to know the number of clusters ahead of time
Most popular is k-means, but there are some problems
• Need to give the algorithm the number of clusters beforehand
• Has difficulty when clusters are close, different sizes, etc.
9
Further Defining the Problem
We want to be able to determine the number of clusters
when:
The distance between clusters is very small
The ratio of cluster sizes is large (100:1 to 1000:1)
We decided to further constrain the problem such that we could
determine:
1 cluster vs 2 clusters when the size ratio was up to 1000:1
10
Current Approaches &
Results
11
Two Approaches
Approach #1:
Transformation
Find the center of the data
Take each point and find its angle
from the horizontal line located at the
center (new x-value) and distance from
the center (new y-value)
Use transformed data to determine
number of clusters
Approach #2:
Testing Normal Fit
Project 2D data onto line to create 1D
data
Apply normal distribution fit
Compare the Bayesian Information
Criterion (BIC) of the fit to a cut-off
limit
If the BIC is above the limit, there are
two clusters; otherwise, there is one
12
Approach #1:
Transformation
13
Approach #1: Transformation
𝜋/2
𝜋
3𝜋/2
2𝜋
14
Approach #1: Transformation Process
𝜋/2
𝜋/2
𝜋
3𝜋/2
2𝜋
15
Approach #1: Transformation
𝜋/2
𝜋
3𝜋/2
2𝜋
16
Approach #2: Testing
Normal Fit
17
Approach #2: Testing Normal Fit
3 standard deviations apart, ratio 1:99
ONE CLUSTER BEST FITS
TWO CLUSTER BEST FITS
18
Approach #2:
Testing Normal
Fit
Comparing BIC of the one
cluster versus two clusters
All data was generated using
100000 points and the same
standard deviations
The ratios between clusters
and distance between two
clusters (if applicable) was
varied
•
•
Ratios: 199:1 to 63:1
Distance: 1.5 to 5 Standard
Deviations apart
19
Approach #2:
Testing Normal
Fit
Comparing BIC of the one
cluster versus two clusters
All data was generated using
100000 points and the same
standard deviations
The ratios between clusters
and distance between two
clusters (if applicable) was
varied
•
•
Ratios: 199:1 to 63:1
Distance: 1.5 to 5 Standard
Deviations apart
20
Future Work
21
Future Work
Approach #1:
Determine if there is a way to detect the second cluster in the
transformation
Approach #2:
Use real data to see if a cut-off can be determined
Overall:
After figuring out how to distinguish one and two clusters, extend the
method to two versus three clusters
22
Limitations
Assume the data will have Gaussian distribution
Number of clusters limited to two or three
23
Acknowledgements
I would like to thank my research advisor, Dr. Stephen Huang, and
Mitch Shih for their guidance on this project. I would also like to
thank the University of Houston Computer Science Department and
the National Science Foundation for providing me with the
opportunity to participate in the REU.
24