Mining and Visualization of Flow Cytometry Data

Download Report

Transcript Mining and Visualization of Flow Cytometry Data

Mining and Visualization
of Flow Cytometry Data
A N G EL A CHI N
UN I V ERSITY OF HOUSTON R ES EARCH E X P ERI ENCE FOR UN DE RGRADUATES
JU LY 3 , 2 0 1 3
1
Contents
1. Introduction to Flow Cytometry
2. The Problem
3. Current Approaches & Results
4. Future Work
2
Flow Cytometry
MEDICAL TECHNIQUE USED FOR CELL COUNTING AND CELL
SORTING
3
How it Works
Picture from: Abcam
http://www.abcam.com/index.html?pageconfig=resource&rid=11446
4
Flow Cytometry Application
Determine whether a person has b-cell lymphoma
Based on the number of clusters that result from flow cytometry
• Two clusters : cancer patient
• Three clusters : healthy individual
5
Example: Flow Cytometry Results
Healthy Patient
Cancer Patient
6
Problems with Current Methods
The process for determining if there are two or three clusters is
manual
Doctors’ time could be better spent on other tasks
7
The Problem
CREATING AN AUTOMATED METHOD TO DETERMINING THE
NUMBER OF CLUSTERS
8
Past Approaches
Many ways to determine number of clusters
• Most need to know the number of clusters ahead of time
Most popular is k-means, but there are some problems
• Need to give the algorithm the number of clusters beforehand
• Has difficulty when clusters are close, different sizes, etc.
9
Further Defining the Problem
We want to be able to determine the number of clusters
when:
The distance between clusters is very small
The ratio of cluster sizes is large (100:1 to 1000:1)
We decided to further constrain the problem such that we could
determine:
1 cluster vs 2 clusters when the size ratio was up to 1000:1
10
Current Approaches &
Results
11
Two Approaches
Approach #1:
Transformation
Find the center of the data
Take each point and find its angle
from the horizontal line located at the
center (new x-value) and distance from
the center (new y-value)
Use transformed data to determine
number of clusters
Approach #2:
Testing Normal Fit
Project 2D data onto line to create 1D
data
Apply normal distribution fit
Compare the Bayesian Information
Criterion (BIC) of the fit to a cut-off
limit
If the BIC is above the limit, there are
two clusters; otherwise, there is one
12
Approach #1:
Transformation
13
Approach #1: Transformation
𝜋/2
𝜋
3𝜋/2
2𝜋
14
Approach #1: Transformation Process
𝜋/2
𝜋/2
𝜋
3𝜋/2
2𝜋
15
Approach #1: Transformation
𝜋/2
𝜋
3𝜋/2
2𝜋
16
Approach #2: Testing
Normal Fit
17
Approach #2: Testing Normal Fit
3 standard deviations apart, ratio 1:99
ONE CLUSTER BEST FITS
TWO CLUSTER BEST FITS
18
Approach #2:
Testing Normal
Fit
 Comparing BIC of the one
cluster versus two clusters
 All data was generated using
100000 points and the same
standard deviations
 The ratios between clusters
and distance between two
clusters (if applicable) was
varied
•
•
Ratios: 199:1 to 63:1
Distance: 1.5 to 5 Standard
Deviations apart
19
Approach #2:
Testing Normal
Fit
 Comparing BIC of the one
cluster versus two clusters
 All data was generated using
100000 points and the same
standard deviations
 The ratios between clusters
and distance between two
clusters (if applicable) was
varied
•
•
Ratios: 199:1 to 63:1
Distance: 1.5 to 5 Standard
Deviations apart
20
Future Work
21
Future Work
Approach #1:
Determine if there is a way to detect the second cluster in the
transformation
Approach #2:
Use real data to see if a cut-off can be determined
Overall:
After figuring out how to distinguish one and two clusters, extend the
method to two versus three clusters
22
Limitations
Assume the data will have Gaussian distribution
Number of clusters limited to two or three
23
Acknowledgements
I would like to thank my research advisor, Dr. Stephen Huang, and
Mitch Shih for their guidance on this project. I would also like to
thank the University of Houston Computer Science Department and
the National Science Foundation for providing me with the
opportunity to participate in the REU.
24