Transcript PPT

Database Systems Group
Research Overview 2010
1
OLAP Statistical Tests
Zhibo Chen
Advisor: Dr. Carlos Ordonez
• Goal: Isolate factors that cause
significant changes in a
measured value
– Ex: Increase in age causes increase
in risk for heart disease
• Combined OLAP with Means
Comparison Parametric Test
– Used to pair similar groups and
determine if they are significantly
different
– Want to reject hypothesis that the
two groups have the same mean
• Developed GUI that allows for
easy user interface
2
OLAP Statistical Tests
Zhibo Chen
Advisor: Dr. Carlos Ordonez
• Association Rules – technique used to detect patterns within
items of dataset
– HighAge, High Cholestrol => Heart Disease
• Compare results from both techniques
• OLAP Statistical Test discovered more rules than Association
Rules
– p-value is more reliable than confidence (considers pdf)
– OLAP affected less by distribution than AR
• AR better when performance is priority and data is skewed
• OLAP Statistical Test better when data is distributed
3
OLAP Statistical Test versus
Association Rules
Zhibo Chen
Advisor: Dr. Carlos Ordonez
• Blue and red lines represent
location of the averages of
the two groups
– Averages are fairly different
from one another
• Confidence says that the
two groups are similar
– Many blue points above 50
– Many red points above 50
– confidence is low
4
OLAP Exploration with UDF
Zhibo Chen
Advisor: Dr. Carlos Ordonez
• On-Line Analytical Process (OLAP)
– Set of techniques allowing users to explore
various aggregations of a dataset
– Ex: dataset with day, month, year, sales
• What were average sales for Sundays?
• Solve by grouping on day and then extracting
Sunday
• Normally done outside the database or
with OLAP servers
– We want to study how to perform the same
techniques inside the DBMS (SQL or UDF)
– Found that users can efficiently perform
OLAP exploration using UDFs
5
Digital Libraries in a DBMS
Carlos Garcia-Alvarado
•
•
•
Advisor: Dr. Carlos Ordonez
Information retrieval techniques have been traditionally
exploited outside relational database systems due to storage
overhead, complexity to suit them in a relational model, and
slower performance in SQL implementations.
Searching and querying documents under information retrieval
models in relational database systems can be performed with
optimized SQL.
We explore three phases:
• Document preprocessing.
• Document storage.
• Document retrieval (VSM, OPM,
DPLM).
6
Keyword Search Across Document and
Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez
Databases
•
•
•
•
•
Sometimes the meaning and structure of a database is
unknown.
There are external semi-structured sources that can help to
describe it.
We found that we can link these two worlds to identify
relationships between the structured data with the semistructured data.
We believe that is the right
approach to do it inside the
database.
We implemented a prototype
entirely in SQL.
7
Bayesian Statistics
Carlos Garcia-Alvarado
•
•
•
•
•
Advisor: Dr. Carlos Ordonez
Latest trend in advanced statistics; very demanding: CPU and
large data sets
Applied to microarray data in the DBMS. The problem involves
high dimensionality data of few samples.
Variable selection is the first issue that we have been trying to
solve. Computational expensive looking for the best model (2^d),
where d is de number of dimensions.
Applying SQL optimizations and data layout modifications, we
obtain less than 3 seconds selections of > 1 M dimensions , but
still not enough.
Current work: Gibbs Sampler Variable Selection.
8
PCA
Mario Navas
Advisor: Dr. Carlos Ordonez

Black-box

Rotation of the input space

Make the representative components evident

No Covariance between attributes

Variance represented by the eigenvalues

Deal with high dimensionality
9
DB Implementation



Summary matrices n L Q
Correlation matrix
Eigenvalue decomposition problem
10
Outliers detection in
microarray data

Deal with high dimensionality

Redundancy minimized

Find distance based outliers in a reduced space
PCA -based
Outliers [2D]
Distance-based
Outliers [7D]
PCA -based
Outliers [2D]
Distance-based
Outliers [126]
Matching top 10
11
Bayesian Classification Based On
Decomposition via Clustering
Sasi Kumar Pitchaimalai
Advisor: Dr.Carlos Ordonez
• An Extension Of Naïve
Bayes.
• Class Decomposition of the
Gaussians Using Clustering
• Using K-Means and E-M
• Scalability - Query
Optimizations for
Computationally and
Memory Intensive
Computations
• Incremental Learning of the
Classifier
12
Computing Distance & Sufficient
Statistics Using SQL & UDFs
Sasi Kumar Pitchaimalai
Advisor: Dr.Carlos Ordonez
• Five different SQL optimizations and one
User Defined Function (UDF) to compute
Euclidean distance in K-Means
• Sufficient Statistics – Count, Linear Sum
and Quadartic Sum for multiple clusters
and multiple classes computed in a single
data set scan Using SQL (or) UDF.
13
Fast Bayesian Classifier Based on
FREM
• The Algorithm
– Initialization : Randomly initialize k clusters per class
from the data set.
– E-step : Compute Mahalanobis distance, find nearest
cluster and then compute sufficient statistics.
– M-step : Recompute the mean and variances and
weight of the clusters per class. Mixture parameters
updated in this step.
– SplitClusters : Splitting Heavy Clusters to reach
higher quality solutions and reseeding low weight
clusters.
– The E-step and M-step are iterated until the model
converges.
14
Constrained Association Rules in SQL
Kai Zhao
Advisor: Dr. Carlos Ordonez
• Association rules are a data mining technique used to discover frequent
patterns in a data set. Real world application of this technique is broad
and can include fields such as medical and commerce. We can
automatically generate efficient SQL queries for discovering
association rules
15
Comparison between CAR and DT
Kai Zhao
Advisor: Dr. Carlos Ordonez
• CAR perform an exhaustive combinatorial
research whereas DT recursively partition
the input attribute space.
• CAR aim to find all rules above the given
thresholds whereas DT find regions in
space where most records belong to the
same class.
• CAR analyze item combinations whereas
DT select only one input attribute at one
time.
16
Frequent Subgraph Mining
Kai Zhao
Advisor: Dr. Carlos Ordonez
• Frequent subgraph
– A (sub)graph is frequent if its support (occurrence frequency)
in a given dataset is no less than a minimum support
threshold
(A)
(B)
(C)
FREQUENT PATTERNS
(MIN SUPPORT IS 2)
17