Example: Data Mining for the NBA - The University of Texas at Dallas
Download
Report
Transcript Example: Data Mining for the NBA - The University of Texas at Dallas
Data and Applications Security
Developments and Directions
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Data Warehousing, Data Mining and Security
September 2014
Outline
Background on Data Warehousing
Security Issues for Data Warehousing
Data Mining and Security
What is a Data Warehouse?
A Data Warehouse is a:
- Subject-oriented
- Integrated
- Nonvolatile
- Time variant
- Collection of data in support of management’s decisions
- From: Building the Data Warehouse by W. H. Inmon,
John Wiley and Sons
Integration of heterogeneous data sources into a repository
Summary reports, aggregate functions, etc.
Example Data Warehouse
Users
Query
the Warehouse
Oracle
DBMS for
Employees
Data Warehouse:
Data correlating
Employees With
Medical Benefits
and Projects
Sybase
DBMS for
Projects
Could be
any DBMS;
Usually based on
the relational
data model
Informix
DBMS for
Medical
Some Data Warehousing Technologies
Heterogeneous Database Integration
Statistical Databases
Data Modeling
Metadata
Access Methods and Indexing
Language Interface
Database Administration
Parallel Database Management
Data Warehouse Design
Appropriate Data Model is key to designing the Warehouse
Higher Level Model in stages
- Stage 1: Corporate data model
- Stage 2: Enterprise data model
- Stage 3: Warehouse data model
Middle-level data model
- A model for possibly for each subject area in the higher level
model
Physical data model
- Include features such as keys in the middle-level model
Need to determine appropriate levels of granularity of data in order
to build a good data warehouse
Distributing the Data Warehouse
Issues similar to distributed database systems
Branch A
Branch B
Central
Bank
Central
Warehouse
Non-distributed Warehouse
Branch A
Branch A
Warehouse
Branch B
Central
Bank
Central
Warehouse
Distributed Warehouse
Branch B
Warehouse
Multidimensional Data Model
Project Name
Project Leader
Project Sponsor
Years
Project Cost
Months
Project Duration
Weeks
Dollars
Pounds
Yen
Indexing for Data Warehousing
Bit-Maps
Multi-level indexing
Storing parts or all of the index files in main memory
Dynamic indexing
Metadata Mappings
Metadata
for the Warehouse
Metadata for
Mappings and
Transformations
Metadata
for Data source A
Metadata for
Mappings and
Transformations
Metadata
for Data source B
Metadata for
Mappings and
Transformations
Metadata
for Data source C
Data Warehousing and Security
Security for integrating the heterogeneous data sources into
the repository
- e.g., Heterogeneity Database System Security, Statistical
Database Security
Security for maintaining the warehouse
- Query, Updates, Auditing, Administration, Metadata
Multilevel Security
- Multilevel Data Models, Trusted Components
Example Secure Data Warehouse
User
Secure Data Warehouse
Manager
Secure DBMS A
Secure
Database
Secure DBMS B
Secure
Database
Secure
Warehouse
Secure DBMS C
Secure
Database
Secure Data Warehouse Technologies
Secure Data Warehousing Technologies:
Secure data modeling
Secure heterogeneous database integration
Database security
Secure access methods and indexing
Secure query languages
Secure database administration
Secure high performance computing technologies
Secure metadata management
Security for Integrating Heterogeneous Data
Sources
Integrating multiple security policies into a single policy for
the warehouse
- Apply techniques for federated database security?
Need to transform the access control rules
Security impact on schema integration and metadata
- Maintaining transformations and mappings
Statistical database security
- Inference and aggregation
e.g., Average salary in the warehouse could be
unclassified while the individual salaries in the databases
could be classified
Administration and auditing
-
-
Security Policy for the Warehouse
Federated Policy
for Federation
F2
Federated Policy
for Federation
F1
Export Policy
for Component A
Export Policy
for Component B
Export Policy
for Component B
Export Policy
for Component C
Generic Policy
for Component A
Generic Policy
for Component B
Generic policy
for Component C
Component Policy
for Component A
Component Policy
for Component B
Component Policy
for Component C
Security Policy Integration and Transformation
Federated policies become warehouse policies?
Security Policy for the Warehouse - II
Policy
for the Warehouse
Policy for
Mappings and
Transformations
Policy
For Data Source A
Policy for
Mappings and
Transformations
Policy for
Mappings and
Transformations
Policy
For Data Source B
Policy
For Data Source C
Secure Data Warehouse Model
Project Name, U
Project Leader, U
Project Sponsor, S
Year, U
Project Cost, S
Months, U
Project Duration, U
Weeks, U
U = Unclassified
S = Secret
Dollars, S
Pounds, S
Yen, S
Methodology for Developing a Secure Data
Warehouse
Integrate
Secure
data
sources
Secure data
sources
Clean/
modify
data
Sources.
Integrate
policies
Build secure
data model,
schemas,
access
methods,
and index
strategies for
the secure
warehouse
Multi-Tier Architecture
Tier N: Secure
Data Warehouse
Data Warehouse
Builds on Tier N-1
*
*
Tier 2: Builds on Tier 1
Tier 1:Secure Data Sources
Each layer builds on the Previous
Layer
Schemas/Metadata/Policies
Administration
Roles of Database Administrators, Warehouse
Administrators, Database System Security officers, and
Warehouse System Security Officers?
When databases are updated, can trigger mechanism be used
to automatically update the warehouse?
- i.e., Will the individual database administrators permit
such mechanism?
Auditing
Should the Warehouse be audited?
- Advantages
Keep
up-to-date information on access to the
warehouse
Disadvantages
May need to keep unnecessary data in the warehouse
May need a lower level granularity of data
May cause changes to the timing of data entry to the
warehouse as well as backup and recovery
restrictions
Need to determine the relationships between auditing the
warehouse and auditing the databases
-
Multilevel Security
Multilevel data models
- Extensions to the data warehouse model to support
classification levels
Trusted Components
- How much of the warehouse should be trusted?
- Should the transformations be trusted?
Covert channels, inference problem
Inference Controller
User
Inference
Controller
Secure Data Warehouse
Manager
Secure DBMS A
Secure
Database
Secure DBMS B
Secure
Database
Secure
Warehouse
Secure DBMS C
Secure
Database
Status and Directions
Commercial data warehouse vendors are incorporating role-
based security (e.g., Oracle)
Many topics need further investigation
- Building a secure data warehouse
- Policy integration
- Secure data model
- Inference control
Data Mining for Counter-terrorism
Data Mining for
Counterterrorism
Data Mining for
Non real-time
Threats:
Gather data,
build terrorist profiles
Mine data,
prune results
Data Mining for
Real-time
Threats:
Gather data in real-time,
build real-time models,
Mine data,
Report results
Data Mining Needs for Counterterrorism:
Non-real-time Data Mining
Gather data from multiple sources
- Information on terrorist attacks: who, what, where, when, how
- Personal and business data: place of birth, ethnic origin,
religion, education, work history, finances, criminal record,
relatives, friends and associates, travel history, . . .
- Unstructured data: newspaper articles, video clips, speeches,
emails, phone records, . . .
Integrate the data, build warehouses and federations
Develop profiles of terrorists, activities/threats
Mine the data to extract patterns of potential terrorists and predict
future activities and targets
Find the “needle in the haystack” - suspicious needles?
Data integrity is important
Techniques have to SCALE
Data Mining for Non Real-time Threats
Integrate
data
sources
Clean/
modify
data
sources
Build
Profiles
of Terrorists
and Activities
Mine
the
data
Data sources
with information
about terrorists
and terrorist activities
Report
final
results
Examine
results/
Prune
results
Data Mining Needs for Counterterrorism:
Real-time Data Mining
Nature of data
- Data arriving from sensors and other devices
Continuous data streams
- Breaking news, video releases, satellite images
- Some critical data may also reside in caches
Rapidly sift through the data and discard unwanted data for later use
and analysis (non-real-time data mining)
Data mining techniques need to meet timing constraints
Quality of service (QoS) tradeoffs among timeliness, precision and
accuracy
Presentation of results, visualization, real-time alerts and triggers
Data Mining for Real-time Threats
Integrate
data
sources in
real-time
Rapidly
sift through
data and
discard
irrelevant
data
Build
real-time
models
Mine
the
data
Data sources
with information
about terrorists
and terrorist activities
Report
final
results
Examine
Results in
Real-time
Data Mining Outcomes and Techniques for
Counter-terrorism
Data Mining
Outcomes and
Techniques
Classification:
Build profiles of
Terrorist and
classify terrorists
Association:
John and James
often seen
together after an
attack
Link Analysis:
Follow chain
from A to B
to C to D
Clustering:
Divide population; People from
country X of a certain religion;
people from Country Y
Interested in airplanes
Anomaly Detection:
John registers at
flight school;
but des not care
about takeoff or
landing
Example Success Story - COPLINK
COPLINK developed at University of Arizona
- Research transferred to an operational system currently
in use by Law Enforcement Agencies
What does COPLINK do?
Provides integrated system for law enforcement;
integrating law enforcement databases
- If a crime occurs in one state, this information is linked to
similar cases in other states
It has been stated that the sniper shooting case may have
been solved earlier if COPLINK had been operational at
that time
-
Where are we now?
We have some tools for
- building data warehouses from structured data
- integrating structured heterogeneous databases
- mining structured data
- forming some links and associations
- information retrieval tools
- image processing and analysis
- pattern recognition
- video information processing
- visualizing data
- managing metadata
What are our challenges?
Do the tools scale for large heterogeneous databases and petabyte
sized databases?
Building models in real-time; need training data
Extracting metadata from unstructured data
Mining unstructured data
Extracting useful patterns from knowledge-directed data mining
Rapidly forming links and associations; get the big picture for real-
time data mining
Detecting/preventing cyber attacks
Mining the web
Evaluating data mining algorithms
Conducting risks analysis / economic impact
Building testbeds
IN SUMMARY:
Data Mining is very useful to solve Security Problems
- Data mining tools could be used to examine audit data
-
and flag abnormal behavior
Much recent work in Intrusion detection
e.g., Neural networks to detect abnormal patterns
Tools are being examined to determine abnormal patterns
for national security
Classification techniques, Link analysis
Fraud detection
Credit cards, calling cards, identity theft etc.
BUT CONCERNS FOR PRIVACY