No Slide Title

Download Report

Transcript No Slide Title

MURI — Info. Management Group


Group Co-Leaders:

Jiawei Han (UIUC)

Chris Clifton (Purdue)

Hillol Kargupta (UMBC)

Collaborators:

Murat Kantarcioglu
(UT-Dallas)

Shouhuai Xu (UT- San
Antonio)

Ninghui Li (Purdue)
Core Contributors:


July 17, 2015
Latifur Khan (UTDallas)
Chengxiang Zhai
(UIUC)

Liasons:

Ravi Sandhu (UT- San
Antonio)

Anupam Joshi (UMBC)
1
Core Contributors & Current Ph.D. Students


Jiawei Han (UIUC)

Lu An Tang

Zhijun Yin
Chengxiang Zhai



Yuanhua Lv

Hyun Duk Kim

Mehedy Masud
Chris Clifton
(Purdue)

July 17, 2015
Kamalika Das
Latifur Khan (UTD)

(UIUC)

Hillol Kargupta
(UMBC)
Mummoorthy
Murugesan
2
General Project Goals

Provide information management and analysis
support for the project

Major research themes

Knowledge Discovery

Data integration and fusion

Measuring and maintaining information quality

Provenance tracking

Confidentiality in Information Management and
Analysis
July 17, 2015
3
Posters Reported in the Kick-Off Meeting

Plausibly Deniable Search


Conforming to Truth with Multiple Conflicting Information Providers
on the Web


Xuehua Shen, Bin Tan, and ChengXiang Zhai
Privacy Preserving Distributed Data Mining: A Game-Theoretic Approach


Shouhuai Xu
User-Centered Adaptive Information Retrieval


Jiawei Han, Xiaoxin Yin, and Philip S. Yu
Privacy-preserving Data Mining within Anonymous Credential Systems


Mummoorthy Murugesan and Chris Clifton
Kamalika Das and Hillol Kargupta
Novel Class Detection in Concept-Drifting Data Streams in a Shared
Environment.

July 17, 2015
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani
Thuraisingham
4
On-Going Research Projects






Novel Class Detection in Concept-Drifting Data Streams in a Shared
Environment

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and
Bhavani Thuraisingham (UTD/UIUC)
Confidentiality Preserving Data Cubes

Jiawei Han, Lu An Tang and Bolin Ding (UIUC)
Scalable Distributed Privacy-Preserving Local Algorithms for Large
Peer-to-Peer Data Mining: A Game Theoretic Approach

Hillol Kargupta and Kamalika Das (UMBC)
Confidential peer to peer extension to personalized search

Chengxiang Zhai, Chris Clifton, and Mummoorthy Murugesan
(UIUC/Purdue)
Information quality: Understanding and identifying provenance

ChengXiang Zhai and Jiawei Han (UIUC)
SPDU: A Secure Provenance Management Framework

Shouhuai Xu and Ravi Sandhu (UTSA)
July 17, 2015
5
Discovery in Data Streams for Security
Protection

Novel Class Detection in Concept-Drifting Data Streams in a Shared
Environment

Novelty/anomaly detection: A major issue in many applications,
especially in a streaming environment

Goal: Detect new classes in data streams

Approach: Efficiently handle the novel class detection task in the
presence of concept-drift and multiple classes

The approach is non-parametric—not assume any underlying
distributions of data

Comparison with the state-of-the-art stream classification techniques
prove the superiority of our approach

The technique can be extended to a distributed environment with
multiple sources
July 17, 2015
6
Confidentiality-Preserving Data Cubes

Confidentiality-/privacy-/sensitivity-preserving data cubes

Researchers have been studying confidentialitypreserving database systems (for query processing) and
confidentiality-preserving data mining systems

We propose to investigate confidentiality-preserving data
cubes for multidimensional analysis of data warehouses

Goal: Work out mechanisms to ensure one can access
maximal information in data cubes for information
understanding but lose minimal privacy information, even
with different combinations of OLAP queries

Extensions: How knowledge discovery will help
confidentiality preserving
July 17, 2015
7
Data and Information Integration for
Security Protection

Data fusion: Merge/integrate the same objects with
different names or identities

Data distinction: Distinguish different objects with
identical names

Information integration by information network analysis

Veracity analysis to conform truth with conflicting
information provided by multiple website or other
information providers

Correlation analysis to reduce redundancy and control
information disclosure

July 17, 2015
E.g. medical records, patients, medical treatments
8
Data and Information Access and
Management for Security Protection

Data separation vs. data integration and their role in
sensitive information disclosure and correlation discovery

Privacy-aware indexing to support fast/efficient data
accessing

Sensitivity-aware query processing and data publishing

Any other data/information management and analysis
issues needed from other groups in the project
July 17, 2015
9
Scalable Distributed Local Algorithms for
Peer-to-Peer Knowledge Discovery from
Sensitive Data
Hillol Kargupta
University of Maryland, Baltimore County
www.cs.umbc.edu/~hillol
www.agnik.com
Acknowledgement:
Chengxiang Zhai, Kamalika Das, Kanishka
Bhaduri, Kun Liu
July 17, 2015
10
Scalable Privacy-Preserving Information
Assurance

Challenges in Scalable Knowledge Discovery
 Scaling in large asynchronous distributed
environments
 Confidentiality/Privacy Preserving Data
Analysis
 Heterogeneous Policies and Strategies

Applications
 Distributed collaboration
 Distributed search and information retrieval
Motivation: Secure Multi-Party Sum
Computation
v1
• Compute the sum without
divulging the numbers
z1=(R+v1) mod N
z3=(z2+v3) mod N
v2
• Each party has a number
z2=(z1+v2) mod N
R is uniformly distributed in [0, N-1]
• Consider a sequence of
secure sum operations.
v3
Locality Sensitive Distributed Algorithms

Global algorithms: Communicate
with the entire network




Every node needs to maintain information
about the entire network
Maintaining this information is resource
intensive for large networks
Local algorithms: Communicate
only with the local neighborhood.
Bounded communication local
algorithms
Distributed Sum Computation: A Local
Approach

Each node has a number xi [0]
Compute the sum

Update xi [t ] using the following rule:

xi [t ]  xi [t 1]    ( x j [t 1]  xi [t 1])
ji

Asymptotically converges to the global sum
Optimization, Games, and PrivacyPreserving Knowledge Discovery

Multi-Party Privacy Preservation as an
optimization problem

Multi-party, multi-objective optimization

Blending game theory and mechanism design

Asynchronous algorithms for achieving
equilibrium states
Privacy/Confidentiality Preservation: An
Optimization Perspective

Multi-objective
Optimization
Perspective
 Policies
 Strategies
 Performance

Distributed games for
optimizing utility
functions
Summary of the Approach

Local Asynchronous Distributed Knowledge
Discovery Algorithms that preserve
Privacy/Confidentiality

Distributed Search and Information Retrieval
Algorithms

Multi-party Optimization Perspective of
Privacy/Confidentiality Preservation and Design of
Distributed Game Theoretic Mechanisms
July 17, 2015
17
Example: Cross-Domain Network
Threat Detection
Correlating threats
from different network
domains
Copyright, Agnik
Motivation : P2P Search Engine
What is the
most visited
news-page
in network
today?
Has
anybody
found a
cheap store
to buy a
digital
camera?
What is the best
search-key to search
for “Child Care”?
Useful Browser Data

Web-browser history
Browser cache
Click-stream data stored at browser (browsing pattern)
Search queries typed in the search engine
User profile
Bookmarks

Challenges








Indexing, clustering, data analysis in a decentralized
asynchronous manner
Scalability
Privacy
User-Centered Adaptive Information
Retrieval
WEB
Viewed
Web pages
Search
Engine
Search
Engine
Desktop
Files
...
Personalized
search agent
Email
Query
History
Search
Engine
“java”
Personalized
search agent
“java”
User-Centered Adaptive IR
• A novel retrieval strategy emphasizing
– user modeling (“user-centered”)
– search context modeling (“adaptive”)
– interactive retrieval
• Implemented as a personalized search agent that
– sits on the client-side (owned by the user)
– integrates information around a user (1 user vs. N sources as opposed to
1 source vs. N users)
– collaborates with each other
– goes beyond search toward task support
Reranking of Search Results with UCAIR
Toolbar
July 17, 2015
23
Research Agenda

Develop a scalable methodology for
Knowledge Discovery from Multi-Party Data

Design local asynchronous algorithms with
bounded communication

Multi-objective Distributed Optimization,
Mechanism Design, and Local Algorithms

Designing the Next Generation of PrivacyPreserving Distributed Knowledge Discovery
Algorithms
Research Agenda

Privacy-preserving user modeling:



P2P information recommendation




How can we model a user’s information need yet preserving
privacy?
How can we aggregate user models and information needs to
control privacy?
P2P architecture: flexible information sharing
What’s the right protocol for information recommendation?
How to extend collaborative filtering algorithms to protect user
privacy?
Collaborative Search

How can we match information needs with information content at
different levels of representation?
From Collaborative Query/Filtering to
Information Push




Chengxiang Zhai and Chris Clifton (UIUC/Purdue)
Personalized search  profile of information needs
 Profile based on prior search, without requiring
explicit definition of profile
 Assist information sources in identifying need to share
Challenge: profile / search may be sensitive
 May not be able to reveal to information source
(unless they have needed information?)
Research thrusts:
 Turning personalized search into profiles
 Matching information to profiles without disclosing
either
July 17, 2015
27
SPDU: A Secure Provenance
Management Framework




Shouhuai Xu and Ravi Sandhu (UTSA)
Security of provenance management is critical to
many applications including assured information
sharing
The state-of-the-art is that we know little about the
security aspect of provenance management.
We propose investigating a comprehensive
framework for secure provenance management as
well as supporting architectures and mechanisms
for realizing the framework
July 17, 2015
28
SPDU
Shouhuai Xu and Ravi Sandhu
• A comprehensive framework for securing
provenance and the corresponding
information
– We cannot talk about provenance without
touching what the provenance is for (i.e., both
data and their provenance are the goals for
protection)
• Supporting architectures and mechanisms
for realizing the framework
SPDU framework
• The above challenges call for a novel framework for secure
provenance management.
• We propose a SPDU framework for this purpose.
– S stands for Source trustworthiness management
Information
trustworthiness
– P stands for Processing trustworthiness management
management
– D stands for Dissemination management
– U stands for Usage management
• SPDU is application-neutral: allowing plug-and-play applicationspecific modules (e.g., semantic similarity between two documents)
• SPDU covers the whole lifecycle of information sharing
Processing
(recursive)
Dissemination
Source
Usage
Eight facets of SPDU
Usage
accountability
Dissemination
accountability
Processing
accountability
Source
accountability
Source privacy
Secure provenance
management
Usage
privacy
Processing
privacy
Dissemination
privacy
Information Quality: Understanding
and Identifying Provenance





ChengXiang Zhai and Jiawei Han (UIUC)
Credibility of information, particularly information presumed to
be from multiple sources, is a challenging issue
Are multiple reports independent confirmation of the same
event? Based on a common report? Reports of different
events?
Propose to use data mining techniques to identify
similarities/differences in information that is apparently from
different sources to estimate the likelihood that data is from a
single or independent sources, and about the same or multiple
events
Propose to develop novel text mining algorithms to analyze
"information genealogy" in large amounts of text data from
multiple sources and summarize contradictory opinions on a
topic
July 17, 2015
33
Summarizing Contradictory Information

Given a set of text articles from different sources with
contradictory information, how can we help analysts
to digest the information?

Problem 1: Semantic integration of information from
multiple sources

Problem 2: Detection of contradictory information

Problem 3: Summarization of contradictory
information

Techniques to explore:

text mining with probabilistic models

information extraction (e.g., entity/relation extraction)
Questions for YOU!



Other data analysis / global statistical model needs?
 Data quality?
 Lifecycle?
What sort of global statistical models would be of
interest to Intelligence Analysts?
 Models that transcend data silos
Scenarios for testing
 Sample/surrogate data to support scenarios
July 17, 2015
35
Thanks and Questions
July 17, 2015
36