Storing RDF Data in Hadoop And Retrieval

Download Report

Transcript Storing RDF Data in Hadoop And Retrieval

Assured Cloud Computing
for Assured Information
Sharing
Dr. Bhavani Thuraisingham
The University of Texas at Dallas (UTD)
February 2014
•
•
•
•
•
•
Outline
Objectives
Assured Information Sharing
Layered Framework for a Secure Cloud
Cloud-based Assured Information Sharing
Cloud-based Secure Social Networking
Other Topics
– Secure Hybrid Cloud
– Cloud Monitoring
– Cloud for Malware Detection
– Cloud for Secure Big Data
• Education
• Directions
• Related Books
Team Members
• Sponsor: Air Force Office of Scientific Research
• The University of Texas at Dallas
– Dr. Murat Kantarcioglu; Dr. Latifur Khan; Dr. Kevin Hamlen; Dr.
Zhiqiang Lin, Dr. Kamil Sarac
• Sub-contractors
– Prof. Elisa Bertino (Purdue)
– Ms. Anita Miller, Late Dr. Bob Johnson (North Texas Fusion
Center)
• Collaborators
– Late Dr. Steve Barker, Dr. Maribel Fernandez, Kings College, U of
London (EOARD)
– Dr. Barbara Carminati; Dr. Elena Ferrari, U of Insubria (EOARD)
Objectives
• Cloud computing is an example of computing in which dynamically scalable
and often virtualized resources are provided as a service over the Internet.
Users need not have knowledge of, expertise in, or control over the
technology infrastructure in the "cloud" that supports them.
• Our research on Cloud Computing is based on Hadoop, MapReduce, Xen
• Apache Hadoop is a Java software framework that supports data intensive
distributed applications under a free license. It enables applications to work
with thousands of nodes and petabytes of data. Hadoop was inspired by
Google's MapReduce and Google File System (GFS) papers.
• XEN is a Virtual Machine Monitor developed at the University of Cambridge,
England
• Our goal is to build a secure cloud infrastructure for assured information
sharing and related applications
Information Operations Across Infospheres:
Assured Information Sharing
Objectives
 Develop a Framework for Secure and Timely Data Sharing
across Infospheres
 Investigate Access Control and Usage Control policies for
Secure Data Sharing
 Develop innovative techniques for extracting information
from trustworthy, semi-trustworthy and untrustworthy
partners
Data/Policy for Coalition
Publish Data/Policy
Publish Data/Policy
Publish Data/Policy
Component
Data/Policy for
Agency A
Component
Data/Policy for
Agency C
Component
Data/Policy for
Agency B
Scientific/Technical Approach
 Conduct experiments as to how much information is lost
as a result of enforcing security policies in the case of
trustworthy partners
 Develop more sophisticated policies based on role-based
and usage control based access control models
 Develop techniques based on game theoretical strategies
to handle partners who are semi-trustworthy
 Develop data mining techniques to carry out defensive
and offensive information operations
Accomplishments
 Developed an experimental system for determining
information loss due to security policy enforcement
 Developed a strategy for applying game theory for semitrustworthy partners; simulation results
 Developed data mining techniques for conducting
defensive operations for untrustworthy partners
Challenges
 Handling dynamically changing trust levels; Scalability
Our Approach
• Policy-based Information Sharing
• Integrate the Medicaid claims data and mine the data;
• Enforce policies and determine how much information has
been lost (Trustworthy partners);
• Application of Semantic web technologies
• Apply game theory and probing to extract information from
semi-trustworthy partners
• Conduct Active Defence and determine the actions of an
untrustworthy partner
– Defend ourselves from our partners using data analytics
techniques
– Conduct active defence – find our what our partners are
doing by monitoring them so that we can defend our
selves from dynamic situations
Policy Enforcement Prototype
Coalition
Layered Framework for Assured Cloud
Computing
Policies
XACML
RDF
QoS
Applications
Resource
Allocation
HIVE/SPARQL/Query
Hadoop/MapReduc/Storage
XEN/Linux/VMM
Risks/
Costs
Cloud
Monitors
Secure Virtual
Network Monitor
Figure2. Layered Framework for Assured Cloud
3/27/2016
8
Secure Query Processing with
Hadoop/MapReduce
• We have studied clouds based on Hadoop
• Query rewriting and optimization techniques designed and
implemented for two types of data
• (i) Relational data: Secure query processing with HIVE
• (ii) RDF data: Secure query processing with SPARQL
• Demonstrated with XACML policies
• Joint demonstration with Kings College and University of Insubria
– First demo (2011): Each party submits their data and policies
– Our cloud will manage the data and policies
– Second demo (2012): Multiple clouds
Fine-grained Access Control with Hive
System Architecture
 Table/View definition and loading,
 Users can create tables as well as
load data into tables. Further, they
can also upload XACML policies
for the table they are creating.
Users can also create XACML
policies for tables/views.
 Users can define views only if
they have permissions for all
tables specified in the query used
to create the view. They can also
either specify or create XACML
policies for the views they are
defining.
 CollaborateCom 2010
SPARQL Query Optimizer for Secure
RDF Data Processing
New Data
Web Interface
Answer
Query
Data Preprocessor
MapReduce Framework
Parser
N-Triples Converter
Query Validator &
Rewriter
Prefix Generator
Predicate Based
Splitter
Predicate Object
Based Splitter
Server
Backend
XACML PDP
Query Rewriter By
Policy
Plan Generator
Plan Executor
To build an
efficient storage
mechanism using
Hadoop for large
amounts of data
(e.g. a billion
triples); build an
efficient query
mechanism for
data stored in
Hadoop; Integrate
with Jena
Developed a query
optimizer and
query rewriting
techniques for
RDF Data with
XACML policies
and implemented
on top of JENA
IEEE Transactions
on Knowledge and
Data Engineering,
2011
Demonstration: Concept of Operation
Agency 1
Agency 2
Agency n
…
User Interface Layer
Relational Data
Fine-grained Access Control
with Hive
RDF Data
SPARQL Query Optimizer
for Secure RDF Data
Processing
RDF-Based Policy Engine
Technology
By UTDallas
Interface to the Semantic Web
Inference Engine/
Rules Processor
e.g., Pellet
Policies
Ontologies
Rules
In RDF
JENA RDF Engine
RDF Documents
RDF-based Policy Engine on the Cloud
Query
Result

Determine how access is granted to a resource as
well as how a document is shared

User specify policy: e.g., Access Control, Redaction,
Released Policy

Parse a high-level policy to a low-level
representation

Support Graph operations and visualization. Policy
executed as graph operations

Execute policies as SPARQL queries over large
RDF graphs on Hadoop

Support for policies over Traditional data and its
provenance

IFIP Data and Applications Security, 2010, ACM
SACMAT 2011
User Interface Layer
High Level Specification
Policy
Translator
Policy Parser Layer
Access Control/ Redaction
Policy (Traditional Mechanism)
Policy / Graph
Transformation Rules
Regular Expression-Query
Translator
Provenance Controller
Data Controller
XML
DB
Policy
Transformation
Layer
...
RDF
DB
RDF
A testbed for evaluating different policy sets over
different data representation. Also supporting
provenance as directed graph and viewing policy
outcomes graphically
Integration with
Assured Information Sharing:
Agency 1
Agency 2
Agency n
…
User Interface Layer
SPARQL Query
RDF Data
and Policies
Policy Translation and
Transformation Layer
RDF Data Preprocessor
MapReduce Framework for
Query Processing
Hadoop HDFS
Result
Architecture
Agency 2
Agency n
Agency 1
User Interface Layer
Policy Request
RDF Graph
Access Control
Combined
Redaction
Policy n-2
Redaction
Policy Engine
Policy n-1
Access Control
Cloud-based
Store
Provenance
Combined
RDF Query: SPARQL
Policy n
RDF Graph: Model
Connection Interface
RDBMS
Connection: DB
Connection: Cloud
Connection: Text
Local
Policy Reciprocity
Agency 1 wishes to share its resources if
Agency 2 also shares its resources with it
 Use our Combined policies

Allow agents to define policies based on reciprocity and mutual interest amongst
cooperating agencies
SPARQL query:
SELECT B
FROM NAMED uri1 FROM NAMED uri2
WHERE P
Develop and Scale Policies
Agency 1 wishes to extend its existing policies
with support for constructing policies at a finer
granularity.
 The Policy engine
– Policy interface that should be implemented by all
policies
– Add newer types of policies as needed
Justification of Resources
Agency 1 asks Agency 2 for a justification of
resource R2
• Policy engine
– Allows agents to define policies over provenance
– Agency 2 can provide the provenance to Agency 1
• But protect it by using access control or redaction policies
Other Example Policies
Agency 1 shares a resource with Agency 2 provided
Agency 2 does not share with Agency 3
Agency 1 shares a resource with Agency 2 depending
on the content of the resource or until a certain time
Agency 1 shares a resource R with agency 2 provided
Agency 2 does not infer sensitive data S from R
(inference problem)
Agency 1 shares a resource with Agency 2 provided
Agency 2 shares the resource only with those in its
organizational (or social) network
Analyzing and Securing
Social Networks in the Cloud
Analytics
Location Mining from Online Social Networks
Predicting Threats from Social Network Data,
Sentiment Analysis
Cloud Platform for implementation
Security and Privacy
Preventing the Inference of Private Attributes
(liberal or conservative; gay or straight)
Access Control in Social Networks
Cloud Platform for implementation
Security Policies for On-Line Social
Networks (OSN)
• Security Policies ate Expressed in SWRL
(Semantic Web Rules Language) examples
Security Policy Enforcement
• A reference monitor evaluates the requests.
• Admin request for access control could be
evaluated by rule rewriting
– Example: Assume Bob submits the following admin
request
– Rewrite as the following rule
Framework Architecture
Social Network
Application
Access
Decision
Access request
Reference
Monitor
Knowledge Base
Queries
Modified Access
request
Reasoning Result
Semantic Web
Reasoning
Engine
Policy Retrieval
Policy Store
SN Knowledge Base
Secure Social Networking in the Cloud
with Twitter-Storm
Social Network 1
Social Network 2
Social Network N
…
User Interface Layer
Relational Data
Fine-grained Access Control
with Hive
RDF Data
SPARQL Query Optimizer
for Secure RDF Data
Processing
Secure Storage and Query Processing in a
Hybrid Cloud
• The use of hybrid clouds is an emerging trend in cloud computing
– Ability to exploit public resources for high throughput
– Yet, better able to control costs and data privacy
• Several key challenges
– Data Design: how to store data in a hybrid cloud?
• Solution must account for data representation used
(unencrypted/encrypted), public cloud monetary costs and
query workload characteristics
– Query Processing: how to execute a query over a hybrid cloud?
• Solution must provide query rewrite rules that ensure the
correctness of a generated query plan over the hybrid cloud
Hypervisor integrity and forensics
in the Cloud
Applications
Linux
forensics
Solaris
XP
MacOS
OS
integrity
Virtualization Layer (Xen, vSphere)
Hardware Layer
 Secure control flow of hypervisor code
Hypervisor
Cloud integrity &
forensics
 Integrity via in-lined reference monitor
 Forensics data extraction in the cloud
 Multiple VMs
 De-mapping (isolate) each VM memory from physical memory
Cloud-based Malware Detection
Stream of known malware or
benign executables
Buffer
Unknown
executable
Feature
extraction and
selection using
Cloud
Feature
extraction
Malware
Remove
Training &
Model update
Ensemble of
Classification
models
Classify
Benign
Class
Keep
Cloud-based Malware Detection
• Binary feature extraction involves
– Enumerating binary n-grams from the binaries and selecting the best n-grams
based on information gain
– For a training data with 3,500 executables, number of distinct 6-grams can
exceed 200 millions
– In a single machine, this may take hours, depending on available computing
resources – not acceptable for training from a stream of binaries
– We use Cloud to overcome this bottleneck
• A Cloud Map-reduce framework is used
– to extract and select features from each chunk
– A 10-node cloud cluster is 10 times faster than a single node
– Very effective in a dynamic framework, where malware characteristics
change rapidly
Identity Management
Considerations in a Cloud
• Trust model that handles
– (i) Various trust relationships, (ii) access control policies based on roles and
attributes, iii) real-time provisioning, (iv) authorization, and (v) auditing and
accountability.
• Several technologies are being examined to develop the trust
model
– Service-oriented technologies; standards such as SAML and XACML; and
identity management technologies such as OpenID.
• Does one size fit all?
– Can we develop a trust model that will be applicable to all types of clouds
such as private clouds, public clouds and hybrid clouds Identity architecture
has to be integrated into the cloud architecture.
Big Data and the Cloud
0 Big Data describes large and complex data that cannot be managed by
traditional data management tools
0 From Petabytes to Zettabytes to Exabytes of data
0 Need tools for capture, storage, search, sharing, analysis, visualization of
big data.
0 Examples include
- Web logs, RFID and surveillance data, sensor networks, social network data
(graphs), text and multimedia, data pertaining to astronomy, atmospheric
science, genomics, biogeochemical, biological fields, video archives
0 Big Data Technologies
0 Hadoop/MapReduce Platform, HIVE Platform, Twitter Storm Platform, Google
Apps Engine, Amazon EC2 Cloud, Offerings from Oracle and IBM for Big Data
Management, Other: Cassandra, Mahut, PigLatin, - - - 0 Cloud Computing is emerging a critical tool for Big Data Management
0 Critical to maintain Security and Privacy for Big Data
Security and Privacy for Big Data
0 Secure Storage and Infrastructure
0 How can technologies such as Hadoop and MapReduce be
Secured
0 Secure Data Management
0 Techniques for Secure Query Processing
0 Examples: Securing HIVE, Cassandra
0 Big Data for Security
0 Analysis of Security Data (e.g., Malware analysis)
0 Regulations, Compliance Governance
0 What are the regulations for storing, retaining, managing,
transferring and analyzing Big Data
0 Are the corporations compliance with the regulations
0 Privacy of the individuals have to be maintained not just for raw
data but also for data integration and analytics
0 Roles and Responsibilities must be clearly defined
Security and Privacy for Big Data
0 Regulations Stifling Innovation?
0Major Concern is too many regulations will stifle
Innovation
0Corporations must take advantage of the Big Data
technologies to improve business
0But this could infringe on individual privacy
0Regulations may also interfere with Privacy – example
retaining the data
0Challenge: How can one carry out Analytics and still
maintain Privacy?
0 National Science F Workshop Planned for Spring 2014 at
the University of Texas at Dallas
Education on Secure Cloud Computing
and Related Technologies
• Secure Cloud Computing
– NSF Capacity Building Grant on Assured Cloud Computing
• Introduce cloud computing into several cyber security courses
• Completed courses
– Data and Applications Security, Data Storage, Digital Forensics,
Secure Web Services
– Computer and Information Security
• Capstone Course
– One course that covers all aspects of assured cloud computing
– Week long course to be given at Texas Southern University
• Analyzing and Securing Social Networks
• Big Data Analytics and Security
Directions
• Secure VMM and VNM
– Designing Secure XEN VMM
– Developing automated techniques for VMM
introspection
– Determine a secure network infrastructure for the
cloud
• Integrate Secure Storage Algorithms into Hadoop
• Identity Management in the Cloud
• Secure cloud-based Big Data Management/Social
Networking
Related Books
• Developing and Securing the Cloud, CRC Press (Taylor and
Francis), November 2013 (Thuraisingham)
• Secure Data Provenance and Inference Control with
Semantic Web, CRC Press 2014, In Print (Cadenhead,
Kantarcioglu, Khadilkar, Thuraisingham)
• Analyzing and Securing Social Media, CRC Press, 2014,
In preparation (Abrol, Heatherly, Khan, Kantarcioglu,
Khadilkar, Thuraisingham)