Harris T. Lin - Department of Computer Science

Download Report

Transcript Harris T. Lin - Department of Computer Science

Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Machine Learning
Learning Classifiers from Chains of
Multiple Interlinked RDF Data Stores
Relational, Distributed
?
Harris T. Lin and Vasant Honavar
Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University
[email protected]
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Resource Description Framework (RDF) Primer
• RDF triple = subject-predicate-object triple
• RDF graph = set of RDF triples
• Directed labeled graph whose nodes are URIs
yearOfBirth
Movie
hasActor
Actor
xsd:integer
gender
Gender
RDF Schema
RDF Data (Triple representation)
RDF Data (Graph representation)
1987
yearOfBirth
Inception
hasActor
Ellen Page
gender
hasActor
F
1974
yearOfBirth
Titanic
hasActor
Leonardo
gender
DiCaprio
M
(Inception, hasActor, Ellen Page)
(Inception, hasActor, Leonardo DiCaprio)
(Titanic, hasActor, Leonardo DiCaprio)
(Ellen Page, yearOfBirth, 1987)
(Ellen Page, gender, F)
(Leonardo DiCaprio, yearOfBirth, 1974)
(Leonardo DiCaprio, gender, M)
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Introduction
• Motivating scenario: Facebook + New York Times
– Facebook users share posts about news items published in
New York Times
– Goal: predict the interest of a user in joining a group
• Challenges for Machine Learning
– Multiple interlinked data stores
– Physically distributed data stores
– Autonomously maintained data stores
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Introduction
• Linked Open Data cloud
– 300+ interlinked datasets
– 30+ trillion triples
Linked Open Data cloud diagram,
by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
• Multiple interlinked, physically distributed, autonomously
maintained data stores
• Prohibits downloading all data together
–
–
–
–
Bandwidth limits
Access limits
Storage and Memory limits
Privacy and confidentiality constraints
• We need
– Learning from multiple interlinked RDF stores that support
only indirect access to data (e.g. SPARQL query interface)
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Summary of Contribution
• Learning Classifiers from Chains of Multiple Interlinked RDF
Data Stores
• Contributions
1.
2.
3.
4.
5.
Statistical query-based formulations of several representative
algorithms for learning classifiers from RDF data
Distributed learning framework for RDF stores that form a chain
[Not covered in this talk]
Identify 3 special cases of RDF data fragmentation
[Not covered in this talk]
Novel application of matrix reconstruction for approximating
statistics, which dramatically reduce communication
Experimental results demonstrating feasibility
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Problem Formulation
• Last.fm dataset:
User1
Y
User2
N
User3
N
Dataset
(Conceptual)
((B11, …, B1K), c1)
((B21, …, B2K), c2)
…
((Bn1, …, BnK), cn)
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Learning with Indirect Access to Data
RDF data stores
Statistics via
SPARQL queries
((B11, …, B1K), c1)
((B21, …, B2K), c2)
…
((Bn1, …, BnK), cn)
New instance
Learner
Classifier
Predicted class
• Single RDF data store
– Lin et al. [10]
• Multiple Interlinked RDF data stores
– This work
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Learning Algorithms
1. Aggregation
– Simple aggregation (max, min, avg, etc.)
– Vector distance aggregation (Perlich and Provost [12])
2. Generative Models
– Naïve Bayes (with 4 different distributions)
•
•
•
•
Bernoulli
Multinomial
Dirichlet
Polya (Dirichlet-Multinomial)
• Key sufficient statistic:
count for each value, for each instance
(= histogram for each instance)
• How to obtain this efficiently?
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Obtaining Statistics for Learning
Schema:
Users
Track
Artist
Tag
Data Graph:
Tag
User
Tag
Artist
Artist
Track
Track
User
Matrix
Representation:
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Iowa State University
Approximating Statistics
Tag
User
Tag
Tag
User
Tag
Artist
Track
Artist
Tag
User
Artist
Artist
Track
User
Track
User
Tag
Artist
User
Track
Column
Projection:
Row
Projection:
Artist
Track
Track
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Approximating Statistics
User
Tag
• Could we approximate this matrix
?
from the two projections?
• CT scans for the rescue!
• CT: Reconstruct 3D object from its projected slices
• We want: Reconstruct 2D matrix from its projections
Source: http://health-fts.blogspot.com/2012/01/brain-ct-mri.html
Source: https://www.medicalradiation.com/types-of-medicalimaging/imaging-using-x-rays/computed-tomography-ct/
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Approximating Statistics
Tag
User
• We adapted one of the simplest
reconstruction method:
Algebraic Reconstruction Technique
• Proposed scheme
?
1. Use SPARQL queries to accumulate and pass along column
and row vectors, ultimately send back to the learner
2. Learner use a CT method to reconstruct matrix from
projections
3. Use the approximated matrix to compute necessary
statistics for learning
• Dramatically reduce communication!
• How accurate are the learned classifiers?
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Experimental Results
• Two subsets of Last.fm dataset
• 2 aggregation and 4 naïve Bayes models
• Compares against centralized counterpart
– Uses exact matrix for learning
• Accuracy Results (10-fold cross validation)
– ART approximation has different effects depend on models
– NB(Pol) is competitive, even in the ART approximated case
– NB(Mul) is competitive too, despite using less information than NB(Pol)
• NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence
their results are identical (*)
•
Sensitivity of ART on different models [Not covered in this talk]
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Communication Complexity
• Size of query results transferred
v.s.
Size of the dataset (# users)
• Size of projections are several orders of magnitude
smaller
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Conclusion
• Challenges
– Multiple interlinked, physically distributed, autonomously maintained RDF data
stores
– Learner may be prohibited to download all data due to limitations in bandwidth,
access, storage and memory, privacy and confidentiality constraints
• We need
– Learning from multiple interlinked RDF stores that support only indirect access to
data (e.g. SPARQL query interface)
• Contributions
– Statistical query-based formulations of several representative algorithms for
learning classifiers from RDF data
– Distributed learning framework for RDF stores that form a chain
[Not covered in this talk]
– Identify 3 special cases of RDF data fragmentation
[Not covered in this talk]
– Novel application of matrix reconstruction from Computerized Tomography for
approximating statistics, which dramatically reduce communication
– Experimental results demonstrating feasibility
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.
Iowa State University
Department of Computer Science
Center for Computational Intelligence, Learning, and Discovery
Related Work and Future Work
• Related Work
– Most existing work on learning from RDF data assume
direct access
– Lin et al. [10] learns relational Bayesian classifiers from a
single remote RDF store via SPARQL queries
– Extends the remote access framework [20] to multiple
RDF stores
• Future Work
–
–
–
–
Consider more recent and complex CT methods
Explore other ways of taking projections
Consider more complex RDF data fragmentations
Consider richer classes of learning models
Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.