Data Mining - UAntwerpen

Download Report

Transcript Data Mining - UAntwerpen

Research Internships
Advanced Research and Modeling
Research Group
ADREM – What?
• Research group that deals with computational
aspects of data
– databases
– data mining
– Information retrieval
ADREM – Who?
DB/DM/IR
• Floris Geerts
• Bart Goethals
• Martin Theobald
Bioinf
• Kris Laukens
• Tim Van den Bulcke
+ Phd students and postdoctoral researchers
http://adrem.ua.ac.be/adrem
Internships – What?
• 2 research internships (15 credits each)
• Msc thesis (30 credits).
Goal: internships are an initiation to research and is in collaboration with
researchers in ADReM
• 15 credits is a lot = internship is time consuming!
• 1 credit = 15 hour work…
• Balance your course load and internship well.
• Internships are not necessarily related to your Msc thesis (but it can)
• In a Msc thesis your ability to independently do research plays an
important role.
Internships – Who?
• Everyone who follows the research option in the
database Msc program
Research
In an internship you need to:
1.Understand a specific problem
2.Implement an (existing) method for solving the
problem
3.Test and evaluate
4.Write a report
(Msc thesis: you have to solve the problem as well
by designing new methods…)
Internships in a company
• It is allowed to do a internship in a company
but you have to ask permission
• Also, you have to find the company yourself
and convince us that there is research
involved
• You can’t receive any money from the
company during your internship
Databases, data mining, information
retrieval
• These are not separate research domains
• The topics for internships that each of us will
present next are usually on the intersection of
these areas.
• Let’s see some example topics….
Bart Goethals
Recommender Systems
•
•
•
•
•
Implement state of the art recommenders
Pattern mining for better recommendations
Interactive Recommendation
Explaining recommendations
Test recommenders for real data
Visual Instant Interactive
Pattern Mining
• Study Visualizations enabling Interactive
Pattern Mining
• Implement and Experiment with novel instant
mining methods
Pattern based Clustering
• Implement and evaluate different techniques
for clustering based pattern mining, and
pattern based clustering
Data Mining for Cleaning
• Study and experiment with data mining
methods for data cleaning.
Martin Theobald
Information Extraction (I): Wikipedia Infoboxes
Information Extraction (I): Infoboxes
YAGO/DBpedia et al.
bornOn(Jeff, 09/22/42)
gradFrom(Jeff, Columbia)
hasAdvisor(Jeff, Arthur)
hasAdvisor(Surajit, Jeff)
knownFor(Jeff, Theory)
>120 M facts for YAGO2
(mostly from Wikipedia infoboxes)
Information Extraction (II): Wikipedia Categories
Information Extraction (II): Wikipedia Categories
RDF Knowledge Bases
3 Mio. entities, 120 Mio. facts
100 relations, 200k classes
Entity
subclass
subclass
Organization
accuracy
 95%
subclass
Person
Location
subclass
subclass
Scientist
subclass
subclass
Biologist
subclass
Politician
instanceOf
subclass
subclass
State
instanceOf
Physicist
Country
instanceOf
City
instanceOf
Germany
instanceOf
Oct 23, 1944
instanceOf
Max_Planck
Society
instanceOf
Erwin_Planck
diedOn
Nobel Prize
Kiel
hasWon
FatherOf
locatedIn
locatedIn
bornIn
SchleswigHolstein
citizenOf
Oct 4, 1947
Apr 23, 1858
diedOn
Max_Planck
bornOn
means
means
Angela Merkel
“Max
Planck”
http://www.mpi-inf.mpg.de/yago-naga/
means
“Max Karl Ernst
Ludwig Planck”
means
“Angela
Merkel”
means
“Angela
Dorothea
Merkel”
Linked Open Data
As of Sept. 2011:
> 200 sources
> 30 billion RDF triples
> 400 million links
http://linkeddata.org/
As of Sept. 2011:
> 5 million owl:sameAs links
between DBpedia/YAGO/Freebase
IBM Watson: Deep Question Answering
William Wilkinson's "An Account of the
Principalities of Wallachia and Moldavia"
inspired this author's most famous novel
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
As of 2010, this is the only
former Yugoslav republic in the EU
99 cents got me a 4-pack of Ytterlig coasters
from this Swedish chain
question
classification &
decomposition
knowledge
back-ends
D. Ferrucci et al.: Building Watson: An Overview of the
DeepQA Project. AI Magazine, Fall 2010.
YAGO
www.ibm.com/innovation/us/watson/index.htm
Jeopardy!
A big US city with two airports, one named after a World
War II hero, and one named after a World War II battle field?
Structured Knowledge Queries
A big US city with two airports, one named after a World
War II hero, and one named after a World War II battle field?
Select Distinct ?c Where {
?c type City . ?c locatedIn USA .
?a1 type Airport . ?a2 type Airport .
?a1 locatedIn ?c . ?a2 locatedIn ?c .
?a1 namedAfter ?p . ?p type WarHero .
?a2 namedAfter ?b . ?b type BattleField . }
• Use manually created templates for mapping sentence
patterns to structured queries.
• Works for factoid and list questions.
Mining Rules from RDF Knowledge Bases
Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y)
R
Ground truth for livesIn (only partially known)
Knowledge base for livesIn (known positive examples)
Facts produced by the rule (only partially correct)
KB
G
| Head  Body|
accuracy( S )  P( Head | Body) 
| Body|
• A-priori-style pre-filtering of low-support join patterns
• Dynamic programming ILP algorithm
• Learning with constants and type constraints
Rule-based Reasoning
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints
• People may live in more than one place
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y)
[0.8]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y)
[0.5]
• People are not born in different places/on different dates
bornIn(x,y)  bornIn(x,z)  y=z
• People are not married to more than one person
(at the same time, in most countries?)
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z
 disjoint(t1,t2)
Probabilistic RDF Database
Query
graduatedFrom(Surajit, y)
0.7x(1-0.888)=0.078
graduatedFrom
(Surajit,
Q1 Princeton)
(1-0.7)x0.888=0.266
graduatedFrom
(Surajit,
Q2 Stanford)
A(B (CD))
 A(B (CD))
1-(1-0.72)x(1-0.6)
=0.888

\/
0.8x0.9
=0.72
A graduatedFrom
(Surajit,
Princeton)[0.7]
C
hasAdvisor
(Surajit,Jeff)[0.8]
B graduatedFrom
/\
D
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]
Rules
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z) [0.4]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 y=z
Base Facts
graduatedFrom(Surajit, Princeton) [0.7]
graduatedFrom(Surajit, Stanford) [0.6]
graduatedFrom(David, Princeton) [0.9]
hasAdvisor(Surajit, Jeff) [0.8]
hasAdvisor(David, Jeff) [0.7]
worksAt(Jeff, Stanford) [0.9]
type(Princeton, University) [1.0]
type(Stanford, University) [1.0]
type(Jeff, Computer_Scientist) [1.0]
type(Surajit, Computer_Scientist) [1.0]
type(David, Computer_Scientist) [1.0]
Temporal Knowledge
Probabilistic-Temporal Consistency Reasoning
Derived
Facts
 t3 teamMates(Beckham,
Ronaldo,
Ronaldo, Tt3)
State
Relation
0.08
‘03
0.4
Base
Facts
‘04

0.16
playsFor(Beckham, Real, T1)
 playsFor(Ronaldo, Real, T2)
 overlaps(T1,T2)
0.12
‘05
‘07
0.6
‘05
‘07
‘03
playsFor(Beckham, Real, T1)
0.1
0.2
0.4
0.2
‘00 ‘02
‘07
‘04 ‘05
playsFor(Ronaldo, Real, T2)
Topics for Internships & Master Theses
Research Internships
• Preparation & Integration of Linked Data Sources for
Scientific Experiments (SQL/Java/Python)
• Mining Association Rules from Linked Data (Java/C++)
• Visualization Frontend for Linked Data
(ActionScript & Adobe Flash)
Master Theses
• Implementation of a distributed rule-based query engine
for RDF data (C++ & Message Passing Interface)
• Implementation of a distributed factor graph model for
correlated RDF facts (C++ & Message Passing Interface)
• Faceted Search and Interactive Browsing for Linked Data
Floris Geerts
RDBMS-based recommendation
systems
EDI
NY
 Find top-3 flights from Edi to NYC with at most one stop
 Items: flights
 Selection criteria: relational queries
 Utility function: in terms of price and duration (for ranking)

Top-k item selection
Selection criteria
Utility function
top-k items
…
items
32
Books, music, news, Web sites, research papers,…..
Query relaxation
Query for 5-day holiday
Q(f#, name,type,ticket, time) = ∃DT, AT, AD, xTo
( flight ( f#, EDI, xTo, DT, 5/19/2012, AT, AD, Pr ) ∧
POI ( name, xTo, type, ticket, time) ∧xTo= NYC )
Relaxation: cities
There is no direct flight
within 15 miles of EDI
from EDI to NYC
or NYC are acceptable
E = { EDI,NYC,4/1/2012 }, X = { xTo }
Q1(f#, name, type, ticket, time) =∃DT, AT, AD, uTo, wEdi, wNYC,wDD
( flight ( f#, wEdi, xTo, DT ,wDD, AT,A D, Pr ) ∧ xTo= wNYC ∧
POI( name, uTo, type, ticket, time) ∧ wDD=5/19/2012 ∧
dist(wNYC,NYC)≤15 ∧ dist(wEdi,EDI) ≤15 ∧ xTo=uTo)
Further relaxation: departure dates
within 3 days of 5/19/2012
arequery relaxation
dist(wDD,5/10/2012 ) ≤ 3
valid
33
acceptable
Topics
 Top-k query answering algorithm on top of RDBMS
 Query relaxation approaches and query completion
Data quality
• Detecting and correcting inconsistencies
• Finding duplicates
• Finding most up-to-date information
Semantic errors
Yahoo! Finance
Day’s Range: 93.80-95.71
Nasdaq
52wk Range: 25.38-95.71
Day’s Range: 93.80-95.71
52 Wk: 25.38-93.72
Instance ambiguity
Out-of-Date Data
4:05 pm
3:57 pm
Unit errors
76.82B
76,821,000
Topics




Fast inconsistency detection
Duplication elimination algorithms
Automated repairing algorithms
Mining of “data quality rules”