Data Mining - UAntwerpen
Download
Report
Transcript Data Mining - UAntwerpen
Research Internships
Advanced Research and Modeling
Research Group
ADREM – What?
• Research group that deals with computational
aspects of data
– databases
– data mining
– Information retrieval
ADREM – Who?
DB/DM/IR
• Floris Geerts
• Bart Goethals
• Martin Theobald
Bioinf
• Kris Laukens
• Tim Van den Bulcke
+ Phd students and postdoctoral researchers
http://adrem.ua.ac.be/adrem
Internships – What?
• 2 research internships (15 credits each)
• Msc thesis (30 credits).
Goal: internships are an initiation to research and is in collaboration with
researchers in ADReM
• 15 credits is a lot = internship is time consuming!
• 1 credit = 15 hour work…
• Balance your course load and internship well.
• Internships are not necessarily related to your Msc thesis (but it can)
• In a Msc thesis your ability to independently do research plays an
important role.
Internships – Who?
• Everyone who follows the research option in the
database Msc program
Research
In an internship you need to:
1.Understand a specific problem
2.Implement an (existing) method for solving the
problem
3.Test and evaluate
4.Write a report
(Msc thesis: you have to solve the problem as well
by designing new methods…)
Internships in a company
• It is allowed to do a internship in a company
but you have to ask permission
• Also, you have to find the company yourself
and convince us that there is research
involved
• You can’t receive any money from the
company during your internship
Databases, data mining, information
retrieval
• These are not separate research domains
• The topics for internships that each of us will
present next are usually on the intersection of
these areas.
• Let’s see some example topics….
Bart Goethals
Recommender Systems
•
•
•
•
•
Implement state of the art recommenders
Pattern mining for better recommendations
Interactive Recommendation
Explaining recommendations
Test recommenders for real data
Visual Instant Interactive
Pattern Mining
• Study Visualizations enabling Interactive
Pattern Mining
• Implement and Experiment with novel instant
mining methods
Pattern based Clustering
• Implement and evaluate different techniques
for clustering based pattern mining, and
pattern based clustering
Data Mining for Cleaning
• Study and experiment with data mining
methods for data cleaning.
Martin Theobald
Information Extraction (I): Wikipedia Infoboxes
Information Extraction (I): Infoboxes
YAGO/DBpedia et al.
bornOn(Jeff, 09/22/42)
gradFrom(Jeff, Columbia)
hasAdvisor(Jeff, Arthur)
hasAdvisor(Surajit, Jeff)
knownFor(Jeff, Theory)
>120 M facts for YAGO2
(mostly from Wikipedia infoboxes)
Information Extraction (II): Wikipedia Categories
Information Extraction (II): Wikipedia Categories
RDF Knowledge Bases
3 Mio. entities, 120 Mio. facts
100 relations, 200k classes
Entity
subclass
subclass
Organization
accuracy
95%
subclass
Person
Location
subclass
subclass
Scientist
subclass
subclass
Biologist
subclass
Politician
instanceOf
subclass
subclass
State
instanceOf
Physicist
Country
instanceOf
City
instanceOf
Germany
instanceOf
Oct 23, 1944
instanceOf
Max_Planck
Society
instanceOf
Erwin_Planck
diedOn
Nobel Prize
Kiel
hasWon
FatherOf
locatedIn
locatedIn
bornIn
SchleswigHolstein
citizenOf
Oct 4, 1947
Apr 23, 1858
diedOn
Max_Planck
bornOn
means
means
Angela Merkel
“Max
Planck”
http://www.mpi-inf.mpg.de/yago-naga/
means
“Max Karl Ernst
Ludwig Planck”
means
“Angela
Merkel”
means
“Angela
Dorothea
Merkel”
Linked Open Data
As of Sept. 2011:
> 200 sources
> 30 billion RDF triples
> 400 million links
http://linkeddata.org/
As of Sept. 2011:
> 5 million owl:sameAs links
between DBpedia/YAGO/Freebase
IBM Watson: Deep Question Answering
William Wilkinson's "An Account of the
Principalities of Wallachia and Moldavia"
inspired this author's most famous novel
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
As of 2010, this is the only
former Yugoslav republic in the EU
99 cents got me a 4-pack of Ytterlig coasters
from this Swedish chain
question
classification &
decomposition
knowledge
back-ends
D. Ferrucci et al.: Building Watson: An Overview of the
DeepQA Project. AI Magazine, Fall 2010.
YAGO
www.ibm.com/innovation/us/watson/index.htm
Jeopardy!
A big US city with two airports, one named after a World
War II hero, and one named after a World War II battle field?
Structured Knowledge Queries
A big US city with two airports, one named after a World
War II hero, and one named after a World War II battle field?
Select Distinct ?c Where {
?c type City . ?c locatedIn USA .
?a1 type Airport . ?a2 type Airport .
?a1 locatedIn ?c . ?a2 locatedIn ?c .
?a1 namedAfter ?p . ?p type WarHero .
?a2 namedAfter ?b . ?b type BattleField . }
• Use manually created templates for mapping sentence
patterns to structured queries.
• Works for factoid and list questions.
Mining Rules from RDF Knowledge Bases
Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y)
R
Ground truth for livesIn (only partially known)
Knowledge base for livesIn (known positive examples)
Facts produced by the rule (only partially correct)
KB
G
| Head Body|
accuracy( S ) P( Head | Body)
| Body|
• A-priori-style pre-filtering of low-support join patterns
• Dynamic programming ILP algorithm
• Learning with constants and type constraints
Rule-based Reasoning
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints
• People may live in more than one place
livesIn(x,y) marriedTo(x,z) livesIn(z,y)
[0.8]
livesIn(x,y) hasChild(x,z) livesIn(z,y)
[0.5]
• People are not born in different places/on different dates
bornIn(x,y) bornIn(x,z) y=z
• People are not married to more than one person
(at the same time, in most countries?)
marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
Probabilistic RDF Database
Query
graduatedFrom(Surajit, y)
0.7x(1-0.888)=0.078
graduatedFrom
(Surajit,
Q1 Princeton)
(1-0.7)x0.888=0.266
graduatedFrom
(Surajit,
Q2 Stanford)
A(B (CD))
A(B (CD))
1-(1-0.72)x(1-0.6)
=0.888
\/
0.8x0.9
=0.72
A graduatedFrom
(Surajit,
Princeton)[0.7]
C
hasAdvisor
(Surajit,Jeff)[0.8]
B graduatedFrom
/\
D
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]
Rules
hasAdvisor(x,y)
worksAt(y,z)
graduatedFrom(x,z) [0.4]
graduatedFrom(x,y)
graduatedFrom(x,z)
y=z
Base Facts
graduatedFrom(Surajit, Princeton) [0.7]
graduatedFrom(Surajit, Stanford) [0.6]
graduatedFrom(David, Princeton) [0.9]
hasAdvisor(Surajit, Jeff) [0.8]
hasAdvisor(David, Jeff) [0.7]
worksAt(Jeff, Stanford) [0.9]
type(Princeton, University) [1.0]
type(Stanford, University) [1.0]
type(Jeff, Computer_Scientist) [1.0]
type(Surajit, Computer_Scientist) [1.0]
type(David, Computer_Scientist) [1.0]
Temporal Knowledge
Probabilistic-Temporal Consistency Reasoning
Derived
Facts
t3 teamMates(Beckham,
Ronaldo,
Ronaldo, Tt3)
State
Relation
0.08
‘03
0.4
Base
Facts
‘04
0.16
playsFor(Beckham, Real, T1)
playsFor(Ronaldo, Real, T2)
overlaps(T1,T2)
0.12
‘05
‘07
0.6
‘05
‘07
‘03
playsFor(Beckham, Real, T1)
0.1
0.2
0.4
0.2
‘00 ‘02
‘07
‘04 ‘05
playsFor(Ronaldo, Real, T2)
Topics for Internships & Master Theses
Research Internships
• Preparation & Integration of Linked Data Sources for
Scientific Experiments (SQL/Java/Python)
• Mining Association Rules from Linked Data (Java/C++)
• Visualization Frontend for Linked Data
(ActionScript & Adobe Flash)
Master Theses
• Implementation of a distributed rule-based query engine
for RDF data (C++ & Message Passing Interface)
• Implementation of a distributed factor graph model for
correlated RDF facts (C++ & Message Passing Interface)
• Faceted Search and Interactive Browsing for Linked Data
Floris Geerts
RDBMS-based recommendation
systems
EDI
NY
Find top-3 flights from Edi to NYC with at most one stop
Items: flights
Selection criteria: relational queries
Utility function: in terms of price and duration (for ranking)
Top-k item selection
Selection criteria
Utility function
top-k items
…
items
32
Books, music, news, Web sites, research papers,…..
Query relaxation
Query for 5-day holiday
Q(f#, name,type,ticket, time) = ∃DT, AT, AD, xTo
( flight ( f#, EDI, xTo, DT, 5/19/2012, AT, AD, Pr ) ∧
POI ( name, xTo, type, ticket, time) ∧xTo= NYC )
Relaxation: cities
There is no direct flight
within 15 miles of EDI
from EDI to NYC
or NYC are acceptable
E = { EDI,NYC,4/1/2012 }, X = { xTo }
Q1(f#, name, type, ticket, time) =∃DT, AT, AD, uTo, wEdi, wNYC,wDD
( flight ( f#, wEdi, xTo, DT ,wDD, AT,A D, Pr ) ∧ xTo= wNYC ∧
POI( name, uTo, type, ticket, time) ∧ wDD=5/19/2012 ∧
dist(wNYC,NYC)≤15 ∧ dist(wEdi,EDI) ≤15 ∧ xTo=uTo)
Further relaxation: departure dates
within 3 days of 5/19/2012
arequery relaxation
dist(wDD,5/10/2012 ) ≤ 3
valid
33
acceptable
Topics
Top-k query answering algorithm on top of RDBMS
Query relaxation approaches and query completion
Data quality
• Detecting and correcting inconsistencies
• Finding duplicates
• Finding most up-to-date information
Semantic errors
Yahoo! Finance
Day’s Range: 93.80-95.71
Nasdaq
52wk Range: 25.38-95.71
Day’s Range: 93.80-95.71
52 Wk: 25.38-93.72
Instance ambiguity
Out-of-Date Data
4:05 pm
3:57 pm
Unit errors
76.82B
76,821,000
Topics
Fast inconsistency detection
Duplication elimination algorithms
Automated repairing algorithms
Mining of “data quality rules”