Knowledge and the Social Web - clic

Download Report

Transcript Knowledge and the Social Web - clic

INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Massimo Poesio
LECTURE 10: Knowledge and The Social
Web
`CYC convinced the AI community that
creating a commonsense knowledge
base by hand is impossible’
(Massimo, Lecture 1)
That may depend on how many
people you put on to it!
THE SOCIAL WEB
• Increasingly, the Web is becoming not just a
way to facilitate information exchange or
commercial transactions, but also a tool to
facilitate socialization (Facebook, LinkedIn,
etc)
• Also, where information can be collectively
created
SOCIAL CREATION OF KNOWLEDGE
WIKIPEDIA
The free encyclopedia that anyone can edit
•Wikipedia is a free, multilingual
encyclopedia project supported by
the non-profit Wikimedia Foundation.
•Wikipedia's articles have been
written collaboratively by volunteers
around the world.
•Almost all of its articles can be edited
by anyone who can access the
Wikipedia website.
----http://en.wikipedia.org/wiki/Wikipeida
WIKIPEDIA
• Wikipedia is:
1. domain independent
– it has a large coverage
2. up-to-date
– to process current information
3. multilingual
– to process information in many languages
•Title
•Abstract
•Infoboxes
•Geo-coordinates
•Categories
•Images
•Links
•Other languages
•Other wiki pages
•To the web
•Redirects
•Disambiguates
Encyclopedic knowledge in
coreference resolution
[The FCC] took [three specific actions]
regarding [AT&T]. By a 4-0 vote, it allowed
AT&T to continue offering special discount
packages to big customers, called Tariff
12, rejecting appeals by AT&T competitors
that the discounts were illegal. …..
[The agency] said that because MCI's
offer had expired AT&T couldn't continue
to offer its discount plan.
Why Wikipedia may help addressing the
encyclopedic knowledge problem
http://en.wikipedia.org/wiki/FCC:
The Federal Communications
Commission (FCC) is an independent
United States government agency,
created, directed, and empowered by
Congressional statute (see 47
U.S.C. § 151 and 47 U.S.C. § 154).
Another interesting scenario
A fresh mandate for [Mr Ahmadinejad] would, say his
critics, consecrate the “revolution within a revolution” he
has been trying to effect since his surprise electoral
triumph in 2005. Best known to outsiders for his
bellicose grandstanding, [the incumbent] is more
familiar to Iranians as a radical and hyperactive populist
who has used the tacit backing of his fellow
conservative, Mr Khamenei, greatly to expand the
powers of the presidency.
Source: It could make a big difference, The Economist, Mar 19th 2009
Why Wikipedia may help addressing the
encyclopedic knowledge problem
Wikipedia as Ontology
• Unlike other standard ontologies, such as
WordNet and Mesh, Wikipedia itself is not a
structured thesaurus.
• However, it is more…
– Comprehensive: it contains 12 million articles (2.8
million in the English Wikipedia)
– Accurate : A study by Giles (2005) found Wikipedia can
compete with Encyclopædia Britannica in accuracy*.
– Up to date: Current and emerging concepts are
absorbed timely.
* Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900–901.
Wikipedia as Ontology
• Moreover, Wikipedia has a well-formed
structure
– Each article only describes a single concept.
– The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus.
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Ontology
• Moreover, Wikipedia has a well-formed
structure
– Each article only describes a single concept
– The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus.
– Equivalent concepts are grouped together by
redirected links.
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Ontology
• Moreover, Wikipedia has a well-formed
structure
– Each article only describes a single concept
– The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus.
– Equivalent concepts are grouped together by
redirected links.
– It contains a hierarchical categorization system,
in which each article belongs to at least one
category.
The concept
Artificial
Intelligence
belongs to four
categories:
Artificial
intelligence,
Cybernetics,
Formal sciences &
Technology in
society
Wikipedia as Ontology
• Moreover, Wikipedia has a well-formed structure
– Each article only describes a single concept
– The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus.
– Equivalent concepts are grouped together by
redirected links.
– It contains a hierarchical categorization system, in
which each article belongs to at least one category.
– Polysemous concepts are disambiguated by
Disambiguation Pages.
The different meanings that Artificial intelligence may refer to
are listed in its disambiguation page.
SEMANTIC NETWORK KNOWLEDGE
IN WIKIPEDIA
• Taxonomic information: category structure
• Attributes: infobox, text
Wikipedia category network
Deriving a taxonomy from
Wikipedia (AAAI 2007)
• Start with the category tree
Deriving a taxonomy from
Wikipedia (AAAI 2007)
• Induce a subsumption hierarchy
INFOBOXES
• Collaborative
content
• Semistructured data
{{Infobox Writer
| bgcolour = silver
| name
= Edgar Allan Poe
| image
= Edgar_Allan_Poe_2.jpg
| caption = This [[daguerreotype]] of Poe was taken in 1848 ...
| birth_date = {{birth date|1809|1|19|mf=y}}
| birth_place = [[Boston, Massachusetts]] [[United States|U.S.]]
| death_date = {{death date and age|1849|10|07|1809|01|19}}
| death_place = [[Baltimore, Maryland]] [[United States|U.S.]]
| occupation = Poet, short story writer, editor, literary critic
| movement = [[Romanticism]], [[Dark romanticism]]
| genre
= [[Horror fiction]], [[Crime fiction]], [[Detective fiction]]
| magnum_opus = The Raven
| spouse
= [[Virginia Eliza Clemm Poe]]
...
DBPEDIA
DBpedia.org is a effort to :
• extract structured information from Wikipedia
• make this information available on the Web under
an
open license
• interlink the DBpedia dataset with other datasets on the
Web
The DBpedia Dataset
􀀟 1,600,000 concepts
􀀟 including
􀀟 58,000 persons
􀀟 70,000 places
􀀟 35,000 music albums
􀀟 12,000 films
􀀟 described by 91 million triples
􀀟 using 8,141 different properties.
􀀟 557,000 links to pictures
􀀟 1,300,000 links external web
pages
􀀟 207,000 Wikipedia categories
􀀟 75,000 YAGO categories
REPRESENTING EXTRACTED
INFORMATION
The DBpedia.org project uses the Resource Description Framework
(RDF) as a flexible data model for representing extracted information
and for publishing it on the Web. It uses the SPARQL query language
to query this data. At Developers Guide to Semantic Web Toolkits you
find a development toolkit in your preferred programming language
to process DBpedia data.
Extracting Infobox Data (RDF Representation):
http://en.wikipedia.org/wiki/Calgary
http://dbpedia.org/resource/Calgary
dbpedia:native_name Calgary”;
dbpedia:altitude “1048”;
dbpedia:population_city “988193”;
dbpedia:population_metro “1079310”;
mayor_name
dbpedia:Dave_Bronconnier ;
governing_body
dbpedia:Calgary_City_Council;
...
SPARQL :
• SPARQL is a query language for RDF.
•RDF is a directed, labeled graph data format for representing
information in the Web.
•This specification defines the syntax and semantics of the
SPARQL query language for RDF.
• SPARQL can be used to express queries across diverse data
sources, whether the data is stored natively as RDF or viewed
as RDF via middleware.
The DBpedia SPARQL Endpoint
􀀟 http://dbpedia.org/sparql
􀀟 hosted on a OpenLink Virtuoso server
􀀟 can answer SPARQL queries like
􀀟 Give me all Sitcoms that are set in NYC?
􀀟 All tennis players from Moscow?
􀀟 All films by Quentin Tarentino?
􀀟 All German musicians that were born in Berlin in the 19th century?
WEB COLLABORATION FOR
KNOWLEDGE ACQUISITION
• Efforts such as Wikipedia indicate that many
Web surfers may be willing to participate in
collective resource-producing efforts
– Other initiatives: Citizen Science, Cognition and
Language Laboratory, …
• This has been taken advantage of in AI
– Open Mind Commonsense (Singh) (collecting
facts)
– Semantic Wikis
www.phrasedetectives.com
WEB COLLABORATION PROJECTS
• Open Mind Common Sense
• Crater mapping (results)
– Singh
– Kanefsky
• Learner / Learner2 / 1001 Paraphrases
• FACTory
• Hot or Not
– CyCORP
– 8 Days
• ESP / Phetch / Verbosity / Peekaboom
• Galaxy Zoo
– Chklovski
– von Ahn
– Oxford University
www.phrasedetectives.com
OPEN MIND COMMONSENSE
• A project started in 2000 by Push Singh to take
advantage of people’s collaboration to collect
commonsense
WHAT’S IN OPEN MIND
COMMONSENSE: CAR
Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)
THINGS
(52,000 assertions)
EVENTS
(38,000 assertions)
AGENTS
(104,000 assertions)
SPATIAL
(36,000 assertions)
TEMPORAL
time & sequence
IsA: (IsA "apple" "fruit")
Part of: (PartOf "CPU" "computer")
PropertyOf: (PropertyOf "coffee" "wet")
MadeOf: (MadeOf "bread" "flour")
DefinedAs: (DefinedAs "meat" "flesh of animal")
PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope")
SubeventOf: (SubeventOf "play sport" "score goal")
FirstSubeventOF: (FirstSubeventOf "start fire" "light match")
LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")
CapableOf: (CapableOf "dentist" "pull tooth")
LocationOf: (LocationOf "army" "in war")
CAUSAL
(17,000 assertions)
AFFECTIONAL
(mood, feeling, emotions)
(34,000 assertions)
FUNCTIONAL
(115,000 assertions)
EffectOf: (EffectOf "view video" "entertainment")
DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")
DesireOf (DesireOf "person" "not be depressed")
MotivationOf (MotivationOf "play game" "compete")
ASSOCIATION
K-LINES
(1.25 million assertions)
SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization")
ThematicKLine: (ThematicKLine "wedding dress" "veil")
ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")
IsUsedFor: (UsedFor "fireplace" "burn wood")
CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")
OPEN MIND COMMONSENSE:
ADDING KNOWLEDGE
OMCS ADDING KNOWLEDGE, 2
OPEN MIND COMMONSENSE:
CHECKING KNOWLEDGE
FROM OPENMIND COMMONSENSE
TO CONCEPT NET
• ConceptNet (Havasi et al, 2009) is a semantic
network extracted from OpenMind
Commonsense assertions using simple
heuristics
CONCEPT NET
FROM OPENMIND COMMONSENSE
FACTS TO CONCEPTNET
A lime is a very sour fruit
isa(lime,fruit)
property_of(lime,very_sour)
GAMES WITH A PURPOSE
• Luis von Ahn pioneered a new approach to
resource creation on the Web: GAMES WITH A
PURPOSE, or GWAP, in which people, as a side
effect of playing, perform tasks ‘computers are
unable to perform’ (sic)
GWAP vs OPEN MIND COMMONSENSE vs
MECHANICAL TURK
• GWAP do not rely on altruism or financial
incentives to entice people to perform certain
actions
• The key property of games is that PEOPLE
WANT TO PLAY THEM
EXAMPLES OF GWAP
• Games at www.gwap.com
– ESP
– Verbosity
– TagATune
• Other games
– Peekaboom
– Phetch
ESP
• The first GWAP developed by von Ahn and
their group (2003 / 2004)
• The problem: obtain accurate description of
images to be used
– To train image search engines
– To develop machine learning approaches to vision
• The goal: label the majority of the images on
the Web
ESP: the game
ESP: THE GAME
• Two partners are picked at random from the large
number of players online
• They are not told who their partner is, and can’t
communicate with them
• They are both shown the same image
• The goal: guess how their partner will describe
the image, and type that description
– Hence, the ESP game
• If any of the strings typed by one player matches
the string typed by the other player, they score
points
THE TASK
SCORING BY MATCHING
THE CHALLENGE: SCORES
• One of the motivating factors is to try to score
as many points as possible
• Hourly, daily, weekly, and monthly scores are
shown
SCORES
THE CHALLENGE: TIMING
• Partners try to agree on as many images as
they can during 2 ½ minutes
• The termometer on the side indicates how
many images they have agreed on
• If they agree on 15 images they score bonus
points
TABOO WORDS
• To ensure the production of a large number of
specific labels, some words are declared
TABOO and not allowed
• Taboo words are obtained from the game
itself: any word that has been agreed upon by
players who were shown a picture earlier
becomes a taboo word for that image
TABOO WORDS
PASSING
GOOD LABELS, COMPLETING AN
IMAGE
• A label is considered “good” when more than
N players produce it (with N a parameter of
the game)
• An image is “done” when its list of taboo
words is so extensive that most players pass
on it
IMPLEMENTATION
• Pre-recorded game play
– Especially at the beginning, and at quiet times, there
won’t always be players to pair with
– In these cases a player is paired against a recorded
‘hand’ of a previous game with the same picture
• Cheating
– Players could cheat in a number of ways, including
agreeing on labels / playing against themselves
– A number of mechanisms are in place against those
cases
• Selecting images
SOME STATISTICS
• In the 4 months between August 9th 2003 and
December 10th 2003
– 13630 players
– 1.2 million labels for 293,760 images
– 80% of players played more than once
• By 2008:
– 200,000 players
– 50 million labels
ANALYSIS
• The numbers indicate that the game is fun to
play
• Exciting factors:
– Playing with a partner
– Playing against time
QUALITY OF THE LABELS
• For IMAGE SEARCH:
– choose 10 labels among those produced and look at which
images are returned
• Compare labels produced by players with labels
produced by participants in an experiment
– 15 participants, 20 images among the 1000 with more
than 5 labels
– 83% of game labels also produced by participants
• Manual assessment of labels (‘would you use these
labels to describe this image?’)
– 15 participants, 20 images
– 85% of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
VERBOSITY
• … or, the game approach to collecting
commonsense knowledge
• Motivation: slow progress both on CYC (5
million facts collected) and on Open Mind
Commonsense (around 700,000 facts)
THE GAME
• Based on an existing game, TABOO:
– Players have to guess a word
– One of the players gives hints concerning the word
• In Verbosity, you have two players, the
DESCRIBER and the GUESSER, and a SECRET
WORD
THE GAME
TEMPLATES IN VERBOSITY
• As in Open Mind Commonsense, templates
are used to ensure that the relations /
properties of interest are collected
• The Describer produces hints by filling in a
template
GUESSING ATTRIBUTES
PRODUCING A DESCRIPTION
TEMPLATES
•
•
•
•
_ is a kind of _
_ is used for _
_ is typically near/in/on _
_ is the opposite of _ / _ is related to _
EMULATION
• As in ESP game, pre-recorded games are used
when a player cannot be paired with another
player
• The asymmetry of the game causes a problem
not encountered in ESP game
– Describer: can just repeat behavior of previous
describer
– Guesser: not so easy
RESULTS
• Only published results I’m aware of predate
the actual release of the game so I don’t know
about the QUANTITY
• Quality:
– Ask six raters whether 200 facts collected using
Verbosity are ‘true’
– Around 85% success
PHRASE DETECTIVES
www.phrasedetectives.org
PHRASE DETECTIVES: THE TASKS
• 2 tasks :
– Find The Culprit (Annotation)
User must identify the closest
antecedent of a markable if it is
anaphoric
– Detectives Conference (Validation)
User must agree/disagree with a
coreference relation entered by
another user
www.phrasedetectives.com
NAME THE CULPRIT
READINGS
• V. Nastase& M. Strube, Transforming Wikipedia into a large
scale multilingual concept network, Artificial Intelligence,
2012
• C. Havasi, J. Pustejovsky, R. Speer and H. Lieberman, Digital
Intuition: Applying Common Sense Using Dimensionality
Reduction, IEEE Intelligent Systems, 2009
• L. von Ahn and L. Dabbish (2008). Designing games with a
purpose. Communications of the ACM, v. 51, n.8, 58-67
• Poesio, Chamberlain, Kruschwitz, Robaldo, & Ducceschi, 2013.
Phrase Detectives: Utilizing Collective Intelligence for InternetScale Language Resource Creation. ACM Transactions on
Intelligent Interactive Systems