KnowItNow: Fast, Scalable Information Extraction from the Web

Download Report

Transcript KnowItNow: Fast, Scalable Information Extraction from the Web

KnowItNow: Fast, Scalable Information
Extraction from the Web
Michael J. Cafarella, Doug
Downey, Stephen Soderland,
Oren Etzioni
The Problem

Numerous NLP applications rely on searchengine queries to:
–
–

Extract information from the web.
Compute statistics over the Web corpus.
Search engines are extremely helpful for several
linguistic tasks such as:
–
–
Computing usage statistics.
Finding a subset of web documents to analyze in
depth.
Problem With Search Engines

Search engines were not designed as building blocks for NLP
applications. As a result:
–
An NLP application is forced to issue literally millions of
queries to search engines; increasing processing time and
limiting scalability.
–
Fetching web documents is also time-consuming.
–
Search engines are limiting the use of programmatic
queries to their engines


Google has placed hard quotas on the number of daily queries a
program can issue.
Other engines force applications to introduce “courtesy waits” between
queries.
Example of the Problem “KnowItAll”

KnowItAll works in a generate-and-test
architecture extracting Information in 2 stages:
–
–
First, it Utilizes a small set of domain independent
extraction patterns to generate candidate facts.
Second, it automatically tests the plausibility of
the candidate facts it extracts using pointwise
mutual information (PMI) statistics computed from
search-engine hit counts.
1st Stage in KnowItAll


Take the generic pattern “NP1 such as
NPList2”.
This indicates that the head of each simple
noun phrase (NP) in NPList2 is a member of
the class named in NP1.
–
–
Take as example the pattern for class City, and
the sentence “We provide tours to cities such as
Paris, London, and Berlin.”
KNOWITALL extracts three candidate cities from
the sentence: Paris, London, Berlin.
2nd Stage in KnowItAll




KnowItAll needs to assess the likelihood of
the information it found.
Verify that Paris is actually a city.
It does that by computing the PMI between
Paris and a set of k discriminator phrases
that tend to have high mutual information
with city names. (Paris is a city)
This requires at least k search-engine
queries for every candidate extraction!
The Solution


A novel architecture for Information
Extraction which does not depend on Web
search-engine queries; KnowItNow.
Works over 2 stages like KnowItAll:
–
–
Uses a specialized search engine called the Binding Engine
(or BE) which efficiently returns bindings in response to
variabilized queries.
Uses URNS, a combinatorial model, which estimates the
probability that each extraction is correct without using any
additional search engine queries
The Binding Engine vs.
The Traditional Engine
The Traditional Engine



Take the search query (“Cities such as
<NounPhrase>”).
Perform a traditional search engine query.
For each such URL:
–
–
–
–
obtain the document contents.
find the searched-for terms in the document text.
Run the noun phrase recognizer to determine if
text found satisfies the linguistic type requirement
If it does, return the string.
Problems With Traditional Engine

The search itself doesn’t take a long time. Even if
there are multiple search queries

The second stage fetches a large number of
documents, each fetch likely resulting in a random
disk seek; this stage executes slowly.

this disk access is slow regardless of whether it
happens on a locally-cached copy or on a remote
document server.
The Binding Engine
Why
not use a table to store a list of terms and
documents containing them?!

The Binding Engine supports these queries:
–
–
–
Typed variables (such as NounPhrase)
String-processing functions (such as “head(X)” or
“ProperNoun(X)”).
Standard query terms.
It processes a variable by returning every possible string in the
corpus that has a matching type, and that can be substituted for the
variable and still satisfy the user's query.
How the Binding Engine Works?


It uses a novel approach called the
“neighborhood index”
The neighborhood index is an augmented
inverted index structure.
–
–
For each term in the corpus, the index keeps a list
of documents in which the term appears and a list
of positions where the term occurs.
The index also keeps a list of left-hand and righthand neighbors at each position. (Adjacent text
strings that satisfy a recognizer, e.g. NounPhrase)
How is The Binding Engine Better?
K
is the number of concrete terms in the query.
B is the number of variable bindings found in the corpus.
N is the number of documents in the corpus.
Expensive processing such as part-of-speech tagging or shallow
syntactic parsing is performed only once, while building the index,
and is not needed at query time.
How is The Binding Engine Better?




Average time to return the relevant bindings
in response to a set of queries.
0.06 CPU minutes for BE.
8.16 CPU minutes for Nutch (Private search engine)
Disadvantages of The Binding Engine

It consumes a large amount of disk space, as
parts of the corpus text are folded into the
index several times.

The neighborhood index increased disk
space four times that of a standard inverted
index
The URNS Model

We need a way to test that the extractions
from the Binding Engine are correct

KnowItAll issues queries to search engines
and uses the PMI model to verify extractions.

PMI is very efficient but it is also very slow.
How URNS works?

URNS is a probabilistic model
–
It takes the form of a classic “balls-andurns” model from combinatorics.

Each extraction is modeled as a
labeled ball in an urn.

A label represents either an instance of
the target class or relation, or
represents an error
How URNS works?




C - the set of unique target labels; |C| is the number
of unique target labels in the urn.
E - the set of unique error labels; |E| is the number of
unique error labels in the urn.
num(b) - the function giving the number of balls
labeled by b where b is a subset of C U E.
num(B) is the multi-set giving the number of balls for
each label b, where b is a subset of B.
How URNS works?

The goal of an IE system is to discern which of
the labels it extracts are in fact elements of C.
–
Given that a particular label x was extracted k times
in a set of n draws from the urn, what is the
probability that x is a subset of C?
Alternative to URNS

Items that were extracted more often are
more likely to be true.
–
i.e. Extractions with higher frequencies are true.
Experiments



Recall: how many distinct extractions does
each system return at high precision?
Time: how long did each system take to
produce and rank its extractions?
Extraction Rate: how many distinct high
quality extractions does the system return
per minute? The extraction rate is simply
recall divided by time.
KnowItNow vs. KnowItAll
Tested on relation “Country”
KnowItNow vs. KnowItAll
Tested on relation “CapitalOf”
KnowItNow vs. KnowItAll
Tested on relation “Corp”
KnowItNow vs. KnowItAll
Tested on relation “CeoOf”
KnowItNow vs. KnowItAll
Contributions



A novel architecture for Information
Extraction which does not depend on Web
search-engine queries.
Extract tens of thousands of facts from the
Web in minutes instead of days.
KnowItNow's extraction rate is two to three
orders of magnitude greater than KnowItAll's.