ppt - Computer Science, Columbia University

Download Report

Transcript ppt - Computer Science, Columbia University

A Portal for Access to Complex
Distributed Information about Energy
Jose Luis Ambite,
Yigal Arens,
Eduard H. Hovy,
Andrew Philpot
DGRC
Information Sciences Institute
University of Southern California
Walter Bourne, Peter T. Davis,
Steven Feiner, Judith L. Klavans,
Samuel Popper, Ken Ross, Ju-Ling
Shih, Peter Sommer, Surabhan
Temiyabutr, Laura Zadoff
DGRC
Columbia University
The Vision:
Ask the Government...
We’re thinking of
moving to Denver...
What are the schools
like there?
How have property
values in the area
changed over the
past decade?
Is there an orchestra?
An art gallery? How
far are the nightclubs?
How many people had
breast cancer in the area
over the past 30 years?
Census
Labor
Stats
The Energy Data Collection project
• EDC research team
• Information Sciences Institute, USC
• Dept of CS, Columbia University
• Government partners
• Energy Information Admin. (EIA)
• Bureau of Labor Statistics (BLS)
• Census Bureau
• Research challenge
• Make accessible in standardized
way the contents of thousands of
data sets, represented in many
different ways (webpages, pdf, MS
Access, Excel, text…)
x x x x
Xxx x x
Xx xx Xxx xx
Xxx xx X xxx x
Xx xxx Xxxx x
Xx
X
Xxx x x xxxxxx
xx
Heterogeneous Data
Sources
EPA
Information Access
User Interface
Metadata and
Terminology
Management
Interface Design
and
Task-based
Evaluation
Census
Labor
Data
Access
and
Query
Processing
EIA
Concept
Ontology
Trade
Data Integration
User Evaluation
Terminology
Sources
Sources:
Data access using SIMS
• ‘Hide’ from user details of data sources:
1. ‘Wrap’ each source in software that handles
access to its data
2. Record the types of info in each source in a
‘Source Model’
3. Arrange all source models together in the same
space—the Domain Model
• SIMS data access planner transforms user’s
request into individual access queries
• SIMS extracts the right data from the
appropriate sources
• Current databases and models:
x x x x
Xx xx
Xxx xx
Xx xxx
X
Xxx x x
xx
Xxx x x
Xxx xx
X xxx x
Xxxx x
Xx
xxxxxx
Models:
– Databases: 58,000+ series (EIA OGIRS and others)
– Webpages: 60+ (BLS, CEC tables) SENSUS ontology: 90,000 nodes
(from ISI’s NLP technology)
– Domain model: 500 nodes (manual; for database access planner)
– LKB: 6000 nodes (NL term/info extraction from glossaries)(Ambite et al., ISI)
Data access using
in-memory query processing
(Ross et al., Columbia)
• How can you provide fast access to millions of data values?
• Cache data that doesn’t change much in data warehouse
• Create rich multidimensional index structures; keep in memory
• Adapt index depending on user’s patterns of use
• Technical details:
•
•
•
•
•
Same engine for many data sets
Client/server parallel User
Branch Misprediction
SIMD
Dynamic Query
Asynchronous work
• Use:
• Real-time interactive
data exploration: ‘fly’
over the data
Data Request
Graphical User
Interface
Dynamic Query
Engine
Mediator
Unified
Results
Data Files
e.g., PUMS
Web
...
The Heart
of EDC
(Hovy et al., ISI)
Linguistic
Mapping
(semi-automated)
Logical
mapping
Large ontology
(SENSUS)
Concepts from
glossaries (by
GlossIT)
Domain-specific
ontologies
(SIMS models)
Data
sources
SENSUS
and
DINO
browser
http://edc.isi.edu:8011/dino
•
•
•
•
•
•
Taxonomy, multiple superclass links
Approx. 90,000 concepts
Top level: Penman Upper Model (ISI)
Body: WordNet 1.6 (Princeton), rearranged
New information added by text mining
Used at ISI for machine translation, text
summarization, database access
(Knight et al., ISI)
Extracting term info from online sources
(Klavans et al., Columbia)
• GetGloss: given a URL, find all the glossary files
• ParseGloss: given a set of NL glossary definitions,
extract and format the important information
• GetGloss:
– Glossary identification rules
consider format tags, etc.
– F-score: 0.68 (2nd after
SVM at 0.92)
• ParseGloss:
– Identify term, def, head
noun, etc.
– Evaluation underway
Term-to-ontology alignment
(Hovy et al., ISI)
How to link new concepts into the Ontology (or
Domain Model) in the right places?
• Manual approach expensive: NxM steps
• Approach: try to automatically propose links,
then hand-check only the best proposals
– Created and tested various match heuristics
(NAME, DEF, TAXONOMY, DISPERSAL)
– Tried various clustering methods: CLINK, SLINK,
Ward’s Method…, new version of k-Means
(Euclidean and spherical distance measures)
– Tested numerous parameter combinations
(stemming, etc.) in EDC and NHANES domains;
see http://edc.isi.edu/alignment/
 Results not great
?
User interface testbed
(Feiner et al., Columbia)
• Menu presented
as grid of
alternating rows
and columns
• Ontology entry
shown in beam for
selected item
– Located as near as
possible
– Color coding
shows parental
and semantic
relationships
• Fisheye magnification of region of interest
– Magnified group laid out to avoid internal overlap
AskCal: User requests in English (Philpot et al., ISI)
• ATN:
– 341 nodes
– 14 question
types
• Automated
paraphrase to
confirm
• Dialogue
continues via
menus for
detailed
selection
Interface/usage evaluation
Evaluation study, started late
2001
What to evaluate?
• Variables
– Category display
– Magnifying columns
– Fisheye proximity &
magnification
– Searchlight
– Synonyms
• Methods
– Observe cognitive styles
– Examples in other domains
• Research on content
– Energy vs. Census domains
(Sommer et al., Columbia)
Task evaluation
• Process
– Task scenario
– Interview
– Observation
• Goal
– User behaviors
– User intuitiveness for different
groups of users
– Strengths and weaknesses of
the design
• Participants
– Content experts
– Government agency workers
– Faculty and students
Thank you!
Please come see our demos
this afternoon!