2003 Paula Matuszek
Download
Report
Transcript 2003 Paula Matuszek
CSC 9010: AeroText,
Ontologies, AeroDAML
Dr. Paula Matuszek
[email protected]
(610) 270-6851
©2003 Paula Matuszek
AeroText
Information Extraction tool marketed by
Lockheed Martin
Capabilities similar to GATE
Much better developed IDE
Less open to extensions of the system itself.
Equally steep learning curve for effective use!
Lockheed AeroText General Overview
Lockheed AeroText White Paper
©2003 Paula Matuszek
AeroText Demo
©2003 Paula Matuszek
Ontologies
Information Extraction requires modeling extensive
domain knowledge
Other applications of text mining, such as
document categorization, can also use domain
information
In modeling such knowledge we often create an
ontology: An explicit formal specification of how to
represent the objects, concepts, and other entities
that are assumed to exist in some area of interest
and the relationships that hold among them.
©2003 Paula Matuszek
A Simple Ontology: Birthdates
Objects, concepts, entities:
–
–
–
–
–
–
Months, days, years
dates
first names
last names
persons
birthdates
Relationships between them
–
–
–
–
–
a date has exactly one month, day, year
a birthdate is a date
a person has at least 1 first name and exactly 1 last name
a person has a birthdate
a birthdate has a person
©2003 Paula Matuszek
Who and Why?
Many groups are developing ontologies:
– standardize terms and vocabulary
– facilitate the semantic web
– improve information integration
– interested in the domain itself
Some ontologies under development
– Cyc
– GO (Gene ontology)
– UMLS (Unified Medical Language System)
– CIA World Factbook
©2003 Paula Matuszek
DAML
DARPA Agent Markup Language
A language for describing ontologies
Example: an ontology for dates
Extensive information available at
www.daml.org.
©2003 Paula Matuszek
UBOT
UML Based Ontology Toolkit
Part of a DARPA project to automatically
mark up web pages to make them
The purpose of DAML is to annotate
information on the web to make it
machine-readable so that software
agents can interpret it and reason with it:
the semantic web
http://ubot.lockheedmartin.com/ubot/intro/index.html
©2003 Paula Matuszek
AeroDAML
AeroDAML is a web service that takes a
web page as an input and generates
DAML markup.
Uses AeroText as the underlying
extraction tool.
Works with various ontologies.
Paper describing system
©2003 Paula Matuszek
Lab: try out AeroDAML
AeroDAML page
•Choose a news page (www.phillynews.com, Google
News, ...) and tag it with the Cyc and CIA ontologies.
•How well did each ontology do at picking up content? Did
they miss things they should have found? Was anything
tagged incorrectly?
•Repeat for one of your domain-specific documents, or
a web page in a specific area. Try a different ontology if
you think one of the others might be more interesting.
•How was the annotation different?
•Are we enabling the semantic web?
©2003 Paula Matuszek