SEASR.longer-intro - University of Illinois Urbana

Download Report

Transcript SEASR.longer-intro - University of Illinois Urbana

SEASR – Software Environment for the
Advancement of Scholarly Research
Overview
University of Illinois
June 2007
Michael Welge, Loretta Auvil, John Unsworth
Data Intensive Technologies and Applications/IGB
Automated Learning Group, and GSLIS
University of Illinois, Urbana-Champaign
Structured Data “Rush”
D2K- Framework for Data Analysis
•
Provides scalable environment from
the Desktop to Web Services to Grid
Services
•
Employs a visual programming system
for data/work flow paradigm
•
Provides capability to build custom
applications
•
Provides capability to access data
management tools
•
Contains data mining algorithms for
prediction and discovery
•
Provides data transformations for
standard operations
•
Integrated environment for models and
visualization
•
Supports an extensible interface for
creating one’s own algorithms
•
Provides access to distributed
computing capabilities
D2K Components
•
•
•
•
D2K Infrastructure
• Itinerary Execution engine
D2K-Driven Applications
• Applications that make use of the D2K
Infrastructure
• Toolkit is a D2K-Driven app
D2K Server
• Special kind of D2K-Driven app
• Wraps the infrastructure to provide remote
itinerary and module execution
• Used by the Toolkit to distribute module
execution
D2K Web Service
• Provides a generic programmatic interface for
executing itineraries
• Communicates with D2K Servers over socket
connections using D2K Specific protocols.
Creating Customer Value
Prediction
Industrial Manufacturer
Computed customer buying propensities
Achieved 25% conquest customer sales lift by executing directed
cross/upsell resulting in $65 million in incremental revenue
Discovery
Automotive manufacturer
Identified patterns of inappropriate warranty work in dealer channel
Targeted $200M+ of potentially unnecessary annual expense
Monitoring
Department store retailer
Watched POS transaction flow for unusual variations
Deterred inappropriate behavior and fraudulent transactions
Resulted in savings of over $125 million
Applications Examples
Comparative Genomics
Harris A. Lewin explains that Evolution Highway
allows one to look " . . . at the whole genome at
once - multiple chromosomes across multiple
species. The insights wouldn't have come so
quickly if we couldn't throw the data at this
framework from NCSA.”
Science, Vol. 309, Issue 5734, Pages 613-617, 22
July 2005
Music Analysis
Astronomy
J. Stephen Downie, The Scientific
Evaluation of Music Information
Retrieval Systems: Foundations and
Future, Computer Music Journal, Vol.
28, No. 2, Pages 12-23 Summer
2004
Nicholas M. Ball, Robert J. Brunner, Adam D. Myers,
and David Tcheng, Robust Machine Learning Applied to
Astronomical Data Sets. I. Star-Galaxy Classification of
the Sloan Digital Sky Survey DR3 Using Decision
Trees, The Astrophysical Journal, Vol. 650, Part 1,
Pages 497–509, 2006
Research, Development, &
Technology Transfer Model
SEASR: The Data Problem
Structured Vs. Unstructured
20%
Today, 80% of business is conducted
on unstructured information
– Gartner Group
80% of the information needed
is in the Open Source
– NIA
Structured
Data
Workers spend 80% of the time
gathering information
– STIC, EMF
Cave paintings,
Bone tools 40,000
WritingBCE
3500 BCE
80%
Unstructured
Data
0 C.E.
Paper 105
Printing 1450
Computing 1950
Internet (DARPA) Late 1960s
The Web 1993
1999
GIGABYTES
Electricity, Telephone
1870
Transistor 1947
Source: www.fastsearch.com
Unstructured Data “Rush”
Database
Semi-structured
Information
Unstructured
Information
Doc Mgt / XML
email / Word / HTML / PDF / etc
• Today, 80% of
business is
conducted on
unstructured
information
Gartner Group
-15 Years
Today
• 80% of the
information
needed is in the
Open Source
NIA
• Workers spend
80% of the time
gathering
information
The Internet
STIC, EMF
The issue is getting worse...
Other forms of
Unstructured Information
Multiple Devices
+
+
Voice
Video
Now
Affecting every
Industry Sector
Hail SEASR!
Software Environment for the Advancement of Scholarly Research (SEASR)
–
addresses the challenges of transforming information into knowledge by constructing
the software bridges that are required to move from the unstructured and semistructured data world to the structured data world.
–
aims to make collections more useful by integrating two well-known research and
development frameworks NCSA’s Data-To-Knowledge (D2K) and IBM’s Unstructured
Information Management Architecture (UIMA) into an easily usable environment that
researchers in any discipline can easily learn and adapt for their own unstructured
data analysis.
UIMA Lineage
•
Developed over 5 years
– Funded by DARPA (GALE)
– Companies: BBN, MITRE, SAIC
– Universities: Carnegie Mellon, Columbia, UMass/Amherst
– 100 Developers from IBM World Wide
• UIMA Enables ….
–
–
–
–
–
–
Part of Speech Detectors
Document Structure Detectors
Tokenizers, Parsers, Translators
Named-Entity Detectors
Sentiment Detectors
Relationship Detectors
SEASR: Architecture
•
•
•
•
•
•
SEASR’s advanced informatics tools will expand
the technical capabilities of what is now available
in the field by:
connecting data sources that are currently
incompatible, whether due to different formats or
protocols
offering all project components as open source, to
enable users to modify and add to tools
allowing users to write analytic engines in their
programming language of choice
installing on all hardware footprints, so that the tools
can be brought to data sets where they are housed
creating a repository for components that will support
sharing and publishing among users
enabling scalability so that components may run on a
large variety of hardware footprints, including shared
memory processors and clusters
SEASR: Research, Development, &
Technology Transfer Model
Research Areas
•
Focused Data Retrieval and Data Integration: Given a target topic
(Iran’s nuclear program) or an entity (University of Tehran), how do we
locate, retrieve, and integrate all relevant data-both structured (databases)
and observational (sensory data, textual data, image data)?
•
Semantic Data Enrichment: How to handle the overwhelming array of
different data formats, how to understand the layout of data and infer
metadata for a variety of text sources and images, and how to infer
semantic markup and construct/augment knowledge bases.
•
Entity and Relationship Discovery: How can we match ambiguous
mentions of entities across both structured data and text? How do we
discover relationships among entities? How do we related new collected
data to existing knowledge bases?
IACAT, Dan Roth and Jiawei Han, CS, UIUC
Research Areas
•
Knowledge Discovery and Hypotheses Generation: How to exploit the
rich semantic structure generated by identifying entities and relationships
among them to promote knowledge discovery and to generate hypotheses
that emerge from “surprising” correlations or structural events?
•
Intelligent Human-Computer Interactions for Information Access: How
to devise effective interaction models and interfaces for accessing
multimodal data, interactive annotation and discovery models, and support
hypotheses suggestion and verification?
•
Mathematical and Computational Foundations: The research described
above builds on our team members’ work on key mathematical and
algorithmic questions underlying progress in the Data Sciences, and
serves to motivate further theoretical questions.
IACAT, Dan Roth and Jiawei Han, CS, UIUC
Getting the “Band” Together
• June 2007 – Band formation
– Project start date
– More use ideas and framework discussions
• December – First ‘gig”
– Framework and data app demonstration
• Vocals - Research Technology
– John Unsworth, Stephen Downie, Tim Wentling
– Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang Zhai
• Percussions & Bass - SEASR Development
– Loretta Auvil, Tara Bazler, Duane Searsmith, Andrew Shirk, Students
• Lead – Designers/Developer/Applications Areas
– Humanities – M2K, Nora/Monk and Others (we heard about
yesterday/today))
• Need Groupies! (Advisors, Researchers, Developers, and Application
Drivers) – Loretta Auvil
SEASR: How can I participate?
• Collaborate on application
development or ontology creation
• Contribute to component
development for analytics or data
access
• Participate in visualization and UI
design
• Serve as an advisor
Contact Loretta Auvil ([email protected])
SEASR
Engineering Knowledge for the Humanities
Thank You
Lincoln Papers Project
•
•
•
A Model for Digital Humanities Scholarship
Collaboration between I-CHASS, and the Lincoln Presidential
Museum and Library in Springfield, Illinois
– UIUC permanent home of digital archive of all Lincoln
materials held by Lincoln Library
Opportunities for Discovery from the Lincoln Papers
– Provides ability to explore many technologies of interest
to humanities scholars, including:
• Digitization and OCR
• Information Extraction and Analysis of text- and
image-based information
• Social Networking Tools
• Geo-spatial Analysis
– Solutions can be transferred to other digital collections,
such as Founding Father’s Papers
Vernon Burton, History Department, UIUC
Research Project and Consequence
of Digital Analysis
•
•
Scholar interested in development of Lincoln’s concept of “Liberty”
– Data extraction tools identify all instances of “liberty” and related
concepts
– Social networking tools trace with whom, when, and how frequently
Lincoln corresponded on the subject
– Geo-spatial analysis can reveal regional differences in support for
emancipation
Scholar is able to
– Easily identify and retrieve all key materials from a collection
numbering hundreds of thousands--or even millions--of documents
– Gain insight into development and strength of Lincoln’s commitment
to emancipation
– Identify key correspondents--some of whom might have previously
been overlooked--who helped shape Lincoln’s public policy
Vernon Burton, History Department, UIUC
Other Example Research
•
•
•
Voice mining (DH 2006 Poster )
– Scholar is interested in development of models that can analyze characters’
utterances in plays.
– Scholar is able to construct analytical models that can successfully identify the
socio-economic class or status of the character which uttered a given line of play text.
Criticism mining (DH 2006)
– Scholar is interested in development of tools that can automatically analyze critical
reviews on humanities objects.
– Scholar is able to easily construct text categorization models predict positive and
negative reviews; predict the genre of the work being reviewed; and differentiate
fiction and non-fiction book reviews
Differentiating Editorial and Customer Critiques of Cultural Objects Using Text Mining (DH
2007)
– Scholar is interested in development of tools that can automatically differentiate
critiques written by scholars and professional editors versus ordinary readers.
– Scholar is able to use text mining tools to differentiate these two kinds of critiques as
well as to see what features makes them different.
J. Stephen Downie, GSLIS, UIUC
Conceptual Analytical Architecture
SEASR Architecture
Structured Data for Analysis
•
Low Volume Data
– Wire services
– Call Detail Records
– Phone directories
– Badge access tracking
– Customer lists
– Account histories
– Supplier network data
– Biometric access data
• High Volume Data
– Stock transactions
– Web pages
– News Wire feeds
– Audits
– CRM databases
– Web access logs
– Net logs
– Mutual Fund validation
– Credit/Debit transactions
– RFID tracking logs
Unstructured Data for Analysis
•
Low Volume Data
– Email, Chat and IM
– Internal documents
– Call Center data logs
– Pager data
– External reports and
data
– Publicly accessible
records
– Calendars
– RF monitoring
– Print stream monitoring
• High Volume Data
– VOIP phone calls
– Broadcast media
– Web cam data
– Deep web crawl data
– Surveillance cameras
– Videoconferences
– Voice mail
– Satellite data
Current SEASR Team
•
•
•
•
•
•
•
•
PI: Michael Welge, NCSA
Co-PI: John Unsworth, GSLIS; Loretta Auvil, NCSA
Technical Lead: Duane Searsmith, NCSA
Use Cases and Communities Involvement: Loretta Auvil, NCSA
Usability Evaluator: Tara Bazler, Indiana University
Software and Application Developers
– Bernie Acs, NCSA
– Vered Goren, NCSA
– Amit Kumar, NCSA
– Xavier Llora, NCSA
– Mary Pietrowicz, NCSA
– Andrew Shirk, NCSA
– David Tcheng, NCSA
Humanities Domain and Communications Consultant
– Kelly Searsmith, NCSA
Community Advisors
– Tim Cole, Mathematics Librarian, UIUC