ppt - Courses

Download Report

Transcript ppt - Courses

SIMS 296a-3:
Current Topics in Information
Access
Marti Hearst
Fall ‘98
Today




Introductions
Goals and Course Requirements
Administrivia
Topics
 What is Information Access
 Current Topics (an outline)
 Intro to IA
Goals


Become expert on the state-of-the-art in
timely topics related to information access
Begin getting research results.
Course Requirements

To get S/U credit for the class
 Lead two discussions
 Do the readings
 Attend the meetings
Course Requirements

To get a grade in the class


Do the above
Do one of the following (optionally with the help of
a faculty member and/or another student):
 Write a publishable survey paper on an
emerging area of information access.
 Do research that should lead to a publishable
research paper on a new idea, method,
analysis, or vision statement for an emerging
area of information access.
 Implement and/or evaluate code to further an
information access research project.
Administrivia



Sign up sheet
Readings
Other questions?
Outline



What is Information Access?
 Goals, Tasks, Types of data
Standard Information Retrieval
 Assumptions, Techniques, Evaluation
Current Topics
 Candidate topics
What is Information Access?

Information Access:
 The process by which users use
information technology to seek, organize,
and understand information.
 Focus: information expressed as text.
Information Retrieval

Task Statement
Build a system that retrieves documents that
users are likely to find relevant to their queries.

This set of assumptions underlies the field of
Information Retrieval.
Information Retrieval
Assumptions

The system has available only preexisting, “canned” text passages.

Its response is limited to selecting from
these passages and presenting them to
the user.

It must select, say, 10 or 20 passages out
of millions or billions!
Top 10 Research Issues for IR
What do people want from IR?




By Bruce Croft, DLIB Magazine, Nov 95
Based on work observations from work on
public-domain systems, including:
 THOMAS
 American Memory Project (Library of
Congress)
The order of importance does not correspond
to many IR researchers’ priorities.
The same can be said for AI researchers.
Top 10 Research Issues for IR

Bruce Croft, DLIB Magazine, Nov 95. In descending order of
importance.
Integrated Solutions
 Distributed IR
 Efficient, Flexible Indexing and Retreival
 “Magic” (Effective Vocabulary Expansion)
 Interfaces and Browsing
 Routing and Filtering
 Effective Retrieval
 Multimedia Retrieval
 Information Extraction
 Relevance Feedback

Other Issues

Mundane issues are important
Spelling Correction
 Fast display of initial results


Less important but more interesting from
many researchers’ points of view: (Bruce Croft,
DLIB Magazine, Nov 95)
Multilingual IR
 Data Mining (in text databases)
 Text Categorization

Matching Tasks, Collections, and
Search Systems



Typical WWW search is not the whole
picture.
Different information needs require:
 different collections
 different search systems and strategies
Compare:
 general WWW
 newswire and magazines
 medical journal articles
Match Task and Search Type

WWW Tasks: (from www.cnet.com/Content/Reviews/Compare/Seach/ss1a.html)






Find how-to pages for Doom.
Purchase plane tickets and hotel for a trip to Java.
Find the top five all-time scoring leaders in the national
hockey league.
Find a recipe for potato latkes.
Find the tide tables for Maui.
Characteristics:

Timely, specific, found via help from human
agents and in well-known resources before the
WWW.
Match Task and Search Type

Newswire & Magazine Tasks: (from the TREC
collection)
 Find articles on research into cures for osteoporosis.
 Find articles on the effects of recycling of tires on the
environment.
 Find information on jail and prison overcrowding and how
inmates are forced to cope with those conditions.
 Find discussion of an existing or proposed insurance plan
(governmental, commercial or individual) and the coverage it
provides for long term care confinements in an institution.

Characteristics:
Complex combinations of topics.
 Research-oriented
 Either timely or retrospective

Match Task and Search Type

MEDLINE Tasks: (From OHSUMED, medir.ohsu.edu/pub/ohsumed)





Are there adverse effects on lipids when progesterone is
given with estrogen replacement therapy?
Pathophysiology and treatment of disseminated
intravascular coagulation.
Reviews on subdurals in the elderly.
Effectiveness of etidronate in treating hypercalcemia of
malignancy.
Characteristics
Research-oriented
 Technical
 Cause and Effect, Implications

The Problem of Information Access



Main problem:
 Computers can’t understand natural
language.
Therefore:
 Information access systems must guide
users to information of interest by
approximate methods.
General common methods:
 word match
 topic directories
Why Text is Tough

Abstract concepts difficult to represent
(AI-Complete)
“Countless” combinations of subtle,
abstract relationships among concepts
 Many ways to represent similar concepts

space ship, flying saucer, UFO, figment of imagination
Concepts are difficult to visualize
 High dimensionality

Tens or hundreds of thousands of features
Why Text is Tough
I saw Pathfinder on Mars with a telescope.
 Pathfinder photographed Mars.
 The Pathfinder photograph mars our
perception of a lifeless planet.
 The Pathfinder photograph from Ford has
arrived.
 The Pathfinder forded the river without
marring its paint job.

Outline



What is Information Access?
 Goals, Tasks, Types of data
Standard Information Retrieval
 Assumptions, Techniques, Evaluation
Current Topics
 Candidate topics
User Interfaces
 Quality Assessment
 Text Data Mining
 Student suggestions

Tools for Information Access
User Interfaces
(information visualization)
Information Access
(information retrieval)
Language and
Content Analysis
Task Analysis
Current Topics

User Interfaces
Incorporating “personal” information
 Automated “Agents” vs. User Initiated Steps
 Support for the dynamic process of
information access
 How to organize large search results


Categories, clusters, combinations of these
Question Answering
 Others?

Current Topics

Quality Assessment
 Issues:
How to define quality
 Rating methods
 Different fields (medicine, business)


Techniques
Visitation patterns and times
 “Social” techniques
 Link structure (co-citation patterns)
 Link structure + content

Current Topics

Text Data Mining
 Visualizating the contents of large text
collections
 Automatically discovering associations
within text collections
Discovering useful patterns
 Spotting anomalies


*Finding chains of associated information

*I have a proposal for this
Current Topics


Cognitive modeling/AI techniques
Your idea goes here:
For Next Time



Do background reading
Think about which topics to pursue
I will present more background
information