How To Read Research Papers

Download Report

Transcript How To Read Research Papers

Documents and Indexing
•Readings Overview
•Topic Discussions Schedule Set
•Projects and Papers Ideas
Indexing and Searching
• Queries models work against the index
- Find words, word counts, phrases
- Sequential search, indexed search
•
•
•
•
•
•
Inverted Files & Other Indices
Boolean Queries
Sequential Searching
Pattern Matching
Structural Queries
Data structures
- The infrastructure of search
- Varied per data set and query contexts
Inverted Files
- Vocabulary with counts for occurrences
- Positions where each word (or char) is stored
• Character positions – exact – full inverted
indexes
• Words referenced – local – block addressing –
then words are found
- Pointers can be made efficient, compressed
- Fixed size or content based
- Dependent on text location and access
Searching through Inverted Files
1. Vocabulary search
-
Query is parsed and words/phrases are determined
2. Retrieval of occurrences
-
List of words and locations
3. Manipulation of occurrences
•
•
•
•
•
•
Occurrences are processed for IR operation
Block search if needed
Vocabulary is small(er) and starting point
Sequence queries use more processing
Fast in general
Easy to update
Other Indices for Text
- Suffix Trees
• Helps with phrases, patterns (ACGT)
• File, text is like a huge string
• Blocks of text seen as suffixes (from X to EOT)
• Each virtual block is different
• Great at finding word beginnings
- Suffix Arrays
• Pointers to text (matches)
• Comparison to each pointer
- Look, stop, look to EOF, EOT
- Sorting and Merging can occur
Signature Files
• A different method (or is it?)
• Vocabulary indexed hashes (a signature)
- Cut blocks of text of b words each to bit masks
- Bit masks are just bins of bits
- Signature is sequence of bit masks for all the
blocks and a pointer to each block
• Search by converting (hashing) the query to
bit mask and then comparing
• Finds sets of words easier (phrases)
• Block tweaking can go on forever
• Easy to update… add new bit masks
Boolean Queries
• A set of operations to manipulate the indices
and produce results
• Algorithm based
- Find document qualifiers
- Make relevance judgment (value) of document
- Retrieve query matches for review
•
•
•
•
Mulit-phase, levels of processing (merging)
Narrows scope of the query
Parsing leaves from the syntax tree (AND/OR)
Optimization is in the details & situations
Sequential Searching
• Search through a free-form data set (no
indexed specific structure)
• Find where the query pattern occurs
- Query shorter than text
- Number of potential matches for query in text
• Simple case:
- Start at first character
- Proceed linearly through the text
• Huge variety of methods
String Search Methods
• Brute Force – search over all position in the
text
- Match?
• No, Next
• Yes, Note location
- Next
- Sequential
• Windows (framed) text scanning
- Knuth-Morris-Pratt: slides frame over text the size
of the search string, starting on the next character
from each confirmed miss
- Boyer-Moore: slides frame over text, moving ahead
the length of the whole search string
More String Search
• Shift-Or
- Uses binary representation, bit masks for
searching with a frame
- Strengths and weaknesses of hashes
• Phrase search
- Look for longest word first using Window scan
- Look for less frequent term first
• Which method is best?
- It depends on the document and IR model for
queries
Putting it all together…
1. Query
2. Index
3. Search Method
-
•
Vocabulary is matched
Results of matches are merged into a list of
documents and text locations in documents
You can see how the IR model, the query and
the index all combine to enable the IR
system
Structural Queries
• Working with a document that has explicit or
implicit structure
- Explicit – HTML, tags, spaces, numbers…
- Implicit – Genre, Language, Formats…
• Tags count as vocabulary words (mostly)
• Index represents structure
• IR models allow acting on structure in index
• Do we ever not have some kind of structure?
- Should structure be ignored or removed?
- Do structures scale and have consistency?
Compression
• Search is possible directly in compressed text
• This is a huge win for disk access (and CPU)
• Language compression:
- Heap - vocabulary
- Zipf – document scaling
• Indexes compress well, sequence and
differences can be small overall sizes
Evaluating Indices & Search
• Fixed indices are best for static or predictable
text
• One, single search method may be tuned for
static text
• Hybrid solutions are best for varied
documents
• Hybrid solutions are best as you want to be
prepared for document and query type
diversity
Users and Searching
• Document sets and therefore indices are
growing rapidly
• Users have increasing expectations for both
accuracy and speed
• Compression methods help please users
• Structured information will transform search
selection and IR model preferences
• Computation power solves some problems
(mostly)
Keeping Up with the Changing Web
• Building Indices is difficult enough in theory
• What about a continuously changing huge volume of
information?
• Is old information good?
• What does up-to-date mean anymore?
• Is Knowledge a depreciating commodity?
- Correctness + Value over time
• Different information changes at different rates
- Really it’s new information
• How do you update an index with constantly
changing information?
Changing Web Properties
• Known distributions for information change
• Sites and pages may have easily identifiable
patterns of update
- 4% change on every observation
- Some don’t ever change (links too)
• If you check and a page hasn’t changed, what
is the probability it will ever change?
• Rate of change is related to rate of attention
- Machines vs. Users
- Measures can be compared along with information
Dynamic Maint. of Indexes w/Landmarks
• Web Crawlers do the work in gathering pages
• Incremental crawling means incremented
indices
- Rebuild the whole index more frequently
- Devise a scheme for updates (and deletions)
- Use supplementary indices (i.e. date)
• New documents
• Changed documents
• 404 documents
Landmarks for Indexing
• Difference-based method
• Documents that don’t change are landmarks
- Relative addressing
- Clarke: block-based
- Glimpse: chunking
• Only update pointers to pages
• Tags and document properties are
landmarked
• Broader pointers mean less updates
• Faster indexing – Faster access?
Yahoo! Cataloging the Web
• How do information professionals build an “index” of
the Web?
• Cataloging applies to the Web
• Indexing with synonyms
• Browsing indexes vs searching them
• Comprehensive index not the goal
- Quality
- Information Density
• Yahoo’s own ontology – points to site for full info
• Subject Trees with aliases (@) to other locations
• “More like this” comparisons as checksums
Yahoo uses tools for indexing
Investigation of Documents from the WWW
• What properties do Web documents have?
• What structure and formats do Web
documents use?
• What properties do Web documents have?
-
Size – 4K avg.
Tags – ratio and popular tags
MIME types (file extensions)
URL properties and formats
Links – internal and external
Graphics
Readability
WWW Documents Investigation
• How do you collect data like this?
- Web Crawler
• URL identifier, link follower
- Index-like processing
• Markup parser, keyword identifier
• Domain name translation (and caching)
• How do these facts help with indexing?
• Have general characteristics changed?
• (This would be a great project to update.)
Properties of Highly-Rated Web Sites
• What about whole Web sites?
• What is a Web site?
- Sub-sites?
- Specific contextual, subject-based parts of a Web
site?
- Links from other Web pages: on the site and off
- Web site navigation effects
• Will experts (like Yahoo catalogers) like a
site?
Properties
•
•
•
•
•
•
Links & formatting
Graphics – one, but not too many
Text formatting – 9 pt. with normal style
Page (layout) formatting – min. colors
Page performance (size and acess)
Site architecture (pages, nav elements)
- More links within and external
- Interactive (search boxes, menus)
• Consistency within a site is key
• How would a user or index builder make use
of these?
Extra Discussion
• Little Words, Big Difference
- The difference that makes a difference
- Singular and plural noun identification can change
indices and retrieval results
- Language use differences
• Decay and Failures
- Dead links
- Types of errors
- Huge amount of dead links (PageRank effective)
• 28% in 1995-1999 Computer & CACM
• 41% in 2002 articles
• Better than the average Web page?
Topic Discussions Set
• Leading WIRED Topic Discussions
- About 20 minutes reviewing issues from the week’s
readings
• Key ideas from the readings
• Questions you have about the readings
• Concepts from readings to expand on
- PowerPoint slides
- Handouts
- Extra readings (at least a few days before class) –
send to wired listserv
Web IR Evaluation
- 5 page written evaluation of a Web IR System
- technology overview (how it works)
• Not an eval of a standard search engine
• Only main determinable diff is content
- a brief overview of the development of this type of
system (why it works better)
- intended uses for the system (who, when, why)
- (your) examples or case studies of the system in
use and its overall effectiveness
Projects and/or Papers Overview
• How can (Web) IR be better?
- Better IR models
- Better User Interfaces
• More to find vs. easier to find
• Scriptable applications
Project Ideas
• Searchable Personal Digital Library
• Browser hacks for searching
• Mozilla keeps all the pages you surf so you
can search through them later
- Mozilla hack
- Local search engin
Paper Ideas
• New datasets for IR
- Search on the Desktop – issues, previous research
and ideas
- Collaborative searching – advantages and
potential, but what about privacy?