Indexing for Searching - UNC School of Information and Library

Download Report

Transcript Indexing for Searching - UNC School of Information and Library

Major Issues



Information is mostly online
Information is increasing available in
full-text (full-content)
There is an explosion in the amount of
information being produced.
1
Manual Indexing



Human’s read and index content.
Fairly good, although not consistent
(interobserver, or even intraobserver).
Certain fields support costly manual
indexing (primary example is Medline).
2
Major Issues


For all fields unable to afford manual
indexing, and even for biomedical,
because there is so much knowledge in
the huge amount of literature being
produced that we cannot keep track of
it, or utilize it.
Research Example: Swanson’s
undiscovered literature
3
What this means

Need ways to index without requiring
paid experts
– Automatic indexing, classification, keyword
extraction, and even relationship and fact
extraction.
– Need to take advantage of experts who are
reading the materials to comment on it and
provide rankings, summarizations,
keywords, “factoids”. (like Amazon)
4
Why Automatic Classification?


Classification is time consuming and
expensive
Knowledge structuring
– To much information

Status of automatic classification
– Approaching level of human indexing.
(NLM’s Metamap).
5
What is Automatic Classification?

Automatic manipulation of a document’s
contents to support logical grouping with
other similar documents for organization
and/or retrieval activities. Can include
the assignment of, or manipulation of,
classification notation.
6
Approaches and Methods

Initial approach
– Create an inverted file
– On-the-fly (natural language processing)

Methods
– All words, remove stop words
– Word frequencies (Wilson’s objective
method of determining aboutness)
– More sophisticated IR methods
• Semantic/linguistical analysis,
• co-occurrence/similarity measures, etc.
7
Simple automatic indexes

Inverted file: contains all the index
terms automatically drawn from the
document records according to the
indexing technique used.
– Position of term
-
record number
Field number
Number of occurrences
Position in the field (digits 45-57)
8
Pros and Cons of Automatic
Indexing

Pros
– Consistency
– Cost reduction
– Time reduction

Cons / limitations
–
–
–
–
Human intellect
Term relationships
Misleading in retrieval
Good algorithms, but generally domain-specific
9
How to gauge effectiveness?
Recall
Number of relevant documents retrieved out of all the
possible relevant documents in system.
[quantity—did you get it all?]
Precision
Percentage of documents retrieved that were relevant
[quality of what you found]
10
Tradeoff between Recall and Precision
We can easily recall everything that matches a
particular text string or pattern; however, we cannot
search through all the matching results (too many)
We can do an OK job limiting to most relevant, but as
we “tune” result to be more relevant, we leave out
more and more matching results.
11
Future Search



Full text searching of content, and of
associated annotations on content, and
metadata (including reader rankings,
tags, etc). Like Connotea, NeoNote,
etc.
Faceted based searching (Endeca, e.g.
Home Depot, NCSU library).
Clustered based searching (Clusty)
12
Study on gene name searching



Looks at full text searching
Tradeoff between precision and recall
(Hemminger 2007).
13
Article Discovery Study
Schizophrenia +
Schizophrenia
Gene
Genes Found in
Metadata Only
Schizophrenia Gene
Arabidopsis Gene
172
8.58%
3541
20.63%
2712
8.83%
1671
83.38
%
10125
58.99%
5705
18.57%
Genes Found in
Metadata and
Full-text
161
8.03%
3498
20.38%
22305
72.60%
Totals for Found
Genes
2004
Genes Found in Fulltext Only
17164
30722
14
Article Review Study

Two literature cohorts,
– Schizophrenia (Pat Sullivan)
– Arabidopsis (Todd Vision)


Each cohort had three readers
Readers are asked to “review the article
and judge its relevance to them as
someone new to the gene in this
biological setting, trying to build an
understanding of the state of knowledge
15
in that research area.”
Metadata Articles More Valuable

In both cases and for all observers, their
mean quality rating values were lower
(more useful) for the metadata
discovered articles. There were
statistically significant differences
between the mean quality rating for the
metadata discovered articles versus the
full-text discovered articles for the both
the Arabidopsis and Schizophrenia sets16
at the p < 0.05 level
Precision and Recall
Schizophrenia
Recall
Precision
Arabidopsis
Recall
Precision
Metadata discovered
15.7%
(16.6%)
94.7%
84.1%
(84.1%)
100%
Full-text only discovered
100%
63.7%
100%
69%
17
Article Features that correlate with
Value: Number of Hits

The number of hits or matches of the search
term within the returned document is a
commonly used feature to rank returned
articles. To test the value of this feature, the
number of hits was correlated with the mean
quality ranking for each article (averaged
across all observers). The results clearly
show a relationship where articles with many
matches of the search term, tend to be much
more highly valued.
18
Improving Relevance for Metadata
Searching

Repeating the calculations on the
schizophrenia and Arabidopsis article review
sets, but limited to only matches with high hit
counts (Schizophrenia ≥ 20 hits and
Arabidopsis ≥ 15 hits) shows that precision
for the full text is now the same (100% in
Aradidopsis) or slightly better than that of the
metadata retrieved articles (95% versus
94.4% in schizophrenia). However, the
number of additional cases discovered by fulltext searching is now only slightly better,
finding 5% more cases in schizophrenia and
28% more in Arabidopsis.
19
Conclusions

This suggests that rather than accepting
metadata searching as a surrogate for fulltext searching, it may be time to make the
transition to direct full text searching as the
standard. This could be accomplished by
using certain features of the full-text article,
such as number of hits of the search string or
whether the search string is found in the
metadata (i.e. our current metadata search)
as filters that allow us to increase the
precision of our results. (and put the user in
20
control of the filtering).