Distributed search - 10th International Bielefeld Conference

Download Report

Transcript Distributed search - 10th International Bielefeld Conference

Technology for
integrated access and
discovery
Presented by: Marc Krellenstein
Title: VP, Search and Discovery
Advanced Technology Group
Date: February 5, 2004
Basic search is pretty good
•
Modern search engines are fast and
scalable
–
•
Can interpret keyword, Boolean and
pseudo-natural language queries
–
•
•
Ex: “how to make an international call with my
Blackberry”
Spell checking, thesauri and stemming to
improve recall
Users are more experienced
–
•
Having the data (usually lots) is still key
More multi-term searches
Gets lots of hits, but that’s usually OK if
good ones on top
Basic search is pretty good
•
Best practice relevancy ranking is good:
–
–
Term frequency (TF): more hits count more
Inverse document frequency (IDF): hits of rarer
search terms count more
•
–
Hits of search terms near each other count
more
•
–
Use anchor text – referring text – as metadata
Items with more links/references to them count
more
•
–
Ex: penicillin allergy vs. “penicillin allergy”
Hits on metadata (title,subject, etc.) count more
•
–
Ex: diabetes diagnosis and treatment
Authoritative links/referrers count yet more
Many other factors: length, date, etc.
Basic search is pretty good
•
•
Using these techniques search
engines can locate specific
documents, or good documents (if
not the absolute best) around general
or specific topics
But challenges remain…
Current challenges
•
Integrated search: Content still exists in
separate silos
–
–
–
•
Silos getting bigger but there are still too many
Library patrons have dozens of choices
Putting even more into Google is probably not
sufficient to solve the problem
Finding the best/novel documents
–
Hard to perform complicated searches (e.g.,
research similar to one’s own)
•
•
Historians can’t define a profile…
Discovery
–
Hard to do more than search: summarize,
uncover novelty and relationships, analyze
The integration challenge
•
Two approaches:
–
Build even bigger databases (well,
yes…)
•
•
–
Not easy, but sometimes the easiest
approach
Can be difficult to manage and secure
appropriate rights
Distribute search: Search separately
managed (or owned) large databases
as if they are one
•
Technically more challenging, but a scalable
and maintainable architecture
Distributed search
•
Index multiple (maybe geographically)
separate databases with a single search
engine that supports distributed search
–
–
–
–
Use common metadata scheme (e.g., Dublin
Core) and/or determine other common fields or
field mappings for each database
Search engine provides parallel search,
integrated ranking and integrated results
The separate databases can be maintained
and updated separately
Elsevier is currently unifying its own sources in
such a model with a ‘web service’ architecture
•
–
Has contributed specifications to the public domain
Such services can also be offered externally
Distributed search
•
•
Simplifies some business issues, but still
requires common technology platform
Where common platform not possible, add
federated search (i.e., metasearch)
–
–
–
–
Translate queries
Access and perform parallel search of multiple
search engines (vs. multiple databases)
Integrate results as best as possible
Use standards to approximate distributed
research
•
•
•
Uniform access, one query language (Z39.50,
updated)
Add standards for relevancy ranking and results
return?
NISO and its members are working on standards
Finding the best: Navigation
•
More data can also make finding the best
or novel documents harder
–
–
•
•
•
For searches for rare items, more data is a win
For all other searches, it’s more likely your
answer is in there…but it’s also more likely
there’s lots of other stuff close but not as good
Why? relevancy is good but…
Relevancy has its limits…there may be
many ‘good’ documents referring to
different aspects of the search…the best?
Underlying problems:
–
–
User’s needs may not be that specific
Even long searches are under-specified
One solution: clustering documents
•
•
•
Group results around common themes:
same subject, author, web site, journal,…
Show largest/most interesting categories
Depression  psychology, economics,
meteorology, antiques…
–
–
•
Psychology  treatment of depression,
depression symptoms, seasonal affective…
Psychology  Kocsis, J. (10), Berg, R. (8), …
Themes could come from static metadata
or dynamically by analysis of results text
–
–
Static: fixed, clear categories and assignments
Dynamic: doesn’t require metadata (or
controlled vocabulary to draw from)
Clustering benefits
•
•
Disambiguates and refines search results
to get to documents of interest quickly
Can navigate long result lists hierarchically
–
–
•
•
Would never offer thousands of choices to
choose from as input…
Access to bottom of list…maybe just less
common
Discovery – new aspects or sources
Can narrow results *after* search
–
–
Start with the broadest area search – don’t
narrow by subject or other categories first
Easier, plus can’t guess wrong, miss useful, or
pick unneeded, categories…results-driven
•
Knee surgery  cartilage replacement, plastics, …
Finding the best: Complex search
•
•
•
Main problem is still short searches/underspecification….which the keyword-based
‘enter a query’ paradigm encourages
One solution: Relevance feedback –
marking good and bad results
A long-standing and proven search
refinement technique
–
–
•
•
More information is better than less (longer
queries are better)
Pseudo-relev feedback is a research standard
Commercial forms – find-similar, etc. --–
not widely used (or well executed)...
…but successful in Pubmed (diff users)
Relevance feedback
•
•
One catch: Must first find a good document
to be similar to
Solution: Let the user provide the ideal
document – or a long query or problem
statement – as input in the first place
–
–
Can enter free text or specific documents
describing the interest, e.g., article, grant
proposal, experiment description, etc.
Should provide the best possible matches
Discovery challenge: Beyond search
•
How do you summarize a corpus?
–
–
–
•
How to you find a relationship if you don’t
know what relationships exist?
–
–
•
May want to report on what’s present, numbers
of occurrences, trends, etc.
Ex: What diseases are studied the most?
Must know all diseases and look one by one
Ex:does gene p53 relate to any disease?
Must check for each possible relationship
Ad hoc analysis
–
How do all genes relate to this one disease?
Over time? What organisms have the gene
been studied in? Show me the document
evidence…
One solution: entity extraction
•
Identify entities (things) in a text corpus
–
–
•
Examples: authors, universities… diseases,
drugs, side-effects, genes…companies, law
suits, plaintiffs, defendants…
Use lexicons, patterns, NLP for finding any or
all instances of the entity
Identify relationships:
–
Through co-occurrence
•
•
–
Relationship presumed from proximity
Example: author-university affiliation
Through limited limited natural language
processing
•
•
Semantic relations – causes, is-part-of, etc.
Examples: drug-causes-disease…drug-is treatment
for-disease…a is suing b…
ClearForest pilot, Fall 2002
•
•
•
•
•
•
Goal: Demonstrate real value to a working
expert in 90 days
Chose biomedical domain
Hired expert to help define entities and
relationships
Used 25,000 abstracts from 23 Elsevier
journals
Worked with ClearForest to define and
revise extraction of entities and
relationships
Have related partnership with Stanford for
text mining
Pilot scenarios
•
•
Answered real questions using real data –
not a demo or mock-up
The user:
–
•
anyone involved in genomic academic
research: a primary researcher, graduate
student or post-doc
Scenario 1: Research about gene p53
–
–
–
–
What journals should I publish in?
Who’s an expert I can ask for advice?
What connections have been made to my
gene?
What organisms have my gene?
What journals should I publish in?
Who’s an expert?
Connections to p53?
To organisms?
Pilot scenarios
•
Scenario 2: Disease research
–
–
–
–
–
–
What diseases are most researched?
What’s the time trend in HIV research?
What are the centers of HIV research?
Who are the author teams in HIV?
What gene-disease relationships are
there? What were they to start in 1996?
through 1997?
(Note: Cannot answer the above with
search alone)
What diseases are most researched?
Time trend in HIV research?
Centers of HIV research?
Author teams
In HIV
research?
Gene-disease relationships?
To start, in 1996?
Through 1997?
Pilot scenarios
•
Scenario 3: Connections between
leukemia and Alzheimer’s
–
–
–
Are there direct connections between
leukemia and Alzheimer’s?
What enzymatic activity is associated
with leukemia?
Are there indirect connections between
leukemia and Alzheimer’s mediated by
enzymatic activity?
Direct connections between leukemia and Alzheimer’s?
Enzymes associated with leukemia?
Indirect links from
leukemia to
Alzheimer’s via
enzymes
The power of indirect links
•
•
Almost impossible to determine
manually
Can provide completely unexpected
relationships between source and
target
The value of analytics
•
•
Goes beyond search – summarizes,
shows relationships, answers
complex questions
A significant value-added service
–
Value of one new drug discovery?
Summary
•
Need to search more broadly, more easily
–
–
•
Need to locate best/novel documents in
even larger (distributed) databases
–
–
•
Larger databases
Distributed search
Clustering to find documents of real interest
Find/similar, descriptive search
Need to go beyond search for overviews,
relationships and discovery
–
Text-based data mining and entity extraction