Introduction Recherche d`information

Download Report

Transcript Introduction Recherche d`information

Introduction
Recherche d’information
Jian-Yun Nie
(Based on van Rijsbergen’s
introduction)
Plan




Definition
History
Experimental tradition
Methods







Query
Indexing
Matching
Results
Evaluation
Current situation
Research
Definition
Important concepts











Document: an entity that contains some description of information, may
be in form of text, image, graphic, video, speech, etc.
Document collection: a set of documents (may be static or dynamic)
User
Information need: the user’s requirement of information
Query (request): a description of information need, usually in natural
language
Relevance (relevant document): a document that contains the required
information (Pertinence)
Correspondence, degree of relevance, relevance score, …: the
degree of belief (by the system) that a document is relevant
Judge = user/system
Indexing: a process that transforms a document into a form of internal
representation
Retrieval: an operation that determines the documents to be retrieved
Response (answer): the documents returned by the system (usually a
ranked list)
What is IR?




(Salton, 1968) Information retrieval is a field
concerned with the structure, analysis, organization,
storage, searching, and retrieval of information.
(Lancaster, 1968) An information retrieval system
does not inform (i.e. change the knowledge of) the
user on the subject of his inquiry. It merely informs on
the existence (or non-existence) and whereabouts of
documents relating to his request.
(Needham, 1977)…..the complexity arises from the
impossibility of describing the content of a document,
or the intent of request, precisely, or unambiguously
…
Data retrieval v.s. IR (VR 79)
DR
IR
Matching
Exact match
Partial match, best match
Inference
Deduction
Induction
Model
Deterministic
Probabilistic
Classification
Monothetic
Polythetic
Query language
Artificial
Natural
Query specification
Complete
Incomplete
Items wanted
Matching
Relevant
Error response
Sensitive
Insensitive
Logic
Classic
Non-classic
representation
A priori
A posteriori
Language model
Logical
Statistical
History
Important events in IR (1)












1952 Mooers coins IR
1958 International Conference on Scientific
Information
1960 Cranfield I
1960 Maron and Kuhns paper
1961 (-1965) Smart built
1964 Washington conference on Association Methods
1966 Cranfield II
1968 Salton’s first book
197- Cranfield conferences
1975 CvR’s book
1975 Ideal test collection
1976 KSJ/SER JASIS paper
Important events in IR (2)












1978
1979
1980
1981
1982
1983
1985
1986
1990
1991
1992
1998
1st SIGIR
1st BCSIRSG
1st joint ACM/BCS conference on IR
KSJ book on IR Experiments
Belkin et al ASK hypothesis
- Okapi started
RIAO-1
CvR logic model
Deerwester et al,LSI paper
– Inquiry started
TREC-1
Croft & Ponte paper on language models
Best known researchers
(Salton award)
Gerard Salton
"About the future of automatic information
retrieval"
Gerard Salton 1927-1995: see SIGIR Forum
memorial issue.
1983
Karen Sparck Jones
"A look back and a look forward"
http://www.acm.org/pubs/articles/procee
dings/ir/
1988
62437/p13-jones/p13-jones.pdf
Cyril Cleverdon
1991
"The significance of the Cranfield tests on index languages"
Cyril Cleverdon 1914-1997: see Journal of Documentation 54(3), June
1998, 265-280.
Best known researchers
(Salton award)
William Cooper
"The formalism of probability theory in IR: a foundation or an
encumbrance?"
http://www.acm.org/pubs/articles/proceedings/ir/
1994
188490/p242-cooper/p242-cooper.pdf
Tefko Saracevic
1997
"Users lost (summary):
reflections on the past, future, and limits of
information science"
SIGIR Forum, 31 (2), 16-27 (Fall 1997).
Stephen Robertson "On theoretical argument in information retrieval"
2000
SIGIR Forum, 34 (1), 1-10 (April 2000)
For... "Thirty years of significant, sustained and continuing contributions to
research in information retrieval. Of special importance are the theoretical
and empirical contributions to the development, refinement, and evaluation
of probabilistic models of information retrieval."
Best known researchers
(Salton award)
W. Bruce Croft
2003
C.J.
(Keith) van
Rijsbergen
"Information Retrieval and Computer Science: An Evolving
Relationship"
For... "More than twenty years of significant, sustained and continuing
contributions to research in information retrieval. His contributions to
the theoretical development and practical use of Bayesian inference
networks and language modelling for retrieval, and to their evaluation
through extensive experiment and application, are particularly
important. The Center for Intelligent Information Retrieval which he
founded illustrates the strong synergies between fundamental
research and its application to a wide range of practical information
management problems."
Relevant journals and
conferences

Journals






ACM Transactions on Information Systems (TOIS)
Information Processing and Management (IPM)
J. of the American Society for Information Science and Technologies
(JASIST)
Information Retrieval
…
Conferences








ACM SIGIR
CIKM
TREC
ECIR
ACL
RIAO
CORIA
…
Experimental tradition
Tradition of Experiments

Strong experimental tradition (from Cranfield)



Pros




To prove that an IR technique or IR system is better, the
effectiveness should be measured on test data
This tradition has a strong influence to other areas
(computational linguistics, AI, machine translation, etc.)
Develop practically effective approaches
Experimental evidence to prove a technique
Avoid nice, but useless theories
Cons



Neglect theoretical development
Difficult to develop new theories and new techniques to compete
against established methods
Wide use of heuristics, intuitions, manual tuning,…, or tricks
Experimental Methodology

Cleverdon: Cranfield


Lancaster: Medlars



Big document collection, large set of various queries, exhaustive relevance judgments
Blair & Maron: Stairs


"Salton's Magical Automatic Retriever of Text"
Vector space model, relevance feedback, tf*idf, …
Sparck Jones: Ideal Test Collection


Theories and experiments related to: human information behavior; human-computer interaction from
the human viewpoint; and modeling interaction processes in information retrieval. Notion of
relevance in relation to information and information systems. Theoretical and pragmatic study of
value of information and library services. Nature of information science as a field.
Salton: Smart


Report on the evaluation of its operating efficiency. American documentation. 20(2): 119-142; 1969
April. Lancaster refined a technique of failure analysis for this evaluation, seeking to investigate
reasons why relevant documents were not retrieved.
Saracevic: CWRU


Developed the “Cranfield Experiments” (funded by the National Science Foundation) and introduced
the concepts recall and precision to study the performance of information retrieval systems.
law documents, result analysis
Harman: TREC



Annual experimental contest
Large document collections, more realistic queries, partial relevance judgments
Not an ideal test collection, but more realistic
Some References on the Web




Cyril W. Cleverdon, The significance of the Cranfield tests on
index languages, ACM-SIGIR, 1991, pp. 3 – 12,
(http://portal.acm.org/citation.cfm?id=122861)
David C. Blair , M. E. Maron, An evaluation of retrieval
effectiveness for a full-text document-retrieval system,
Communications of the ACM, v.28 n.3, p.289-299, March 1985
(http://portal.acm.org/citation.cfm?id=3197&dl=GUIDE&coll=G
UIDE&CFID=65359706&CFTOKEN=94782922)
G. Salton , M. E. Lesk, Computer Evaluation of Indexing and
Text Processing, Journal of the ACM (JACM), v.15 n.1, p.8-36,
Jan. 1968
(http://portal.acm.org/citation.cfm?id=321441&dl=GUIDE&coll=
GUIDE&CFID=65359986&CFTOKEN=40286)
TREC: http://trec.nist.gov
Evaluation
Query
Document
collection
Desired
answers
Answers
evaluation
Test collection



Document collection: a large set of
documents
Query set: a set of queries (usually 50 or
more)
Relevance judgments: for each query,
determine manually the relevant documents
in the document collection

In TREC: the judgments are not known priori to
the experiments
Methods
Indexing-based IR
Document
Query
indexing
indexing
(Query analysis)
Representation
(keywords)
Query
evaluation
Representation
(keywords)
Query
Query Language




Artificial/Natural (web)
multilingual/cross-lingual
images
none at all!
Query Definition






Complete/Incomplete
Independence/Dependence
Weighted/Unweighted (tf × idf)
Query expansion/one shot (feedback,
web)
Sense disambiguation
Cross-lingual
Indexing
Maron’s theory of indexing


…..in the case where the query consists of
single term, call it B, the probability that a
given document will be judged relevant by a
patron submitting B is simply the ratio of the
number of patrons who submit B as their
query and judge that document as relevant,
to the number of patrons, who submit B as
their search query
P(D|B) = P(D,B) / P(B)
Representation of Information



Discrimination without Representation (specificity)
Representation with Discrimination (exhaustivity)
TF*IDF:



TF: importance of term for a document
IDF: Importance of document for term (specificity)
...defining a concept of ‘information’,....[that] once
this notion is properly explicated a document can be
represented by the ‘information’ it contains (CvR,
1979)
Maching (query evaluation)
Matching (query evaluation)








exact/partial match e.g SQL/Dice
Boolean matching (Fairthorne, 50)
co-ordination level matching (Cleverdon,60)
cosine correlation (Salton, 70) VS
probabilistic (ranking principle) (SER,80) PRP
logical uncertainty principle (CvR, 90) LUP
Bayesian inference (Croft,90) NET
Language modeling (Ponte&Croft 98)
Inference




Deduction/Induction: A, A→B infer B
Cluster Hypothesis
Association Hypothesis
P(term1|term2)
Logic


It is a common fallacy, underwritten at this
date by the investment of several million
dollars in a variety of retrieval hardware, that
the algebra of Boole (1847) is the appropriate
formalism for retrieval design…..The ‘logic’ of
Brouwer, as invoked by Fairthorne, is one
such weakening of the postulate system,……
(Mooers, 1961)
Another one:
Logical Uncertainty Principle (CvR, 1986)
Logic



If Mark were to loose his job, he would
work less
If Mark were to work less, he would be
less tense
If Mark were to loose his job, he would
be less tense
Cluster Hypothesis


If document X is closely associated with
Y, then over the population of potential
queries the probability of relevance for
X will be approximately the same as the
probability of relevance for Y, or in
symbols
P(relevance|X) ~ P(relevance|Y)
Document clustering
Association Hypothesis


If one index term X is good at
discriminating relevant from nonrelevant documents, then any closely
associated index term Y is also likely to
be good at this.
P(relevance|X) ~ P(relevance|Y)
Query expansion
Models






Boolean
Vector Space (metrics) - mixture of
things
Probabilistic (3 models)
Logical (implication) - what kind of logic
Language models
Cognitive (users)
Retrieval Result
Items Wanted


Matching/Relevant or Correct/Useful
The function of a document retrieval system



cannot be to retrieve all and only the relevant
documents....but to guide the patron in his search
for information (Maron)
Topical/tasks
Meaning/content
Some difficulties with
‘relevance’

Goffman, 1969:


‘..that the relevance of the information from one
document depends upon what is already known
about the subject, and in turn affects the
relevance of other documents subsequently
examined.’
Maron, :

‘Just because a document is about the subject
sought by a patron, that fact does not imply that
he would judge it relevant.’
Relevance (Borlund, 2000)

‘That is the relevance or irrelevance of a
given retrieved document may affect
the user’s current state of knowledge
resulting in a change of the user’s
information need, which may lead to a
change of the user’s perception/
interpretation of the subsequent
retrieved documents….’
Evaluation
Error Response

Precision: error where an irrelevant is
retrieved


Recall: error where a relevant document is
not retrieved



R= =#(relevant doc. Retrieved)/#(Relevant)
Trade-off


P=#(relevant doc. Retrieved)/#(retrieved)
F-measure: F = 2*P*R/(P+R)
How to cope with lack of recall
Cranfield →Ideal test collection →TREC
What is a relevant document?

Relevance is:





The correspondence between a document and a
query, a measure of informativeness to the query;
A degree of relation (overlap, relatedness, …)
between document and query
A measure of utility of the document to the user;
…
Judged by user / system

User relevance / system relevance
How should relevance be
judged?

Relevance is dependent on





Document contents
Information need (query)
Time constraint
Purpose of retrieval
Retrieval environment




Domain of application (newspaper articles, law, patent,
medicine …)
User’s knowledge



Computer/connection speed
User interface
…
About the domain of application
about the system
How is relevance judged?
(TREC)





Candidate answers by merging the answers from
different participating systems
Several human assessors judge for the same query
Agreement/disagreement
Binary value (rel. / irrel.) / multi-valued
Workable strategy but potential problems:



Some relevant document may not be found by any system
Subjective judgments
Disagreement between assessors and participants (but
participants usually respect the judgments of assessors)
Practice v.s. Experiments

Practice:








Web
Electronic Publishing
Task-oriented IR
Data Mining
Knowledge Discovery
Distance learning
Video/film asset management
Experiments:TREC









HCI
Visualisation
Work in Context, Cognitive approaches
Cross - lingual
Cross - media
Corpus-based IR (inc. wordnet, etc)
Digital Libraries
CBIR (Content-Based Image R)
TDT (Topic Detection and Tracking)
Research themes

















Discrimination/Representation
Data fusion
Authority/importance models (e.g. PageRank)
Logic + Uncertainty models
Filtering/Routing
Language models
Summarisation
IR + DBMS (inc XML etc)
Clustering the web
Visualising the web
Living with single term queries
Living with no queries
Scale free networks
Trading media (text helps images!)
Temporal dimensions (topics,events)
Evaluation (Time to dump ‘P and R’?)
NLP in IR
Current situation
Where are we now in IR?





Landmarks
Hypotheses/Principles
Postulates of Impotence
Long-term challenges
Areas of research
Landmarks







Luhn’s tf weighting
Architecture
Relevance Feedback
Stemming
Poisson Model -> BM25
Statistical weighting tf*idf
Various models
Hypotheses/Principles









Items may be associated without apparent meaning
but exploiting their association may help retrieval
P & R trade-off – ABNO/OBNA
Exhaustivity/Specificity
Cluster Hypothesis
Association Hypothesis
Probability Ranking Principle
Logical Uncertainty Principle
ASK
Polyrepresentation
Postulates of Impotence
(according to Swanson, 1988)





An information need cannot be expressed
independent of context
It is impossible to instruct a machine to translate a
request into adequate search terms
A document’s relevance depends on other seen
documents
It is never possible to verify whether all relevant
documents have been found
Machines cannot recognize meaning -> can’t beat
human indexing etc
….more postulates




Word-occurrence statistics can neither
represent meaning nor substitute for it
The ability of an IR system to support an
iterative process cannot be evaluated in terms
of single-iteration human relevance judgment
You can have either subtle relevance
judgments or highly effective mechanised
procedures, but not both
Thus, consistently effective fully automatic
indexing and retrieval is not possible
Long-term Challenges –
workshop Umass. 9/2002


Global information access: Satisfy human
information needs through natural efficient
interaction with an automated system that
leverages world-wide structured and
unstructured data in any language.
Contextual Retrieval: Combine search
technologies and knowledge about query and
user context into a single framework in order
to provide the most “appropriate” answer for
a user’s information need.
Areas of Research









How does the brain do it? (neuroscience)
How do we see to retrieve? (computer vision)
How do we reduce dimensionality in dynamic
fashion? (Statistics)
What is a good logic for IR? (mathematical logic)
What is a good theory of uncertainty?
(frequency/geometry)
How do we model context? (HCI)
How do we formally capture interaction?
How do we capture implicit/tacit information?
Is there a theory of information for IR?
Images not Text: how might that
make a difference?

no visual keywords (yet)





tf/idf issue
aboutness revisable (eg Maron)
relevance revisable (eg Goffman)
feedback requires salience
aboutness -> relevance -> aboutness
Text v.s. image
Text







Keywords
Frequency
Meaning
Grammar
Salience
Relevance
Query expansion
Image
Visual features
?
Object/form/color
Geometry
Salience
Path dependent
Example image
References

Van Rijsbergen’s talks:



http://ir.dcs.gla.ac.uk/oldseminars/Keith2.ppt
http://wwwclips.imag.fr/mrim/essir03/PDF/4.Rijsbergen.pdf
S.E. ROBERTSON, COMPUTER RETRIEVAL
(http://www.soi.city.ac.uk/~ser/papers/j_doc
_history/npap.html)