The IRF Organisation
Download
Report
Transcript The IRF Organisation
Text mining and Indexing:
Assessing the Results of
Deeper Indexing for
Patent Search
John Tait
Chief Scientific Officer
IRF
1
Acknowledgements
Mihai Lupu of the IRF
Jian-Han Zhu of University College London
Jimmy Huang of York University of Canada
Giovanna Roda of colleagues in the CLEF IP
team for Matrixware and the IRF
• Royal Society for Chemistry for generously
making their Scientific Journal Colelctions
available to us
•
•
•
•
IRF Member Services
2
An apology
The content of this talk was planned on the
basis we could discuss the detail of TREC
CHEM: over the weekend it became clear
NIST policy is that the results should not be
made public until the TREC conference in
the US in November
• Therefore much detail had to be removed
IRF Member Services
3
Outline
• Introduction to the IRF
• TREC CHEM
– 2009
– Plans for the future
• Summary and Conclusions
IRF Member Services
4
The IRF
5
The Information Retrieval Facility
A international not-for-profit institution,
founded in 2006, based in Vienna, to
promote and facilitate research in large
scale information retrieval
The IRF Mission
To bridge the gap between the needs of the
industry and the academic know-how.
To maintain a facility that enables large
scale information retrieval and in depth
processing of data for research
To bring the latest information retrieval
technology to the community of patent
professionals and other professional
searchers.
IRF – Founding Members
The Information Retrieval Facility
A platform initiated by Matrixware
which:
improves the transfer of knowledge
between professionals in Intellectual
Property and Information Retrieval
and
promotes collaboration between
experts on the development of new
research methodologies for
international patent
. data
Distinctive Patent Search
Characteristics
High Recall: a single missed document can
invalidate a patent
Session based: single searchers may
involve days of cycles of results review
and query reformulation
Defendable: Process and results may need
to be defended in court
CLEF-IP
The goal of the CLEF-IP track is to investigate
multilingual IR techniques in the Intellectual
Property domain.
• Target data >1Mio EPO granted patents documents in
three languages: English, German, French
CLEF-IP
Track
• Tasks prior art search, invalidity search
• Test collection constructed using the available EPO
prior art reports
IRF Member Services
• Scientific Members
– Access to data to resources
– Project links to industry
• Industrial Members
– Consultancy and research in IR and IP search
– Training and support: systems evaluation
semantic computing
– Links to academia
IRF Member Services
12
TREC Chemistry
Information Retrieval
Track 2009
John Tait
Chief Scientific Officer
IRF
13
TREC
• Organised by the US Federal Institute of
Standards and Technology
• Has run annually since1991
– Originally focused on ad hoc text retrieval
with long queries
– Regularly extended
•
•
•
•
Video
Web
Genomics
Legal
IRF Member Services
14
Origins of TREC CHEM
• IRF approached NIST about using our patent
data and computing facilities as a means to
promote scientific co-operation
• Jian-Han Zhu then of UK Open University
about Chemistry approached NIST about
Chemistry
• NIST were interested in domain specific
retrieval to follow up Genomic track etc.
and helped us get going
IRF Member Services
15
Data
• 1.2 mil. patent files
(IRF)
• 59k scientific
articles (RSC)
IRF Member Services
16
Tasks
• Technical Survey
– Search for all potentially relevant
documents, in both collections.
– 18 manually defined and evaluated topics
• Prior Art
– Search for patents that may invalidate a
given patent
– 1000 automatically created and evaluated
topics (1000 patent files)
17
Participants
• 15 institutions registered to get the data
– 6 submitted 31 runs for the TS task:
• University of Applied Science Geneva, Information
Retrieval Laboratory of Dalian University of
Technology, Fraunhofer SCAI, Milwaukee School of
Engineering, Purdue University, York University
– 8 submitted 59 runs for the PA topics:
• University of Applied Science Geneva, Carnegie
Mellon University, Information Retrieval
Laboratory of Dalian University of
Technology,University of Iowa,Fraunhofer SCA,
Milwaukee School of Engineering, Purdue
University, York University
18
Methods
• Basic vector space model
– Different sections, weights on each section
– bm25
• Additional filtering/weighting based on IPC
codes
• Linguistic processing
– Emphasis on Noun Phrases
• Concept based search
• Query expansion
– Using Oscar3, MeSH
19
Evaluations
• Technology Survey tasks
– 8 chemistry grad students
– 5 experts
– Each topic evaluated by 2 students and 1
expert
• Prior Art tasks
– Automatically evaluated based on citations
within patents and family members
20
Initial Results
• Manual evaluations have some conflicting
results
– Not more than other manually evaluated
topics
• Using entity recognition and synonyms
proves successful
– Some groups manually extended the queries
• “simple methods” seem to also perform
well (e.g. Lucene-based, bm25)
– E.g. for Inferred Average Precision they
reach 97% of highest score
• Disclaimer: results analysis is still ongoing
21
TREC CHEM 2010 onwards
• Subject to discussion at TREC in November
– Increased numbers of patents
– Include images
– Task extensions/refinements
• Searching for numerical ranges (independent of
unit)
• Searching for specific roles of specific chemical
components
• The use of Markush structures
IRF Member Services
22
Summary and Conclusions
• The IRF is promoting collaboration between
information retrieval and intelelctual
property professionals through promoting
evaluations and joint technology
development projects
• TREC CHEM has provided an objective and
independent means as assessing the
effectiveness of technologies on two sorts
of retrieval tasks
IRF Member Services
23
IRF Newsletter Chemistry Issue:
http://www.ir-facility.org/the_irf/newsletter
Thank you for your attention
Any questions ?
www.ir-facility.org
www.matrixware.com
www.matrixware.net