A Risk Minimization Framework for Information Retrieval

Download Report

Transcript A Risk Minimization Framework for Information Retrieval

Research Problems & Topics
(Literature Domain)
(CS598-CXZ Advanced Topics in IR Presentation)
Feb 1, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Research Area Mining
•
•
•
•
•
There are all kinds of research branches for one department, for example,
Artificial Intelligence, Machine Learning, Data Mining, and Computer
Vision… for Computer Science. What is the relationship between these
areas? For example, Machine Learning always has strong relation with Data
Mining and Computer Vision. Data Mining is always correlated with
Information Retrieval. Could we find the relation from the Web? Could we
find or anticipate the new emerging areas or inter-disciplinary areas?
Users: students, faculties
Data: I think the homepages of the faculties are good sources. Faculties
always state their interests and their publications in their homepage. If one
professor has more than one interest, the two areas are probably related. If
two professors collaborate on one paper, the two professors’ interests are
probably related. Such an application may help faculties and students find
new interests.
Functions: Research Area relation mining.
Challenges: How to recognize the faculties’ interests? How to mine the
relation?
Paper Classification & Organization
• The problem is to classify published papers in CS domain
•
•
•
into different sub areas, and organize them in the time
order.
The current situation for a researcher is, if he wants to
know what has been done or not been done in a field, he
has to search on the web in an ad-hoc way. It is easy for
someone to miss some important publications by searching
in this way.
If this task is done, researchers who want to do literature
survey in a specific area will benefit a lot. For example, if a
data mining person wants to know what has been done on
frequent pattern mining. He can input “frequent pattern
mining” and all the relevant papers are output in the time
order. Then he can do the literature survey very easily.
The major challenge is, how to summarize and classify the
papers correctly. And if a paper is an interdisciplinary one,
we should assign it to every related field.
Automatic Survey Generation
•
•
•
A useful tool for researchers may be a program that can automatically generate
surveys on topics given by the user, and/or recommend a state-of-the-art technique
that works the best for the user regarding the topic/problem.
The users of such a program are either researchers who want to find out the current
state-of-the-art technology in some research areas new to them (or simply new
research areas), or researchers and engineers who want to use existing
methods/tools as building blocks for their research/products at a higher level.
The problem is not trivial because
–
–
•
(1) for new research topics, there probably is no survey paper published, yet,
(2) even for old research topics, the existing survey papers may be out-dated already.
Therefore, data involved in this challenge not only include existing survey papers on the given
topic, but also include the most recent research papers that are addressing the problem.
The problem is challenging in several aspects.
–
–
–
(1) Some topics (especially new topics) may not be well defined. When searching for relevant
papers, the system needs to consider different ways of describing the problem, in addition to
what the user provides, (similar to query expansion.)
(2) How to summarize the methods proposed in different papers, and how to compare the pros
and cons of these methods may be difficult. This may involve text summarization and
information extraction. For example, can the system identify a benchmark for the given
problem and compare the performance of different methods on that benchmark?
(3) If the user provides his constraints/requirements, can the system recommend a good
method that fits the userâ^À^Ùs need the best? This may involve more sophisticated
techniques.
Note Taking System
•When a student study a new subject,
sometime it is hard to distinguish what is
more important in this subject from what
is not important. By collecting many
textbooks and the class handout or lecture
note, we might automated generate the
note for the subject. It help student learn a
new subject quickly.
Integrated information system for
bioinformatics sources
•
Functional analysis, which studies how a
biological entity is functionally related to other
biological entities, is a major research issue in
modern biology. To perform successful
functional analysis, biologists must integrate
data from multiple sources, which is usually
carried out largely by hand. Hence, developing
automatic techniques to integrate genomic data
has now become truly critical to successful
functional analysis.
–
–
–
Users: biologists, bioinformaticists et al.
Data involved: biomedical literature, biology entities et
al.
Functions to be developed: text search, relational query
et al.
Evolutive Text Mining
•
•
•
•
•
•
In literature collections, there would be hundreds of papers on the each area every
year. Concepts, problems and technologies are not only evolutive over time in each
field, but also involved in interdisciplinary interactions. Taking concepts for example,
as time goes by, some concepts dies out, some concepts emerges, some concepts are
borrowed from other fields, some merges together and some splits. Some concepts in
different fields (collection, community) may have different name but share analogical
content and similar evolution path.
If we can model the evolution of concepts/problems/technologies in one field, we can
understand the evolution of this field well; sometimes even can predict the change of
this field. For an even more ambitious scenario, suppose A, B, C, .. are techniques in
field 1, and A¡¯, B¡¯, C¡¯ are their analogical techniques in field 2. Suppose we discover
two evolutive paths in field 1 and 2: Field1: A->B -> (+D) ->C-> (+E) ->F; Field2: A¡¯ ->B¡¯ > (+ D¡¯) -> C¡¯; C and C¡¯ share similar evolutive process in field 1 and 2. Does this
indicate that the involving of a technique E¡¯ (which is analogical to E in field1) might
bring the next development of C¡¯ in field2?
This would be very useful for scientific researchers. Using Comparative Text Mining,
we are able to find analogical concepts in different fields, and if we can model the
evolution of concepts well, this task becomes possible.
User: Scientists, researchers Data: Scientific literatures, for example, Honeybee data
and Flybase data.
Functions: Finding analogical concepts over collections; Modeling the evolutive paths
in each collection; compare and make predictions with the paths in different
collections.
Challenge: How to find a good model of concept evolutions. How to use CTM to define
analogical concepts.
Topic Evolution Discovery
• Challenge: To find and discover how a topic has
evolved through time
• Users: Researchers in different fields, managers
who want to streamline the company's process by
looking for inefficiencies, etc.
• Data: Scientific literature, company documents
• Method: In the simpliest sense, it may be
interesting to find the function parent_of(A,B)
where A and B are documents and much of the
content of B comes from (or influenced by) A. With
this function and a timestamp for each document,
it should be possible to create a timeline that
shows the lineage of a concept.
Paper Writing Support
(Finding Related Work)
•
•
•
•
•
•
Help the research paper writing When I am writing a paper, one tedious thing is to
compose the related work. Usually, I only have a few competitive or referential papers.
But the related work needs a more thorough survey, so as to avoid some unnecessary
arguments from the reviewers. It would be good to have a system that I can give it
some articles or some paragraphs from the ongoing paper and it can return some
typical related works together with a rough organization of them according to
research topics. For example, given this note, the system may return some papers
about searching in local cached pages, some about email categorization, and some
about paper retrieval and summarization.
To retrieve related papers, it may need the techniques of content-based information
retrieval together with link-based approach on the collection obtained by expanding
the citations in the given articles. To give each paper a short summary, we may apply
some sort of summarization technique on each paper, or just extract th! ose
sentences mentioning the paper in other papers that refer the target paper.
To organize the result pagers can be achieved by classification using some welldefined research topic hierachy or by clustering if no such topic information is
available in advanced.
User: Research paper writers
Data: Research papers
Function: Given some papers, return some typical related works together with some
summarization of each paper as the reference to compose the article and some
topical information about each paper that help us to organize those related papers.
Automatic Identification of
Related Literature
• IR for literature may be the most important since the
information is authoritative. Google and Citeseer can index
papers in PS and PDF form and Citeseer appears to
automatically extract the special fields from the document
(e.g. title, author, bibliography).
• Perhaps an interesting next step would be to make
browsing through the documents more tractable by
automatically identifying related literature.
• Possible ways to find related literature would be word-level
similarity (common keywords), bibliographic similarity,
medium appeared in (same conference, same workshop,
name author, etc.) For the suggestions to not overwhelm the
user, some user feedback would seem necessary. If
suggested literature from the same workshop is not
relevant the system might suggest documents using a
different heuristic.
•
•
•
•
Limited Domain Question
Answering
Help Windows programmers find solutions to a technical problem.
Users: Windows programmers
Data: MSDN library and knowledge base, and maybe external sources like
articles in codeproject.com
Description: Develop a system to quickly help programmers find Win32
APIs or sample code that can help them solve a particular technical
problem. If the terms used by the programmer to describe the problem are
different from those used in the documentation, or if the solution is not
explicitly stated in the documentation and scattered in other
documentation, then it may be difficult to find the solution. For example, if
one wants to find out how to convert from DOS 8.3 filename to Windows
long filename, the search result for ^Óconvert file name 8.3 long^Ô in
MSDN would return GetFullPathName, and one has to read carefully its
documentation to discover the actual API that does the desired job is
GetLongPathName. It happens so because ^Ó8.3^Ô is never mentioned in
the documentation of GetLongPathName and only in the documentation of
GetFullPathName as a side note. It would be nice if the system could collect
all these information together to provide the programmer a direct answer.
This may be a challenge because it may need sophisticated NLP analysis
like those used in question answering.
Personal Literature Management
•
•
•
•
•
Researchers store many papers on the local disk. Sometimes, it is
hard to ¯nd the downloaded literature. So it is important to
organize these papers and provide some functionality such as
search to the user.
Every researcher will bene¯t from this tool.
The personal literature data collection is the data this tool
manages.
The functionality will include search (¯nd relevant papers) and
classi¯cation. The user provides a hierarchy, then the system will
associate each paper to several tags automatically. Of course,
some papers will be tagged by the user so that some training data
is provided. The challenge of this project will be how to design and
implement such a system and choose the best search and
classi¯cation algorithms suitable for personal literature collection.
Personal Literature Management
•
•
•
•
•
Researchers store many papers on the local disk. Sometimes, it is
hard to ¯nd the downloaded literature. So it is important to
organize these papers and provide some functionality such as
search to the user.
Every researcher will bene¯t from this tool.
The personal literature data collection is the data this tool
manages.
The functionality will include search (¯nd relevant papers) and
classi¯cation. The user provides a hierarchy, then the system will
associate each paper to several tags automatically. Of course,
some papers will be tagged by the user so that some training data
is provided. The challenge of this project will be how to design and
implement such a system and choose the best search and
classi¯cation algorithms suitable for personal literature collection.
Topic-Specific Paper Rank
•
•
•
•
•
Users always prefer to good papers. Such good papers can be
divided into two types: Good survey papers, which include all the
good topics of one area, and good technical papers, which set a
new direction or address the specific problems thoroughly.
However, a paper is good or not is area-dependent. For example,
the user would like to get a good paper of Information Retrieval.
Another user would like a good paper of Data Mining. The question
here is how to rank the paper according to their areas. Such an
application may tell people the necessity of writing a new survey
paper if he can’t find a good survey paper right now.
Users: researchers, scientists, graduate students.
Data: literature materials
Functions: Paper search and topic-specific rank
Challenges: How to identify a paper as a good survey paper, how
to identify a paper as a good technical paper and how to classify a
paper to a specific domain? How to use the author information in
the paper rank?
Starting Point for Research
Name: Starting Point for Research in Any Area
User: Any faculty, or student who is looking at entering a new are of research.
Data Involved: All the papers/or indexed summaries available on the web.
Function: Whenever a researcher wants to enter a new area, he/she faces a big
question:
How or from where should I start? Finding an answer for this question can be a
difficult
or at least time consuming task. It would be great if a system exists that can
gather, and
summarize all the information about all the papers (including classic, highly
referenced,
cutting edge, etc) and also all the people that work in the related areas (including
summaries and information about their publications, projects, affiliations, etc).
This
intelligent system can somehow generate a route through which the user can get
all the
information he/she would need in order to start getting into the desired area.
Automatically Discover
Cause-Effect Relationship
In literatures, facts as cause-effect relationship are popular, especially in medical,
law,
and history literature. To be able to do that, a person need to read all the related
documents, remember most of facts, and do a good reasoning. However, with a
huge
number of literatures in each field today, no one could be able to do that
thoroughly.
Most of attempts success with some forms of lucky which is reaching right
documents at
right time.
Making this task done automatically, much useful and maybe surprised knowledge
will
not be missed. And base on this, we could build some a new kind of expert system
which
works directly with knowledge in form of literatures.
Users: Researchers, lawyers, historians.
Data: Existing literatures, especially literatures of medical, law, history, and
chemistry.
Challenges: Recognizing and connecting causes and effects together is extremely
hard.
Literature Network
• One topic is how to automatically build a "literature
•
•
•
network" for a topic. In a literature network, every node is a
paper which is related to the topic and every edge between
nodes is annotated with the relation between these two
paper (i.e. why one paper cite the other and how these two
papers are related ). Such literature network will provides
users a whole picture of the area, which makes literature
survey easier. Note that there are two major types of
citation between papers, one is about some known
techniques (not necessarily related to the topic) and the
other one is about the previous and related work. The first
type of citation should not be included in the network.
The users are researchers.
The data are conference papers, journal papers and books.
The major challenge is how to identify the relations
between two papers, which involving the techniques of
information extraction, information summarization and text
categorization.
Statistical models for Peptide
Tandem Mass Spectrometry Data
Analysis
•
•
Description: Molecular biology has been revolutionized by the advent of high throughoutput
experimental methods that could investigate thousands of genes or proteins in parallel. With the
great success of Microarray analysis techniques for genomics, mass spectrometry based proteomics
becomes the next hot point in the literature. However, unlike the reliable microarray based analysis
methods for genes, interpreting high-throughoutput peptide tandem mass spectrometry data is still
an open problem. The large volume of data generated from peptide tandem mass spectrometry
experiments is full of noise and unknown underlying biochemistry principles. How to utilize these
data to extract useful information and knowledge remains a problem.
In this project, our long term research goal is two folds:
–
•
•
•
–
1. decide what kind of proteins are presented in the tissue samples.
2. decide the quantitative ratios of different proteins.
Under the guidance of these two directions, many sub goals could be derived, such as, how to
design efficient and effective scoring functions for sequence databases searching, how to design
probabilistic models to simulate interactions between different proteins, how to derive useful
features from pure peptide sequences and spectrum data.
The rough outline for this project is:
–
1. make a literature survey and write a review report, to summarize what the other researchers are currently
doing.
–
2. identify one or two promising topics from the literature survey.
–
3. conduct the research work and get some initial result.
–
4. finish the course project paper.
Group: this project could be a 1 person project, since it needs some biology background about
peptide tandem mass spectromety experiments and machine learning knowledge, however, if there
are other students whoe are really interested in it, it may be expaned to a 2-team member group.
Possible Topics
• Literature Access
– Personal literature management
– Summarization
• Generating/Identifying Survey papers
• “Starting points”
• Literature Mining
– Literature networks/Find related work
– Research area mining
– Topic evolution mining
– Biology functional analysis/Question answering
Assignment 2 (for Literature
Team)
• Search on the web (starting with digital library
conferences, JCDL, and summarization work)
• Every one identifies one or two most interesting
papers, which you like to present
• Send me your choices by this Saturday (Feb. 5)
• Need one volunteer for presenting a literature
paper on Feb. 10
• Possible choices:
– Building domain-specific web collections for scientific
digital libraries: a meta-search enhanced focused crawling
method
– Panorama: extending digital libraries with topical crawlers