ppt - University of Illinois Urbana

Download Report

Transcript ppt - University of Illinois Urbana

Research Problems & Topics
(Web Domain)
(CS598-CXZ Advanced Topics in IR Presentation)
Jan. 25, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Faculty Homepage
Classification/Finding
• The problem is, to classify the faculty homepages
from different universities according to their
research field. If a student doing data mining
wants to apply for the graduate program in US
universities, he can input “data mining, U.S.,
university”. The search result is the data mining
faculty homepages from different universities in
U.S. This would help a lot since people currently
have to navigate to different university websites,
go to the “faculty” list, and click on every faculty
name to find out whether his interest is data
mining or not.
• The challenge is how to summarize the
homepages and classify them correctly.
Search by Relations
•
•
Search by relations instead of words and phrases
A typical search query today is based on keyword
matching. Usually, the semantic-rich, structured data are
hidden implicitly on the Web. Hence, we need some
“mining” technologies to find these relations. E.g., if we
want to find Mike’s address, we will take “Mike” and
“address” as keywords to search by Google. Typically all
the sources with these two keywords will be showed as
the search results, in which most actually have nothing to
do with Mike’s address. Then we have to dig deeply into
these results to look for the information we want.
–
Users: everyone who has an internet connection
–
Data involved: the whole web
–
Functions to be developed: search by relations
Google Dictionary
•
•
•
•
As a non-native English speaker, I often want to find out the correct usage
of a word, and more often, the correct usage of a phrase. Online
dictionaries usually only show a few examples. For certain phrases, online
dictionaries may not even have entries for them. If I try to search for the
word or phrase on Google, it usually finds web pages where the word or
the phrase only appears in the title, which is not very helpful.
It would be very useful to have a Google dictionary so that if you type in a
word or a phrase, it shows how the word or phrase is popularly used, with
summarization and examples.
The users of such a tool will be people to whom English is a second
language, or kids who are still learning new words and phrases. It would
also be useful for finding out the meanings of buzzwords. Since it is
supposed to be an English dictionary, the system should filter out web
pages that may contain improper usage of English.
The data involved should most likely be news articles, online books,
essays, and other well-written English articles. The dictionary should
ideally summarize the usage of the word or phrase into several categories,
give examples for each of them, and maybe differentiate between formal
usage and informal usage. And just like other online dictionaries, this
dictionary should be able to correct the user’s spelling, or find the best
match if the user enters a phrase that does not exist.
•
•
Personalized Conceptual
Search Engine
An identical query may have different latent meanings. In real world searching, people
usually have their own preferences of a certain aspect of one query. For example,
“apple” may mean “computer” and also “fruit”. “Java” may mean “country”, “coffee” and
“programming language”. In general search, it’s hard to indicate which aspect a user
want from one concept, but in personalized search, one user is likely to prefer one
aspect. Task2: Sometimes it’s hard to generate a good query from a certain need. For
example, a user wants to know “what did America Government say about …”. Most
articles may mention “Bush said …”, “Bill Clinton said …”, “Bowel said …”. A query like
“America Government …” may not get satisfactory results. This is because America
Government is a concept, which indicates a group of terms. Again, in general search,
modeling a concept is hard, because each concept may have different meaning. In
personalized search, people may have stationary components for each concept. ?
A better scenario, when a user wants to know “the state-of-the-art of NLP”, an ideal
personalized system should first figure out NLP is Natural Language Processing of this
user, and then figure out it refers to POS Tagger, Parser, etc. These, are hard for
general search engine but doable for personalized search. Interestingly, a person’s
name can be a very good example of concept. ? The training data could be any kind of
texts with personalized property (may not be strictly on query history). For example, a
word-usage statistics of the user’s articles, chatting records, and other collections can
be very useful. All these things can be done on client side, which avoids the privacy
problem. ?
–
–
–
–
User: common users of search engine.
Data: query history, personal collections of texts, articles and chatting records.
Functions: concept clustering, summarization from texts; personal preference learning, query
modification by concept selection and splitting.
Challenge: How to cluster terms into concepts from personalized texts. How to represent a
concept. How to do query expansion with the information of concepts.
Find in-depth knowledge about an
Entity
• To find in-depth knowledge for a particular entity on the
web.
• Example: I type in "Microsoft", and I want to find the
earnings, revenue, and locations for the Microsoft
corporation.
• Users: Researcher, stock analysis, people who work in
human resources, and all who want to do research on a
person, company, or a particular topic.
• Data: The complete web
• Method: The semantic web may be the solution to this
problem. However, it may be still useful to disambiguate
among entities with common names and cluster the pages,
then do summarization on them.
Web Search for information Seen
Before
•
•
•
•
•
I think that most people has such an experience that they remember they saw
something useful or interesting before, but just cannot figure out how to access them
again. For example, when planning the dinner, one may just happen to remember s/he
saw an interesting receipt in a cookbook site before and try to find it. However, s/he
just forget the name of the site and couldn't find it after trying several different
queries to the search engines. I believe this kind of information need usually emerges
in our daily life.
In such situation, we usually still have a rough idea of that information. Sometimes,
we can recall the original search context (when or what situation) and then figure out
how to query and then access it again. But sometimes, we have to give it up after
several unsuccessful trials.
This kind of search is different from the general Web search. The user is each
individual user who ever surfs and searches the Web. The search content would be
the pages that s/he have ever accessed. The challenge is how to help the user clarify
his memory so as to figure out the context to access the target information.
There would be many kinds of approaches to this problem. For example, the Internet
front end (e.g., IE) can log the queries that the user ever submitted and interactively
help the user to refine the query. Another possibility is that a personal agent indexes
all the pages accessed before and do the search on these cached pages. Although the
search space of such cached pages becomes much smaller compared to the whole
Web, the space available in local computer limit the indexing capability. It may not be
able to index all cached pages. How to index and perform search would be critical.
To summarize, User: Each individual user Data: Cached Web page + Web Function:
efficient indexing and search in local computer
Infer User Preferences over
Websites
• Users: Search engine users
• Data: Search results
• Description: People have preferred websites for
different searches. For example, if I am searching
for a paper download then I would probably only
look for CiteSeer and ACM because they usually
have a paper download link. When searching for
news, I trust NY Times over other news sources. It
would be convenient to give more weight to
results from these sites. The preference can be
inferred from implicit or explicit feedback, but the
challenge is the preferred websites change for
different search topics.
Structural Search
• If we submit a long sentence or a paragraph to
google, most of time, google is not able to handle
it. Structural search is different from traditional
key words search in that it also takes the
structural information in the query sentences.
• The users could be any general users.
• The data involved in the challenge is the text
webpages over the internet.
• The key functions in the problem are the
document indexing and document searching.
Price Extraction and
Comparison
• Companies, like Walmart, Bestbuy, and etc, usually price
their merchandise only based on their buy-in cost and the
amount of goods in stock. However, if other competing
compaies offer a lower price at the same time, most
customers will spend their money at other companies. The
consequence is that the goods cannot be sold and company
will incur more cost.
• If we can build an information retrieval system that collect
all the retailing prices from other competing companies.
Then we can price the merchandise more competitively.
• The challenge in this system includes how to find all prices
of certain merchandise on the web, and how to link some
effect that make the price not directly visible, like the on
sale or coupon information.
Domain-Specific Search
Currently, the search engines only deal with general search. That is, for any
query, it will search the whole web. However, in many situations, people
know the answer is in a specific domain. For example, one student would
like to find some references about a particular problem, say, “max flow of
the network”, he knows the answer should be on a Web page which
belongs to the .edu domain. But using Google, it is hard to specify such a
constraint. One solution is only to index the Web pages in this domain.
Then a search engine can be built on such a domain. This is the topic of
domain-specific search. Another use of domain-specific search is to help
employers find suitable employees. For example, some organizations want
to recruit new staffs from the ongoing graduate students. They can rely on
the .edu domain search engine for this purpose.
•
•
•
•
Users: Each domain will have a particular group of users. For example, the
.edu has the faculties, students and potential employers.
Data: the Web pages which belongs to a domain
Functions: keyword search, course search for .edu domain.
Challenges: What is the characteristic of the domain? People will agree that
.gov and .edu has different characteristics. How to recognize these
characteristics to help search? What specific functions should be defined
for a domain? For example, the .edu may support “professor” or “course”
search function.
Research Area Relation Mining
•
•
•
•
•
There are all kinds of research branches for one department, for example,
Artificial Intelligence, Machine Learning, Data Mining, and Computer
Vision… for Computer Science. What is the relationship between these
areas? For example, Machine Learning always has strong relation with Data
Mining and Computer Vision. Data Mining is always correlated with
Information Retrieval. Could we find the relation from the Web? Could we
find or anticipate the new emerging areas or inter-disciplinary areas?
Users: students, faculties
Data: I think the homepages of the faculties are good sources. Faculties
always state their interests and their publications in their homepage. If one
professor has more than one interest, the two areas are probably related. If
two professors collaborate on one paper, the two professors’ interests are
probably related. Such an application may help faculties and students find
new interests.
Functions: Research Area relation mining.
Challenges: How to recognize the faculties’ interests? How to mine the
relation?
Fuzzy Matching for Web Search
•
•
•
•
Topic: More intelligent search engine
Description: Current search engine "Google", even though
powerful, not "smart" enough, it can only conduct exact search
with "key word" matching. However, this works only under the
assumption that user could specify the "best" keywords. If the user
himself only have some vague ideas, Google may not be good
enough.
Therefore, a function component for "Vague searching" may be
added. ? An example is illustrated here: When a user wants to buy
a cheap computer, he may input a batch of keywords: "computer",
"PC", "cheap", "personal computer", if we use Google directly,
Google may return the results contain "Computer + PC +cheap +
personal computer". However, the diresed result maybe the sale
inforamtion page of DELL computer.
The desired techniques should be natural language processing and
semantic web.
•
•
•
More Expressive Query
Languages
One interesting research topic is how to allow users to query the web
using more sophisticated query language instead of just keyword queries.
The keyword query has the advantage of simplicity, but it does not allow
user to specify their information needs precisely. For example, you are a
new DAIS Ph.D student and you are preparing for the Qualify Exam, so you
want to search for all the courses related to Database and information
system area. You can send such a query as "related course database
information system" to Google. However, you will be very disappointed
with the results returned by Google, which only supports the keyword
query.
For this topic, the user can be any web user ,and the data can be any
indexed web pages.
To solve this problem, many techniques will play an important role, such as
text summarization, text categorization and information extraction etc. One
of the most challenging problems is about query language. Unlike
traditional database, there is no schema for web data, which creates huge
challenge for defining a query language. It is still worth discussing that
whether we should have a universal query language or several special
query language for different domain.
Automatic News and Information
Extractor, Classifier, and
Comparator
User: Ordinary Internet
user, who is looking for a way to read
news in a more classified and organized way from all the
sources he/she wants, spending the minimum amount of time.
Data Involved: The specified web pages by the user, input data
entered by the user.
Function: This problem was motivated by the problem that I am
having almost every day. There are two domains in which I
check news. First social and political news, and second sports
news. For each of the categories above, I check several online
sources to get the desired breadth and depth of information.
However, this process takes a lot of my time everyday. What I
am looking for, is a software (or a web page) that gets the links
to the online news/information sources that I use once, and do
the following actions either on a regular basis, or in an on
demand fashion:
Regular (or daily) function: Extract the mutually related news from
all the sources and provide me with titles, sources, summaries,
and links to the full content/articles. It is expected that the
software do this separately for every interesting, hot, or
commonly discussed issue that appears in the media.
Personalized News Alerts
Right now, many people read news online routinely instead of
buying a newspaper. Generally the user not only read general
news such as politics and business, but he has his possible
multiple distinguished interests such as the news in his
professional community and the news in his hobby. But at the
same time, the user has to visit multiple web sites to browsing
what is happening daily, weekly or monthly. If we have a
provide a personalized News Alerts, a desktop software, which
can automatically retrieve, rank and present the news of
personal interest to the user, the user can save a lot time and
do not miss important news.
Everyone can benefit from this software. The News articles (maybe
in RSS format) and other web pages of World Web Wide will be
crawled and ¯filtered. For each person, we need only crawl a
subset of web sites. This personalized News Alerts will have the
filtering functionality (¯filtering uninteresting news),
organization(organization by topic, date,etc), search (search by
topic, date, etc) and mining ( user can do comparative news
reading to get different opinions about the same event.)
There are some challenges. First is how to model the user
interesting. There are some clues about the user's personal
Possible Web Topic Areas
• Improving search engines
– Specialized Search Engines
• Special collections (domain specific, information seen
before)
• Specialized users/Personalized (infer user interests)
– Advanced query capabilities
• More powerful query languages (structured, relational,
semantic)
• Comprehensive/complete news service
• Web Information Extraction
– Price extraction
– English usage extraction
– IN-depth knowledge about an entity
A possible group project
•Better CS website search
– More powerful query language
• customized to CS domain/ontology (e.g., courses,
publications, projects, etc)
– Better browsing support
• Adding structures to the collection
• Automatic annotation (virtual links)
– Academic ads
– Automatic report of “what’s new”
– CS domain news service?
Two Papers to Consider Presenting
1. Adaptive Web Search Based on User Profile Constructed
without Any Effort from Users, WWW 2004.
(http://www.www2004.org/proceedings/docs/1p675.pdf)
2. Query-Free News Search, WWW 2003.
(http://www2003.org/cdrom/papers/refereed/p707/p707-henzinger.html)
Assignment 2 (for Web Team)
•Read past WWW conference proceedings
(e.g., www2002-www2005)
•Every one identifies one or two most
interesting papers, which you like to
present
•Send me your choices by this Sunday (Jan.
30)
•Need one volunteer for presenting a web
paper on Feb. 3