Presentation_Aditya_Mantri
Download
Report
Transcript Presentation_Aditya_Mantri
Data-Specific Web Search
By:
Aditya Mantri
Data on the web
Web is not anymore a only a means to share data
and information.
Comprises of social networking sites, wikis, blogs that
facilitate creativity, collaborations and sharing among
users.
Steady rise in the amount and significance of nontraditional data.
For eg. A simple blog is plain textual commentary;
but now it’s no uncommon to see photoblogs,
sketchblogs, vlogs, MP3 logs and podcasts.
Data on the web
Other than textual data web now contains, and can
be searched for:
Multimedia (images, audio, video, animations, etc)
Blogs
News
Scientific/Research papers
Source Code
Jobs
Travel
Health
Classifieds
Multimedia (MM)
Fundamental IR and searching techniques
cannot be applied to MM:
Multiple modalities – text, audio, still images &
video.
Size of MM data
Query in the form of words now cannot be matched
directly to the raw multimedia file.
Methods of storage and indexing need to be efficient
Therefore structural issues such as storage and
networking, as well as intelligent content analysis
need to addressed.
Multimedia (MM)
Basic challenge – understanding the user’s query.
Important to process raw multimedia and convert
into high level semantics.
Multimedia (MM)
Text based search – ‘query by word’
Metadata (filename, captions, tags)
Cues from text and HTML source code
Cues from image content (color, image size, file type, etc)
Content based search – when text annotations are
nonexistent/incomplete. ‘query by similarity’ or ‘query by
example’
Image (shape, color, texture)
Video (motion of object spatio-temporal relations)
Audio (humming for music, sampling rate, pitch, brightness,
bandwidth)
Relevance feedback - queries entered using either of the above
methods. Results returned are used to improve the user-query.
Multimedia (Research)
Closely related to the research in the MM IR field.
Paradigm shift in Content Analysis – domain specific
knowledge to bridge the semantic gap between features and
semantic concepts very specific.
Content Mining and knowledge discovery
Automated content–based image and video Annotation
Better indexing techniques
Blogs
“A website or page that is the product of (generally)
an individual or of non-commercial origin that uses a
date-limited or diary format, and which is updated
either daily or at least regularly with new information
about a subject, range of subjects, or personal
details.”
Key difference to note –
Temporal information
Connected community based collections – blogosphere
Personal nature (adds to the subjectivity)
Varying Structure
Blogs
Search strategy
Use basic keyword type search.
Then, use clustering to reduce the number of results
returned.
Can optionally, use interconnections between blogs to follow
a piece of conversation to gauge importance of a topic in the
blog.
Research
Temporal Mining
Extraction of opinions from blogs
Domain specific weblogs. Using Machine Learning
Techniques and probabilistic models.
News
News Search Engines, or new aggregators basically
compile syndicated web content such as news
articles from various reliable sources.
Each Search Engine differs in the way that they …
… crawl or index the news articles … some aggregators just
scrape headlines
… calculate the relevance of a story based on the credibility
… presence on any human intervention
News
Research:
Solve the problem of vastness of content the aggregator
returns to the user based on the keyword. Some sort of
clustering mechanism is proposed.
Selection of credible articles by automatically filtering out
wrong articles.
A good metric is ‘commonality’
Other metrics such as ‘bias’, ‘objectivity’, etc have been
proposed.
Common Issues and trends
Query formulation
Methods such as using domain semantics to represented as
ontologies to specify/formulate queries
Improving relevance feedback
Improvement to the basic search infrastructure:
crawling and indexing.
Improving and utilizing general web search trends
such as personalization, improvement in UI design,
etc.
Improvement to link analysis.
Unique, data-specific challenges
Multimedia
Bridging the semantic gap - allow the user to make queries
in their own terminology.
Creating multimodal analysis and retrieval algorithms
exploiting the synergy between the various media including
text and context information.
Effective browsing and summarization techniques need to be
addressed.
Creation of High Performance Indexes.
Unique, data-specific challenges
Blogs
Need to understand relationships between the title, body,
and comments to create better clustering algorithms for
selective blog search.
Need to understand the structure of a blog
Blogs can be written using word abbreviation and slang,
in one or multiple paragraphs, formally or informally, etc.
Community structure and time stamping of blogs needs to
be studied to extract cohesive discussions.
Unique, data-specific challenges
News search engine tech. is still considered a black
art.
Indexing is a major challenge
Difficult to crawl and extract snippets from source that have
varying structures and patterns.
The clustering techniques used to perform similarity matches
need be enhanced to avoid presenting flat list of the search
to the user.
It is difficult to device methods to find relevance of a news
article from non-credible sources.
Conclusion
Definitely a growing need to investigate and research various
issues related to data-specific web search.
Data-specific web search is quite different from traditional web
search
Multimedia web search primarily differs due to its multimodal
nature and size.
Blogs and News searches differ mainly due to its amorphous
structure and temporal nature.
‘Search for meaning’ has a ubiquitous significance for web
search, as, rather than having search by inputting keywords,
allowing users to make queries in their own terminology is
becoming important.
References
Articles from Wikipedia
Wall, Aaron, comp. "Search Engine History." 18 Feb. 2008 <http://www.searchenginehistory.com/>.
Wikipedia. "Web search engine" < http://en.wikipedia.org/wiki/Web_search>
Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information
Retrieval. Cambridge UP., 2008. 18 Feb. 2008. <http://www-csli.stanford.edu/~hinrich/informationretrieval-book.html>.
Emre Sokullu, "Search 2.0 - What's Next?". December 13, 2006.
John John B. Horrigan, "For many home broadband users, the internet is a primary news source". 22
March 2006. Pew Internet & American Life
Project.<http://www.pewinternet.org/pdfs/PIP_News.and.Broadband.pdf>
“Multimedia Information Retreival – Challenges” , ACMSIGMM
.<http://sigmm.utdallas.edu/Members/nicu/mir/challenges/>
"Web search engine multimedia functionality" Tjondronegoro D., Spink A. Information Processing and
Management: an International Journal 44(1): 340-357, 2008.
Alan Hanjalic, Nicu Sebe, and Edward Chang "Multimedia Content Analysis, Management and
Retrieval: Trends and Challenges". Multimedia Content Analysis, Management, and Retrieval 2006.
Wikipedia. “Blog” <http://en.wikipedia.org/wiki/Weblog>
Beibei Li, Shuting Xu. "Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments".
ACM Southeast Regional Conference. Proceedings of the 45th annual southeast regional conference.
Phil Bradley. "Search Engines: Weblog search engines".
<http://www.ariadne.ac.uk/issue36/search-engines>
Yun Chen, Flora S. Tsai, Kap Luk Chan. "Blog search and mining in the business domain". Year of
Publication: 2007. ycos Retriever. “Google News” < http://www.lycos.com/info/google-news.html >
Yan, Wang, Guo, Yao, Lv, Wang, “The Optimization in News Search Engine Using Formal Concept
Analysis Full text”.
Ryosuke Nagura, Yohei Seki, Noriko Kando, Masaki Aono. "A method of rating the credibility of news
documents on the web".Year of Publication: 2006.
Thanks!