Transcript Document

CS523 INFORMATION
RETRIEVAL
COURSE INTRODUCTION
•
• YÜCEL SAYGIN
• SABANCI UNIVERSITY
Contact Info
[email protected]
http://people.sabanciuniv.edu/~ysaygin
Tel : 9576
No Specific office hours. You can drop by
anytime you like. Email or call me to make
sure I am at the office.
Course Info
Reference Book: Introduction to
Information Retrieval,
Authors:
Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schütze
Publisher:Cambridge University Press. 2008.
Course Info
Grading:

Homework : 10%

Project : 40%

Paper presentation : 20%

Term Paper : 20%

Attendance during paper presentations:
10%
Topics that will be covered
Document Retrieval Techniques
Information Retrieval on the Web
Data Mining for Information Retrieval
Aim of the course
Knowledge:

To introduce information retrieval
techniques
Skills:


paper reading and presentation
research and/or project work
A Rough Schedule
October, November:

Lectures on various information retrieval
techniques
Remaining weeks: Paper and research
project presentations
What I will do
Give the basics on information retrieval
Project supervision
Give directions and advise on the projects
Coordination of the presentations
What I expect you to do





Understand the basic concepts of Information
Retrieval
Choose a specific area and two related papers on
the same topic for presentation in class
Attendance is required for paper presentations and
you will loose 2% of your overall grade for each
presentation you missed.
Write a term paper on the two papers presented.
Do a project and a final report describing what
you learned or achieved in the scope of the
project.
Sources
TREC Conference http://trec.nist.gov/
SIGIR Conference http://www.sigir.org/
WWW Conference http://www2004.org/
ACM TOIS Journal
SIGMOD, VLDB, ICDE Conferences
(database perspective)
SIGKDD, ICDM Conferences (data
mining perspective)
Tools
SMART IR (Cornell Univ.)

http://www.cs.cornell.edu/Info/Projects/NLP/
Glimpse from Univ. Arizona

http://webglimpse.net/
Google
Altavista
Yahoo
Information Retrieval
Refers to the retrieval of any type of
information such as





Structured data (e.g. relational database)
Text (We will focus on this)
Video
Image, sound
DNA
Document Retrieval
User Query
Static
Document
Collection
Ranked Result
•Document Collection is previously indexed
•User query is ad hoc
•Results are ranked wrt their similarity to the user query
Document Routing
User profiles are set in advance
Incoming documents are directed to relevant users
Useful for redirecting corporate emails to relevant
departments (sales, marketing, support etc)
Performance Metrics for IR
Precision
Recall
Not practical to have good precision and recall
Retrieved
Documents
Relevant
Documents
Whole Document Space
Relevant and Retrieved
Documents
First Reading for Tomorrow
The Anatomy of a Large-Scale Hypertextual Web
Search Engine (WWW Conference 1998)


paper by Sergey Brin and Lawrence Page
www-db.stanford.edu/~backrub/google.html
Web Information Retrieval
Two possible ways:


Use the web structure starting from a location like
yahoo where things are categorized
Use search engines
Web Information Retrieval
Challenges

Scale:
 Hundreds of millions of queries per day
 Web grows, continuous crawling is needed
 Obstacles due to OS, and disk seek time
Google handles large data sets by indexing and
compression
Search quality is important


Completeness of the index is important
But ranking is also of utmost importance due to the size of
the Web
Web Information Retrieval
Ranking (of google)



The idea is to give importance to pages that have
a lot of back links
Similar to the notion of citations in academia
A link graph of the web was formed and
maintained (518 million links in 1998 for the
prototype)
Web Mining
(focused) Crawling and Indexing
Topic Directories
Clustering and Classification
Hyperlink Analysis
Personalization (profiles, preferences)