Eslam Al Maghayreh Searching throug the Internet

Download Report

Transcript Eslam Al Maghayreh Searching throug the Internet

Searching through the
Internet
Dr. Eslam Al Maghayreh
Computer Science Department
Yarmouk University
1
Outline





Introduction
Information Retrieval
Indexing
Smarter Internet Searching
Examples
2
Introduction


Internet has enormous quantity of information:
 billions of web pages
 thousands of newsgroups
Two questions face any information seeker:
 (1) How can I find what I want?
 (2) How can I know that what I find is any good?
3
Information Retrieval

Goal = find documents relevant to an information
need from a large document set
Info.
need
Query
Document
collection
Retrieval
IR
system
Answer list
4
Example
Google
Web
5
Search Engine

Consists of:
 the interface you use to type in a query
 an index of Web sites that the query is matched
with
 and a software program (called a spider or bot)
that goes out on the Web and gets new sites for
the index
6
IR problem

First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>



External attributes and internal attribute (content)
Search by external attributes = Search in DB
IR: search by content
7
Possible approaches
1. String matching (linear search in
documents)
- Slow
2. Indexing
- Fast
- Flexible to further improvement
8
Query
Documents
Indexing
Indexing
Query Representation
Comparison
Function
Document Representation
Index
Results
9
Main problems in IR

Query evaluation (or retrieval process)


To what extent does a document correspond
to a query?
System evaluation



How good is a system?
Are the retrieved documents relevant?
(precision)
Are all the relevant documents retrieved?
(recall)
10
Document indexing


Goal = Find the important meanings and create an
internal representation
Factors to consider:




Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)
Facility for computer to manipulate
What is the best representation of contents?



Coverage
(Recall)
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
Word
Phrase
Concept
Accuracy
(Precision)
11
Keyword selection and weighting

How to select important keywords?


Simple method: using middle-frequency words
Search engines usually disregard minor words such as
"the, and, to, etc."
Frequency/Informativity
frequency
informativity
Max.
Min.
1 2 3 …
Rank
12
Result of indexing

Each document is represented by a set of weighted
keywords (terms):
D1  {(t1, w1), (t2,w2), …}
e.g. D1  {(comput, 0.2), (architect, 0.3), …}
D2  {(comput, 0.1), (network, 0.5), …}
13
Retrieval

The problems underlying retrieval

Retrieval model


How is a document represented with the
selected keywords?
How are document and query representations
compared to calculate a score?
14
Vector space model




Vector space = all the keywords encountered
<t1, t2, t3, …, tn>
Document
D = < a1, a2, a3, …, an>
ai = weight of ti in D
Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q
R(D,Q) = Sim(D,Q)
15
Matrix representation
Document space
D1
D2
D3
…
Dm
Q
t1
t2
t3
a11 a12 a13
a21 a22 a23
a31 a32 a33
…
…
…
…
tn
a1n
a2n
a3n
am1 am2 am3 …
b1 b2 b 3 …
amn
bn
Term vector
space
16
Some formulas for Sim
Dot product
Cosine
Sim ( D, Q)   (ai * bi )
 (a * b )
i
Sim ( D, Q) 
Sim ( D, Q) 
i
 ai *  bi
2
Q
2
i
t2
2 (ai * bi )
i
 ai   bi
2
i
Jaccard
D
i
i
Dice
t1
2
i
 (a * b )
Sim ( D, Q) 
 a   b   (a * b )
i
i
i
2
2
i
i
i
i
i
i
i
17
(Classic) Presentation of results


Query evaluation result is a list of documents,
sorted by their similarity to the query.
E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…
18
IR on the Web






No stable document collection (spider,
crawler)
Duplication
Huge number of documents
Multimedia documents
Multilingual problem
…
19
Tips for smarter Internet
searching


Use unique, specific terms
Use the minus operator (-) to narrow the search



yarmouk -university
Utilize quotation marks, to view "consecutive words
of a phrase," such as "flower arrangement".
Enter a short question, such as " what time is it in
amman?“, “3.55*4.5-11 =“, “who is the king of
england?”, “what is the distance between the sun
and earth”
20
Smarter Internet Searching



inurl:test results
 only test must be found in the web address (URL)
allinurl:test results
 Both test AND results must be found in the web
address.
define:
 will provide definitions of the words, gathered from
various online sources.

define: search engine
21
Smarter Internet Searching

Allintext
 Sometimes you get pages that do not have your
search term/phrase in them.
 Why? Because Google also searches for pages
that just link to the target page.
 Use allintext to get only those pages that have
your search terms in them.
22
Smarter Internet Searching


Allinanchor:
 Returns only pages that link to pages with your
search terms, but not in the actual pages.
 This is the opposite of allintext.
Site:
 Limit your search to a specific web site.
 Example:
 students site:yu.edu.jo
 students site:yu.edu.jo filetype:pdf
23
Smarter Internet Searching



Don't use common words and punctuation
 Common words and punctuation marks should be
used when searching for a specific phrase inside
quotes
Most search engines do not distinguish between
uppercase and lowercase
Maximize AutoComplete
24
Smarter Internet Searching

The wildcard operator (*): Google calls it the fill in
the blank operator. For example, amusement * will
return pages with amusement and any other term(s)
the Google search engine deems relevant.

Using a wildcard (*) for a character does not work in
Google. cat* returns the same results as cat.
25
Smarter Internet Searching

Related sites:


For example, related:www.yu.edu.jo can be used to find
sites similar to Yarmouk University site.
Specific file type: For example
Information retrieval filetype:ppt
26
Examples

Searching for papers



YU library
Google scholar
Searching for instructor resources


Morgan Kaufmann
Pearson
27
Examples



Searching for books to buy
 Amazon.com
 Ebay.com
Searching for items to buy
 Electronics: bustbuy.com
Searching for hotels
 Expedia.com
 Priceline.com
 Booking.com
28
Examples

Regional search


Searching for images


Google jo
Google images
Searching for a job


Jobsinacademia.net
Academickeys.com
29
The End.
30