Eslam Al Maghayreh Searching throug the Internet
Download
Report
Transcript Eslam Al Maghayreh Searching throug the Internet
Searching through the
Internet
Dr. Eslam Al Maghayreh
Computer Science Department
Yarmouk University
1
Outline
Introduction
Information Retrieval
Indexing
Smarter Internet Searching
Examples
2
Introduction
Internet has enormous quantity of information:
billions of web pages
thousands of newsgroups
Two questions face any information seeker:
(1) How can I find what I want?
(2) How can I know that what I find is any good?
3
Information Retrieval
Goal = find documents relevant to an information
need from a large document set
Info.
need
Query
Document
collection
Retrieval
IR
system
Answer list
4
Example
Google
Web
5
Search Engine
Consists of:
the interface you use to type in a query
an index of Web sites that the query is matched
with
and a software program (called a spider or bot)
that goes out on the Web and gets new sites for
the index
6
IR problem
First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
External attributes and internal attribute (content)
Search by external attributes = Search in DB
IR: search by content
7
Possible approaches
1. String matching (linear search in
documents)
- Slow
2. Indexing
- Fast
- Flexible to further improvement
8
Query
Documents
Indexing
Indexing
Query Representation
Comparison
Function
Document Representation
Index
Results
9
Main problems in IR
Query evaluation (or retrieval process)
To what extent does a document correspond
to a query?
System evaluation
How good is a system?
Are the retrieved documents relevant?
(precision)
Are all the relevant documents retrieved?
(recall)
10
Document indexing
Goal = Find the important meanings and create an
internal representation
Factors to consider:
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)
Facility for computer to manipulate
What is the best representation of contents?
Coverage
(Recall)
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
Word
Phrase
Concept
Accuracy
(Precision)
11
Keyword selection and weighting
How to select important keywords?
Simple method: using middle-frequency words
Search engines usually disregard minor words such as
"the, and, to, etc."
Frequency/Informativity
frequency
informativity
Max.
Min.
1 2 3 …
Rank
12
Result of indexing
Each document is represented by a set of weighted
keywords (terms):
D1 {(t1, w1), (t2,w2), …}
e.g. D1 {(comput, 0.2), (architect, 0.3), …}
D2 {(comput, 0.1), (network, 0.5), …}
13
Retrieval
The problems underlying retrieval
Retrieval model
How is a document represented with the
selected keywords?
How are document and query representations
compared to calculate a score?
14
Vector space model
Vector space = all the keywords encountered
<t1, t2, t3, …, tn>
Document
D = < a1, a2, a3, …, an>
ai = weight of ti in D
Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q
R(D,Q) = Sim(D,Q)
15
Matrix representation
Document space
D1
D2
D3
…
Dm
Q
t1
t2
t3
a11 a12 a13
a21 a22 a23
a31 a32 a33
…
…
…
…
tn
a1n
a2n
a3n
am1 am2 am3 …
b1 b2 b 3 …
amn
bn
Term vector
space
16
Some formulas for Sim
Dot product
Cosine
Sim ( D, Q) (ai * bi )
(a * b )
i
Sim ( D, Q)
Sim ( D, Q)
i
ai * bi
2
Q
2
i
t2
2 (ai * bi )
i
ai bi
2
i
Jaccard
D
i
i
Dice
t1
2
i
(a * b )
Sim ( D, Q)
a b (a * b )
i
i
i
2
2
i
i
i
i
i
i
i
17
(Classic) Presentation of results
Query evaluation result is a list of documents,
sorted by their similarity to the query.
E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…
18
IR on the Web
No stable document collection (spider,
crawler)
Duplication
Huge number of documents
Multimedia documents
Multilingual problem
…
19
Tips for smarter Internet
searching
Use unique, specific terms
Use the minus operator (-) to narrow the search
yarmouk -university
Utilize quotation marks, to view "consecutive words
of a phrase," such as "flower arrangement".
Enter a short question, such as " what time is it in
amman?“, “3.55*4.5-11 =“, “who is the king of
england?”, “what is the distance between the sun
and earth”
20
Smarter Internet Searching
inurl:test results
only test must be found in the web address (URL)
allinurl:test results
Both test AND results must be found in the web
address.
define:
will provide definitions of the words, gathered from
various online sources.
define: search engine
21
Smarter Internet Searching
Allintext
Sometimes you get pages that do not have your
search term/phrase in them.
Why? Because Google also searches for pages
that just link to the target page.
Use allintext to get only those pages that have
your search terms in them.
22
Smarter Internet Searching
Allinanchor:
Returns only pages that link to pages with your
search terms, but not in the actual pages.
This is the opposite of allintext.
Site:
Limit your search to a specific web site.
Example:
students site:yu.edu.jo
students site:yu.edu.jo filetype:pdf
23
Smarter Internet Searching
Don't use common words and punctuation
Common words and punctuation marks should be
used when searching for a specific phrase inside
quotes
Most search engines do not distinguish between
uppercase and lowercase
Maximize AutoComplete
24
Smarter Internet Searching
The wildcard operator (*): Google calls it the fill in
the blank operator. For example, amusement * will
return pages with amusement and any other term(s)
the Google search engine deems relevant.
Using a wildcard (*) for a character does not work in
Google. cat* returns the same results as cat.
25
Smarter Internet Searching
Related sites:
For example, related:www.yu.edu.jo can be used to find
sites similar to Yarmouk University site.
Specific file type: For example
Information retrieval filetype:ppt
26
Examples
Searching for papers
YU library
Google scholar
Searching for instructor resources
Morgan Kaufmann
Pearson
27
Examples
Searching for books to buy
Amazon.com
Ebay.com
Searching for items to buy
Electronics: bustbuy.com
Searching for hotels
Expedia.com
Priceline.com
Booking.com
28
Examples
Regional search
Searching for images
Google jo
Google images
Searching for a job
Jobsinacademia.net
Academickeys.com
29
The End.
30