Lecture 32 WWW Search - BYU Computer Science Students

Download Report

Transcript Lecture 32 WWW Search - BYU Computer Science Students

Lecture #32
WWW Search
Review: Data Organization
• Kinds of things to organize
–
–
–
–
–
–
Menu items
Text
Images
Sound
Videos
Records (I.e. a person’s name, address, &
phone number, or a car’s year, make, & model)
Review: Data Organization
• Three ways to find things:
– Lists (in-order search, binary search)
– Trees (balance number of branches with time to
decide which is correct branch)
– Search
WWW Search
Search issues
• How do we say what we want?
– I want a story about pigs
– I want a picture of a rooster
– How many televisions were sold in Vietnam
during 2000?
– Find a movie like this one
• How does the computer find what we said?
Things to search for
•
•
•
•
•
Records
Text
Images
Audio
Video
Records
• Car
–
–
–
–
–
Price
Miles
Year
Make
Doors
• Queries
• Price < 6000 & Miles<100000
• Make == Toyota & Year > 1993
Queries
• Make == Toyota & Year >1993
0
1
2
3
4
5
Make
Toyota
Honda
Ford
Toyota
Chevy
BMW
Year
1994
1992
1997
1992
1996
1994
Miles
20000
100000
5000
150000
30000
120000
Price
$6,000
$2,000
$1,000
$3,000
$2,000
$100,000
Queries
• Make == Toyota & Year >1993
0
1
2
3
4
5
Make
Toyota
Honda
Ford
Toyota
Chevy
BMW
Year
1994
1992
1997
1992
1996
1994
Miles
20000
100000
5000
150000
30000
120000
Price
$6,000
$2,000
$1,000
$3,000
$2,000
$100,000
Queries
• Year >1993 or Price < $3,000
0
1
2
3
4
5
Make
Toyota
Honda
Ford
Toyota
Chevy
BMW
Year
1994
1992
1997
1992
1996
1994
Miles
20000
100000
5000
150000
30000
120000
Price
$6,000
$2,000
$1,000
$3,000
$2,000
$100,000
Queries
• Year >1993 or Price < $3,000
0
1
2
3
4
5
Make
Toyota
Honda
Ford
Toyota
Chevy
BMW
Year
1994
1992
1997
1992
1996
1994
Miles
20000
100000
5000
150000
30000
120000
Price
$6,000
$2,000
$1,000
$3,000
$2,000
$100,000
Databases
• Large collections of records
• Accessed by queries
Things to search for
• Records
Text
• Images
• Audio
• Video
Text searching
• How do I say what I want?
– Type some phrase
• I want a story about pigs
• How will the computer match this?
– What is text?
• An array of characters
– What can can a computer do with text?
• Match characters
Text searching
• People think in words not characters
• How do I convert an array of characters into
an array of words?
– Collect together sequences of letters
– How do I know if character C is a letter?
• C>=“a” & C<=“z” | C>=“A” & C<=“Z”
Convert to words
• Because people think in words
0 1 2
T h e
3
4 5
l a
6
z
0
1
2
3
7
y
8
9 10 11 12 13 14 15 16 17
b r o w n
d o g
The
lazy
brown
dog
Every document is an array of
words
• I want a story about pigs
• How will I find the right documents?
– Find all documents that have the word “pigs”
Searching text
• How will I find pigs fast?
– Create an index of all words
• With each word store the name or address of each
document that contains that word
– Search the index for “pigs”
• Return the list of documents
• Use a binary search on the word list (50,000 words)
Problems
• What if a document has the word “Pig” but
not “pigs”?
• Normalize
– Case - make all words lower case
• Pig -> pig
– Stemming - remove all suffixes and prefixes
before putting a word into the index
• pigs -> pig
• piggy -> pig
Problems
• I want a story about pigs?
– How does the computer know to search for
pigs?
• It doesn’t
– How does the computer know what a story is?
• It doesn’t
Searching
• I want a story about pigs
• Pick out the important words and search for them
– Which words are important?
– D = number of times a word appears in a document
– A = average number of times a word appears in all
documents
– Importance = D/A
• Why?
How do we create an index of all
documents on the Web?
• Try = a list of URLs
• Seen = all URLs you have seen
While (Try is not empty)
{ Page = take a URL from Try
Words = all the “important” words in Page
add Page to the index using all of Words
Links = all URLs in Page
for every Link that is not in Seen add Link to Try and to Seen
}
Other ways to find important
words and important documents
• A Document is important if many other
documents point to it
• A word is important in document D if that
word occurs frequently in documents that
link to document D.
Images
• What will I say when searching for an
image?
– I want a rooster picture
– Draw a picture of a rooster?
Search by picture?
?
Is this possible? If so, how?
What’s in a picture?
• Computers don’t understand the contents of
images
• To a computer an image is a bunch of
colored pixels
I want a picture of a rooster
• Label all of the pictures
• How does Google Images do it?
– File name of the picture “rooster-crossingSt.jpg”
– Words around the picture in the HTML
• Use “Safe Search” and set filters appropriately
(http://www.youtube.com/watch?v=maWx-ApkBCs)
Audio
• Talking
– Use speech recognition to convert audio to text
– With each recognized word keep track of where
in the audio it was recognized.
• Build an index using the recognized text
– Normalize based on how words sound rather
than are spelled.
Video
• Where in “Casablanca” does Bogart say
“Play it again Sam” ?
– he never does, he just says “play it”
• How can the computer find that?
– Transcribe the audio
– Speech recognition on the audio
Video
• Does Woody ever kiss Bo Peep?
• Exactly what color is a kiss?
Video
• Does Woody ever kiss Bo Peep?
• Annotate every frame with who is in the
frame and search for frames with both
Woody and Bo Peep.
So what’s with this?
Or this?
Is Woody cheating?
• Records
Search
– Queries
• < > = And Or
• Text
– Normalized words (case, stemming, thesaurus)
• Images
– Add words
• Audio
– Transcribe or recognize as words
• Video
– Transcribe
– Annotate
“Re-Search” Directions in
Image Recognition,
Search and Retrieval
Face Detection
In Commercial Digital Cameras
Train on
- 1000’s of faces
- Millions of non-faces
Face Detection – Viola & Jones
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Face Recognition
(Eigenfaces [Turk and Pentland 1991])
Project image into higherdimensional space
“Recognize” by
grouping unknown
image with closest
training example
2
N
N
0 71 250 68
N
210 44 128 53
Face Recognition
(Picasa - Google)
• Image search/organization
• Automatically finds, crops and groups images of
the same person from a collection of photos
• Allows user feedback (trainable) - user can
indicate if it found the wrong person.
Face/Object Recognition/Search:
Feature-Based Technology
Object
Extract
Features
Bag of
“words”*
Create visual “words” from image features.
*Li Fei-Fei (Princeton)
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Face/Object Recognition/Search:
Feature-Based Technology
Do this for multiple objects
*Li Fei-Fei (Princeton)
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Face/Object Recognition/Search:
Bag of Words
How to get matching images/documents?:
nid
Use “word” frequencies = n , where nid = # times word i
d
occurs in document d
nd = total # words
in document d
n
Then combine word frequency with
= nid log D
A
d
inverse document frequency weighting
to downweight words that occur frequently
(D = # of occurrences; A = average # of occurrences)
From R. Szeliski, Computer Vision Algorithms and Applications, p. 605
Face/Object Recognition/Search:
Feature-Based Technology
Drop word features through a “vocabulary tree” to classify
*Li Fei-Fei (Princeton)
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington