How does a web search engine work
Download
Report
Transcript How does a web search engine work
how does a web search
engine work?
google
(started 1998 … now worth $365 billion)
bing
amazon
search
web, images, news, maps, books, shopping, apps, videos, music – and
much more!
sometimes people tell you what they want
type your query into google
sometimes you have to guess and offer stuff to them
amazon gives you shopping ideas
search engines help people find things
our objective
saves them time and effort
as a search engine, your job is to make sure they can
find things quickly and easily!
1. collect the things that you want to search
... take a snapshot of all the internet
2. figure out what those things are “about”
overview
... words in text documents, speech in videos,
notes in sound
3. allow people to find what they want from your
collection
… decide which things you have are relevant
… like finding a needle in a haystack!
crawling
3 core parts
the web ‘spider’
crawls across all the
pages on the internet
indexing
retrieving
like a librarian
categorising books
letting people find the
stuff they want
the internet is huge (trillions of webpages)
useless information (old, poorly written, advertising,
duplicates)
but it gets
complicated…
inappropriate stuff
different languages
spam
lots of computers to do you work for you!
but you need to tell them what work to do – programming.
they all have to work together
what if some break?
search engines
are expensive
24/7/365
mobile phones, ipads, computers etc
in every corner of the world
lots of fast internet connections
servers
cooling
let’s take a little look at how each part works…
start with a bunch of website you know about
and just follow the links…
Imagine if you kept clicking all the links forever
1. crawling
How long would it take to get back to the page you started on, if you were
clicking on a different link each time?
Could you cover all the pages on the internet?
Is it equally likely you will cover all pages? What about more popular pages,
for example: bbc.com, facebook.com etc?
2. indexing
activity...
3. retrieval
activity...
words like ‘the’, ‘a’, ‘and’, ‘what’ – useless!
tiger/tigers, bengali, tyger, big cat – plurals, spelling mistakes, synonyms
challenges
what people search for ‘mismatches’ exactly what people write, even though
it means the same!
how easy to read, or “helpful” is a web page?
how about the search ‘topic’ is the page, really?