Google 搜索引擎與衝擊

Download Report

Transcript Google 搜索引擎與衝擊

今天你 Google 了沒:
Google 搜索引擎與衝擊
1042.探索數位世界
Week 4, 03/30
張家銘 資訊科學系
http://www.cs.nccu.edu.tw/~jmchang/course/1042/digital/
本投影片僅供教學用途,所用圖檔都盡量附上原始來源,如有侵權煩請告知,將立即修正
本週大事記
•
去年 12 月,28 歲的賽耶德·法魯克和妻子對加州聖伯納迪諾一家社會服務機構發起襲擊,造成 14 人死亡,兩人在槍
戰中被警方擊斃。案發後,警方在其汽車上發現一支 iPhone 5c 手機。由於蘋果在 iPhone 中使用了自己的加密技術,
FBI 也無法破解。因此,洛杉磯地方法院上個月做出裁決,要求蘋果必須提供適當的技術幫助調查人員解鎖兇手法魯
克的 iPhone,而庫克則表示:辦不到。
UFED(萬能取證裝置 Universal Forensic Extraction Device)
http://technews.tw/2016/03/29/cellebrite-fbi-iphone/
本週大事記
• 曾經的網路霸主Yahoo 如今淪落到要出售核心網路事業,以及成長大
的亞洲部面。現在更有消息指出,Yahoo 要買家在 4 月 11 日前出價。
http://technews.tw/2016/03/29/selling-of-yahoo-deadline-is-411-microsoft-might-get-involved/
Information Explosion
Number of Hosts advertised in the
DNS
• 2013/1, 996,230,757 部
https://ftp.isc.org/www/survey/reports/2013/07/
台灣地區 WWW Server
• 2016/03 :160025 部
160000
2016.02
2015.12
2015.10
2015.08
2015.06
2015.04
2015.02
2014.12
2014.10
2014.08
2014.06
2014.04
2014.02
2013.12
2013.10
2013.08
2013.06
2013.04
2013.02
2012.12
2012.10
2012.08
2012.06
2012.04
http://www.twnic.net.tw/survy.xls
2012.02
2011.12
2011.10
2011.08
2011.06
2011.04
100000
.com.tw
.edu.tw
.gov.tw
.net.tw
.org.tw
150000
140000
130000
120000
110000
一些數字的省思
• Web page
– 8,058,044,651 (Google 2005/7 )
• image
– 1,305,093,600 (Google 2005/7 )
Credit: 吳錦範, 臺東大學圖書館
一些數字的省思
• Google將與美國的哈佛﹑史丹佛及密西根大學﹑英國
的牛津大學及紐約公共圖書館簽約﹐預計6年內,將其藏
書1500萬冊圖書數位化 (2004/12)
• Yahoo數位圖書館計畫:Internet Archive、加州大學等
共同參與,將其公開版權或授權的圖書數位化(2005/10)
• 微軟(MSN)書籍內文搜尋,計畫第一年提供15萬本
• 全球數位圖書館百萬圖書計畫:全球150萬冊無版權的
圖書數位化
Credit: 吳錦範, 臺東大學圖書館
We are Always Overwhelmed by Flood of
Information
Information is Nothing without Retrieval
Search Engine as A Daily Tool
History of the Evolution of Search Engines
• 1990 - Archie (or “Archive” without the “v”)
– First search engine
– FTP site hosted an index of downloaded directory listings
– Due to limited space, only the listings were available and not the contents
for each site
• 1992 – VLib
– Tim Berners-Lee set up a Virtual Library
– CERN webserver hosted a list of webservers in the early age of the
Internet
History of the Evolution of Search Engines
•
February 1993 – Excite
– Created by six Stanford undergrads
•
June 1993 – World Wide Web Wanderer
– Robot counts active web servers and measures the growth of the Internet.
Robot was soon upgraded to capture actual URLs.
– Databases was called the Wandex
– Robot accessed same page hundreds of times a day and caused lag.
•
October 1993 – ALIWEB
– Crawled meta info and allowed users to submit the pages they wanted
indexed along with a description
– No robot
– Noting using excessive bandwidth
•
But, people didn’t know how to submit their sites
Web Robots
• Web Robots
– A computer program that browses the World Wide Web in
a methodical, automated manner or in an orderly fashion.
• Robots Exclusion Standard/Web Robots
– Created standards for how search engines
should/shouldn’t index content
– Webmasters can block robots from their entire site or just
specific pages
History of the Evolution of Search Engines
• December 1993 – Primitive Web Search
– JumpStation: Info about page’s title and header using
simple linear search
– World Wide Web Crawler: Indexed titles and URLs
• These two listed results in the order they were found without
ranking
– RBSE Spider
• unless exact title was a match it was extremely hard to find
anything
Web Crawler (Web Spider)
• Web crawler (Web spider)
– In general, it starts with a list of URLs to visit, called the seeds.
– As the crawler visits these URLs, it identifies all the hyperlinks in the page and
adds them to the list of URLs to visit, called the crawl frontier.
– URLs from the frontier are recursively visited according to a set of policies.
History of the Evolution of Search Engines
• January 1994 – Infoseek
– Webmasters could submit a page in realtime
– December 95: Netscape began using them as their default
search engine
• January 1994 – AltaVista
– First to allow natural language queries
– Advanced searching techniques (ie, AND, OR, NOT)
– Add or delete your own URL within 24 hours
History of the Evolution of Search Engines
• April 1994 – WebCrawler
– First crawler that indexed entire pages
– Too popular to be used during daytime
– June 1995: AOL buys WebCrawler
• April 1994 –Yahoo! Directory
– Created by David Filo and Jerry Yang
– Began as a collection of favorable web pages
– Increasing size influenced them to become a searchable directory
– A man-made description with each URL
– Informational sites added for free, but they expanded to include commercial sites
– Long wait time to be included
Motivation for Link Analysis
•
Early search engines mainly compare content similarity of the query and the
indexed pages. I.e.,
– They use information retrieval methods, cosine, TF-IDF, ...
•
From mid 90’s, it became clear that content similarity alone was no longer
sufficient.
– The number of pages grew rapidly in the mid-late 1990’s.
• Try “classification methods”, Google estimates: millions of relevant pages.
• How to choose only 30-40 pages and rank them suitably to present to the user?
– Content similarity is easily spammed.
• A page owner can repeat some words and add many related words to boost the
rankings of his pages and/or to make the pages relevant to a large number of queries.
Credit: Dr. C. Lee Giles, The Pennsylvania State University
Vector Space Model
• Salton’s Vector Space Model
– Represent each document by a high-dimensional
vector in the space of words
Documents
Journal of Artificial Intelligence Research
JAIR is a refereed journal, covering all areas
of Artificial Intelligence, which is distributed
free of charge over the internet. Each
volume of the journal is also published by AI
Access Foundation …
Vectors
0 learning
2 Journal
3 Intelligence
0 text
0 agent
1internet
0 webwatcher
0 perlS
…
1 volume
Gerald Salton
Term-Document Matrix
• Term-document matrix is m x n matrix where m is number of
terms and n is number of documents
document
d
1

a11

a21
A  


am1

d
d
2

a
a

12
a
n
22
m2






2n


 

amn
a
a
1n
 t1
 t2
term

 tm
21
Term Weighting by TFIDF
•
The term frequency (tf) in the given document d gives a measure of the importance of the term ti
within the particular document
tf (ti , d ) 
ni
 nk
k
with ni being the number of occurrences of the considered term, and the
denominator is the number of occurrences of all terms
• The inverse document frequency (idf) is obtained by dividing the number of
all documents by the number of documents containing the term ti,
idf (ti )  log
D
( d i  ti )
|D| : total number of document in the corpus
: number of documents where the term ti appears
tfidf = tf*idf
1st Generation: Content Similarity
•
1st Generation (ca 1994):
–
–
AltaVista, Excite, Infoseek…
Ranking based on Content:
•
•
Pure Information Retrieval
Content Similarity Ranking:
The more rare words two documents share,
the more similar they are
•
Documents are treated as “bags of words”
t3
(no effort to “understand” the contents)
•
•
2
Similarity is measured by vector angles
θ
Query Results are ranked
by sorting the angles
How To Spam?
d1
t1
between query and documents
•
d
t2
Credit: cs.wellesley.edu/~cs315/.../CS315-L14-Evolution-of-Search-Engines.ppt
1st Generation: How to Spam
• “Keyword stuffing”: Add keywords, text, to
increase content similarity
Page stuffed with
casino-related
keywords
Credit: cs.wellesley.edu/~cs315/.../CS315-L14-Evolution-of-Search-Engines.ppt
History of the Evolution of Search Engines
• July 1994 – Lycos
– Went public with catalog of 54,000 documents
– Ranked Relevance retrieval
– Prefix matching and word proximity
• August 1994: identified 394,000 documents
• January 1995 : 1.5 million documents
• November 1996: 60 million documents (more than any other search engine at
that time)
– October 2004: Lycos was sold to Daum Communications, the second
largest Internet portal in Korea
Term Explanations
• Ranked Relevance Retrieval
– Relevance most commonly refers to topical relevance or aboutness, i.e. to
what extent the topic of a result matches the topic of the query or
information need.
– Mathematical model
• Standard Boolean model, Vector space model, Probabilistic relevance model, Language
models, …
• Prefix matching
– Ex: redo, review : re is a prefix meaning again
• Word proximity
– Distance among words
History of the Evolution of Search Engines
• 1995 – LookSmart
– Competed with Yahoo by increasing inclusion rates back
and forth
– 2002: Began depending on MSN by syndicating links
through their portal
– 2003: Felt the sting of rejection when it was dumped by
Microsoft and lost more than 65% of its annual revenue
2nd Generation: Add Popularity
•
2nd Generation (ca 1996):
www.aa.com
1
– Lycos
www.bb.com
2
– Ranking based on Content + Structure
• Site Popularity
• A hyperlink
www.cc.com
1
www.dd.com
2
from a page in site A
www.zz.com
0
to some page in site B
is considered a popularity vote from site A to site B
• Rank similar documents according to popularity
• How To Spam?
Credit: cs.wellesley.edu/~cs315/.../CS315-L14-Evolution-of-Search-Engines.ppt
2nd Generation: How to Spam
• Create “Link Farms”: Heavily interconnected owned sites
spam popularity
Interconnected
sites owned by
vespro.com
promote main site
Credit: cs.wellesley.edu/~cs315/.../CS315-L14-Evolution-of-Search-Engines.ppt
History of the Evolution of Search Engines
•
January 1996 – Google (googol = 10100)
– Larry and Sergey began working on BackRub, a search engine which utilized backlinks for
search
– It ranked pages using citation notation, meaning any mention of a website on another site
would count it as a vote toward the mentioned site
– A website’s “authority” or reliability came from how many people linked to that site, and how
trustworthy the linking sites were
– 1998: Google launches
•
No one wanted to purchase the PageRank technology at that time
– 1999: Google got funding from Sequoia Capital as well as from a few other investors
– 2000: Yahoo selects Google as a search partner
History of the Evolution of Search Engines
• April 1997 – Ask.com/Ask Jeeves
– Human editors tried to match search queries
– Powered by DirectHit, which aimed to rank links by
popularity. Easy to spam.
– Uses clustering analysis to organize sites by subject
specific popularity (local web communities)
– March 2005: IAC buys Ask Jeeves for $1.85 billion, changes
name to Ask.com
Clustering Analysis
• Clustering
– The task of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar (in some sense or another) to
each other than to those in other clusters.
– Clustering is a main task of explorative data mining, and a common technique
for statistical data analysis used in many fields, including machine learning,
pattern recognition, image analysis, information retrieval, and
bioinformatics.
History of the Evolution of Search Engines
• 1998 – MSN
– MSN Search launches
– Relied on Overture, Looksmart and Inktomi until Google provided their
PageRank model
– Launched preview of new engine in July 2004
• 1998 – Open Directory Project
– Largest internet directory run by volunteer editors
– Unlike Yahoo, not a long wait time
– Netscape bought it in November 1998
– AOL buys Netscape same year for $4.5 billion
3rd Generation: Add Reputation…
•
3rd Generation (ca 1998):
–
–
Google, Teoma, Yahoo
Ranking based on Content + Structure + Value
•
•
Page Reputation
PageRank
–
–
A link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of web
pages with the purpose of "measuring" its relative importance within the set.
PageRank Calculator: link
Credit: cs.wellesley.edu/~cs315/.../CS315-L14-Evolution-of-Search-Engines.ppt
Initial PageRank Idea
• Just measuring in-degree (citation count) doesn’t account for the
authority of the source of a link.
• Initial page rank equation for page p:
– Nq is the total number of out-links from page q.
– A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g.
p).
– c is a normalizing constant set so that the rank of all pages always sums to 1.
Credit: Dr. C. Lee Giles, The Pennsylvania State University
Initial PageRank Idea (cont.)
• Can view it as a process of PageRank “flowing” from pages to the
pages they cite.
.1
.05
.08
.05
.03
.09
.03
.03
.08
.03
Credit: Dr. C. Lee Giles, The Pennsylvania State University
History of the Evolution of Search Engines
• October 2005 – Snap
– Overture owner Bill Gross launches Snap search engine
– Shows search volumes, revenues, and advertisers
– Proved to be too complicated, not simplistic enough for
the average web surfers
• September 2006 – Live Search
– Microsoft announces launch of Live Search Product
History of the Evolution of Search Engines
• June 2008 – Cuil
– Managed and developed by former Google employees
– A search engine that organized web pages by content and displayed
relatively entries along with thumbnail pictures for many results
– Sep. 2010: shutdown
• June 2009 – Bing
– Rebranding of MSN/Live Search
– Inline search suggestions for related searches directly in result set
Advanced searches
• Vertical Search
– Offers several potential benefits over general search engines:
• Greater precision due to limited scope
• Leverage domain knowledge including taxonomies and ontologies
• Support specific unique user tasks
– Ex: Object Search, Product Search
• Domain-specific search
– Focus on one area of knowledge, create customized search
experiences, and provide extremely relevant results for searchers
Advanced searches
• Personalized search
– Refers to search experiences that are tailored specifically to an
individual's interests by incorporating information about the
individual beyond specific query provided
– There are several publicly available systems for personalizing Web
search results
• Ex: Google Personalized Search and Bing's search result personalization
– However, the technical details and evaluations of these commercial
systems are proprietary.
New emerging search engines
• Search + Ontology
– Ontology
• Ontology is the philosophical study of the nature of being, existence or reality as such,
as well as the basic categories of being and their relations.
• Deals with questions concerning what entities exist or can be said to exist, and how such
entities can be grouped, related within a hierarchy, and subdivided according to
similarities and differences.
New emerging search engines
•
MrTaggy
–
MrTaggy is an experiment in web search and exploration built on top of an algorithm called TagSearch.
–
Unlike most search engines, MrTaggy doesn’t index the text on a web page. Instead, it leverages the
knowledge contained in the tags that people add to web pages when using social bookmarking services.
–
Video demo: http://www.youtube.com/watch?v=gwYbonHI5ss
New emerging search engines
• Search + Social Network and Personal Web Services
– Social Network
• Facebook, Twitter …
– Personal Web Services
• Dropbox, Gmail, Google Calendar, Google Document …
– How to integrate the information into a search system
New emerging search engines
• greplin
– Greplin is a personal search engine that allows you to search all your online
data in one easy place.
– Greplin indexes the information you create on different websites (like Gmail,
Twitter and Facebook) and provides lightning fast search of all your
information.
– Greplin makes the search bar for your life.
– Video demo: http://vimeo.com/14579806
Google搜尋的工作原理
• 爬取和索引
• 演算法公式
• 去除垃圾網站
http://www.google.com/insidesearch/howsearchworks/thestory/index.html
How Search Works: From
algorithms to answers
• How Search Works
– https://www.youtube.com/watch?v=BNHR6IQJGZs
• How Google makes improvements to its search
algorithm
– https://www.youtube.com/watch?v=J5RZOU6vK4Q
How to use Google?
I’m Feeling Lucky
• When a user clicks on I’m Feeling Lucky in
Google search, he/she will be redirected to
the first search result provided by Google,
without having to browse through the search
results page.
Search Tips
• Time & Verbatim
Verbatim搜尋
• Google不會再雞婆地幫你自動糾正拼寫
• 關閉個性化搜尋服務,不會根據先前的搜尋記錄,將你多次造訪的網
站顯示順位提前
• 不再進行同義字詞搜尋,例如搜尋「automotive」,就不會送上「car」
的搜尋結果
• 不再進行相近字詞搜尋,例如搜尋「花店」,就不會提供「鮮花快遞」
• 不再進行時態變化字詞搜尋,例如搜尋「running」,就不會呈現
「run」的結果
• 不再選擇性忽略某些關鍵字
http://www.bnext.com.tw/article/view/id/20877
Google 搜尋例子
• 巴塞隆納 FCB
• Knowledge Graph : 台灣歌手
https://www.youtube.com/watch?v=mg91_trV4hY
Search Tips
•
+, - , OR
– 探索數位世界 張家銘
– 探索數位世界 + 張家銘
– 政大 (日本料理 OR 炒飯)
•
“”
– "探索數位世界”+張家銘
– “statistical programming”
•
"Come * right now * me”
•
網際網路 filetype:ptt
•
張家銘 site:.tw
•
iPhon se, 399 USD=?TWD
•
“數位相機” $10000..$25000
A year in search 2015
• https://www.google.com/trends/topcharts#d
ate=2015&geo=TW
• https://www.google.com/trends/story/2015_G
LOBAL
• https://www.google.com/trends/story/GB_cu
_G7YDn1MBAABLqM_en
Any question?