Transcript slides
FINDING INFORMATION ON THE
WEB
Srinivasan Seshadri
CTO Kosmix
EARLY INTERNET (1992 – 1994)
Mozilla Browser
People linked to others home pages and other
interesting pages
People really browsed
INTERNET (1995 – 2002)
Search - Altavista, Lycos
Google
Used Hyperlink Graph Structure to Rank Results
INTERNET NOW
Kosmix bringing back joys of browsing and
exploring
360 degree view of any topic
Topic Home page (why not a topic )
Top Informational Sites for a topic and a preview
(snippets) are the results!
INFORMATION TYPES
Factual Information (Wiki etc.)
Videos
Images
Forum Discussions
Question and Answers
News
Blogs
Structured Information
FUTURE OF SEARCH
First step towards providing multiple pivot
points for a topic or search
Need to make this conversational, stateful – like
talking to an expert on the topic..
TRANSIENT INTENT AND
INTENT
PERSISTENT
TRANSIENT INTENT
Searching for a needle in the haystack
Exploring the haystack for a topic
PERSISTENT INTENT
Interested in the topic for a long time
Carnatic Music, Indian Cricket, Internet Industry, Venture
Capital
INFORMATION
Deliver information to the consumer
what they want
when they want
how they want
where they want
PERSONALIZED NEWSPAPER
My World is Changing
Can not keep track of it
Can my world come to me?
MEDIA INDUSTRY AND INTERNET
Huge pressure on newspapers
More and more content online
Ad spending moving online
Reputed journalists have their own blogs
Content Production; Aggregation and Distribution is
becoming disaggregated
Vanilla online newspaper does not exploit what the
internet enables
Ability to personalize to nano interests
Publish a personalized newspaper for everyone any time
KEY TECHNOLOGY INGREDIENTS
Cloud Computing
Categorization
Relevance
CLOUD COMPUTING AT KOSMIX
Storage:
Biggest Productivity boost in kosmix in the first year
Getting machines to be remotely rebooted!
KFS (Kosmix File System) further lowered the time
to make data accessible after machine failures
Computation:
Long Running Computations need to be broken into
small restartable/replayable components
CLOUD COMPUTING AT KOSMIX
Computation Templates:
Most of the computation could be expressed as some
variant of a single table scan and some aggregate
operation (group by) -- called MapReduce by google
MapReduce not friendly enough to non programmers
SQL not powerful enough in many situations
Need a nice scripting language ..
OPPORTUNITY?
Many many companies trying to provide
interesting web services
A gold mine of information in the web that can be
used by companies
Impractical for each of the companies to build a huge
web scale support system (crawling, indexing, KFS,
MapReduce etc. etc.)
Further most companies want slivers of the web
(typically category based slivers – health forums;
travel news sites etc. etc.)
Web and all the derived information is the biggest
database perhaps -- can some one make this
accessible and easy to use (using some pay you go
model) or perhaps some non profit (academia?)
angle here?
CATEGORIZATION
Concept Space: space in which all connections
are made within kosmix
Documents, Queries, External Modules,
Advertisements, People are all mapped to points
in this space and matched..
Internet Industry, Venture Capital documents need
to be mapped to these categories even if they don’t
contain the original words
KATEGORIZATION AT KOSMIX
Leverage human curated sources
Huge Automatically Curated Taxonomy
Wiki corpus is a majorr source of knowledge
6 million concepts
Building a Concept Graph with relationship
labels where possible
Use a web index to match short pieces of texts
with concepts and use taxonomy to refine the
matches
RELEVANCE
Need to combine multiple signals into one
number to enable ranking
Say Query Relevance Score and Page Relevance
Score (text score and page rank)
Signals need to be made comparable
Normalization alone (making ranges the same) is not
enough
Need to reconcile different distributions
Deviations from the mean
RELEVANCE
More data always beats smarter algorithms
Adding positions information in the index greatly
increases quality
Adding stemming saw a CTR rise of 10%
Adding anchors (and page rank) distinguished
google
Adding origin of anchors (hosts) is a much better
measure of independent votes
Using demand side popularity (alexa, quantcast)
complement web popularity
RELEVANCE
What is a news story?
Cluster news articles..
Use size of cluster as a measure of popularity
How does one do this efficiently?
Needs to be online since interests/queries are ad hoc
Need to combine some offline preclustering and online
methods
SUMMARY
Consumer:
Internet has come a long way in terms of getting
information to people
Utopian goal of a smart, chatty expert still far away –
kosmix.com is a great first step
Need good tools to keep on top of the information
explosion – personalized newspaper (meehive.com) is
our first stab at this..
Technology:
Need to deal with large volume of data
Efficient Data Analysis and Annotation (e.g.,
Categorization)
Humming Next Gen Database System that grows
incrementally, immune to failures, expressive for non
programmers