PPT - NYU Stern School of Business

Download Report

Transcript PPT - NYU Stern School of Business

C20.0046: Database
Management Systems
Lecture #23
M.P. Johnson
Stern School of Business, NYU
Spring, 2008
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
1
Agenda







Ads
Datamining
RAID?
Regexs?
Set Theory?
Indices?
Etc.?
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
2
Fixing PR bugs

Apart from inherited quality, each page also
has inherent quality E:


More precisely, have weighted sum of the
two terms:



PR(P) = E + SUM(PR(Li)/C(Li)))
PR(P) = .15*E + .85*SUM(PR(Li)/C(Li)))
sales% java PageRank3
Leads to a modified random surfer model
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
3
Anchor text

Motiv: pages may not give best descrips. of themselves
 most search engines don’t contain “search engine"
 BP claim: only 1 of 4 “top search engines” could find
themselves on query "search engine"

Anchor text offers mores descriptions of the page:
 many pages link to google.com
 many of them likely say "search engine" in/near the link
  Treat anchor text words as part of page

Search for “US West” or for “g++”
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
4
Next: Bidding for ads
Google had two big ideas:

1.
2.
Fundamental difficulty with mass-advertising:





PageRank
AdWords/AdSense
Most of the audience isn’t interested
Most people don’t want what you’re selling
Think of car commercials on TV
But some of them do!
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
5
Bidding for ads

If you’re selling widgets, how do you know
who wants them?


If someone is searching for widgets, what
should you try to sell them?


Hard question, so invert the question
Easy – widgets!
Whatever the user searches for, display ads
relevant to that query

Or: whatever’s on the page he’s viewing
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
6
Bidding for ads


Q: How to divvy correspondences up?
A: Create a market, and let the divvying take care of
itself

Each company places the bid it’s willing to pay for an
ad responding to a particular query
Ad auction “takes place” at query-time
 Relevant ads displayed in descending bid order (e.g.)
 Company pays only if user clicks

Huge huge huge business

M.P. Johnson, DBMS, Stern/NYU, Spring 2008
7
How to assign ads?

On each query, must assign a bidder for top place)



Also: each bidder has daily budget


Can’t be better than ½-approx (why?)
Another: choose highest remaining budget


Many-to-one
One idea: choose highest bid (in expectation)


Matching problem
Online problem
Definitely suboptimal (why?)
“Trade-off” algorithm: 1-1/e-approx (~.6321)

Best possible guarantee (see Vazirani et al.)
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
8
Click Fraud


Medium-sized challenge:
Users who click on ad
links to cost their
competitors money



Or pay housewives in India
$.25/click
http://online.wsj.com/public/article/0,,
SB111275037030799121k_SZdfSzVxCwQL4r9ep_KgUWBE8
_20050506,00.html?mod=tff_article
http://timesofindia.indiatimes.com/art
icleshow/msid-654822,curpg-1.cms
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
9
For more info


See sources drawn upon here:
Prof. Davis (NYU/CS) search engines course


Original research papers by Page & Brin:



http://www.cs.nyu.edu/courses/fall02/G22.3033-008/
The PageRank Citation Ranking: Bringing Order to the Web
The Anatomy of a Large-Scale Hypertextual Web Search
Engine
 Links on class page
 Interesting and very accessible
Why search the web when you could download it?

Webaroo
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
10
New topic: Data Mining

Situ: data rich but knowledge poor
 Terabytes and terabytes of data
 Searching for needles in haystacks

Can collect lots of data
 credit card xacts
 bar codes
 club cards
 webserver logs
 call centers
 cameras
 311 calls in NYC
 http://blogs.law.harvard.edu/jim/discuss/msgReader$351

http://www.nytimes.com/2003/12/01/nyregion/01PHON.html?ex=138
5701200&en=0d74f2f58599fe35&ei=5007&partner=USERLAND
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
11
Lots of data

Can store this data


Can query it





DBMSs, data warehousing
SQL
DW/OLAP queries (CUBE, ROLLUP, etc.)
find the things such that…
Find the totals for each of…
Can get answers to specific questions, but what
does it all mean?
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
12
But how to learn from this data?

What kinds of people will buy my widgets?

Whom should I send my direct-mail literature to?

How can I segment my market?

http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html

Whom should we approve for a gold card?

Who gets approved for car insurance?

And what should we charge?
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
13
ML/DM: two basic kinds
Supervised learning
1.


Classification
Regression
Unsupervised learning
2.


Clustering
Find something interesting, frequent sets…
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
14
Supervised learning

Situ: you have many particular instances






Transactions
Customers
Credit applications
Have various fields describing them
But other one property you’re interested in
Goal: infer this dependent property from the
other data
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
15
Supervised learning

Supervised learning starts with training data


Use the training data to build a model




Many different algorithms
GAs, NNs, decision trees, etc.
Given the model, you can now determine the
dependent property for new instances


Many instances including the dependent property
And ideally, the answers are “correct”
Categorical property  “classification”
Numerical property  “regression”
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
16
Classification applications

Construe as inference of one property from others:

Spam filtering

Network intrusion detection

Trading strategies

Fraud detection


See also: Zipf’s & Bedford’s laws
Political marketing?

Mail order tulip bulbs  Conservative party voters
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
17
k-Nearest Neighbor

Very simple algorithm



Map training data points in a space
Given any two points, can compute the distance
between them


Sometimes works well
Euclidean
Given an unlabeled point, label this way:


Find the k nearest points
Let them vote on the label
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
18
Naïve Bayes Classifier

Bayes Theorem: Pr(B|A) = Pr(B,A)/Pr(A)
= Pr(A|B)*Pr(B)/Pr(A)

Idea: Pr(B|A) can be defined in terms of Pr(A|B)
Here:
Pr(S|W) = Pr(S,W)/Pr(W)
= Pr(W|S)*Pr(S)/Pr(W)





W means: “the msg has the word W (or several
words…)”
S means: “it’s spam”
Easy: We can (mathematically) compute Pr(W|S)
Goal: We want to get Pr(S|W) – i.e., is this spam?
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
19
Naïve Bayes Classifier

This is supervised learning, but training and use are
intermingled



Training phase:





Spam filter improves over time
But let’s assume we’re given labeled sets of spam and good
For each word Wi, compute Pr(Wi)
For each word Wi, compute Pr(Wi|S)
Compute Pr(S)
That’s it!
Now, we wait for email to arrive…
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
20
Naïve Bayes Classifier

Msgs may have many words W = W1..Wn
So we need to predict Pr(S|W) = Pr(S|W1..Wn) =
Pr(W1..Wn|S)*Pr(S)/Pr(W1..Wn)

Assuming total independence*, we get





Pr(W1…Wn) = Pr(W1)*…*Pr(Wn)
Pr(W1..Wn|S) = Pr(W1|S)*…*Pr(Wn|S)
We’ve precomputed the nums on the right
Now we can compute what we need
* Which isn’t true…
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
21
Naïve Bayes Classifier


To decide spam status of a new msg, we just compute
Pr(S|W)
If this is > 90% (say), call spam

False positives are very bad (why?)

For each new message (spam/not), we update our
probabilities

Simple and unprincipled, but works very well
 can write in a page of Perl code
 See also Paul Graham:
http://www.paulgraham.com/spam.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
22
New topic: Clustering

Unsupervised learning




partition items into clusters of same type
What different types of customers we have?
What different types of webpages are there?
Each item becomes a point in the space




One dim for every property
Poss plotting for webpages: dim. for each word
Value in dim d is # occur.s of d / # all word occur.s
k-means
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
23
k-means

Simple clustering algorithm


Want to partition data points into sets of
“similar” instances
Plot in n-space
 partition points into clumps in space

Like a unsupervised analog to k-NN


But iterative
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
24
k-means




Alg:
Choose initial means (centers)
do {
assign each point to the closest mean
recompute means of clumps
} until change in means < e
Visualization:


http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
Gone!
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
25
New topic: Frequent Sets


Find sets of items that go together
Intuition: market-basket data

Suppose you’re Wal-Mart, with 460TB of data




http://developers.slashdot.org/article.pl?sid=04/11/14/2057228
Might not know customer hist (but maybe: club cards!)
One thing you do know: groupings of items
 Each set of item wrung up in a an individual purchase
Famous (claimed) example: beer and diapers are
positively correlated


Intuition: fathers can’t take babies to the bar
 colocate beer and diapers
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
26
Finding Frequent Sets

Situ: one table





Baskets(id, item)
One row for each purchase of each item
Alt interp: words on webpages
Goal: given support s, find set of items that appear
together in >=s baskets
For pairs, obvious attempt:
SELECT B1.item, B2.item, count(B.id)
FROM Baskets B1, Baskets B2
WHERE B1.id = B2.id AND
B1.item < B2.item
GROUP BY B1.item, B2.item
M.P. Johnson, DBMS, Stern/NYU,
HAVING count(B1.basket)
>= s; Spring 2006
Can we
do better?
27
A Priori Property

a priori ~ prior to



As opposed to a posteriori ~ post

Prior to investigation
Deductive/from reason/non-empirical
After investigation/experience
Logical observation: if support(X) >= s, then for
every individual x in X, support(x) >= s

 any item y with support < s can be ignored
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
28
A Priori e.g.
You sell 10,000 items, stores record of
1,000,000 baskets, with avg of 4 items each



 4,000,000 rows in Baskets
#pairs of items from 1,000,000 baskets
= C(4,2)*1,000,000 = 6,000,000 pairs!
One idea:

1.
2.
Find support-s items
Run old query on them
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
29
A Priori Property

Suppose we’re looking for sup-10,000 pairs


Counting arg: have only 4,000,000 individual
items purchased


Each item in pair must be sup-10,000
Can have at most 400 = 4,000,000/10,000 popular
items
 we get to ignore most of the rows


Naïve solution above is now much cheaper
Can we do better?
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
30
A Priori Algorithm




But: a frequent itemset may have >2 items
Idea: build frequent sets (not just pairs)
iteratively
Alg:
First, find all frequent items


These are the size-1 frequent sets
Next, for each size-k frequent set

For each frequent item


Check whether the union is frequent
If so, add to collection of size-k+1 frequent sets
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
31
Related goal: rules

Would like to infer:
bought A  interested in B

Related to frequent sets, but now



A,B should be frequent (have high support) and
Rule should be true (have high confidence)
More advanced algs…
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
32
Next: Media Failures


Rule of thumb: Pr(hard drive has head crash
within 10 years) = 50%
Simpler rule of thumb: Pr(hard drive has head
crash within 1 year) = (say) 10%



If have many drives, then regular occurrence
Soln: different RAID strategies
RAID: Redundant Arrays of Independent
Disks
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
33
RAID levels


RAID level 1: each disk gets a mirror
RAID level 4: one disk is xor of all others


Each bit is sum mod 2 of corresponding bits
E.g.:




Disk 1: 10110011
Disk 2: 10101010
Disk 3: 00111000
Disk 4:

How to recover?

What’s the disadvantage of R4?

Various other RAID levels in text…
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
34
New topic: Regular Expressions


Finite Automata are studied in automata theory
FAs are the simplest/weakest kind of computer



Each FA is equivalent to some regular expression



Turing Machines the strongest
Chomsky’s Hierarchy
Expressions that specify a pattern
Can check whether a string matches the pattern
The REGEX(P) generates words

the FA accepts them
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
35
RegEx matching

Use REGEX_LIKE in Oracle
Metachar for any char is .

First, get employee_comment table:


http://pages.stern.nyu.edu/~mjohnson/oracle/empcomm.sql

And install…
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
36
Working with SQL*Plus

Command on sales: sqlplus


Run a local file of scripts:


SQL> @filename
Get list of my tables:


See my login.sql for settings…
SQL> select * from cat;
Describe a table:

SQL> describe table-name
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
37
RegEx matching

Now do search:
SELECT emp_id, text
FROM employee_comment
WHERE REGEXP_LIKE(text,'...-....');

So far, a lot like LIKE
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
38
RegEx matching

Can also pull out the matching text with
REGEXP_SUBSTR:
SELECT emp_id,
REGEXP_SUBSTR(text,'...-....') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'...-....');

If want only numbers, can specify a set of chars
rather than a dot:
SELECT emp_id, REGEXP_SUBSTR(text,
'[0123456789]..-...[0123456789]') text
FROM employee_comment
WHERE REGEXP_LIKE(text, '[0123456789].....[0123456789]');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
39
RegEx matching

Or can specify a range of chars:
SELECT emp_id, REGEXP_SUBSTR(text,
'[0-9]..-...[0-9]') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'[0-9]..-...[0-9]');

Or, finally, can repeat a match several times:
SELECT emp_id, REGEXP_SUBSTR(text,
'[0-9]{3}-[0-9]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{4}');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
40
RegEx matching

Other operators:





* - 0 or more matches
+ - 1 or more matches
? - 0 or 1 match
| - OR two options together
Here: we should catch area codes, if they appear:
SELECT emp_id, REGEXP_SUBSTR(text,
'[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}[0-9]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{3}[0-9]{4}|[0-9]{3}-[0-9]{4}');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
41
RegEx matching

There’s a shared structure between the two, but



Area code is just optional
Can use ? op
Must use parens to indicate the whole area code regex is
optional
SELECT emp_id, REGEXP_SUBSTR(text,
'([0-9]{3}-)?[0-9]{3}-[0-9]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,
'([0-9]{3}-)?[0-9]{3}-[0-9]{4}');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
42
RegEx matching

Also, different kinds of phone number separators:


One solution:



dash, dot, blank
Write RegEx for each case
OR them all together
Better: just use set of choices of each sep

Characters in [ ] means match any one:
SELECT emp_id, REGEXP_SUBSTR(text, '([09]{3}[-. ])?[0-9]{3}[-. ][0-9]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ])?[09]{3}[-. ][0-9]{4}');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
43
RegEx matching

One other thing: area codes in parentheses


Parens are part of the regex language
So these parens must be escaped: \( \)
SELECT emp_id, REGEXP_SUBSTR(text, '([09]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][09]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ]|\([09]{3}\) )?[0-9]{3}[-. ][0-9]{4}');
M.P. Johnson, DBMS, Stern/NYU, Spring 2006
44
Future

Project Presentations

Practice hw3
Final Exam


Info up soon…
M.P. Johnson, DBMS, Stern/NYU, Spring 2008
45