Data Structures & Algorithms and The Internet:

Download Report

Transcript Data Structures & Algorithms and The Internet:

Data Structures & Algorithms
and The Internet:
A different way of thinking
Comparison Sort vs Bucket Sort
• What is bigger? The list of numbers or the
alphabet of possible numbers?
• O(n log n) vs. O(n+m)
Password cracking
• Is it easier to try to crack one account by
trying a million passwords,
• Or is it easier to try to crack a million
accounts by trying the same password?
Histograms and large vectors
• How do search engines represent documents for
searching/indexing?
• As a histogram of words, e.g. a vector of say 100000
entries for each document, containing word counts
• Term frequency – inverse document frequency:
enhances rare words, diminishes common words
• Math done on vectors to determine similarity / relevance
Good Enough is OK
• Not the fastest algorithm? Put more computers
on it. Cache results.
• Not the best results? Not a big deal if user can
try again.
• The data may be wrong/ambiguous anyway.
• Britney Spears / Paris Hilton
Graphs are everywhere
• Web is a graph.
Page rank algorithm says important/relevant pages are
linked to by important/relevant pages.
• Reverse lookup problem with web links, basically
building inverse edges to a directed graph
• Spam / porn links to each other… look for bad cliques?
• Social networking. Graph of friends. If you respond well
to an ad, maybe your friends will too.
Liars, spam, hackers, etc.
• Algorithms are easier when you can trust
your data
• Is this email spam?
• Does this website have useful content?
• Can you make it where if you do N work to
detect spammers, they have to do N^2
work to avoid you?
Sample Server Architecture
cache1
front1
back1
back4
back2
back5
back3
back6
cache2
Load
balancer
front2
cache3
front3
cache4
Denial of Service
• A little amount of work by a client causes a
lot of work for the server, slowing down or
blocking out other users
• Either malicious or accidental
• Example: fast queries to a search engine
that cause cache misses
• How to catch?
Log IP addresses
• Block addresses causing too many cache
misses within 10 sec, 1 min, 1 hr
• They may be hitting different front ends
• HashMap of IP addresses to counters?
• But how many IP addresses?
Data structures are too large for
one computer
• Store pieces of data structures across
many computers
• How do you know what is where?
• One solution is a master computer serving
as an index, but this is a single point of
failure
Partitioning
• How to divide 4 billion IP addresses among
several machines?
• We need to know deterministically, given an IP,
what machine to look it up on?
• Use a hash function that knocks it down to the
four machines, i.e. hash has good entropy on
AAA.BBB.CCC.DDD
• Sometimes can also use a small set of sorted
data to divide into ranges that should evenly
distribute the larger data
Captchkas
• Computation easy for person but hard for
computer, vision-based problems
• How to stop computers from getting fake
email addresses? Make it hard to
automatically solve a puzzle required for
registration
• How do spammers bypass? Pay 3rd-world
folks to solve puzzles all day, or offer free
porn for solving puzzles
Databases
• Persistent storage for your data
• Let’s you do complex queries using a query
language
• Can add a layer of abstraction over sets, lists,
hashes, etc.
• Boolean/set logic very important
Redundancy and robustness
•
•
•
•
Is your data fine if a computer goes down
Is your server fine if a computer goes down
Can you undo/revert changes?
If something changes somewhere, how
does everyone know about it?
• Average runtime vs guaranteed runtime
Scaling to more users/data
• Is growth as simple as adding another
machine? (e.g. linear time / space)
• If you get dugg, can you handle a burst of
users?
• Can you fail softly?