MFM Observation Of Magnetization Reversal Process In

Download Report

Transcript MFM Observation Of Magnetization Reversal Process In

Finding Needles in the Internet Haystack
Ron K. Cytron
Washington University in Saint Louis
Department of Computer Science
http://www.cs.wustl.edu/~cytron/
Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood,
George Varghese (UCSD)
Mahesh Jayaram
Thanks: Ben Brodie
Center for Distributed Object Computing
Department of Computer Science
Washington University
Century Club May 2002
Outline
• Computers have come a long way
Outline
• Computers have come a long way
• Today’s computers are never lonely
Outline
• Computers have come a long way
• Today’s computers are never lonely
• Volumes and volumes of data
Outline
•
•
•
•
Computers have come a long way
Today’s computers are never lonely
Volumes and volumes of data
Fast searching of magnetic media
needle
needel
needle
Outline
•
•
•
•
•
Computers have come a long way
Today’s computers are never lonely
Volumes and volumes of data
Fast searching of magnetic media
Internet packet filtering
Outline
•
•
•
•
•
•
Computers have come a long way
Today’s computers are never lonely
Volumes and volumes of data
Fast searching of magnetic media
Internet packet filtering
Conclusion
A Grandchild’s Gift
1966
Cost:
$60
Cost:
Memory ½ char
$35
Memory 16 M chars
1999
Speed:
1 cycle/s
Speed:
16 M cycles/s
Fails:
10 seconds
Fails:
5 years
If cars improved that much in 30 years …
•
•
•
•
•
$4000
60,000 miles per hour
Seats 10,000 people
Gets 20,000 miles per gallon
Breaks every 70 years
The Haystack
• The Internet is large
and growing
• Content on the
Internet is growing
even faster
• A haystack sits still,
but the Internet….
Growth of the Internet
(why computers aren’t lonely anymore)
Interconnected Computers
3,500,000
3,000,000
Y2K Problem (?):
2,500,000
More computers sold than
TVs
2,000,000
1,500,000
1,000,000
500,000
0
1969
1971
1973
1977
1983
Year
1991
1993
1994
Growth of Internet Content
(volumes and volumes of data)
Articles per Day
30,000
25,000
20,000
15,000
Anybody can publish
Problem is how to find
what you want
10,000
5,000
0
1979
1980
1988
Year
1993
Page 6B
9/17/2001
What can tech companies do? Some
say they're at a loss, but others offer
budding solutions
By Kevin Maney
On July 7, 1940, as the nation edged toward World War II, IBM put out a
statement that made headlines. The company offered all its facilities for
national defense, ready to convert to making anything the government
needed.
Other leaders in the electro-mechanical technology of the day -- Ford
Motor, General Motors, General Electric -- also threw their weight into
defense efforts. They switched from making cars and washing machines
to building tanks, aircraft engines and machine guns.
So here we are in 2001, readying for another war. The U.S. technology
industry is the best and most innovative in the world. It is the nation's
pride and joy.
Shouldn't it do something?
...
One possibility is in data-mining technology. Data mining is
a way to collect millions of pieces of information in a
computer system, sift through that data, make sense of
them and come up with something useful. ''We (the U.S.
tech industry) are experts at data mining and have vast
resources of data to mine,'' says Tom Evslin, CEO of
Internet communications company ITXC. ''We have used it
to target advertising. We can probably use it to identify
suspicious activity or potential terrorists.''
...
Fast searching of magnetic media
with
Roger Chamberlain,
Mark Franklin, Ron Indeck,
John Lockwood
Enabling Technology: Disk Drives
Almost 10,000,000x increase in 45 years!
Magnetic disk storage areal density
vs. year of IBM product introduction
(From D. A. Thompson)
Cost per Megabyte
Cost decreasing 3% per week!
Price history of hard disk product
vs. year of product introduction
(From D. A. Thompson)
Massive Storage & Data
• Storage industry will ship
4,000,000,000,000,000,000 Bytes this year
• FedEx generated 14 Terabytes of data last
year
• US intelligence collects data equaling the
printed collection of the US library every
day!
Massive Data Sets
• Employee records
• Consumer information
• Maps/mission/intelligence data
• Genome maps
 Data sets now measured in Terabytes, and
are dynamic!
Genome Application
• Genome maps growing expanded daily
– Wash U sequencing center
– Each of us has 80,000 genes found among 3 billion
characters of DNA (A,C,G,T)
• Look for matches
– Identify function
– Disease: understand, diagnose, detect, medicine,
therapy
– Biofuels, warfare, toxic waste
– Understand evolution
– Forensics, organ donors, authentication
– More effective crops, disease resistance
DNA String Matching
• Looking for CACGTTAGT…TAGC
• Interested in matches and near matches
• Search human genome and other gene oceans
– Need to search entire data sets
Bio Computation Problem
*BIG*
Genome
Databases
A C G
T G
DNA sequence
T A C
A G
DNA pattern
Approximate matches are just as useful
Match?
Finding a needel in a heystuck
• DNA and live text can contain errors
• We often seek an approximate match, for
example
needle
• No match? Try 2-transpositions
enedle, needle, nedele, neelde, needel
• No match? Try 1-deletions
eedle, nedle, nedle, neele, neede, needl
• No match? Try insertions, larger edits, …
• An exponential number of possibilities
How is this done today?
• Think of every way a word can be misspelled
• Present each misspelling to the computer for an
exact match
enedle
needle
nedele
neelde
needel
Yes
No
How can we do better?
• Data is present on magnetic media
• Hardware at the disk is
– Already fault tolerant (more on this later)
needel
needel  needle
– Distributed across all surfaces
We win if number of misspellings is large,
and the number of false hits is small
needle
Another Application:Intelligence Data
• Lots of data
• Changing constantly
• Many perturbations
– Tzar, tsar, czar, . . .
• Don’t know what we want to look for
beforehand
Google Search Engine
• Crawls the web once per month
• Caches web pages
• Fast, exact text-based search (see how soon)
needle
needel
needle
Image Database Applications
•
•
•
•
Challenging database
Unstructured
Massive data sets
Don’t know what we need to look for in
each picture
Satellite Data
• Low-orbit fly-over every 90 minutes
• Look for differences in images
– Large objects
– Troops
– Changes to landscape
• Flag, transmit these differences immediately
• National Reconnaissance Office
• City assessors . . .
Washington University
Hilltop Campus
How do we find what we’re
looking for?!
Conventional Structured Database
Did
1
2
3
4
Document
Agent James Bond
Agent mobile computer
James Madison movie
James Bond movie
Word
Inverted list - pointers
agent
<1,2>
Bond
<1,4>
computer
<2>
James
<1,3,4>
Madison
<3>
mobile
<2>
movie
<3,4>
Challenges in Searching
Massive Databases
Know what to search for
– need to build index beforehand
– maintain index as it changes
Do not know what to search for
– need to search the whole database!
Conventional Search
Processor
Hard drive
Memory bus
I/O bus
Memory
Conventional Search
find ….
Processor
Hard drive
Memory bus
I/O bus
Memory
Conventional Search
yes, no, no,
yes, yes ….
Processor
contents
Hard drive
Memory bus
I/O bus
Memory
Conventional Approach
WUSTL’s Approach
Streaming Approach
Processor
Hard drive
Reconfigurable
hardware
Memory/
processing
Memory
Bus
Memory
I/O bus
Streaming Approach
find
Processor
Hard drive
Reconfigurable
hardware
Memory/
processing
Memory
Bus
Memory
I/O bus
Streaming Approach
Processor
find
Hard drive
Reconfigurable
hardware
Memory/
processing
Memory
Bus
Memory
I/O bus
Streaming Approach
yes, no, no,
yes, yes
Processor
find
Hard drive
Reconfigurable
hardware
Memory/
processing
Memory
Bus
Memory
I/O bus
Parallelism through each transducer and drive
Magnetic Recording Channel
Schematic
Channel Bits
Input
User
Data
Head
Disk
Encoder
A
To Bus
or Cache
Decoded
User
Data C
Decoder
Detector
B
Analog
Readback
Key streaming over Data
Disk Level Implementation
matches
s
c
o
r
e
100-bit-key matching through a pseudo-random binary series
Status: Prototype in progress
Host
ATAPI
Controller
FPX
RAD
Loopback module
IDE bus
15bit CTRL
Tap
NID
32 RAD
test pins
IDE_to_ATM
module
module
16bit Data
Hard drive
Custom PCB for
Electrical Termination &
5V to 3.3V Conversion
Setup reused
from FPX
Internet Packet Filtering
with
Mahesh Jayaram
and
George Varghese
Finding Needles in a Moving
Haystack
Cost of Internet Request
As technology improves, transmission time decreases but
latency stays the same
Transmission
Time
Latency
Year
Example: Garden Hose
Bandwidth ~ hose diameter
Water Supply
Latency (first drop) ~ distance
Fire department and gardener suffer the same wait
Example: Hot Shower
You want this water
Latency (time to get hot water) ~ distance
Latency-Free Hot Shower
Convection circuit continuously
circulates hot water
Latency ~ 0
Better to receive than to give
•
•
•
•
•
Cable broadcast
Radio broadcast
TV guide channel
Gate connection announcements in flight
Winning lottery number
Modern name: push technology
Better to receive than to give
How do you get what you want?
Packet Filters
Filter F
(Weather)
Packet Filters
Filter F
(Weather)
Existing Approach
IBM Quote
Weather
Flight Schedule
Our approach
Flight
IBM
Weather
Schedule
Quote
Composite filter makes just one pass
How we do it
IBM Quote
Grammar 1
Weather
Grammar 2
Flight Schedule
Grammar 3
Parsing Engine
Sample grammar for TCP packet
TCPConnHeader : EtherType IPHeader
TCPPortPair
EtherType : #IP_TYPE
IPHeader : Vers HlenPlusRest
Vers : HalfByte
HlenPlusRest : 0 1 0 1 FixedRest
| 0 1 1 0 FixedRest OneIPOption
| 0 1 1 1 FixedRest TwoIPOption
| 1 0 0 0 FixedRest ThreeIPOption
| 1 0 0 1 FixedRest FourIPOption
| 1 0 1 0 FixedRest FiveIPOption
| 1 0 1 1 FixedRest FiveIPOption OneIPOption
| 1 1 0 0 FixedRest FiveIPOption TwoIPOption
| 1 1 0 1 FixedRest FiveIPOption ThreeIPOption
| 1 1 1 0 FixedRest FiveIPOption FourIPOption
| 1 1 1 1 FixedRest FiveIPOption FiveIPOption
ServiceType : Byte
TotalLength : TwoByte
Identification : TwoByte
Flags : bit bit bit
FragmentOffset : bit Byte HalfByte
TimeToLive : Byte
Protocol : #TCP_PROTOCOL
HeaderChecksum : TwoByte
IPAddrPair : #IP_SRC_DST_PAIR
FiveIPOption : ThreeIPOption TwoIPOption
FourIPOption : TwoIPOption TwoIPOption
ThreeIPOption : TwoIPOption OneIPOption
TwoIPOption : OneIPOption OneIPOption
OneIPOption : Option Padding
Option : ThreeByte
Padding : Byte
TCPPortPair : #TCP_PORT_PAIR
FourByte : TwoByte TwoByte
ThreeByte : TwoByte Byte
TwoByte : Byte Byte
FixedRest : ServiceType TotalLength
Identification Flags
FragmentOffset TimeToLive
Protocol HeaderChecksum
IPAddrPair
Byte : HalfByte HalfByte
HalfByte : bit bit bit bit
bit : 0
|1
Results
The more things you want, the
slower existing approaches get
Our performance doesn’t degrade
Conclusions
• The Internet and its content are growing
explosively
• Disk storage is abundant, cheap, reliable
• Technology must provide fast, inexact
searching of text and images
• As more data is hurled at and past us, fast
filtering of Internet traffic is a must
Questions?