Introduction CS 239 Security for Networks and System
Download
Report
Transcript Introduction CS 239 Security for Networks and System
Other Topics in Experiment
Design
CS 239
Experimental Methodologies for
System Software
Peter Reiher
May 17, 2007
CS 239, Spring 2007
Lecture 12
Page 1
Outline
• Experiment order and randomization
• Important traces
• Useful models for experimentation
CS 239, Spring 2007
Lecture 12
Page 2
Randomization of
Experimental Order
• Uncontrollable parameters may vary during
experimentation
– In non-random ways
• Plotting error vs. experiment number detects
this
– But doesn’t control it
• Randomization controls the problem
– Becomes error parameter
CS 239, Spring 2007
Lecture 12
Page 3
An Example
1
0.8
0.6
0.4
0.2
0
-0.2 0
2
4
6
8
10
12
14
16
18
-0.4
-0.6
-0.8
-1
• Data from sample one factor experiment
with replications
• Assumed order is all A levels first, then B
levels, etc.
CS 239, Spring 2007
Lecture 12
Page 4
What Does This Chart Tell Us?
• Bigger errors for early replications of the
experiment
• Eventually settling down to a narrow range
• So maybe our A experiments observed some
different conditions than later experiments
• Might get different results if A experiments
were run last, instead of first
CS 239, Spring 2007
Lecture 12
Page 5
Why Might This Kind of Thing
Happen?
• Consider measuring disk performance:
– Benchmark creates 1000 small files, 10 large
ones, writes them, then deletes them
– File size is varied as experimental parameter
– One run takes several hours
– Other people use system daily
• Disk fragmentation may increase over time,
changing results
CS 239, Spring 2007
Lecture 12
Page 6
Another Possible Reason
• Cyclic effects
• Something is happening on the computer
every hour/day/week
• Experiments run while this thing is
happening behave differently
• Ideally, should get rid of cyclic effect
– But that’s not always possible
• There are many other similar reasons for
this kind of behavior
CS 239, Spring 2007
Lecture 12
Page 7
Another Reason
• These kinds of effects are very common when you
run live tests
• Also when you run raw traces
– Of sufficient length and complexity to capture
them
– Not a problem if all tests get same trace
– But potentially a problem if you divide the trace
into pieces for different runs
– That includes dividing traces for training
purposes
CS 239, Spring 2007
Lecture 12
Page 8
Complete Randomization
• Plan experiment first
– Levels of each parameter
– Number of replications
• List experiments by levels and replication
number
• Choose experiments from list randomly
– Selection without replacement
CS 239, Spring 2007
Lecture 12
Page 9
More Advanced Techniques
• Complete randomization sometimes impossible
– E.g., might need to install different hardware
for each level
• Too much intervention to potentially change
HW after each run
• Divide experiments into blocks
– Randomize within each block
– Not that helpful if only one factor
• Block effect confounded with true effect
Lecture 12
CS 239, Spring 2007
Page 10
An Example
• Testing DDoS defense boxes
• Your experiment has three factors
– Which of three boxes
– Varying number of attack sites (3
levels)
– Makeup of DDoS traffic (3 levels)
• The boxes are hardware appliances
CS 239, Spring 2007
Lecture 12
Page 11
Why Is This Problematic?
• Boxes need to be put in-line in testing
framework
• Requiring someone to switch cables (at
least)
• With complete randomization, need to
switch cables on roughly 2/3s of
experimental runs
CS 239, Spring 2007
Lecture 12
Page 12
A Block Design for This Case
• Set up blocks of experiments with
single box tested in each block
– But multiple blocks for each box
• E.g., all tests for box A with maximum
number of attack sites are in one block
• Randomize order of block testing
• Randomize within the block
CS 239, Spring 2007
Lecture 12
Page 13
What Have We Gained?
• Many fewer cable changes
• But less danger that unforeseen effects
depending on experiment order will
cause problems
• Haven’t removed the problem, but
have decreased it
CS 239, Spring 2007
Lecture 12
Page 14
Something To Keep In Mind
• Experimenters tend to think of periodic or
startup effects as a nuisance
• They are actually real phenomena
– Possibly important phenomena
• When designing experiments, think
seriously about whether you want to avoid
these effects
– Or, alternately, capture them
– The latter requires careful thought
CS 239, Spring 2007
Lecture 12
Page 15
Traces
• Traces are often an important part of a
workload
• Many kinds of traces are hard to gather
for yourself
• In some cases, traces are publicly
available
• Sometimes you can use those
CS 239, Spring 2007
Lecture 12
Page 16
Some Useful Traces
•
•
•
•
•
•
NLANR packet header traces
CAIDA traces and data sets
U. of Oregon Routeviews traces
File system traces
Web traces
Crawdad wireless traces
CS 239, Spring 2007
Lecture 12
Page 17
NLANR Network Traces
• Traces of Internet packet activities
– Just packet headers
• Variety of traces gathered at different places in
Internet
• Of varying length
• Useful if you want to generate “realistic” internal
Internet traffic
• http://pma.nlanr.net//
• NLANR is out of business, now run by CAIDA
CS 239, Spring 2007
Lecture 12
Page 18
CAIDA Traces and Data Sets
• CAIDA is organization dedicated to
measuring Internet phenomena
• They’ve gathered a bunch of useful data
– Some of which they’ve made publicly
available
• Likely to be adding more over course of
time
• http://www.caida.org
CS 239, Spring 2007
Lecture 12
Page 19
Some CAIDA Datasets
•
•
•
•
Skitter topology data
Denial-of-service backscatter data
Internet worm activity data
Packet traces from OC12 and OC48
ISP points
• DNS root server traffic activity
CS 239, Spring 2007
Lecture 12
Page 20
Skitter Data Sets
• Skitter is CAIDA project to gather Internet
topology data
• Skitter sends probe packets from many sites
around globe to Internet addresses
• Gathers data based on responses
• Data can be used to build map of current
topology/routing state of Internet
CS 239, Spring 2007
Lecture 12
Page 21
Denial of Service Backscatter Data
• Typical DoS attacks result in victim’s
sending lots of response packets
– If attack spoofed addresses, they go to
random sites
– This is called backscatter
• CAIDA watches backscatter and has made
some backscatter data available
• Provides insight into DoS attack numbers,
sizes, targets, etc.
CS 239, Spring 2007
Lecture 12
Page 22
Internet Worm Activity
• Worms spread to randomly chosen
addresses
• CAIDA has data on worm probe attempts to
their addresses
• For Code Red and Witty
• Some parts of data available to all
• Others available on a restricted access basis
• Useful for modeling worm activity
CS 239, Spring 2007
Lecture 12
Page 23
Routeviews Data
• Gathered at University of Oregon
• BGP updates and routing tables from
several participating ASes
– From 2001 to date
– Gathered every two hours, mostly
• http://www.routeviews.org/
CS 239, Spring 2007
Lecture 12
Page 24
What Does Routeviews Data Show?
• Full picture of routing from
perspective of particular points on
Internet
• Partial view of overall Internet
topology and routing
• Data can be used to deduce lots of
useful things
CS 239, Spring 2007
Lecture 12
Page 25
What Could Experimenters Use
Routeviews Data For?
• Generating Internet topology maps
• Generating realistic BGP update traffic
• Generating models of path lengths in
Internet
CS 239, Spring 2007
Lecture 12
Page 26
File System Traces
• Surprisingly few traces of significant
amounts of file system activity
• But some are available
– Many are old
• More might become so in near future
• Best place to start looking is SNIA IOTTA
repository
– http://iotta.snia.org/
CS 239, Spring 2007
Lecture 12
Page 27
Some File System Traces
• Seer trace
– Gathered in my research group (1996/1997)
– Real activity by real users
– 575 Mbytes
• LASR trace
– Also gathered in my group (2000/2001)
– Real activity by real users
– 3.2 Gbytes
• TraceFS data
– 16 minutes worth of activity (2007)
– Based on running benchmarks
– 58 Mbytes
CS 239, Spring 2007
Lecture 12
Page 28
Typical File System Trace Contents
• Records of file system related system
calls
• Recorded every time file system was
invoked
• Indicates file accessed, type of access,
time, size, perhaps user and process
– With significant anonymization
CS 239, Spring 2007
Lecture 12
Page 29
What Can You Do With File System
Traces?
• Replay them when testing file systems
• Use them to build models of file system
activity
• Use them to generate profiles of what files
in a file system are actually used
– One big weakness in most traces is they
show what was accessed
– No info about the rest of the file system’s
contents
CS 239, Spring 2007
Lecture 12
Page 30
Other Interesting File System Traces
• Cello traces
– block level access to disk
• Plan 9 traces
– Possibly deceptive, due to unusual system
model of Plan 9
– Seem to have disappeared from web
• Werner Vogels Windows traces
– Also seem to have disappeared
CS 239, Spring 2007
Lecture 12
Page 31
Web Server Traces
• Usually traces of HTTP requests made
to some web server
– Suitably anonymized
• Many available
– But many are old
– Web moves fast enough that it’s not
clear how representative they are
CS 239, Spring 2007
Lecture 12
Page 32
Lawrence Berkeley Web Trace
Repository
• Various web traces kept at LBL
– http://ita.ee.lbl.gov/
• Some are quite extensive
– E.g., 1.3 billion web requests for
1998 World Cup site
• None from after 2000
CS 239, Spring 2007
Lecture 12
Page 33
IRCache Traces
• Weekly traces of a proxy cache
• Latest currently available from January
2007
• ftp://ftp.ircache.net/Traces/
• Free for academic users
• Commercial users have to pay
CS 239, Spring 2007
Lecture 12
Page 34
Web Caching Trace Site
•
•
•
•
•
Run by Brian D. Davison
http://www.web-caching.com/
Contains pointers to several web caches
Except IRCache, none newer than 1999
Many are pointers to same traces as LBL
– But not all
CS 239, Spring 2007
Lecture 12
Page 35
Crawdad Wireless Traces
• Crawdad is project to gather useful
data on wireless networks
– Based at Dartmouth
– http://crawdad.cs.dartmouth.edu/
• Contains large quantity of data on
various wireless phenomena
CS 239, Spring 2007
Lecture 12
Page 36
The Dartmouth Wireless Traces
• Maybe the best stuff in the Crawdad data
archives
• Dartmouth’s campus has had complete
wireless coverage for several years
– And all students have wireless-enabled
computers
• They’ve kept complete data on associations
to wireless access points for five full years
– Still gathering and making data available
CS 239, Spring 2007
Lecture 12
Page 37
What Can You Do With Dartmouth’s
Data?
•
•
•
•
Lots of stuff
Traces of activity at wireless access points
Models of user mobility
Analysis of malware propagation via user
movement
• Models of typical patterns of user network
access
CS 239, Spring 2007
Lecture 12
Page 38
Other Neat Stuff in Crawdad
Repository
• Other records of user mobility through
wireless networks
• Data on Bluetooth activity in various
environments
• Placelab data on use of wireless for
localization
• Link quality information for mesh networks
• Ongoing data gathering project, so more
will be added
CS 239, Spring 2007
Lecture 12
Page 39
Useful Experimental Models
• In many cases, we can’t test in real conditions
• Typically try to mimic real conditions by using
models
– Workload models
– Network topology models
– Models of other experimental conditions
• There are already useful models for many things
– Often widely accepted as valid within certain
research communities
– Might be better using them than trying to create
your own
Lecture 12
CS 239, Spring 2007
Page 40
Some Important Model Categories
• Network topology models
• Network traffic models
CS 239, Spring 2007
Lecture 12
Page 41
Network Topology Models
• Many experiments nowadays investigate
network/distributed systems behavior
• They need a realistic network to test the
system
– Usually embedded in testbed hardware
• Where do you get that from?
• In some cases, it’s obvious or you have a
map of a suitable network
• In other cases, more challening
CS 239, Spring 2007
Lecture 12
Page 42
Some Challenging Cases
• You need the Internet in the middle
• You are investigating a large enterprise
network
• You are doing scalability testing that
requires networks of several sizes
CS 239, Spring 2007
Lecture 12
Page 43
Network Generation Models
• The typical response to this problem
• Run a program that generates a suitable network
• Map the resulting network onto your available
hardware
– Could be challenging, if you don’t have enough
machines
– Some generators create networks of specified
size
• But theoretically like whatever they’re
modeling
CS 239, Spring 2007
Lecture 12
Page 44
Network Topologies and Power Law
Behavior
• Much debate on whether the Internet (and other
computer networks) follow power law behavior
P( k ) ~ k
– Where P(k) is probability a node connects to k
other nodes
• Generally some agreement that power law
topology generator do better job than hierarchical
models
– Less agreement on how power law properties
arise in networks like Internet
CS 239, Spring 2007
Lecture 12
Page 45
Some Popular Topology Generators
• GT-ITM
• BRITE
• INET
CS 239, Spring 2007
Lecture 12
Page 46
GT-ITM
• Supports various ways to randomly generate
network graphs
– Including transit-stub model
• Which doesn’t produce power law
graphs
• Still, very widely used
• http://www.cc.gatech.edu/projects/gtitm/
CS 239, Spring 2007
Lecture 12
Page 47
BRITE
•
•
•
•
Parameterizable network generation tool
Outputs its networks in NS-2 syntax
Places nodes randomly in a plane
Randomly selects some number of nodes to
connect to each new node
– From a limited set of candidates
• Some experiments suggest it produces graphs
matching power law behavior
• Topology generator of choice for Emulab
• http://www.cs.bu.edu/brite/
CS 239, Spring 2007
Lecture 12
Page 48
INET
• Topology generator specifically
intended to produce Internet-like
graphs
• Much effort to match various network
characteristics
• http://topology.eecs.umich.edu/inet/
CS 239, Spring 2007
Lecture 12
Page 49
A Different Approach
• Map the real Internet accurately
• Use that map for your topology
• Rocketfuel project is one approach to this
mapping
– http://www.cs.washington.edu/research/n
etworking/rocketfuel/
• Issue of producing small representative
topology you can actually test with remains
CS 239, Spring 2007
Lecture 12
Page 50
Network Traffic Models
• Frequently necessary to feed network
traffic into an experiment
• Could use a trace
• But sometimes better to use a generator
• The generator needs a model to tell it
how to generate traffic
• What kind of model?
CS 239, Spring 2007
Lecture 12
Page 51
Different Network Traffic Model
Approaches
• Trace analysis
– Derive properties from traces of network
behavior
– Generate traffic according to those
properties
• Structural models
– Pretend you’re running an application
– Generate traffic as it would do
CS 239, Spring 2007
Lecture 12
Page 52
Harpoon
• Discussed in earlier lecture
• Uses network traces to determine type
of network traffic to mimic
– Gathered with other tools
• Generates traffic from TCP and UDP
sessions
CS 239, Spring 2007
Lecture 12
Page 53
Swing
• A trace-based generator
• Analyzes trace
– Looking at users, networks, apps
• Calculate CDFs based on these parameters
• Traffic generator creates traffic based on these
• Produces very realistic results
– Improves on Harpoon by allowing applicationbased variation of traffic
– And produces fidelity at finer time scales (
~RTT time)
• Apparently not yet available for general use
CS 239, Spring 2007
Lecture 12
Page 54
Netspec
• A structural model generator
• Able to emulate traffic generation behavior of
multiple types of applications
– HTTP, FTP, Telnet, voice, video, etc.
• You decide how many you want of each
• Netspec generates them
• Doesn’t seem to be downloadable, at the moment
– No actual link on the “distribution” place on
Netspec web page
CS 239, Spring 2007
Lecture 12
Page 55