A vital aspect of auditing is system logging

Download Report

Transcript A vital aspect of auditing is system logging

More Data! Please!
Marcus J. Ranum
CSO, Tenable, Inc.
<[email protected]>
What?
• System log analysis, considered by many
to be the most boring topic in computer
security
– Except for formal software verification
or
– Writing security policies
So, why now?
• It used to just be a good idea
• Now it’s a federally-mandated good
idea
• You can ignore a normal good idea, but
once there’s a fine for management
associated with it, even good ideas get a
lot more attention
A Doctrine of System Log
Analysis
• Workflow is everything
– Have a predictable process through which
logs are examined
– This process should correctly identify new
things happening
– This process should correctly identify outliers
in rates at which things happen
– This process should be repeatable and have
evidentiary value
Predictable
• Should follow a simple filtering path;
relying on complex statistics or analytics
makes it hard to be sure you’ll get the
same results as other people
– Statistics are simplifiers
• They take large amounts of data and turn them
into manageable numbers
• In the process of doing so they obscure details
• Statistics are noise-reducers that identify broad
trends! We want noise-amplifiers instead!
Identification of New Things
• With log data, new variant forms are
more interesting than specific forms
• Think:
BSD: Wd0a sense retry failure, mapping sector 0x1443A bad
on your firewall...
Identification of Outliers
• Hypothesize that attacks are nonsymmetrical activity
• Hypothesize that most factors measured
(CPU, network use, mails sent/sec, etc)
all scale with load, if normalized
– Therefore: You’re more likely to find attacks
by multi-plotting all normalized attributes and
looking for a departure from a fitted curve
Evidentiary Value
• Tamper proofing, time synchronizing, etc,
are going to have less effect on your logs
from an evidentiary perspective than:
– Time-in-service: “We have used these for
years”
– Offsite backup at 3rd party: “What, do you
think we changed the copy at iron mountain,
too?”
– Order: “All records are in order received”
Parse Trees
• Most log analysis is done with matching
• I.e.:
– If - log record contains string “su: badsu”
then apply regex to split it into fields and do
something with those fields
– Else - do nothing
• The problem is that the “else” case is the
variant/new form - that’s most likely the
important one!
Parse Trees
(cont)
• If you build a fully parseable tree then
you can instantly detect variant forms as
well as optimally parsing
– Only need to compare branches while true
%s
su:
as
%s
/dev/tty
(sp)
BadSu
/dev/tty
%s
%s
Black / White / Grey list
• Black List: stuff to throw away
• White List: stuff to flag
• Grey List: stuff that somehow wound up
on neither the black list nor the white list
A Workflow
Input
Yes
Is it on the
black list?
Compute
frequency distribution
of this message variant
No
Yes
Is it on the
white list?
No
Add it to the
grey list
Queue it up for an
analyst to look at
Analyst examines
message variant and
specific values then
updates either
black list
or the white list
Workflow
(Cont)
• You can envision building that workflow
fairly trivially around a web-based front
end with some drag and drop
– Do your white/blacklist processing using
anchored, optimized matching trees and it
would run extremely fast
Statistics
• At each point where the messages are
parsed you can apply statistics
– Summarize frequency of field occurrence
– Summarize frequency of variant occurrence
– Summarize change rate in field occurrence
• I.e. is it 90% of the time “badSU”?
– Examine values for hysterisis
• I.e.: count the frequency with which the field value
changes as a percentage of the total number of
times it is seen
Advanced Statistics
• Perform bayesian set calculations on
individual field values on a per-field
within variant basis
– I.e: what is the probability that the string after
“su:” will be “root”? Float up instances where
it is outside of the 90% range
Ultra-Advanced Statistics
• Eventually when you do enough
bayesian set classifiers you wind up
doing “Never before seen” anomaly
detection
– I.e.: Float up instances where you see
something that is outside of your historical
reference
NBS and Bayesian classifiers
• NBS = “Never Before Seen” anomaly
detection
• Bayesian classifiers = given the
probability that X has occurred before
what is its probability to occur again in
the same context? (usually this is in the
context of groupings of words)
• Both notations for improbability where
NBS is 100% improbable
Insane idea #1: Massive
Compression and Parse Trees
• To achieve massive (optimal) database
compression you simply record only the
parse tree-ID and pointers to the tupleset of variant fields
– Most log records compress down to 2 32-bit
words and a date/time stamp
– It’s a darned shame the lawyers need “exact
copies” of the logs
– Moderately invariant logs would compress
to very small size with this scheme
Relational Databases and
Logs
• RDBMS’ are one of the worst
technologies you can use for log storage
– Optimized for query speed not write
– Typical RDBMS expects a ratio of 99 queries
per 1 write
– Typical log application observes a ratio of
10000 writes per query
• Index updates murder you
Getting Around RDBMS
Problems
• Maintain a full inverted index
– Build the index when a query is performed
but updating records within the query
window
– Delayed update of indexes means a “hiccup”
on the first query but subsequently it is fast
• Contrast with RDBMS approach which “hiccups”
constantly unless run on $$$ive hardware or kept
lightly loaded
– Inverted indexes parallelize linearly
Getting Around RDBMS
Problems
• Index the parse tree branch #
– It turns out that 99% of the time all you care
about is the variant form (I.e.: “count the
number of sendmail stat=sent messages and
skin the bytes= value and summarize it”)
– Think of cases where you need to crosscorrelate variant forms! Are there any?
The Splunk Approach
• Lexical matching and text searching with
a power-assist
• Shared sub-parse rules for matched text
(splunkbase)
• In splunk-land the matching queries
become the parse tree branch #, sort of
Insane idea #2: Lex-agenerator
• Want to parse logs really really fast?
– Write a tool that converts sets of matchable
substrings into lex(1) rules
– Run the output through lex, which generates
an optimal parse tree
– Run it through a compiler
– Parse logs at wire speed
Insane idea #3: Enumerate
Them All
• Build a community-oriented tool for
sharing parse tree structures
– New users “pay” for service by contributing 3
parses for messages never seen before
– Pretty quickly the trees are all up to date
– Humans can make sense of log messages
quickly, computer’s can’t
• Why not leverage our brains and the fact that
there are a hell of a lot of humans?
Insane Idea #3:
(Continued)
50,000 variant forms in 10 years of San Diego Supercomputer Center’s Logs
The Future
• What will the future log analysis “next
generation” look like?
– Palantir
• with a semantic forest generator behind it
– with a grid database behind that
Someone: Please get busy and build this
Hint: You will make a lot of money
Where is it all going?
• Eventually it all leads back to parsing
– You can fudge it with expressions
– You can accelerate expressions using fast
search engines
• …but it all comes back to being able to build an
efficient parse-tree