2014_03_15 - NDSU Computer Science

Transcript 2014_03_15 - NDSU Computer Science

Wikipedia: Amazon Web Services (abbreviated AWS) is a collection of remote computing services (also called web services) that together make up a cloud
computing platform, offered over the Internet by Amazon.com. The most central and well-known of these services are Amazon EC2 and Amazon S3. The
service is advertised as providing a large computing capacity (potentially many servers) much faster and cheaper than building a physical server farm.[2]
AWS is located in 8 geographical "regions": US East (Northern Virginia), where the majority of AWS servers are based,[3] US West (northern California), US
West (Oregon), Brazil (São Paulo), Europe (Ireland), South Asia (Singapore), East Asia (Tokyo), and Australia (Sydney).
List of products
Compute Amazon Elastic Compute Cloud (EC2) provides scalable virtual private servers using Xen.
Amazon Elastic MapReduce (EMR) allows developers et al to easily and cheaply process bigdata, using a hosted Hadoop on EC2 and Amazon S3.
Networking Amazon Route 53 provides a highly available and scalable Domain Name System (DNS) web service.
Amazon Virtual Private Cloud (VPC) creates a logically isolated set of Amazon EC2 instances which can be connected to an existing network using a VPN.
AWS Direct Connect provides dedicated network connections into AWS data centers, providing faster and cheaper data throughput.
Content delivery Amazon CloudFront, a content delivery network (CDN) for distributing objects to so-called "edge locations" near the requester.
Storage and content delivery Amazon Simple Storage Service (S3) provides Web Service based storage.
Amazon Glacier provides a low-cost, long-term storage option (compared to S3). High redundancy/availability, but low-frequent access times. Ideal for archiving.
AWS Storage Gateway, an iSCSI block storage virtual appliance with cloud-based backup.
Amazon Elastic Block Store (EBS) provides persistent block-level storage volumes for EC2.
AWS Import/Export, accelerates moving large amounts of data into and out of AWS using portable storage devices for transport.
Database Amazon DynamoDB provides a scalable, low-latency NoSQL online Database Service backed by SSDs.
Amazon ElastiCache provides in-memory caching for web applications. This is Amazon's implementation of Memcached and Redis.
Amazon Relational Database Service (RDS) provides a scalable database server with MySQL, Informix,[20] Oracle, SQL Server, and PostgreSQL support.[21]
Amazon Redshift provides petabyte-scale data warehousing with column-based storage and multi-node compute.
Amazon SimpleDB allows developers to run queries on structured data. It operates in concert with EC2 and S3 to provide "the core functionality of a database".
AWS Data Pipeline provides reliable service for data transfer between different AWS compute and storage services
Amazon Kinesis, streams data in real time with the ability to process thousands of data streams on a per-second basis.
Deployment Amazon CloudFormation provides a file-based interface for provisioning other AWS resources.
AWS Elastic Beanstalk provides quick deployment and management of applications in the cloud.
AWS OpsWorks for configuration of EC2 services using Chef.
Management Amazon Identity and Access Management (IAM), an implicit service, the authentication infrastructure used to authenticate access to services.
Amazon CloudWatch, provides monitoring for AWS cloud resources and applications, starting with EC2.
AWS Mgmt Console: web-based point/click interface to manage/monitor EC2, EBS, S3, SQS, Amazon Elastic MapReduce, and Amazon CloudFront.
Application services
Amazon CloudSearch provides basic full-text search and indexing of textual content.
Amazon DevPay, currently in limited beta version, is a billing and account management system for applics that developers have built atop Amazon Web Services.
Amazon Elastic Transcoder (ETS) provides video transcoding of S3 hosted videos, marketed primarily as a way to convert source files into mobile-ready versions.
Amazon Flexible Payments Service (FPS) provides an interface for micropayments.
Amazon Simple Email Service (SES) provides bulk and transactional email sending.
Amazon Simple Queue Service (SQS) provides a hosted message queue for web applications.
Amazon Simple Notification Service (SNS) provides a hosted multi-protocol "push" messaging for applications.
Amazon Simple Workflow (SWF) is a workflow service for building scalable, resilient applications.
Wikipedia:
Amazon S3 (Simple Storage Service) online file storage web service by Amazon Web Services provides storage thru (REST, SOAP, BitTorrent
Amazon S3 is reported to store more than 2 trillion objects as of April 2013
Details of S3's design are not made public by Amazon, though it clearly manages data with an object storage architecture. According to Amazon, S3's design aims
to provide scalability, high availability, and low latency at commodity costs.
S3 stores arbitrary objects (computer files) up to 5 terabytes in size, each accompanied by up to 2 kilobytes of metadata. Objects are organized into buckets (each
owned by an Amazon Web Services account), and identified within each bucket by a unique, user-assigned key.
Buckets and objects can be created, listed, and retrieved using either a REST-style HTTP interface or a SOAP interface. Requests are authorized using an access
control list associated with each bucket and object.
Bucket names and keys are chosen so that objects are addressable using HTTP URLs: http://s3.amazonaws.com/bucket/key http://bucket.s3.amazonaws.com/key
http://bucket/key (where bucket is a DNS CNAME record pointing to bucket.s3.amazonaws.com)
Because objects are accessible by unmodified HTTP clients, S3 can be used to replace significant existing (static) web hosting infrastructure.[16]
Someone can construct a URL that can be handed off to a third-party for access for a period such as the next 30 minutes, or the next 24 hours.
Every item in a bucket can also be served up as a BitTorrent feed. The S3 store can act as a seed host for a torrent and any BitTorrent client can retrieve the file.
This drastically reduces the bandwidth costs for the download of popular objects.
Amazon S3 provides options to host static websites with Index document support and error document support.[19]
Photo hosting service SmugMug has used S3 since April 2006. They experienced a number of initial outages and slowdowns, but after one year they described it
as being "considerably more reliable than our own internal storage" and claimed to have saved almost $1 million in storage costs.[21]
There are various User Mode File System (FUSE)-based file systems for Unix-like operating systems (Linux, etc.) that can be used to mount an S3 bucket as a file
system. Note that as the semantics of the S3 file system are not that of a Posix file system, the file system may not behave entirely as expected.[22]
Apache Hadoop file systems can be hosted on S3, as its requirements of a file system are met by S3. As a result, Hadoop can be used to run MapReduce
algorithms on EC2 servers, reading data and writing results back to S3.
Dropbox,[23] Bitcasa,[24] StoreGrid, Jumpshare, SyncBlaze,[25] Tahoe-LAFS-on-S3,[26] Zmanda and Ubuntu One,[27] Fiabee are some of the many online
backup and synchronization services that use S3 as their storage and transfer facility.
Minecraft hosts game updates and player skins on the S3 servers.[28]
Tumblr, Formspring, Pinterest, and Posterous images are hosted on the S3 servers.
Alfresco (software) the OpenSource Enterprise Content Management provider are hosting data for the Alfresco in the cloud service on S3.
LogicalDOC, the open source Document management system provides a tool for disaster recovery based on S3.
S3 was used in the past by some enterprises as a long term archiving solution, until Amazon Glacier was released.
Examples of competing S3 compliant storage implementations include:
Google Cloud Storage
Cloud.com’s CloudStack[31]
Cloudian, Inc.[32] an S3-compatible object storage software package.
Connectria Cloud Storage[33][34] in 2011 became the first US cloud storage service provider based on the Scality RING organic storage technology[35][36]
Eucalyptus
Nimbula (acquired by Oracle)
Riak CS,[37] which implements a subset of the S3 API including REST and ACLs on objects and buckets.
Ceph with RADOS gateway.[38]
Caringo, Inc which implements the S3 API and extends Bucket Policies to include the addition of Domain Policies.
From: Arjun Roy I did some tests to compare C/C++ Vs C# on some basic op to see if compilers
So a 100 million bit OR op on my workstation, which is somewhat dated
handle cases differently. I executed Dr. Wettstein's code in libPree library on Linux
in the tooth now, consistently yields about a 76 ms operation
Ubuntu 12.04 and my C# code on Windows 7. 4 GB RAM.
time. Doing the math suggests an effective applic memory
OR
1 million
1.1 sec
0.1013 millisec
bandwidth of 178.5 megabytes/second. Put in another frame
OR
10 million
23.33 sec
1.016 millisec
of reference this suggests that my workstation could easily
OR
100 million
266.38 sec
10.4494 millisec
saturate a 1 gigabit/second LAN connection with a streaming
AND
1 million
0.99 sec
0.0989 millisec
PTree computation. I also ran a 3-tuple of a count test just to
AND
10 million
23.29 sec
1.0235 millisec
make sure that it was consistent:
AND
100 million
279.92 sec
10.6166 millisec
PTree count test: bitcount/iteration/population 100000000/1/500000
Number of 1's
1 million
0.49 sec
0.9647 millisec
Count time: 110000 ticks [0.110000 secs.], Elapsed: 0
Number of 1's
10 million
5.11 sec
6.6821 millisec
real
0m0.122s
Number of 1's
100 million
55.73 sec
57.8274 millisec
user
0m0.112s
sys
0m0.008s
3 run replicate of a 100M bit PTree OR op done with 32-bit code (gcc 4.3.4) with -O2 opt
PTree OR test: bitcount/iteration = 100000000/1 Count time: 60000 ticks [0.060000
PTree count test: bitcount/iteration/population 100000000/1/500000
secs.], Elapsed: 0
Count time: 120000 ticks [0.120000 secs.], Elapsed: 0
real
0m0.076s
real
0m0.123s
user
0m0.048s
user
0m0.120s
sys
0m0.028s
sys
0m0.000s
PTree OR test: bitcount/iteration = 100000000/1 Count time: 60000 ticks [0.060000
PTree count test: bitcount/iteration/population 100000000/1/500000
secs.], Elapsed: 0
Count time: 110000 ticks [0.110000 secs.], Elapsed: 0
real
0m0.076s
real
0m0.121s
user
0m0.048s
user
0m0.112s
sys
0m0.024s
sys
0m0.008s
PTree OR test: bitcount/iteration = 100000000/1 Count time: 70000 ticks [0.070000
secs.], Elapsed: 0
real
0m0.076s
user
0m0.040s
sys
0m0.032s
The above test reflects the amount of time required to root count a PTree with 100 million entries with 500,000 randomly populated one bits. This test suggests a count time of
120 milli-seconds for an effective application memory bandwidth rate of 109.2 megabytes/second. Just under what it would take to saturate a LAN conection.
This test brings up an additional question that I would have with respect to the validity of the testing environment. This test uses the default PTree counting implementation
which uses word by word byte scanning with a 256 entry bitcount lookup table. This means we impose an overhead of 12,500,000 memory references for the table
lookup and an additional 9,375,000 unaligned memory references for the three byte pointer lookups per 32-bit word.
PTree operations are by definition cache pessimal as we have discussed previously. Lookup tables are also notorious for their cache pressure characteristics. It is always
difficult to predict how the hardware prefetch logic handles this but at a minimum the lookup table is going to bounce across two L1 cache-lines.
Given this, it is somewhat unusual in their test results, that the '1 counting' was five times faster then the basic AND/OR ops.
There are multiple '1 counting' implementations in the source code but they would have had to select those. Included in those counting implementations were reduction
counting and an variant which unrolls the backing array onto the MMX vector registers. Those strategies provided incremental performance improvements but
nothing on the order of a 5-fold increase in counting throughput.
With respect to memory optimization PTree operations, I experimented with non-temporal loads and pre-fetch hinting and was never able demonstrate significant performance
increases. As we have discussed many times, PTree technology and data mining in general, is ultimately constrained by memory latency.
So that would be my spin on the results.
The C# test results are approximately consistent with expected memory bandwidth speeds. I took a few minutes and ran the 100 million count test on one of the nodes in our
'chicken cruncher' and I am seeing around 20 milliseconds which is about twice as fast as their C# implementation which is reasonable given hdwre/OS/compilerinterpreter differences.
Some Interesting Articles We should know as much as we can about what "the client" is doing in order to best help them do it. From Popular Mechanics:
Client Data Mining: How It Works: PRISM, XKeyscore, and plenty more classified info in the client's vast surveillance program has been in the light. How
much data is there? How does the government sort through it? What are they learning about you? Here's a guide.
Most people were introduced to the arcane world of data mining by a security breach revealing the government gathers billions of pieces of data—phone
calls, emails, photos, and videos—from Google, Facebook, Microsoft, and other communications giants, then combs through the information for leads
on national security threats. Here's a guide to big-data mining, NSA-style
The Information Landscape How much data do we produce? An IBM study estimates 2.5 quintillion bytes ( 2.5  1018 ) of data every day.
(If these data
bytes were pennies laid out flat, they would blanket the earth five times.) That total includes stored information — photos, videos, social-media
posts, word-processing files, phone-call records, financial records, and results from science experiments —and data that normally exists for mere
moments, such as phone-call content and Skype chats
Veins of Useful Information Digital info can be analyzed to establish connections between people, and these links can generate investigative leads.
But in order to examine data, it has to be collected-from everyone. As the data-mining saying goes: To find a needle in a haystack, you 1st need to build a haystack
Data Has to Be Tagged Before It's Bagged Data mining relies on metadata tags that enable algorithms to identify connections. Metadata is data about data—
e.g., the names and sizes of files. The label placed on data is called a tag. Tagging data enables analysts to classify and organize the info so it can be
searched and processed. Tagging also enables analysts to parse the info without examining the contents. This is an important legal point because the
communications of U.S. citizens and lawful permanent resident aliens cannot be examined without a warrant. Metadata can!
Finding Patterns in the Noise The data-analysis firm IDC estimates that only 3 percent of the info in the digital universe is tagged when it's created, so the
client has a sophisticated software program that puts billions of metadata markers on the info it collects. These tags are the backbone of any system that
makes links among different kinds of data—such as video, documents, and phone records. For example, data mining could call attention to a suspect on
a watch list who downloads terrorist propaganda, visits bomb-making websites, and buys a pressure cooker. (This pattern matches behavior of the
Tsarnaev brothers, who are accused of planting bombs at the Boston Marathon.) This tactic assumes terrorists have well-defined data profiles—
something many security experts doubt
Open Source and Top Secret Accumulo was designed precisely for tagging billions of pieces of unorganized, disparate data. The custom tool isbased on
Google programming and is open-source. Sqrrl commercialized it and hopes healthcare/finance industries will use it to manage their own big-data sets
The Miners: Who Does What The client is authorized to snoop on foreign communications and also collects a vast amount of data—trillions of pieces of
communication generated by people across the globe. Itdoes not chase the crooks, terrorists, and spies it identifies; it sifts info on behalf of other
government players such as the Pentagon, CIA, and FBI. Here are the basic steps:
1. A judge on a secret Foreign Intelligence Surveillance (FISA) Court gets an app from an agency to authorize a search of data collected by the Client.
2. Once authorized the requests go to the FBI's Electronic Comms Surveillance Unit (ECSU), a legal safeguard - they review to ensure no US citizens targeted.
3. The ECSU passes appropriate requests to the FBI Data Intercept Technology Unit, which obtains the info from Internet company servers and then passes it
to the client to be examined. (Many companies have denied they open their servers. As of press time, it's not clear who is correct.)
4. The client then passes relevant information to the government agency that requested it
What is the client up To? Phone-Metadata Mining Dragged Into the Light: The controversy began when it was revealed that the U.S. government was
collecting the phone-metadata records of every Verizon customer—including millions of Americans. At the request of the FBI, FISA Court judge Roger
Vinson issued an order compelling the company to hand over its phone records. The content of the calls was not collected, but national security officials
call it "an early warning system" for detecting terror plots.
PRISM Goes Public: Every collection platform or source of raw
intelligence is given a name, called a Signals Intelligence
Activity Designator (SIGAD), and a code name. SIGAD US984XN is better known by its code name: PRISM.
PRISM involves the collection of digital photos, stored data, file
transfers, emails, chats, videos, and video conferencing
from nine Internet companies. U.S. officials say this tactic
helped snare Khalid Ouazzani, a naturalized U.S. citizen who
the FBI claimed was plotting to blow up the NY Stock Exc..
Mining Data as It's Created: Our client also operates real-time
surveillance tools. Analysts can receive "real-time
notification of an email event such as a login or sent
message" and "real-time notification of a chat login".
Whether real-time info can stop unprecedented attacks is
subject to debate. Alerting a credit-card holder of sketchy
purchases in real time is easy; building a reliable model of an
impending attack in real time is infinitely harder
XKeyscore is software that can search hundreds of databases for leads. It enables low-level analysts to access communications w/o oversight, circumventing
the checks/balances of FISA court. The client vehemently deny this, and the documents don't indicate any misuse. The seems to be a powerful tool that
allows analysts to find hidden links inside troves of info. "My target speaks German but is in Pakistan—how can I find him?" "My target uses Google
Maps to scope target locations—can I use this info to determine his email address?" This program enables analysts to submit one query to search 700
servers around the world at once, combing disparate sources to find the answers to these questions
How Far Can the Data Stretch?: Oops—False Positives: Bomb-sniffing dogs sometimes bark at explosives that are not there. This kind of mistake is
called a false positive. In data mining, the equivalent is a computer program sniffing around a data set and coming up with the wrong conclusion.
This is when having a massive data set may be a liability. When a program examines trillions of connections between potential targets, even a very
small false-positive rate 10K of dead-end leads that agents must chase down—not to mention the unneeded incursions into innocent people's lives
Analytics to See the Future; Ever wonder where those Netflix recommendations or suggested reading lists on Amazon come from?
Your previous interests directed an algorithm to pitch those products to you. Big companies believe more of this kind of targeted marketing will boost sales
and reduce costs. For example, this year Walmart bought a predictive analytics startup called Inkiru. They make software that crunches data to help
retailers develop marketing campaigns that target shoppers when they are most likely to buy certain products
Pattern Recognition or Prophecy? In 2011 British researchers created a game that simulated a van-bomb plot, and 60% of "terrorist" players were spotted
by a program called DScent, based on their "purchases" and "visits" to the target site. The ability of a computer to automatically match securitycamera footage with records of purchases may seem like a dream to law-enforcement agents trying to save lives, but it's the kind of ubiquitous
tracking that alarms civil libertarians.
Client's Red Team Secret Operations with the
Government's Top Hackers By
Glenn Derene When it comes to the
U.S. government's computer security, we
in the tech press have a habit of
reporting only the bad news—for
instance, last year's hacks into Oak
Ridge and Los Alamos National Labs, a
break-in to an e-mail server used by
Defense Secretary Robert Gates ... the
list goes on and on. Frankly that's
because the good news is usually a
bunch of nonevents: "Hackers deterred
by diligent software patching at Army
Corps of Engineers." Not too exciting
So, in the world of IT security, it must seem that the villains outnumber the heroes—but there are some good-guy celebrities in the world of cyber
security. In my years of reporting on the subject, I've often heard the red team referred to with a sense of breathless awe by security pros.
These guys are purported to be just about the stealthiest, most skilled firewall-crackers in the game. Recently, I called up the secretive
government agency and asked if it could offer up a top red teamer for an interview, and, surprisingly, the answer came back, "Yes".
What are red teams? They're sort of like the special forces units of the security industry—highly skilled teams that clients pay to break into clients'
own networks. These guys find the security flaws so they can be patched before someone with more nefarious plans sneaks in. The cleint
has made plenty of news in the past few years for warrantless wiretapping and massive data-mining enterprises of questionable legality, but
one of the agency's primary functions is the protection of the military's secure computer networks, and that's where the red team comes in
In exchange for the interview, I agreed not to publish my source's name. When I asked what I should call him, the best option I was offered
was: "An official within the client's Vulnerability Analysis and Operations Group." So I'm just going to call him OWCVAOG for short.
And I'll try not to reveal any identifying details about the man whom I interviewed, except to say that his disciplined, military demeanor
shares little in common with the popular conception of the flippant geek-for-hire familiar to all too many movie fans (Dr. McKittrick in
WarGames) and code geeks (n00b script-kiddie h4x0r in leetspeak
So what exactly does the red team actually do? They provide "adversarial network services to the rest of the DOD," says OWCVAOG. That
means that "customers" from the many branches of the Pentagon invite OWCVAOG and his crew to act like our country's shadowy
enemies (from the living-in-his-mother's-basement code tinkerer to a "well-funded hacker who has time and money to invest in the effort"),
attempting to slip in unannounced and gain unauthorized access
These guys must conduct their work without doing damage to or otherwise compromising the security of the networks they are tasked to
analyze—that means no denial-of-service attacks, malicious Trojans or viruses. "The first rule," says OWCVAOG, "is `do no harm.'?"
So the majority of their work consists of probing their customers' networks, gaining user-level access and demonstrating just how compromised the
network can be. Sometimes, the red team will leave an innocuous file on a secure part of a customer's network as a calling card, as if to
say, "This is your friendly red team. We danced past the comical precautionary measures you call security hours ago. This file isn't doing
anything, but if we were anywhere near as evil as the hackers we're simulating, it might just be deleting the very government secrets you
were supposed to be protecting. Have a nice day!"
I'd heard from one of the Department of Defense clients who had previously worked with the red team that OWCVAOG and his team had a success
rate of close to 100 percent. "We don't keep statistics on that," OWCVAOG insisted when I pressed him on an internal measuring stick.
"We do get into most of the networks we target. That's because every network has some residual vulnerability. It is up to us, given the time
and the resources, to find the vulnerability that allows us to access it
It may seem unsettling to you—it did at first to me—to think that the digital locks protecting our government's most sensitive info are picked so
constantly and seemingly with such ease. But I've been assured that these guys are only making it look easy because they're the best, and
that we all should take comfort, because they're on our side. The fact that they catch security flaws early means that, hopefully, we can
patch up the holes before the black hats get to them
And like any good geek at a desk talking to a guy with a really cool job, I wondered where the client finds the members of its superhacker squad.
"The bulk is military personnel, civilian government employees and a small cadre of contractors," OWCVAOG says. The military guys mainly
conduct the ops (the actual breaking and entering stuff), while the civilians and contractors mainly write code to support their endeavors.
For those of you looking for a gig in the ultrasecret world of red teaming, this top hacker says the ideal profile is someone with "technical
skills, an adversarial mind-set, perseverance and imagin
Speaking of high-level, top-secret security jobs, this much I now know: The world's most difficult IT department to work for is most certainly lodged
within the Pentagon. Network admins at the Defense Department have to constantly fend off foreign governments, criminals and wannabes
trying to crack their security wall—and worry about ace hackers with the same DOD stamp on their paychecks
Security is an all-important issue for the corporate world, too, but in that environment there is an acceptable level of risk that can be built into the
business model. And while banks build in fraud as part of the cost of doing business, there's no such thing as an acceptable loss when it
comes to national security. I spoke about this topic recently with Mark Morrison, chief info assurance officer of the Defense Intell Agency
"We meet with the financial community because there are a lot of parallels between what the intelligence community needs to protect and what the
financial community needs," Morrison said. "They, surprisingly, have staggeringly high acceptance levels for how much money they're
willing to lose. We can't afford to have acceptable loss. So our risk profiles tend to be different, but in the long run, we end up accepting
similar levels of risk because we have to be able to provide actionable intelligence to the war fighter
OWCVAOG agrees that military networks should be held to higher standards of security, but perfectly secure computers are perfectly unusable.
"There is a perfectly secure network," he said. "It's one that's shut off. We used to keep our info in safes. We knew that those safes were
good, but they were not impenetrable, and were rated on the number of hours it took for people to break into them. This is a similar equation
FAUST
A Clusterer (an unsupervised analytic) analyzes data objects without consulting a known class label (usually none are present).
Objects are clustered (grouped) based on maximizing intra-class similarity and minimizing inter-class similarity.
FAUST Count Change (FCC) Clusterer
1. Choose a nextD recursion plan, e.g., a. Initially use D = the diagonal with Max Standard Deviation (STD) or maximum STD/Spread.. b. Always
use AM (Avg-Median). c. always use AFFA (Avg-FurthestFromAverage). d. always use FFAFFF (FurthestFromAvgFurthestFromFurthest). e. cycle thru diagonals: e1,...,..en, e1e2..; f. cycle thru AM, AFFA, FFAFFF; ...
Choose a DensityThreshold (DT), DensityUniformityThreshold(DUT), Precipitous Count Change Definition (PCCD).
2. If DT (and/or DUT) not exceeded at C, (a cluster), partition C by cutting at each gap and PCC in C oD using the next D in the recursion plan.
FAUST Anomaly Detector (FAD)
(outlier detection analytic) identifies objects not complying with the ata model.
In most cases, outliers are discarded as noise or exceptions, but in some applications, e.g., fraud detection, rare events are the interesting events.
Outliers can be detected with a. Statistics (statistical tests assume a distribution/probability model), b. Distance, c. Density or d. Deviation
(deviation uses a dissimilarity to reduce overall dissimilarity by removing "deviation outliers"). Outlier mining can mean:
1. Given a set of n objects and given a k, find the top k objects in terms of dissimilarity from the rest of the objects.
2. Given a Training Set, identify outlier objects within each class (correctly classified but noticeably dissimilar to their fellow class members).
3. Determine "fuzzy" clusters, i.e., assign a weight for each (object, cluster) pair. (Does a dendogram do that?).
We believe that the FAUST Count Change clusterer is the best Anomaly Detector we have at this time. It seems important to be able to identify and
remove large clusters in order to find small ones (anomalies).
D compute midpt of D or low and hi of xoD over each class, then build a polygonal hull, tightly for Linear and loosely for Medoid. For a loose
hull examine for "none of the above" separately at great expense. Always build a tight polygonal hull, we end up with just one method:
FAUST Polygon classifier: Take the above series of D vectors
(Add additional Ds if you wish - the more the merrier - but watch out for the cost of computing {CkoD}k=1..K
e.g., add CAFFAs (Class Avg-FurthestFromAvg), CFFAFFF (Class FurthestFromAvg-FurthestFromFurthest)
D in the D_series, let lD,kmnCkoD (or 1st PCI) and let hD,kmxCkoD (or last PCD).
y isa Ck iff yHk where Hk = {zSpace | lD,k  Doz  hD,k D in the series}
If y  hull of >1 class, say, y Hi ..Hi , y can be
1
h
fuzzy classified by weight(y,k)=OneCount{PC & PH ... & PH } and, if we wish, we can
k
i1
ih
declare y isa Ck where weight(y,k) is a maximum weight.
Or we can let Sphere(y,r)={z | (y-z)o(y-z)<r2} vote (but, requires the construction of Sphere(y,r) )
DT=1 PCCD: PCCs must involve a hi  5 and  60% change ( 2 if high=5) from that high. Gaps must be 3
FCC on IRIS150
Is there a better measure of gap potential than STD? How about normalized STD,
NSTD  Std(Xk-minXk)/SpreadXk ?
NSTD1
2
Set Dk to 1 for each column with NSTD>NSTDT=0.2,
.22 .18
so D=1011. Make Gap and PCC cuts on YoD
If Density < DT at a dendogram node, C, (a cluster), partition C at each gap and PCC in C oD using next D in recursion plan.
The FCC clusterer algorithm
FCC on IRIS150: 1st round, D=1011, F(x)=(x-p)oD, randomly selected p=(15 6 2 5).
F 0
Ct 3
Gp 3
(300)
3
1
1
4
2
1
5 6
1 1
1 1
(11 0
7 8
2 3
1 1
0)
3
.29
4
.31
91.3% accurate after 1st round
9 10 11 12 13 14 15 16 17 18 19 20 21 22 37 38 40 41 47 49 50 51 53 54 55 56 57 59 60 61 63 65 66 67 69 70 71 72 73 74
1 6 5 2 5 4 4 3 2 3 1 1 1 1 2 1 1 1 2 1 2 1 2 2 3 5 1 2 1 2 1 4 2 2 2 5 2 3 5 2
1 1 1 1 1 1 1 1 1 1 1 1 1 15 1 2 1 6 2 1 1 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 1
(11 00)(200)
(25 0 0)
C1(0 4 1)
C2(0 17 1)
(0 17 0)
C3(032)
75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 99 103 106 108 109 111 114
2 2 1 2 1 2 3 1 1 1 1 2 3 2 2 1 1 1 2 2 1 2
1
1
2
1
1
1
1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4
3
2
1
2
3
C4(0 9 37)
(003)
(0 0 6)
FCC on IRIS150: 2nd round D=0100, p=(15 6 2 5)
C1
C2
F 37 38 40 41
F
Ct 2 1 1 1
Ct
Gp 1 2 1
Gp
(040)
(001)
2 3 4 5 6 7 8 9 10
1 1 2 3 3 4 1 1 2
1 1 1 1 1 1 1 1
(040) (021) (0 11 0)
C3
F
Ct
Gp
2 8 9 11 12
1 1 1 1 1
6 1 2 1
(002) (030)
95.3% accurate after 2nd round
C4
71 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
2 3 5 2 2 2 1 2 1 2 3 1 1 1 1 2 3 2 2 1 1 1 2 2
1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
(041)(005)
C41(055)
(0 0 26)
FCC on IRIS150: 3rd round D=1000, p=(15 6 2 5)
C41
F 13 14 16 17 18 23
Ct 3 3 1 1 1 1
Gp 1 2 1 1 5
(042) (012) (001)
 97% accurate after 3rd round
We can develop a fusion method by looking for non-gaps in projections onto the vector connecting the medians of cluster pairs
FCC on IRIS150
DT=1 PCCD: PCCs must involve a hi  5 and be at least a 60% change ( 2 if high=5) from that high.
Gaps must be 3
FCC on IRIS150: 1st rnd, D=1010 (highest STD)
F
Ct
Gp
0 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 20 24 27 28 29 31 37 38 39 40 41 43 44 45 46 47 48 51 53 54 55 56 57 58 59 60 62 63 64 65
1 1 2 2 3 1 3 3 7 3 7 5 3 3 2 2 1 1 1 1 1 1 1 1 3 1 1 3 2 3 5 1 3 1 4 6 3 3 6 3 3 1 2 3 6 1 5
2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 4 3 1 1 2 6 1 1 1 1 2 1 1 1 1 1 3 2 1 1 1 1 1 1 1 2 1 1 1 1
--------------------------(50 0 0)------------------(001) (0 4 0)
C1(0 18 1)
(050)
C2(0 22 18)
(010)(008)
87.3% accurate after 1st round
66 69 70 71 72 73 76 78 79 81 82 84 88 89 90 92
3 5 2 1 1 1 2 1 1 1 1 1 1 1 2 1
3 1 1 1 1 3 2 1 2 1 2 4 1 1 2
(005) (005)
(0 0 7)
(0 0 5)
C2 2nd round D=0001
F 11 12 13 14 15 16 17 18 19 22 23
Ct 1 4 7 8 3 1 5 4 4 2 1
Gp 1 1 1 1 1 1 1 1 3 1
C21(0 17 3) (040) C22(0 1 12)(003)
C1 D=0001
F 9 10 11 12 13 14 16
Ct 3 2 4 7 1 1 1
Gp 1 1 1 1 1 2
(0 18 0)
(001)
97.3% accurate after 2nd round
FCC on IRIS150
DT=1 PCCD: PCCs must involve a hi  5 and be at least a 60% change ( 2 if high=5) from that high.
FCC on IRIS150: 1st rnd, D=1-111 (highest STD/spread)
Gaps must be 3
91.3% accurate after 1st round
F
0 1 2 3 4 5 6 7 8 9 22 23 24 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 54 55 57 60
Ct 1 1 2 3 7 14 12 4 5 1 2 1 1 1 2 2 4 5 5 6 2 3 5 2 3 5 6 6 5 2 6 5 4 4 3 1 1 2 1 1 1 1 1 1
Gp 1 1 1 1 1 1 1 1 1 13 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 3
----------------(50 0 0)-------C1(0 25 2)
(090)
C2(0 16 11)
(0 0 37)
2nd rnd, D=1-1-1-1 (which is the highest STD/spread of those  to 1-111:
C2
14 15 16 17 19 21 22 23 24 25 26 28 29 31 32 33 35 38
1 1 3 2 1
1 1 1 2 1 1 1 3 1 1 4 1 1
1 1 1 2 2
1 1 1 1 1 2 1 2 1 1 2 3
C21(0 3 10)
(021)
----(0 11 0)-----
C1
F 17 19 21 27
Ct 1 1 1 3
Gp 2 2 6 1
(001) (020)
111-1
11-11 1-1-1-1)
28 29 30 32 33 34 35 36 37 39 40 41 49
3 1 1 1 1 1 4 1 2 2 2 1 1
1 1 2 1 1 1 1 1 2 1 1 8
(0 23 0)
(001)
97.3% accurate after 2nd round
FCC on SEED150
DT=1 PCCs have to involve a high of at least 5 and be at least 60% change from that high. Gaps must be  3
NSTD 1
2
3
4
0.28 0.28 0.22 0.47
NSTD  Std(Xk-minXk)/SpreadXk ?
The Ulitmate PCC clusterer algorithm ?
Set Dk to 1 for each column with NSTD>ThresholdNSTD (NSTDT=0.25)
Shift X column values as in Gap Preservation above giving the shifted table, Y
Make Gap and PCC cuts on YoD
If Density < DT at a dendogram node, C, (a cluster), partition C at each gap and PCC in C oD using next D in recursion plan.
F
Ct
GP
Using UPCC with D=1111
First round only:
0
1
1
1 2 3
6 10 22
1 1 1
4 5 6
25 16 12
1 1 1
C1
(44L, 1M, 47H) (1L, 0M, 0H)
errs=53
45
spread=10 3
F 0 1 2 3
Ct 18 25 18 18
GP 1 1 1 1
4
14
1
8
9
1
9 10 11 12 13 14 15 16 17
6 14 12 7 3 1 3 1 1
1 1 1 1 1 1 1 1
C2 C3
C4
C5
(3L, 9M, 3H) (2L, 31M, 0H) (0L, 9M, 0H)
0
1
Using UPCC with D=1001
First round only:
7
1
1
6
3
5
6
1
6
9
1
7
7
1
2
2
0
1
8 9 10 11
8 21 2 4
1 1 1
C2
C3
C1
C4
(42L, 1M, 50H) (0L, 22M, 0H) (8L, 21M, 0H) (0L, 6M, 0H)
errs=51
spread=6
43
2
0
1
8
2
0
1
(0L,0M,1H) (0L,0M,1H)
C1.5.5
D=1101 1st rnd:
2nd round D=AFFA 85% accurate C1.4.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
18 25 16 9 13 12 6 9 7 7 21 1 3 3
1 1 1 1 1 1 1 1 1 1 1 1 1
22 25 27 29 31 33 34 35 37 39 42 43 44 45 46 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 68 69 70 71 72
1 1 2 6 4 1 1 2 3 3 1 1 1 5 4 7 1 3 2 5 1 3 3 4 3 4 1 3 1 13 1 10 6 3 2 7 1 2
3 2 2 2 2 1 1 2 2 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 3 1 1 1 1
C1,1
C1,8
C1,2
C1,3
C1,5
C1,6
C1,7
C1,4
C1
C2 C3
(50L,22M,50H) (0L,21M,0H) (0L,7M, 0H) (0L,4M,0H) (6L,17M,0H) (13L,1M,2H) (5L,0M,0H)(5L,0M,0H) (16L,0M,6H)(2L,0M,12H) (3L,0M,28H)
errs=72 72
spread=5 3
0
1
0
1
C1,2 3rdrnd CtF
D=0010 CGP
0
4
1
3rd round D=0010 C1,6:
1
19
C1,2,1
1,2,1
(4L,0M,0H) (2L,17M)
F
Ct
0
2
1
5
2
6
3
5
0
3
1
5
2
4
3
1
6
1
5
1
6
1
7
1
3rd round D=0010 C1,8:
F 0
Ct 3
C1,8,1
C1,6,1 C1,6,2
C1,6,3
(2L,0M,0H) (14L,0M,2H) (0L,0M,4H) (3L,0M,0H)
3rd round D=0010 C1,3:
F
Ct
4
1
7
2
C1,3,1 C1,3,2
C1,3,2
C1,3,3
(3L,0M,0H) (9L,0M,0H) (0L,1M,0H) (0L,0M,2H)
3rd round with D=0010
1
2
2 3
2 17
4
3
5
4
C1,8,2
(0L,0M,28H)
97.3% accurate
FCC on CONC150
PCCs:  5 60% change from high.
Gaps:  3
NSTD 1
2
3
4
Std(Xk-minXk)/SpreadXk ?
0.25 0.25 0.24 0.22
CONC counts are L=43, M=52, H=55
NSTD 
DT=1
D 1111 01
1
1
1 5
(200)
6 7 8 10 14 15 16 17 18 20 21 22 23 24 25 26 28 29 31 33 34 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52 53 54 55 56 58 59 61 62
1 1 1 1 1 2 1 1 2 2 2 2 1 2 2 4 1 3 1 1 2 1 1 2 1 1 2 1 2 1 5 4 2 6 1 1 4 1 1
7 1 3 2 1 5
1 1 2 4 1 1 1 1 2 1 1 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
1 2 1 2 1 1
C2(2,2,0)
C3(24 13 5)
C4(6,4,1)(204)
C6(5,6,11)
C5
63 64 65 66 67 68 69 71 72 73 74 76 77 78 80 81 82 85 86 87 91 92 94 97 98 100 101 102 104 110 119
F
3 3 1 2 1 2 1 1 2 1 1 1 1 1 1 5 3 5 2 2 5 1
1 1
3
2
2
1
1
1
1
Count
1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 3 1 1 4 1 2
3 1
2
1
1
2
6
9
Gap
C9(0 11 16)
C10(152)(005)C11(144) (002)
C15(0 5 5)
(0 2 0)
1st rnd
1st rnd 53% accurate
(310)
(610) (002)
(102)
(1 2 0)
F 21 26 31 32 36 51 53 64 68 71 100 102 109 112 121 122
Ct 2 3 4 3
1 4 3 2 7
2
2
2
3
1
1
2
GP 5 5 1 4 15 2 11 4 3 29
2
7
3
9
1
(5 0 0)(520)(100) (0 5 0) (101)
(020)
(100)
D:1100
2nd rnd
C3
(24 13 5)
D:1100
2nd rnd
C15
(0 5 5)
D:1100
2nd rnd
C2
(2 2 0)
(010)
F 20 42 58 64 77 83
Ct 1 2 3 1
2 1
GP 22 16 6 13
6
(010)(0 0 5)
(0 4 0)
F 31 37
Ct 1 1
GP 6 4
(200)
D:1100
2nd rnd
C4
(6 4 1)
D:1100
2nd rnd
C5
(2 0 4)
41 49
1 1
8
(020)
(010)
(002) (011)
D:1100
2nd rnd
F 16 20 24 29 37 48 49 51 52 54 57 63 64 67 73 82 83
C9
Ct 1 1 1 1 2 4 1 1 3 1 3 1 1 2 2 1 1
(0 11 16)
GP 4 4 5 8 11 1 2 1 2 3 6 1 3 6 9 1
(0 4 0)
(002)
(0 2 8)
(012)(020) (011)
D:1100
2nd rnd
F
C6
Ct
(5 6 11)
GP
(200)
F 14 16 24 28 48 49 51 64
Ct 1 1 2 2 2 1 1 1
GP 2 8 4 20 1 2 13
(200)
(101) (140)
F 71 74 119
Ct 1 1
4
GP 3 45
(200) (004)
(100)
(100)
(0 0 9)
(010)
8 16 24 27 28 29 37 41 42 45 48 49 51 67 73
1 1 1 2 2 1 1 1 1 1 4 2 2 1 1
8 8 3 1 1 8 4 1 3 3 1 2 16 6
(020)
( 2 2 1)
(110)
(001)
(030)
D:1100
2nd rn10 F 51 57 58 63 64 83
C10
Ct 1 1
3 1
1 1
(1 5 2)
GP 6 1
5 1 19
(100)(001)
(011) (010)
(010) (001)
D:1100
2nd rn10 F 8 42 54 58 63 73
C11
Ct 1 2
1 3
1 1
(1 4 4)
GP 34 12
4 5 10
(010)(110) (003)
(010)
2nd round 90% accurate
FCC on WINE150
NSTD 
(1 0 0) C3(14 17 1) C5(1 20 4)
(0 0 2)
0
1
2
3
4
5
6
27
32
53
25
10
2
1st rnd Ct 1
C2(22 4 1) C4(18 28 7)
C6(1 6 3)
Gp>=1
D 0001 F
C2 0100 F 1
2nd rnd Ct 2
(22 4 1) GP 1
Gp>=2
C21(5
PCCs:  5 60% change from high.
COLUMN
1
2
3
4
STD
.017 .005 .001 .037
Std(Xk-minXk)/SpreadXk ?
SPREAD
10
39 112 5.6
STD/SPR .18 .23 .21 2.38
WINE counts are L=57, M=75, H=18
DT=1
1st round 64% accurate
2 3
1 2
1 1
0 1)
4
1
2
6
3
1
7 8
3 3
1 1
C22(10
9 10 11 13 14 17 19 23 24 29
2 1 2 1 1 1 1 1 1 1
1 1 2 1 3 2 4 1 5
4 0)
(7 0 0)
F 10 12 18 20 35 58 79 81 82
C6 0010
Ct 1 1 1 1 1 1 1 2 1
2nd rnd
(18 28 7) GP 2 6 2 15 23 21 2 1
(002) (030) (100) (031)
C4 0010
2nd rnd
(18 28 7)
F
Ct
GP
0
1
2
2 3
3 4
1 1
C31(10
4 5
2 3
1 1
8 1)
6
2
1
7
4
1
8
1
1
(030)
9 10 11 12 13 17 18 20 21 22 24 25 27 32 37
2 1 1 2 2 4 1 2 2 3 2 1 1 1 2
1 1 1 1 4 1 2 1 1 2 1 2 5 5 3
C32(1 5 3) C33(121)
C34(4 7 1) C35(481)
(020)
(010)
40 52 60 67 93 106
1 1 1 1 1
1
12 8 7 26 13
(100)
(001) (100

2014_03_15 - NDSU Computer Science

Transcript 2014_03_15 - NDSU Computer Science

Directory