Pattern-based data mining - ICL Database & Commentary
Download
Report
Transcript Pattern-based data mining - ICL Database & Commentary
Surveillance by the
National Defence Radio
Establishment (FRA)
and Data Mining
Mark Klamberg, doctoral candidate
1
1 November 2008
Background
•Legislation adopted June 18th 2008, the “FRA Law”
•Six members of parliament from the Government coalition threatened to
join the opposition in recalling the law during the summer 2008
•Agreement between the six critics and the Government coalition
September 25th September 2008 to amend the law
2
How will FRA be able to access information when an increasing
number of users choose to encrypt their messages?
This is especially relevant, as there has been a tendency for
encryption techniques to develop at a faster rate than decryption
techniques.
What will happen to all this incoming electronic traffic once it has
been re-routed and fed into the FRA agency?
3
The digital revolution affects our lives in terms of privacy
more than we think. We leave electronic ‘footprints’ whatever we
do: paying by credit card, visiting website homepages, calling
friends on the phone or sending them an e-mail. Imagine that
someone decides to collect all this information and assemble it in
a massive database. Using the right tools they will be able to
identify your lifestyle patterns and gain insight into your
personality. This is called social network analysis, a term
included by the wider concept data mining.
4
Recurring personality patterns can be graphically illustrated by means of a
sociogram. A sociogram is a graphic representation of the relationships between
persons, organisations, homepages, etc., with a view to determining personal
social networks, position of power, views and beliefs and other personal
information.
The actual message is less important than
the information about the sender, recipient,
the time of transaction, and means of
communication.
5
Different individuals can be linked to different sociograms: we have different
everyday experiences, social relations, interests, views and beliefs, all of which
is reflected in our electronic communication contacts. Sociograms have
applications in a plethora of areas. With the help of a powerful computer and
appropriate analytical tools we might thus be able to build up a profile of and
identify a typical benefit scrounger, a refugee in hiding, a data hacker, a
homosexual couple, or a political activist, to give just a few examples.
6
If we also monitor cross-border traffic we will be able to – at least
theoretically – build sociograms identifying currency speculators, or foreign
political and military leaders.
It is a well-known fact, however, that best results are obtained from
monitoring a public who is unaware of being watched, or those who cannot
protect themselves against it.
7
Key feature of the “FRA-law”
IT- and telecom operators are obligated to transfer all traffic in
cables crossing Swedish borders to the State
8
Definitions
Distinguish between
• Transfer to the state (stage 1) – collection and processing
(stage 2)
• Content data (text of the message) and traffic data (who is
contacting who, when and how)
9
Collection and processing of data (stage 2)
The FRA has a mandate to monitor and collect content data for
certain purposes including external military threats, terrorism and
IT-attacks. The Agreement of September 25th specifies these
certain purposes. The FRA may under certain conditions collect
and process content data when an individual is targeted.
FRA can also provide assistance to the Police within the
parameters set by the specific purposes of the “FRA-law”.
The question about assistance to the Police has not been finally
settled.
10
Intelligence court
According to the agreement September 25th the collection of
data would be placed under the control of an “intelligence court”
which operates behind closed doors. The court will, among
others things, limit FRA’s collection and processing of content
data.
11
Traffic Data
No restrictions on collection of traffic data, the basis of the FRA
operations (analysis of traffic patterns)
Traffic data on Swedes (and others) have been collected for
more than 10 years without legal basis. The FRA has in an
internal document stated that they intend to continue to collect
traffic data to the same extent.
According to the adopted law the FRA has the mandate to
collect, process and store all available traffic data. This is
necessary for analysis and targeting. Targeting relates to the
12
what content data should be collected and processed.
13
Profiling and targeting
Profiling and targeting can be done
using phone numbers and technical
parameters (for example internet
protocol address)
In addition, according to the law a
person’s race, ethnicity, political
views, religious and philosophical
beliefs, membership of a labour union,
health or sexuality may under certain
conditions be used for targeting.
Targeting is, inter alia, done by the
use of traffic data
14
A problem
A significant problem is that data of
this kind must be collected over a long
period of time, and that we cannot
know beforehand who will satisfy the
deviance criterion linked to an
external threat. This is why the FRA
agency has to store data of a great
number of people, which means
keeping close tabs on practically
everybody, whether they are innocent
or not.
15
Transfer of personal data
According to the law personal data collected by the
FRA may be transferred to other countries.
16
A critical remark
Is this kind of data collection and surveillance…
• Consistent with the right to privacy? This is both a
human right and a constitutional right.
• Efficient?
• Proportional?
• Confident and reliable in the sense that it gives
accurate results and not false alarms?
17
Data Mining
18
Terrorist
cells
Mohammad Atta
American Airlines Flight 11
Marwan al-Shehhi,
United Airlines Flight 175
Hani Hanjour,
American Airlines Flight 77
Ziad Jarrah,
United Airlines Flight 93
Khalid Sheikh Mohammed
Architect of the attacks
19
Disclaimer: This network scheme is partly fictional. Klamberg
U.S. National Research Council, report October 2008
“Protecting Individual Privacy in the Struggle Against Terrorists: A
Framework for Program Assessment”
20
Two general types of data mining techniques
1. Subject-based data mining
2. Pattern-based data mining
U.S. National Research Council
“Protecting Individual Privacy in the Struggle Against
Terrorists: A Framework for Program Assessment”
21
Subject-based data mining
Subject-based data mining uses an initiating individual or
other datum that is considered, based on other information,
to be of high interest, and the goal is to determine what other
persons or financial transactions or movements, etc., are
related to that initiating datum.
U.S. National Research Council
22
Pattern-based data mining
Pattern-based data mining looks for patterns (including
anomalous data patterns) that might be associated
with terrorist activity—these patterns might be regarded as
small signals in a large ocean of noise.
U.S. National Research Council
23
When to use the two different techniques
In the case of the decentralized group, subject-based data
mining is likely to augment and enhance traditional police
investigations by making it possible to access larger volumes
of data more quickly. Furthermore, communications networks
can more easily be identified and mapped if one or a few
individuals in the network are known with high confidence.
By contrast, pattern-based data mining may be more useful
in finding the larger information footprint that characterizes
centrally organized terrorist groups.
U.S. National Research Council
24
Subject Based Data Mining
Terrorists
1. Assumptions
Strong
transaction
a) Use of initial individual
b) Studies strong transactions (double line)
c) Ignores weak transactions (single line)
2. Method
Initiating
individual
Search for communication pattern that match the
abovementioned assumptions
25
Subject Based Data Mining
False negatives
Nawaf al-Hazmi,
American Airlines Flight 77
Hani Hanjour,
American Airlines Flight 77
Mohammad Atta
American Airlines Flight 11
Ziad Jarrah,
United Airlines Flight 93
Mohammed Jahanshahi
Khalid Sheikh Mohammed
Architect of the attacks
Marwan al-Shehhi,
United Airlines Flight 175
False positives
26
Disclaimer: This network scheme is partly fictional. Klamberg
Pattern Based Data Mining
Terrorists
1. Assumptions
Ring
leader
a)
b)
c)
d)
Terrorist cell
A terrorist cell consists of 4-5 members
The terrorist cell has one ring leader
The terrorist cell is part of a larger network
Only the ring leader communicates with
other parts of the terrorist network
e) The members of the terrorist cell only
communicate with the ring leader
2. Method
Search for communication pattern that match the
abovementioned assumptions
27
False negative
Pattern Based Data Mining
Nawaf al-Hazmi,
American Airlines Flight 77
Hani Hanjour,
American Airlines Flight 77
Mohammad Atta
American Airlines Flight 11
Marwan al-Shehhi,
United Airlines Flight 175
Ziad Jarrah,
United Airlines Flight 93
Khalid Sheikh Mohammed
Architect of the attacks
Mohammed Jahanshahi
False positives
28
Disclaimer: This network scheme is partly fictional. Klamberg
Utility of pattern-based data mining
The utility of pattern-based data mining is found
primarily if not exclusively in its role in helping humans
make better decisions about how to
deploy scarce investigative resources, and action
(such as arrest, search, denial of rights) should never
be taken solely on the basis of a data mining result.
Automated terrorist identification through data mining
(or any other known methodology) is neither feasible
as an objective nor desirable as a goal of technology
development efforts.
U.S. National Research Council
29
No single demographic profile
Those who become terrorists “are a diverse collection of
individuals, fitting no single demographic profile, nor do they
follow a typical pathway to violent extremism”.
MI5
30
The problem with false positives
In addition to the highly desirable true positives and true
negatives that are produced, there will be the very
troublesome false positives (i.e., a person telling the truth is
thought to be lying) and false negatives (i.e., a person lying is
thought to be telling the truth). Such errors are linked to the
probabilistic nature of behavioral signals
U.S. National Research Council
31
Trade-off false positives and
false negatives?
Are the consequences of a false negative (a terrorist plan is not detected
and many people die) much larger than the consequences of a false
positive (an innocent person loses privacy or is detained)?
There is no reason to expect that false negatives and false positives trade
off against one another in a one-for-one manner. In practice, the trade-off
will almost certainly entail one false negative against an enormous
number of false positives, and a society that tolerates too much harm to
innocent people based on large a number of false positives is no longer a
society that respects civil liberties.
U.S. National Research Council
32
Questions?
33
Thanks!
Contact:
[email protected]
+46 8 16 11 90
34