Transcript Slide 1
Fifth Workshop on Link Analysis,
Counterterrorism, and Security.
or
Antonio Badia
David Skillicorn
Open Problems
An individualized list
(with some feedback from
workshop participants)
Process improvements:
• Better overall processes.
Defence in depth is the key to lower error rates; good/normal should look like it
from every direction;
• Handling multiple kinds of data at once (attributed with relational);
We don’t know very many algorithms that exploit more than one type of data
within the same algorithm;
• Using graph analysis techniques more widely;
Although there are good reasons to expect that a graph approach will be more
robust than a direct approach, this is hardly ever done – for good reasons because
it’s harder and messier;
• Better ways to exploit the fact that normality implies internal
consistency;
This only makes sense in an adversarial setting so it has received little attention
– but it is a good, basic technique;
• Legal and social frameworks for preemptive data analysis;
The arguments for widespread data collection, and ways to mitigate the downsides
need to be developed further, and explained by the knowledge discovery community
to those who have legitimate concerns about the cost/benefit tradeoff;
• Challenges of open virtual worlds;
New virtual worlds, such as the Multiverse, make it much hard to gather data using
any kind of surveillance – the consequences need to be understood;
• Focus on emergent properties rather than collected ones;
Attributes that are derived from the collective properties of many individual
records are much more resistant to manipulation than those collected directly in
individual records;
• Collaboration with linguists, sociologists, anthropologists, etc.;
Applying technology well depends on deeper understanding of context, and computing
people do not necessarily do this well;
•
Better use of visualization, especially multiple views;
“Easy” technical advances:
• Hardening standard techniques against manipulation (by insiders and
outsiders);
Most existing algorithms are seriously vulnerable to manipulation by, e.g., adding a
few particular data records;
• Distinguishing the bad from the unusual;
It’s straightforward to identify the normal in a dataset, but once these records
have been removed, it still remains to separate the bad and the unusual; little has
been done to attack this problem;
• Getting graph techniques to work as well as they should;
Although graph algorithms have known theoretical advantages, it has been
surprisingly difficult to turn these into practical advantages;
• Strong but transparent predictors;
We know predictors that are strong, and predictors that are transparent (they
explain their predictions) but we don’t know any that are both at once;
• Detecting when models need to be updated because the setting has
changed;
In adversarial settings, there is a constant arms race, and so a greater need to
update models regularly – automatic ways to know when to do this are not really
known;
• Clustering to find ‘fringe’ records;
In adversarial settings, the records of interest are likely to be close to the normal
data, rather than outliers – techniques for detecting such fringe clusters are
needed;
• Better 1-class prediction techniques;
In many settings, only normal data is available; existing 1-class prediction is unusably
fragile;
• Temporal change detection (trend/concept drift in every analysis);
One way to detect manipulation is to see change for which there seems to be no
explanation – detecting this would be useful;
• Keyless fusion algorithms, and an understanding of the limits of
fusion;
Most fusion uses key attributes that are thought of as describing identity – but,
anecdotally, almost any set of attributes can play this role, and we need to
understand the theory and limits;
• Better symbiotic knowledge discovery – humans and algorithms
coupled together;
Many analysis systems have a loop between analyst and knowledge-discovery tools,
but there seem to be interesting ways to make this loop more productive;
Difficult technical advances:
• Finding larger structures in text;
Very little structure above the level of named entities is done at present; but there
are opportunities to extract larger structures both to check for normality, and to
use them to understand content better;
• Authorship detection from small samples;
The web has become a place where authors are plentiful, and it would be useful to
detect that the same person has written in this blog and that forum;
• Unusual region detection in graphs;
Most graph algorithms focus either on clustering or on exploring the region of a
single node – it is also interesting to find regions that are somehow anomalous;
• Performance improvements to allow scaling to v. large datasets;
Changes of three orders of magnitude in quantity require changes in the qualitative
properties of algorithms – scalability issues need more attention;
• Better use of second-order algorithms;
Approaches in which an algorithm is run repeatedly under different conditions and it
is a change from one run to the next that is significant have potential but are hardly
ever used;
• Systemic functional linguistics for content/mental state extraction
from text;
SFL takes into account the personal and social dimensions of language, and brings
together texts that look very different on the surface; this will have payoffs in
several dimensions of text exploitation;
• Adversarial parsing (cf error correction in compilers);
When text has been altered for concealment, compiler techniques may help to spot
where these changes have occurred and what they might have been.