What is AISEC

Download Report

Transcript What is AISEC

Internal Presentation by:
Lei Wang
Pervasive and Artificial Intelligenge research group
http://diuf.unifr.ch/pai
On:
An Artificial Immune System for
E-mail Classification
Andy Secker, Alex Freitas, Jon Timmis
Computing Laboratory, University of Kent
Canterbury, Kent, UK
http://www.cs.kent.ac.uk/~ads3
19/02/2004
An Artificial Immune System
for E-mail Classification
Andy Secker, Alex Freitas, Jon Timmis
Computing Laboratory, University of Kent
Canterbury, Kent, UK
http://www.cs.kent.ac.uk/~ads3
19/02/2004
Significance

With the increase in information on the Internet, the strive to find more
effective tools for distinguishing between interesting and non-interesting
material is increasing.

This paper provides an immune-inspired algorithm called AISEC that is
capable of continuously classifying electronic mail as interesting and
non-interesting without the need for re-training.

Comparing with a naïve Bayesian classifier, the system proposed in
this paper performs as well as the naïve Bayesian system and has a
great potential for augmentation.
19/02/2004
AISEC, immunity-inspired system

Immune system


Human body constantly under attack. Immune system must adapt and
respond
The (natural) immune system is:
1.
2.
3.
4.

Dynamic
Adaptive
Robust
Etc.
Artificial Immune Systems (AIS) use principles and
process from observed and theoretical immunology to
solve problems
19/02/2004
Artificial Immune Systems

Engineering framework


Representation of individual immune cells
Affinity measures
• Evaluate interaction of individuals with environment
and/or each other


Algorithms
• Procedures of adaptation manipulate populations of
immune cells
AIS as a classifier

AIRS
• A successful supervised AIS algorithm for classification
19/02/2004
AIS for Web Mining

Web mining, an umbrella term used to describe three quite
different types of data mining:
 Content mining
• A process of extracting useful information from the text, images
and other forms of content that make up the pages
• The mining of textual data is a common task, often for the
purposes of information retrieval
Usage mining
 Structure mining
AISEC research goal




To develop a highly adaptive system capable of retrieving interesting
information from the internet based on user’s current interests
The authors believe AIS may offer a number of advantages
19/02/2004
What is AISEC ?

AISEC isn’t a spam filter


It has no methods to penalize false positives (loss of important
e-mail)
Without a very low false positive rate, a spam filter would not be
trusted
19/02/2004
What is AISEC ?

AISEC is

A first step towards an AIS for web mining.
•

A study of performance and characteristics of an AIS applied to text
mining in a dynamic domain
A text classification algorithm capable of continuous adaptation, which
may yield a classification accuracy comparable to a Bayesian
approach.
•
•
•
•
User behaviour and interaction with e-mail can be similar to web pages
Supervised classification algorithm
 E-mail classified as interesting and uninteresting
 Uses constant(ish) feedback from user
 Capable of continuous adaptation
This tracks concept drift and can also handle concept shift
A specialised AIS algorithm based in part on the immune principle of
clonal selection
 No previously documented algorithm was suited for use in this
situation without extensive changes
19/02/2004
Representation

Each cell contains 3 sets of words (+ state)
 Punctuation is removed from fields
 Research literature has suggested header information is
enough to accurately classify e-mail*
A = [<free,DVD> , <sales,com> , < canterbury,UK>]
Subject field
Sender field
Return field
Title of the E-mail
Sender’s name
(Sender’s address)
* Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal E-mail Filtering, PAKDD 2000
19/02/2004
Affinity

Affinity value is proportion of words in one cell found in another

More features would require a less naïve distance measure

Cosine distance is an obvious choice

Resultant value always between 0 and 1
PROCEDURE affinity (bc1, bc2)
IF(bc1 has a shorter feature vector than bc2)
bshort ← bc1, blong ← bc2
ELSE
bshort ← bc2, blong ← bc1
count ← the number of words in bshort present in blong
bs_len ← the length of bshort’s feature vector
RETURN count/bs_len
A = [<free,DVDs> , <offers,DVD,com> , <offers,DVD,com>]
B = [<half,price,sale>,<sales,DVD,com>,<sales,DVD,com>]
affinity(A,B) = 4/9
19/02/2004
Clone-Mutation

One mutation takes a word
previously used in subject or
address and replaces single
location


Subject, sender and return
address libraries are kept
separately
Usually >1 mutation per cell
takes place
Subjectlib
SenderLib
ReturnLib
= free,DVD
= sales,DVD,com
= sales,DVD,com
PROCEDURE clone_mutate(bc1,bc2)
aff ← affinity(bc1,bc2)
clones ← ∅
num_clones ← | aff * Kl |
num_mutate ← | (1-aff) * bc’s feature
vector length * Km |
DO(num_clones)TIMES
bcx ← a copy of bc1
DO(num_mutate)TIMES
p ← a random point in bcx’s
feature vector
w ← a random word from the
appropriate gene library
replace word in bcx’s feature
vector at location p with w
bcx’s stimulation level ← Ksb
clones ← clones ∪ {bcx}
RETURN clones
A = [<free,DVD> , <sales,DVD,com> , <sales,DVD,com>]
A = [<free,free> , <sales,DVD,com> , <sales,DVD,com>]
19/02/2004
The algorithm - classification
1. System is initialised with
known uninteresting email
Memory cells
Naive cells
2. E-mail presented for
classification. Classified
as uninteresting as it
stimulates close cells
19/02/2004
The algorithm – correct classification
Classification
Region
Stimulation
Region
3. Highly stimulated cell
reproduces 7 times. Less
stimulated cell produces
only 2 clones but with
higher mutation rate
4. Cell with highest
affinity is known to be
useful therefore
rewarded by becoming
memory cell.
19/02/2004
The algorithm cont…
Incorrect classification
5. Any cell responsible for
incorrect classification is
removed (memory or
otherwise)
Cell removal
•
Aged naïve cells deleted.
Memory cells placed in
already covered areas
also deleted.
19/02/2004
Results – Classification accuracy
2268 e-mails (742 uninteresting) received over 6 months
E-mails presented in the order of date received
Feedback given after EVERY classification
AISEC run 10 times, results show mean
C5.0, neural network and C&R tree all run in “Clementine” data
mining package
Bayesian algorithm used feedback to update like AISEC






Traditional Learning
Continuous Learning
C5.0
83.9%
Naïve Bayesian
88.05%
Naïve Bayesian
85.0%
AISEC
89.09%  0.97
Neural Network
85.6%
AISEC
86.0% 1.29
C&R tree
87.7%
19/02/2004
Results – variation of population size
450
Naïve B-cells
Memory B-Cells
400
350
Number of B-cells
300
250
200
150
100
50
0
100
200
300
400
500
600
700
800
900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200
Generation Number
19/02/2004
User point of view
AISEC runs as a proxy on local machine
Advantages





No need to switch e-mail client
Can collect mail from multiple locations
AISEC’s user interface would require minimal interaction

19/02/2004
User point of view
Collect mail
Client
Interesting
Collect mail
AISEC
Server(s)
Classifier
Return mail
Positive user
response
Local
machine
Uninteresting
Store
User interaction
Negative user
response
19/02/2004
Results cont…

Standard measures of quality


Precision is the proportion of positive documents retrieved
compared with the total number of positive documents
Recall is the proportion of positive documents actually classified as
positive
Precision
Recall
Naïve Bayesian
93.93%
67.76%
AISEC
82.20%  2.63
81.13%  4.71
19/02/2004
Results – variation of time between user
feedback
90.5%
AISEC
Bayeseian
90.0%
Classification Accuracy
89.5%
89.0%
88.5%
88.0%
87.5%
87.0%
86.5%
86.0%
0 between
13 between
38 between
88 between
176 between
E-mails classified between feedback
300 between
600 between
19/02/2004
Conclusion

AISEC has produced promising results and appears robust


Interesting note: Typical accuracy similar to published results from
other AIS for text classification (both traditional and continuous
learning)
Use a larger training set and optimise (the many) parameters
• Detect when there are the optimum number of cells

AISEC has been useful providing some evidence AIS applied to
this domain would be possible

Research on adaptive systems for retrieval of interesting
information, not necessarily purely accurate information
19/02/2004
Questions & Discussions
An Artificial Immune System for E-mail Classification
Andy Secker, Alex Freitas, Jon Timmis
Computing Laboratory, University of Kent
More information:
http://www.cs.kent.ac.uk/~ads3c
19/02/2004