Dark Web Forum Collection, Presentation, and Analysis

Download Report

Transcript Dark Web Forum Collection, Presentation, and Analysis

Dark Web
Collection, Search, and Analysis
Dr. Hsinchun Chen
Director, Artificial Intelligence Lab
University of Arizona
[email protected]
http://ai.arizona.edu
Acknowledgements: NSF CRI; NSF EXP-LA; DTRA,
DOD CTFP, NPS; (ARFL WMD, CIA, FBI)
Leaderless Jihad and the Internet
•
•
•
“The process of
radicalization in a hostile
habitat but linked through
the Internet leads to a
disconnected global
network, the Leaderless
Jihad.”
Before 2004, face-to-face
interactions, 26-year old
After 2004, interactions on
the Internet: Madrid, Dutch
Hifsatd, Cairo, Toronto…
Irhabi007 and Muntada, 20year old
Intelligence and Security Informatics
Intelligence and Security
Informatics (ISI): Development
of advanced information
technologies, systems,
algorithms, and databases for
national security related
applications, through an
integrated technological,
organizational, and policybased approach” (Chen et al.,
2003a)
Data, text, and web mining
 From COPLINK to Dark Web


COPLINK project in the press
The New York Times, November 2, 2002
COPLINK assisted in DC sniper investigation
ABC News April 15, 2003
Google for Cops: Coplink software helps police search for
cyber clues to bust criminals
Newsweek Magazine, March 3, 2003
A computerized way for police to coordinate crime
databases
Washington Post, March 6, 2008
National dragnet is a click away!
COPLINK in use in 1,600 police agencies
in US!
Dark Web Overview




Dark Web: Terrorists’ and
cyber criminals’ use of the
Internet
Collection: Web sites,
forums, blogs, YouTube,
Second Life
Analysis and Visualization:
Link and content analysis;
Web metrics analysis;
Authorship analysis;
Sentiment analysis;
Multimedia analysis
Our collection is about 2
TBs in size, with close to
500M pages/files/messages
from more than 10,000 Dark
Web sites.
Dar Web project in the press
Project Seeks to Track Terror Web
Posts, 11/11/2007

Researchers say tool could trace online
posts to terrorists, 11/11/2007

Mathematicians Work to Help Track Terrorist
Activity, 9/14/2007

Team from the University of
Arizona identifies and tracks
terrorists on the Web, 9/10/2007

Dark Web Forum Crawler System
Middle Eastern Web Collection File
Types


Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in extremist Web
sites, indicating a high level of technical sophistication.
Multimedia files (videos, images) are also heavily used in extremist Web sites.
Terrorist Collection
# of Files
Volume(Bytes)
Total
222,687
12,362,050,865
Indexable Files
179,223
4,854,971,043
44,334
1,137,725,685
Word Files
278
16,371,586
PDF Files
3,145
542,061,545
HTML Files
Dynamic Files
130,972
3,106,537,495
390
45,982,886
6
6,087,168
98
204,678
Multimedia Files
35,164
5,915,442,276
Image Files
31,691
525,986,847
Audio Files
2,554
3,750,390,404
Video Files
919
1,230,046,468
Archive Files
1,281
483,138,149
Non-Standard Files
7,019
1,108,499,397
Text Files
Powerpoint Files
XML Files
Number of Files Distribution (Arabic)
4%
0%
Indexable
Files
Mulmedia
Files
Archive Files
16%
Non-Standard
Files
80%
Volume Distribution (Arabic)
4%
9%
39%
48%
Indexable
Files
Mulmedia
Files
Archive Files
Non-Standard
Files
CyberGate System: Analysis &
Visualization
Measuring Hate and Violence: US vs.
Middle Eastern Groups
U.S. Forum Scores
Violence Scores
400
7. Results: Intensity Relationship
U.S.
Middle
Eastern
N
4676
3349
beta (slope)
0.079
0.682
t-Stat
21.354
48.265
P-Value
0.000
0.000
R-Square
0.076
0.486
300
200
2 R
b
1
100
0
0
100
200
Hate Scores
300
400
Middle Eastern Forum Scores
Violence Scores
400
Strong hate and violence
correlation, especially for
Middle-Eastern groups.
300
200
100
0
0
50
100
150
200
250
Hate Scores
300
350
400
Number of Posts By Month: AlFirdaws vs. Montada
Al-Firdaws Posts By Month
3500
# posts
3000
2500
2000
1500
1000
500
Montada very
active in 2002 and
2005.
Jul-07
May-07
Mar-07
Jan-07
Nov-06
Sep-06
Jul-06
May-06
Mar-06
Jan-06
Nov-05
Sep-05
Jul-05
May-05
Mar-05
Jan-05
0
Montada Posts By Month
25000
20000
15000
10000
5000
0
Sep-00
Jan-01
May-01
Sep-01
Jan-02
May-02
Sep-02
Jan-03
May-03
Sep-03
Jan-04
May-04
Sep-04
Jan-05
May-05
Sep-05
Jan-06
May-06
Sep-06
Jan-07
May-07

Al-Firdaws
consistently has
between 2,5003,000 posts per
month since the
second half of
2006.
# posts

Affect Intensities: Al-Firdaws vs.
Montada
Al-Firdaws
- Violence
Al-Firdaws
has
considerably
higher
violence and
also greater
anger
Al-Firdaws - Anger
intensity.
Montada
- Violence
Montada
- Anger
Arabic Writeprint Feature Set
Feature Set
(418)
Violence
Race/Nationality
Technical Structure
Word Structure
Word Roots
Function Words
Punctuation
Word-Based
Char-Based
Hyperlinks
Embedded Images
Font Size
Font Color
Contact Information
Paragraph Level
Message Level
Elongation
Word Length Dist.
Vocab. Richness
Word-Level
Special Char.
Letter Frequency
Char-Level
(7)
(8) (4)
(29)
(3)
(6)
(5)
(15) (2)
(8)
(6)
(9)
(35)
(4)
(4)
(11)
(48)
(14)
(50)
(200)
(12)
(31)
(48)
(15)
(62)
(262)
(79)
Content
Specific
Structural
Syntactic
Lexical
Arabic Feature Extraction Component
1
Incoming
Message
2
Count +1
Elongation Filter
Degree + 5
Filtered
Message
Feature Set
Similarity
Root Dictionary
3
Scores (SC)
max(SC)+1
Root Clustering
Algorithm
All Remaining
Features Values
Generic Feature
Extractor
4
Sliding Window + PCA : Turning Text
into Dots
Message Text
1.
Compute eigenvectors
for 2 principal
components of feature
group
x
0.533
-0.541
0.034
0.653
0.975
0.143
2.
Extract
feature usage
vectors
y
0.956
0.445
0.089
0.456
-0.085
-0.381
1,0,0,2,1,2
Eigenvectors
3.
Transform into 2dimensional
space
Feature Usage Vector Z
0,1,3,0,1,0
y
x =  Zx
y =  Zy
Repeat steps
2 and 3
x
Author Writeprints
Anonymous Messages
Author A
10 messages
Author B
10 messages
ClearGuidance.com (Toronto Plot):
Participant Network Visualization
ClearGuidance Forum
“Experts”
The
series of overlapping circular patterns for bag-of-word
features indicates that the author’s discussion revolves around a
related set of topics.
Bag-of-words are
predominantly related to
religious topics, e.g., Adam,
angels, etc.
Many large red blots
indicative of the presence
of features unique to this
author, e.g., Adam, angels,
etc.
This author was later arrested as a major culprit in
the Toronto terror plot (“Soldier of God”). He uses
many violent affect terms.
Radar chart showing
violent affect feature
usages.
Selected feature (i.e., “jihad”) is
shown in red.
Selected feature is use of
term “jihad” which is the
highest in the forum .
This author constantly attempts
to justify acts of violence and
terrorism.
“…there
are so many paid sheikhs
stuck in this life….no point going to
them for fatwas…personally
speaking…cuz they don’t even
agree with jihad in the first place”
Dark Web Forum Tools


Information contained
within Dark Web
forums represent a
significant source of
knowledge for security
and intelligence
organizations.
We have developed
tools supporting the
large-scale collection,
search, and analysis
of Dark Web forums,
specifically addressing
the needs of security
analysts.
Collection
AZ Forum
Spider
Search
AZ Forum Portal
AZ Sentiment Analyzer
Analysis
AZ CyberGate
Text Analyzer
AZ Forum Spider
Collection – AZ Forum Spider





Automated
collection of forum
communications;
weekly update
Proxy servers and
parameters
Site map, URL
ordering, and
forum extraction
Incremental spider
Collection
visualization
Forum List
Spidering
Status
Collection
Statistics
Spidering
Profile
AZ Forum Portal
Dark Web Forum Portal


Current version:
13M messages
(340K members)
across 29 major
Jihadi forums in
English, Arabic,
French, German
and Russian
Forum analysis



By forum, thread,
member, time
period, or topic
Social network
analysis and
visualization
Google
Translation
Forum Portal Data Set
Name
Language
Time Span
Number of
Members
Number of
Threads
Number of
Messages
Al-Boraq
Arabic
01/08/2006 - 01/02/2010
3,503
52,322
223,648
Al-Fallujah
Arabic
09/19/2006 - 01/02/2010
5,853
74,899
547,712
Al-Firdaws*
Arabic
01/02/2005 - 12/06/2007
2,187
9,359
39,715
Midad al-Suyuf
Arabic
03/18/2006 - 01/02/2010
1,597
11,232
38,382
Alokab
Arabic
04/08/2005 - 12/31/2009
1,547
8,096
55,947
Al-Qimmah
Arabic
11/23/2007 - 01/02/2010
287
12,097
23,709
Alsayra
Arabic
04/05/2001 - 12/31/2009
66,705
147,598
1,227,207
Ansar
Arabic
11/07/2008 - 01/02/2010
1,224
12,041
46,928
At-tahadi
Arabic
04/14/2008 - 01/02/2010
313
2,599
5,406
Hanin Net
Arabic
11/27/2006 - 01/12/2010
2,837
96,239
821,478
Hawaa World
Arabic
01/01/2001 - 01/02/2010
113,579
40,501
2,251,553
Hadramout
Arabic
11/25/2000 - 12/29/2009
29,491
151,694
1,552,227
Ma’arik
Arabic
07/29/2007 - 01/03/2010
1,880
15,288
57,047
Al-Mujahidin
Arabic
11/09/2007 - 01/02/2010
4,259
29,980
140,930
Montada
Arabic
09/25/2000 - 12/29/2009
40,291
120,181
1,412,028
23
23
Data Set (Cont’d)
Name
Language
Time Span
Number of
Members
Number of
Threads
Number of
Messages
Ana al-Muslim
Arabic
10/08/1985 - 11/26/2009
12,215
179,791
1,343,370
Shumukh
Arabic
03/21/2007 - 01/02/2010
3,938
46,666
289,201
Ansar
English
12/08/2008 - 01/02/2010
377
11,133
29,056
Gawaher
English
10/24/2004 - 01/01/2010
6,790
210,656
569,709
Islamic Awakening
English
04/28/2004 - 12/31/2009
2,361
25,112
116,009
Islamic Network*
English
06/09/2004 - 05/07/2008
1,573
11,974
87,314
Islamic WebCommunity
English
11/14/2000 - 12/31/2009
745
6,262
24,850
Turn To Islam
English
06/02/2006 - 01/01/2010
9,926
38,702
308,970
Ummah
English
04/01/2002 - 12/31/2009
14,349
71,218
1,192,583
Al Minha Dj
French
06/01/2008 - 01/04/2010
313
2,007
6,421
Forums d’aslama
French
10/06/2004 - 01/03/2010
2,665
20,468
131,559
Al-Mourabitoune
French
05/05/2002 - 03/27/2009
3,198
7,905
72,140
Ansar
German
02/27/2009 - 01/02/2010
62
726
1,645
KavkazChat
Russian
03/21/2003 - 01/03/2010
5,634
6,144
558,042
339,699
1,422,890
13,174,786
Total
24
24
Forum Statistics Summary
(Cont’d)
25
25
Cross Forum Search
26
26
Single Forum Search & Translation
Search: bomb, iraq
Translations of thread titles
27
27
SNA Replay Network
1. Bint ul Islam (290
postings)
2. Iloveislam (239
postings)
3. Abuhannah (173
postings)
28
AZ Sentiment Analyzer
Search – AZ Sentiment Analyzer


Portal for the
sentiment and
affect analysis of
forums, measuring
member opinions
and emotions
Characterizes the
affects conveyed in
forum text, and the
underlying
sentiment polarity


By forum, thread,
member, or time
period
Keyword search
AZ CyberGate Text Analyzer
Analysis – AZ CyberGate Text Analyzer

Comprehensive
system for the analysis
and visualization of
forum communications
 Shows all text features
 Utilizes Writeprint and
Ink Blot techniques in
text analysis
 Incorporates rich
visualization based
upon multi-dimensional
scaling and parallel
coordinates
Conclusion





The web offers extremists a rich medium for recruiting,
communication, and radicalization.
Information contained within Dark Web sites, forums,
blogs, multimedia, etc. represent a significant source
of knowledge for security and intelligence
organizations.
A computational approach to Dark Web research
spans collection, search, and analysis.
Dark Web research could potentially assist in terrorism
research and intelligence analysis.
Dark Web Forum Portal available now!!!
Dark Web
Collection, Search, and Analysis
For more information:
Dr. Hsinchun Chen, University of Arizona
[email protected]
http://ai.arizona.edu