Cross-Analysis of Botnet Victims: New Insights and Implication
Download
Report
Transcript Cross-Analysis of Botnet Victims: New Insights and Implication
Cross-Analysis of Botnet Victims:
New Insights and Implication
Seungwon Shin, Raymond Lin, Guofei Gu
Presented by Bert Huang
Introduction
Botnet
Internet-Connected Computers
AKA. Zombie Army
Architecture
Command & Control (C&C) via Internet Relay Chat (IRC)
Peer to Peer (P2P)
What can they do?
DDoS
Spam
Facilitate other malware
Introduction
Propagation Method
Auto-self-propagating (Type I)
Network Scanning
Dictionary attack on Admin Share
Non-auto-self-propagating (Type II)
Phishing
Drive-by-download
Pay-per-installation
Introduction
Research Question
Are there any similarities/differences in infection patterns between these two
types of botnets?
E.g. Distribution of victims
Motivation
Further understand the nature of botnets
Develop more accurate/targeted malware monitoring, detection, and
prediction systems
Three Major Botnets
Conficker (Type I)
MegaD (Type II)
Srizbi (Type II)
Data Collection
Conficker
Gain total control of infected computers
Uses domain-fluxing to generate C&C domain names for
victims to contact
Methodology
shadowserver.org
Sinkhole servers
Registered the domain names same as Conficker’s master
servers to redirect queries of infected computers.
Captures communication activities
Data Collection
MegaD & Srizbi
Spam bots
Methodology
BOTLAB project
Spam trap servers
Crawling URLs
DNS monitoring
Determine via correlation
Data Collection
Limitations
Dynamic IP makes it hard to identify hosts
Generalize IP addresses to the subnet/network level
/24 subnet
If 155.245.23.70 is infected, the whole subnet (155.245.23.*)
will be considered as an infected network
Unlikely to collect the complete data of certain botnets
(in this case… data for MegaD and Srizbi)
Such is life
Data Collection
Conficker
24,912,492 victims
1,339,699 infected networks
MegaD
83,316 victims
71,896 infected networks
Srizbi
106,446 victims
77,934 infected networks
Data Collection
Type I
Type II
1,339,699 infected networks
137,902 infected networks
Common (Type I & Type II)
97,290 infected networks
Cross Analysis – Initial Analysis
Hypothesis I – Since the two types of botnets have very
different infection vectors, they may exhibit different infection
patterns (e.g. distribution of their infected networks)
The distributions of Type I and Type II
botnet are VERY SIMILAR
IP address range of (77.* - 96.*), (109.* 125.*) and (186.* - 222.*) are highly
infected by BOTH type of botnets
These IP address range are continuous,
which may imply that vulnerable
networks are close to each other
Cross Analysis – Initial Analysis
Only considering about the distribution of the IP
address space, no semantic meanings
What are the reason for these highly infected range?
Where are these networks located?
Who are using these networks?
What about those non-infected machines?
Need more fine-grained analysis and investigation
Cross Analysis – Initial Analysis
Follow up investigation
Geographical distribution of infected networks
IP address population/density
Remote accessibility of networks
Dynamism of IP addresses
Build hypothesis for each point based on some
intuition
Cross Analysis – Geological Location
Hypothesis II – Type I and Type II infected networks are
mainly distributed over similar countries.
Infected networks spread all over the world with some concentrated areas.
Some countries are prone to be infected by both Type I and Type II
Some countries are more likely to be infected by one type of botnet
E.g. China is highly prone to Type I, but not so much to Type II
Countries that are more prone to Type I… discuss later!
Countries that are more prone to Type II… discuss later too!
Cross Analysis – IP Address Population
IP addresses are not assigned evenly over networks
or locations
Some addresses are registered only for special
purposes
E.g. (224.* - 239.*) is assigned for multicast addresses
IP address population is different for every country
> 37% of IP addresses are assigned to United States
< 0.5% of IP addresses are assigned to Turkey
Cross Analysis – IP Address Population
Hypothesis III – Countries with more IP addresses (high IP
address population countries) might contain more of both
types of infected networks than low IP address population
countries.
The number of infected network for
Type I, II, I EX, II EX are relatively
proportional to IP address population
The more IP addresses a country has,
the more infected networks it contains
Observed a spike of infections over low
IP address population countries.
Cross Analysis – IP Address Population
What’s up with the spike?
Security education/knowledge of people in these
countries may not be as prevalent compare to others
People may open some vulnerable services or click
suspicious URLs without serious considerations
Network configuration/protection may not be as up to date
as others, making malware more easy to abuse these
exploits
Cross Analysis – IP Address Population
Since high IP address population countries are likely
to have more infected network…
The infected networks could still be a relatively small
percentage in the address population
Purely comparing between absolute infection count
could shadow countries with high percentage
infections
Need ways of finding countries with…
Low IP address population
High infection percentage
Cross Analysis – IP Address Population
Used data from IP2Location.com Report
2,505,141,392 IP addresses observed
Percentage of IP addresses observed from each country
Did some calculations…
𝑖 ∈ 1, 2
𝑗 ∈ {1, … , 240}
Cross Analysis – IP Address Population
Top rank countries in
the previous graphs
show low 𝜂 values
Russia, China, US,
Korea…
More vulnerable
countries were
revealed
Peru, Turkey,
Colombia, Mexico,
Brazil…
Cross Analysis – IP Address Population
What’s up with the countries with high 𝜂 values?
Bad geographical neighbors
Turkey has been suffering from large cyber attacks
generated by its neighbor countries such as Russia
Peru is also surrounded by several highly infected countries
such as Brazil and Colombia
Cross Analysis – Remote Accessibility
Hypothesis IV – Networks that are more open (more directly
accessible from remote hosts) might have more infected
networks of Type I botnets than that of Type II botnets.
Check network accessibility via Ping
5 ICMP echo request packets
Regard network as reachable is success
Only shows the lower bound of reachable networks
Perimeter defending systems (Firewall, IDS, etc.)
Host may be offline
Assume each /24 subnet have the same network access control policy
Cross Analysis – Remote Accessibility
Able to access 54.32% of Type I infected networks
Lower bound limitation
Networks are aware of malware scanning attacks, so they’re more
defensive
Able to access 46.85% of Type II infected networks
Ratio could be higher
Type 2 botnets do not depend on remote accessibility
anyways…
Able to access >60% of common infected networks
Remote accessible networks are much more vulnerable to
malware attacks
Cross Analysis – Dynamism of IP Address
Hypothesis V – Places (or networks) with more dynamic IP
addresses are more prone to be infected by both types of
botnets.
Analyze % of infected networks that are using dynamic IP
Keywords in reverse PTR (pointer) record
E.g. dynamic-host.abcd.com
Only shows the lower bound of dynamic IP addresses
Limitation of reverse DNS lookup & selected keywords
Cross Analysis – Dynamism of IP Address
Type
Dynamic IP
Static IP
Common
62%
38%
Type I
50.1%
49.9%
Type II
58.4%
41.6%
Type I EX
49.08%
50.92%
Type II EX
51.87%
48.13%
Dynamic IP addresses are more vulnerable
Type I have no network preference
Scan the address space regardless whether victim is using dynamic or static IP
Type II prefers dynamic addresses
Most likely to be home users with less security awareness
Victim Prediction
K-Nearest Neighbour Classification (k-NN)
Popular machine learning algorithm
Type I botnet
Done in previous work
>90% accuracy
Strong neighbourhood correlation due to network
scanning
Attempt to do k-NN for Type II botnet
Victim Prediction
K-NN for Type II botnet
Select the same features for the k-NN classifier
/24 subnet address
Physical location of infected networks
Data preparation
Infected network as malicious network
Clean network as benign network
Determined by looking up several DNS blacklists
Victim Prediction
Botnet
MegaD
Srizbi
K
Accuracy
False +ive
1
88.35%
7.35%
3
88.25%
7.36%
5
88.14%
7.54%
1
88.20%
6.23%
3
87.70%
6.04%
5
88.30%
5.77%
Around 88% accuracy
Strong neighbourhood correlation
Type II Infected network highly depends on who receives spam emails
Spammers specifically harness email addresses
Similar locations
E.g. Same company and same university
Cross-Botnet Prediction
Confirm similarity between botnets
Calculated the Manhattan distance between the
distribution of the two types of botnets
The Manhattan distance between two items is the sum of all feature
value differences for each of the all features in the item.
It is frequently used to denote whether two data distributions are similar
or not.
Cross-Botnet Prediction
Probability distribution of infected networks for
Conficker, MegaD, and Srizbi
Manhattan distance
Conficker / MegaD
1.1427
Conficker / Srizbi
1.1604
MegaD / Srizbi
0.8404
Cross-Botnet Prediction
Use currently available information to predict future unknown botnets
Use k-NN classification again, but changing the training and testing
sets to cross-botnets
Botnet
MegaD (Training)
Srizbi (Testing)
Srizbi(Training)
MegaD(Testing)
K
Accuracy
False +ive
1
87.80%
7.41%
3
86.75%
7.49%
5
86.45%
7.69%
1
84.09%
6.53%
3
83.89%
6.31%
5
83.65%
5.09%
More than 83% Accuracy
Even if we have no knowledge of a botnet, if they share similar
infection vector, we may be able to predict unknown infected
networks
Conclusion
Type I and Type II botnet have similar infection distribution
Some countries are more vulnerable than others
Can be largely influenced by:
Geological location
Security education/knowledge
Malware owners intention (specifically or generally target)
Remote accessibility
Dynamism of IP address
Prediction and classification of future malware is possible if they
share some common traits with existing botnets
Criticism
Limitation in dataset can cause bias in these evaluations
Conclusions drawn from each hypothesis are guessed
Assumes /24 subnet have the same network access control policy
Many evaluation rely on previous works, which may not be
applicable or directly comparable due to difference in
methodologies
The End
Any Questions?