BotFinder: Finding Bots in Network Traffic Without Deep Packet Inspection

zoomzurichAI and Robotics

Oct 16, 2013 (4 years and 27 days ago)

118 views

BotFinder
: Finding Bots in Network Traffic
Without Deep Packet Inspection

F. Tegeler, X. Fu (U
Goe
), G. Vigna, C.
Kruegel

(UCSB)

Motivation


Sophisticated type of malware: Bots


Multiple bots under single control
botnet


Distinct characteristics:

command and control (C&C) channel



Threats raised by bots:


Spam


Information theft (e.g., credit card data)


Identity theft


Click fraud


Distributed denial of service attacks (
DDoS
)

CoNEXT 2012

C&C

Victim
hosts

$2M
-
$600M revenue
estimated for
single

botnet

2
/24

Challenge


Complementary approach: Network
based


Vertical correlation (single end host) (
Rishi
,
BotHunter
,
Wurzinger

et al., …)


Typical behavior (SPAM,
DDos

traffic)


Anomaly detection (
Giroire

et al.)


Packet analysis: HTTP structure, payloads,
typical signatures


Horizontal correlation (multiple end hosts)
(
BotSniffer
,
BotMiner
, TAMD…):


Two or more hosts do the same malicious stuff

CoNEXT 2012


How to detect
bot

infections?


Classically: End host


Anti Virus Scanner


But: Requires installation on every machine

3
/24

Challenge and Solution Approach


Existing vertical: Typically relies on scanning, spam,
DDoS

traffic
and requires packet inspection.


Existing horizontal: Requires multiple hosts in single domain to
be infected. Also triggered by noisy activity (e.g.,
BotMiner
)

CoNEXT 2012


Contribution: Vertical detection of single

b
ot

infections

without packet inspection!


Botmaster

establishes C&C connections frequently to disseminate
orders. C&C connections show patterns.


Use these statistical properties of C&C communication! Core
assumption: Periodic behavior!


4
/24

Methodology

CoNEXT 2012


Basic machine learning approach:


Learn about
bot

behavior:


Training phase (a)


Use learned behavior:


Detection phase (b)



Training:


Observe malware in controlled

environment


Extract flows and build traces


Perform statistical analysis to obtain “features”


Create models to describe malware

5
/24

Methodology


Detection Phase

CoNEXT 2012


Detection:


Obtain traffic


Perform analysis analog to training


Compare statistical features of the

traffic with models



During the whole process:


No deep packet inspection!


6
/24

Methodology


Details


Analysis performed on flows


Flow is a connection from A to B:


Source IP address


Destination IP address


Source port


Destination port


Transport protocol ID


Start time


Duration of connection


Number of bytes


Number of packets

CoNEXT 2012

This information is easy to obtain in
real
-
world environments!

Example:
NetFlow

7
/24

Methodology


Details cont’d


Trace: Chronologically ordered
sequence of flows.


Represents long term communication
behavior!

CoNEXT 2012

Example for two dimensions:
time and duration

8
/24

Distinguishing Characteristics

CoNEXT 2012


Bot

traffic is more regular than normal, benign traffic!


The lower the bar, the
more periodic.

9
/24

Methodology


Features


Use statistical features to

describe trace!


Average time between two flows.


Average duration of flows.


Average number of source bytes.


Average number of destination bytes.


A Fourier transform to detect underlying

communication frequencies. More robust

than simple averaging.

CoNEXT 2012

10
/24

24min

Methodology


Models


Example scenario:


Multiple binary versions of the same

bot

family generated traces


Example: time interval feature:







“Intervals of 8, 20, or 210 minutes are typical for this bot.”


Clusters with low standard deviation are trustworthy representations
of malware behavior


Drop very small (one
-
element) clusters

CoNEXT 2012

20min

18min

8min

7.5min

17min

22min

9min

230min

190min

Feature clustering…

20min

8.2min

210min

Cluster
centroids

912min

11
/24

Methodology


Model Matching

CoNEXT 2012


Compare a trace to the cluster

centers of a malware family model:


1. If trace feature “hits” a model:


Increase scoring value based on cluster

quality


2. Take model with highest scoring

value


3. If scoring value > threshold:


Consider model matched



Some more math involved (quality of matching trace,
clustering algorithm, minimal trace length, etc.)


12
/24

Evaluation

CoNEXT 2012


Method is implemented in
BotFinder



Six representative malware families



Dataset LabCapture: 2.5 months of lab traffic with 60 machines


Full traffic capture


allows
verificiation


Should contain benign traffic only



Dataset
ISPNetflow
: one month of
NetFlow

data from large
network


Reflects 540 Terabytes of data or 150
MegaBytes
(!) per second of
traffic.


No ground truth but possibility to compare to blacklisted IP
addresses and judgment of usability.


13
/24


Execution:


Split the ground truth malware dataset
randomly into a training set and a
detection set


Mix the detection set with all traces
from the
LabCapture

dataset


Train BotFinder on the training set


Run BotFinder against the detection
set



Result summary:


77%
detection rate with low false
positives (1 out of 5 million traces)

Evaluation


Cross Validation

CoNEXT 2012

Training
data

Training
set

Detection
set

Lab
-
Capture

Train

Detect

Repeat experiment 50 times

per acceptance threshold

14
/24

Evaluation


Cross Validation

CoNEXT 2012


15
/24

Evaluation


Comparison to
BotHunter

CoNEXT 2012


BotHunter

is an optimized Snort Intrusion Detection System.
It requires packet inspection and leverages anomaly detection.



Many false positives for
BotHunter
, typically raised by IRC
activity or binary downloads.



Detection Results:


BotFinder Detection Rate: 77.5%


BotHunter

Detection Rate: 10%



BotFinder outperformed
BotHunter

and shows relatively high
detection rates and low false positives.

*

*: http://www.bothunter.net

16
/24

Evaluation
-

ISPNetFlow

CoNEXT 2012


Challenging to analyze as minimal information (only
internal IP ranges) is available


542 traces (from >1 billion traces) are identified by
BotFinder to be malicious


On average 14.6 alerts per day

17
/24


Speed is sufficient for large networks:


3min for 15M
NetFlow

records (~15min of
ISPNetFlow
,
800MB
filesize
)


Processing is dominated by feature extraction


Easy to parallelize



Detailed IP address investigation of raised alarms:


Comparison of external IPs with publicly available blacklists*


Result: 56% of all IPs are known to be malicious!


The “false positives” show a large cluster of connections to Apple


With
whitelisted

Apple: 61% of all raised alerts connect to known
malicious pages


Strong support that BotFinder works!

Evaluation ISP
NetFlow

CoNEXT 2012

*=rbls.org

18
/24

Bot

Evolution


Botmasters

may try to evade detection by changing
communication patterns:


Introduction of randomized intervals


Introduction of large gaps between flows


IP or domain flux (fast changing C&C servers)



Randomization impact:


Randomizing individual

features does not

significantly impact

detection




CoNEXT 2012

Lower
limit!

19
/24

FFT Peak Detection with Gaps

CoNEXT 2012


20
/24

Anti
-
Domain Flux

CoNEXT 2012


Problem: Fast C&C
-
Domain/IP changes






Problem: BotFinder can’t create a sufficiently long trace


Idea:


Look at each source IP and compare all connections with each
other


When two connections look very similar, combine them to
one!


Inherently horizontal correlation per source IP!

Change of IP address

Trace “breaks”

Subtrace

1: A to C&C IP 1

Subtrace

2: A to C&C IP 2

21
/24


How can one check that it is working?


Split of real C&C traces and random other, long traces (from real
traffic). Does BotFinder recombine them?












“Low” overhead: 85% increase in the
ISPNetFlow
.

Large distance!


Good!

Additional Pre
-
Processing

CoNEXT 2012

22
/24

Conclusion

CoNEXT 2012



High detection rates
-

nearly 80%
-

with low false positives and
no need for packet inspection!



BotFinder shows better results than
BotHunter
.



61% of BotFinder
-
flagged connections in the
ISPNetFlow

dataset were destined to known, blacklisted host!



BotFinder is robust against potential evasion strategies.

23
/24

Questions

CoNEXT 2012




Thank you for your attention!



Any questions?

24
/24