The majority of the deep learning applications that we see in the community are usually geared towards fields like marketing, sales, finance, etc. We hardly ever read articles or find resources about deep learning being used to protect these products, and the business, from malware and hacker attacks.
While the big technology companies like Google, Facebook, Microsoft, and Salesforce have already embedded deep learning into their products, the cybersecurity industry is still playing catch up. It’s a challenging field but one that needs our full attention.
In this article, we briefly introduce Deep Learning (DL) along with a few existing Information Security (hereby referred to as InfoSec) applications it enables. We then deep dive into the interesting problem of anonymous tor traffic detection and also present a DL-based solution to detect TOR traffic.
The target audience for this article is data science professionals who are already working on machine learning projects. The content of this article assumes that you have foundation knowledge of machine learning and are currently either a beginner, or are exploring, deep learning and it’s use cases.
The below pre-reads are highly recommended to get the most out of this article:
Deep learning is not a silver bullet that can solve all the InfoSec problems because it needs extensive labeled datasets. Unfortunately, no such labeled datasets are readily available. However, there are several InfoSec use cases where the deep learning networks are making significant improvements to the existing solutions. Malware detection and network intrusion detection are two such areas where deep learning has shown significant improvements over the rule-based and classic machine learning-based solutions.
Network intrusion detection systems are typically rule-based and signature-based controls that are deployed at the perimeter to detect known threats. Adversaries change the malware signatures and easily evade the traditional network intrusion detection systems. Quamar et al. [1], in their IEEE transaction paper, showed deep learning (DL)-based systems using self-taught learning to be promising in detecting unknown network intrusions. Traditional security use cases such as malware detection and spyware detection have been tackled with deep neural net-based systems [2].
The generalization power of DL-based techniques is better compared to traditional ML-based approaches. Jung et al.’s [3] DL based system can even detect zero-day malware. Daniel Gibert [2], a Ph.D. graduate from the University of Barcelona, has done extensive work related to convolutional neural networks (CNN, a type of DL architecture) and malware detection. In his Ph.D. thesis, he says that CNNs can detect even polymorphic malware.
The DL-based neural nets are now getting used in User and Entity Behaviour Analytics (UEBA). Traditionally, UEBA employs anomaly detection and machine learning algorithms which distill the security events to profile and baseline every user and network element in the enterprise IT environment. Any significant deviations from the baselines were triggered as anomalies that further raised alerts to be investigated by the security analysts. UEBA enhanced the detection of insider threats, albeit to a limited extent.
Now, deep learning-based systems are used to detect many other types of anomalies. Paweł Kobojek from Warsaw university, Poland [4] uses keystroke dynamics to verify the user using an LSTM network. Jason Trost, director of security data engineering at Capital One, has published several blogs [5] that have a list of technical papers and talks on applying deep learning in InfoSec.
The artificial neural network is inspired from the biological neural network. Neurons are the atomic unit of a biological neural network. Each neuron consists of dendrites, nucleus, and axons. It receives signals through dendrites and is carried out through axons (Figure 1 below). The computations are performed in the nucleus. The entire network is made up of a chain of neurons.
AI researchers borrowed this idea to develop the artificial neural network (ANN). In this setting, each neuron accomplishes three actions:
Each neuron thus can classify whether a set of inputs belong to one class or another. This power is limited when only a single neuron is used. However, coining a set of neurons makes it a powerful machinery for classification and sequence labelling tasks.
Figure 1: Greatest inspiration that we can get is from the nature – figure depicts a biological neuron and an artificial neuron.
A set of neuron layers can be used to create a neural network. The network architecture differs based on the objective it needs to achieve. A common network architecture is a Feed Forward Neural Network (FFN). Neurons are arranged linearly without any cycles to form a FFN. It is called feed forward because information travels in the forward direction inside the network, first through the input neurons layer, then through the hidden neuron layers, and the output neurons layer (Figure 2 below).
Figure 2: A feed forward network with two hidden layers
Like any supervised machine learning model, the FFN needs to be trained using labeled data. The training is in the form of optimizing the parameters by reducing the error between the output value and the true value. One such important parameter to optimize is the weight each neuron gives to each of its input signals. For a single neuron, the weight can be easily computed using the error.
However, when a set of neurons are collated in multiple layers, it is challenging to optimize the neuron weights in multiple layers based on the error computed at the output layer. The backpropagation algorithm helps to address this issue [6]. Backpropagation is an old technique which comes under the branch of computer algebra. Here, automatic differentiation is used to calculate the gradient that is needed in the calculation of the weights to be used in the network.
In a FFN, based on activation of each linked neuron, the output is obtained. The error is propagated layer by layer. Based on the correctness of the output with the final outcome, the error is calculated. This error is then in turn back propagated to fix errors of internal neurons. For each data instance, the parameters are optimized by going through multiple iterations.
The primary goal of cyber-attacks is to steal the enterprise customer data, sales data, intellectual property documents, source codes and software keys. The adversaries exfiltrate the stolen data to remote servers in encrypted traffic along with the regular traffic.
Most often adversaries use an anonymous network that makes it difficult for the security defenders to trace the traffic. Moreover, the exfiltrated data is typically encrypted, rendering rule-based network intrusion tools and firewalls to be ineffective. Recently, anonymous networks have also been used for C&C by specific variants of ransomware/malware. For instance, Onion Ransomware [7] uses the TOR network to communicate with its C&C.
Figure 3: An illustration of TOR communication between Alice and a destination server. The communication starts with Alice requesting a path to the server. TOR network gives the path which is AES encrypted. The randomization of the path happens inside the TOR network. The encrypted path of the packet is shown in red. Upon reaching the exit node, which is the periphery node of the TOR network, the plain packet is transferred to the server.
Anonymous network/traffic can be accomplished through various means. They can be broadly classified into:
Among them, TOR is one of the more popular choices. TOR is a free software that enables anonymous communication over the internet through a specialized routing protocol known as the onion routing protocol [9]. The protocol depends on redirecting internet traffic over various freely hosted relays across the world. During the relay, like the layers of an onion peel, each HTTP packet is encrypted using the public key of the receiver.
At each receiver point, the packet can be decrypted using the private key. Upon decryption, the next destination relay address is revealed. This carries on until the exit node of the TOR network is met, where the decryption of the packet ends, and a plain HTTP packet is forwarded to the original destination server. An example routing scheme between Alice and the server is depicted in the above Figure 3 for illustration.
The original intent of launching TOR was to safeguard the privacy of users. However, adversaries have hijacked the good Samaritan objective to use it for various nefarious means instead. As of 2016, around 20% of the Tor traffic accounts for illegal activities. In an enterprise network, TOR traffic is curtained by not allowing the installation of the TOR client or blocking the Guard or Entry node IP address.
However, there are numerous means through which adversaries and malware can get access to the TOR network to transfer data and information. The IP blocking strategy is not a sound strategy. Adversaries can spawn different IPs to carry out the communication. A bad bot landscape report by distil networks [5] shows that 70% of automated attacks in 2015 used multiple IPs, and 20% of automated attacks used over 100 IPs.
TOR traffic can be detected by analyzing the traffic packets. This analysis can be on the TOR node, or in between the client and the entry node. The analysis is done on a single flow of packet. Each flow constitutes a tuple of source address, source port, destination address, and destination port.
Network flows for different time intervals are extracted and analysis is carried on them. G. He et al. in their paper “Inferring Application Type Information from Tor Encrypted Traffic” extracted burst volumes and directions to create a HMM model to detect the TOR applications that might be generating that traffic. Most of the popular works in this area leverage time-based features along with other features like size and port information to detect TOR traffic.
We take inspiration from Habibi et al’s “ Characterization of Tor Traffic using Time based Features” paper and follow a time-based approach over extracted network flow to detect TOR traffic for this article. However, our architecture uses a plethora of other meta-information that can be obtained to classify the traffic. This is inherently due to the Deep Learning architecture that has been chosen to solve this problem.
We obtained the data from Habibi Lashkari et al. [11] at the University of New Brunswick for the data experiments done in this article. Their data consists of features extracted from the analysis of the university internet traffic. Extracted meta information from the data is given in the table below:
Table 1: Meta information parameters obtained from [1]
Meta-Information parameter | Parameter Explanation |
FIAT | Forward Inter Arrival Time, the time between two packets sent forward direction (mean, min, max, std). |
BIAT | Backward Inter Arrival Time, the time be- tween two packets sent backwards (mean, min, max, std). |
FLOWIAT | Flow Inter Arrival Time, the time between two packets sent in either direction (mean, min, max, std). |
ACTIVE | The amount of time time a flow was active before going idle (mean, min, max, std). |
IDLE | The amount of time time a flow was idle before becoming active (mean, min, max, std). |
FB PSEC | Flow Bytes per second. Flow packets per second. duration: The duration of the flow. |
Apart from these parameters, other flow-based parameters are also included. A sample instance from the dataset is shown in Figure 4 below:
Figure 4: An instance of the dataset used for this article.
Please note that source IP/port and destination IP/port, along with the protocol field, have been removed from the instance as they overfit the model. We process all other features using a deep feed forward neural network with N hidden layers. The architecture of the neural network is shown in Figure 5 below.
Figure 5: Deep learning network representation used for TOR traffic detection.
The hidden layers vary between 2 to 10. We found N=5 to be optimal. For activation, Relu is used for all the hidden layers. Each layer of the Hidden layers is dense in nature and of dimension 100.
model = Sequential() model.add(Dense(feature_dim, input_dim= feature_dim, kernel_initializer='normal', activation='relu')) for _ in range(0, hidden_layers-1): model.add(Dense(neurons_num, kernel_initializer='normal', activation='relu')) model.add(Dense(1,kernel_initializer='normal', activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=["accuracy"]) |
Figure 6: A Python Code Snippet of the FFN in Keras.
The output node is activated by a sigmoid function. This was used as the output is a binary classification – Tor or Non-Tor.
We used Keras with Tensorflow in the backend to train the DL module. Binary cross entropy loss was used for optimizing the FFN. The model was trained for different epochs. Figure 7 below shows training simulation for a run depicting the increasing performance and decreasing loss value as the number of epochs increase.
Figure 7: Tensorboard generated statics depicting the network training process
The results of the deep learning system were compared with various other estimators. Standard classification metrics of Recall, Precision and F-Score were used to measure the efficacy of the estimators. Our DL-based system was able to detect the TOR class well. However, it is the Non-Tor class that we need to give more importance to. It is seen that a Deep Learning-based system can reduce the false positive cases for Non-Tor category samples. The results are shown in the table below:
Table 2: The output of ML and DL Models for the Tor Traffic Detection experiment
Classifier used | Precision | Recall | F-Score |
Logistic Regression | 0.87 | 0.87 | 0.87 |
SVM | 0.9 | 0.9 | 0.9 |
Naïve Bayes | 0.91 | 0.6 | 0.7 |
Random Forest | 0.96 | 0.96 | 0.96 |
Deep Learning | 0.95 | 0.95 | 0.95 |
Among various classifiers, Random Forest and Deep learning based approaches perform better than the rest. The result shown is based on 55,000 training instances. The dataset used in this experiment is comparatively smaller than typical DL-based systems. As the training data increases, performance would increase further for both DL-based and Random forest classifier.
However, for large datasets, a DL-based classifier typically outperforms other classifiers, and it can be generalised for similar types of applications. For example, if one needs to train a classifier to detect the application used by TOR, then only the output layer needs retraining, and all the other layers can be kept the same. Whereas other ML-classifiers will need to be retrained for the entire dataset. Keep in mind that retraining the model may take significant computing resources for large datasets.
Anonymized traffic detection is a nuanced challenge that every enterprise faces. The adversaries use TOR channels to exfiltrate data in anonymous mode. Current approaches by tor traffic detection vendors depend on blocking known entry nodes of the TOR network. This is not a scalable approach and can be easily bypassed. A generic method is to use deep learning-based techniques.
In this article, we presented a deep learning-based system to detect the TOR traffic with high recall and precision. Let us know your take on the current state of deep learning, or if you have any alternate approaches, in the comments section below.
Dr. Satnam Singh, Chief Data Scientist – Acalvio Technologies
Dr Satnam Singh is currently leading security data science development at Acalvio Technologies. He has more than a decade of work experience in successfully building data products from concept to production in multiple domains. In 2015, he was named as one of the top 10 data scientists in India. To his credit, he has 25+ patents and 30+ journal and conference publications.
Apart from holding a PhD degree in ECE from University of Connecticut, Satnam also holds a Masters in ECE from University of Wyoming. Satnam is a senior IEEE member and a regular speaker in various Big Data and Data Science conferences.
Balamurali A R, Member Technical Staff (Data Science) at Acalvio
Balamurali A R is a member of the data science team at Acalvio. He is a graduate from IIT Mumbai, holds a Ph.D in Computer Science and has previously worked with companies like Samsung and IBM.
Where can I find the data for the case study?
Hi Souraj, The data has been obtained from Habibi Lashkari et al. You can find it here.
Its indeed a great article.Thanks. I am a data analyst who would like to learn more about cyber security,and Machine Learning applications over cyber security. As my knowledge on cyber security is very limited,is it advisable to go for an Ethical Hacking course?
Hi I want to learn new technologies. 😊