When you search for security data science on the internet, it’s difficult to find resources with crisp and clear information about the use cases, methods and limitations in Information Security (hereby referred to as InfoSec). There’s usually always some marketing material attached to it. So, I thought of summarising my knowledge and InfoSec experience in this article.
The intended audience for this article is:
There are several reasons for this and a few major ones are listed below:
When the attackers are within the enterprise network, they first need to figure out where they are. Once they accomplish this, they move towards their targets, and carry out the attack. During these reconnaissance queries and movements, they usually leave some traces or signals. These signals are present in the data, and their presence can be detected using data science to raise timely alerts.
Earlier, we used to bring all the data to a security data lake called SIEM (security information and event management). But now with the advancements in data science, correlations across multiple events can be performed in real-time. Using algorithms, we can connect the dots and find the patterns which used to be difficult to find manually owing to the lack of security analysts.
One of the key advantages of data science-based systems is that they learn from the decisions taken by the security analysts. After training the systems extensively, they can also start taking the same preventive measures/actions as the security analysts.
The problems in InfoSec are multi-dimensional, that is, thousands of features present in tons of data sources. We need to detect the adversary presence by mining petabytes of machine logs. This is a complex and difficult problem, because the signal to noise ratio is very low. Also, connecting the attack sequences among isolated and rare signal events is a significant challenge.
The majority of the security data has no labels, which makes it difficult to apply deep learning networks to a large number of InfoSec use cases. However, the industry is tackling this problem by generating class labels for a few use cases at a time.
For example, detection of malware, and the ranking of malicious websites and DNS domains, is primarily done using Machine Learning techniques. Another successful use case of data science for security is making a baseline of each user/network device/entity within the network and comparing it with the real-time data to find rare/abnormal behavior and raising anomalies.
These user behavior-based anomalies are certainly more than 100 times lesser than rule-based anomalies. However, their magnitudes are still quite high and a large number of them end up being false positives. In short, security data science is not a silver bullet for InfoSec. We need to marry multiple technologies along with it to improve the defense.
Figure 1: Data Sources and Use Cases for Security Data Science
The InfoSec domain has a large number of logs. The data volume and variety depend on the organisation’s size and domain. Most of the big MNCs use 20-50 InfoSec tools and record the data into hot and cold storage. They use so-called “security data lakes” or Security Information and Event Management (SIEM) tools to store recent data (e.g. DNS logs, authentication logs, Windows security logs, etc.) for monitoring the threats. Data older than a few months, or high volume data (e.g. NetFlow, Bro logs etc.), is pushed to cold storage in Hadoop-based systems.
Here is a list of typical data sources in InfoSec:
All these logs provide a lot of visibility about the adversary’s presence and activities. The table below summarises various use cases according to the data source type. Figure 1 (above) shows that these use cases are typically solved using anomaly detection and ML techniques.
Table 1: Use Cases for the Security Data Science
Network logs – Use cases | Endpoint logs – Use cases | Authentication logs – Use cases | |
1 | Unusual volume of network traffic from a host/network device | Anomalous New Listening Ports/Services/Processes | Excessive Failed Logins – Brute Force Attack |
2 | Network intrusion detection (Scanning, Spoofing detection etc.) | Host with Excessive No. of Listening Ports/Services/Processes | Default Account Usage |
3 | Application attack detection (Top 10 OWASP attacks) | Malware detection and classification | User Behavior Analytics |
4 | Reputation of DNS servers and CnC Detection | Spyware, Ransomware detection | Active directory and Privilege user monitoring |
5 | Substantial increase in Port activity/Events | Prohibited Process/Service creation | Geographically improbable authentication detection |
6 | Detection of unapproved port activity | Host with multiple infections | Brute force access behavior detection |
7 | DNS tunnel attack detection | Unusual registry changes | Spam Mitigation |
Security data science has evolved in three phases as shown in Figure 2 below.
Since the 1990s, data science has played an increasingly important role in information security. This started with rules-based approaches to finding anomalies in intrusion detection system (IDS) and intrusion prevention system (IPS). Most of the firewall, network/host IDS/IPS are either rule-based or anomaly detection-based systems.
Rules are written by security experts and the system raises alerts based on the rules, for instance, failed authentication beyond a specific count indicates a brute force attack. However, these rules don’t capture the dynamic nature of events and context around the events.
Anomaly detection systems are based on the normal behavior models of hosts and networks. Whenever there is significant deviation from the normal behavior, then they raise alerts. Anomaly detection algorithms, such as Clustering, Robust-PCA, SVD, One-Class SVM, DB Scan and KDE, are used to detect anomalous events.
Anomaly-based algorithms are used in networks to detect:
Unfortunately, most of the AD systems raise high false alarms and need a lot of security analysts to validate the alerts.
Figure 2: Evolution of the Security Data Science
In the early 2000s, the second generation of security tools evolved. These facilitated triaging the alerts by correlating multiple data sources in a security data lake called as security information and event management (SIEM) tools. SIEM was successful when the data was large, but in the Big data era, they are slow and are missing an intelligence layer.
With the advances in Big data frameworks, a new form of security data science has evolved. Now, it is possible to boil the ocean of raw logs in real time and raise alerts. This gave rise to user and entity behavior analytics (UEBA) that leverages Hadoop/Spark and anomaly detection techniques to raise real-time alerts whenever there is abnormal behavior of hosts/users within the enterprise network.
This has enabled enterprises to detect insider attacks. However, the anomaly-based solutions have a drawback of generating a large number of false-positive alerts. Each investigation of a false-positive alert adds a significant burden to an already overloaded security analyst.
Another emerging area that is rapidly gaining traction is endpoint security where deep learning is used to detect and classify malware in real time. Supervised ML algorithms such as Deep Learning Networks (ANN, RNN, CNN), Random Forest and XGBoost are used to classify malicious scripts vs benign scripts, detect DNS tunnels, detect C&C servers, detect malware, detect known network scans, application attacks, and many more known threats that have labels available for training the system.
In this evolution, we are bringing a new paradigm shift for the InfoSec field. In this security defense, we first deploy deceptions (reincarnation of honeypots, honeynets, honeywords etc.) in the enterprise network. Then, we leverage data science to profile adversary behavior and their movements within the network. We termed this research “deception-triggered data science.”
Deception-triggered data science is significantly different from conventional security data science. The latter primarily leverages anomaly detection techniques to identify anomalous behavior in network traffic, or user/host/network element behavior. Whereas deception-triggered data science starts from a real attack, i.e., anomaly announced by a deception event, and hence does not require anomaly detection algorithms.
Deception alerts are high fidelity alerts. Data science correlates other security event data with these high fidelity alerts to generate a lot of insights about the adversary behavior. In this approach, we collect and describe the context around a deception alert instead of looking for anomalies like a needle in a haystack. Instead, this kind of data science can focus on capturing everything about how an attack begins and proceeds as it progresses.
To draw on a metaphor, comparing deception-triggered data science to brute force security data science is like boiling a cup of tea rather than boiling an entire ocean. The former is practical, clever and elegant; the latter is expensive, cumbersome and impractical. Deception triggered data science significantly reduces the false positives thereby reducing the overall infrastructure and maintenance cost associated with security-related chores.
More details about this topic can be found in my talk at Splunk .conf 2016 (#4 in the references at the bottom of this article). To read more about deception, please refer to Almeshekah and Spafford’s paper listed in point #5.
Figure 3: Security Data Science Methods
Most of the InfoSec problems can be modeled using anomaly detection and machine learning techniques, as shown with an example in Figure 3 above. I have shared the details of algorithms, feature engineering and data science pipeline for several InfoSec case studies during my webinars.
The video titles mentioned below, along with the timeline, contain the InfoSec use cases. The links to these videos are in the references section at the bottom of this article.
I have put together a link to the datasets, papers and talks related to Security Data Science. In the references below, use [6,7,8] Github links to learn further. Enjoy learning 🙂
[1] TechGig Webinar, Demystifying Security Data Science – Part 1
[2] TechGig Webinar, Demystifying Security Data Science – Part 2
[3] DataHack Summit
[4] Splunk .conf 2016 Talk, “Deception-Triggered Security Data Science to Detect Adversary Movements”.
[6] GitHub – Awesome ML for Cyber Security
[7] The Definitive Security Data Science and Machine Learning Guide
[8] SANS Institute: Reading Room
Dr. Satnam Singh, Chief Data Scientist – Acalvio Technologies
Dr Satnam Singh is currently leading security data science development at Acalvio Technologies. He has more than a decade of work experience in successfully building data products from concept to production in multiple domains. In 2015, he was named as one of the top 10 data scientists in India. To his credit, he has 25+ patents and 30+ journal and conference publications.
Apart from holding a PhD degree in ECE from University of Connecticut, Satnam also holds a Masters in ECE from University of Wyoming. Satnam is a senior IEEE member and a regular speaker in various Big Data and Data Science conferences.
Neatly summarized. Much needed blog.
Very good information. Lucky me I ran across your site by accident (stumbleupon). I have saved it for later!