Demystifying Information Security Using Data Science

Guest Blog Last Updated : 02 Feb, 2018

9 min read

Introduction

When you search for security data science on the internet, it’s difficult to find resources with crisp and clear information about the use cases, methods and limitations in Information Security (hereby referred to as InfoSec). There’s usually always some marketing material attached to it. So, I thought of summarising my knowledge and InfoSec experience in this article.

The intended audience for this article is:

Budding security data scientists
Security analysts
Threat hunters
InfoSec professionals
Anyone who wants to explore a career path in InfoSec and data science.

Why are so many ransomware attacks and data breaches happening now?
What are the challenges in InfoSec?
Why InfoSec needs data science?
What are the data science challenges for InfoSec?
What are the key data sources and use cases for security data science?
How did security data science evolve over the time?
How to model an InfoSec use case into a data science problem?

Why are so many ransomware attacks and data breaches happening now?

There are several reasons for this and a few major ones are listed below:

The attack surface is increasing and the network perimeter has now been dissolved due to mobile, cloud, BYOD, etc.
Attackers have found a highly efficient way to make quick money using ransomware. In fact, ransomware is now available as a service on the dark web. Due to this, novice attackers can also simply leverage the ransomware service and focus more on the ransom extortion.
Attackers are also using more tools, like polymorphic malware and zero-day vulnerabilities, to evade the current InfoSec tools.
The InfoSec defense team has a limited number of sensors/cameras to watch the adversary movements within the enterprise network (these are the so-called “Insider Threats”). The adversaries are almost always in an advantageous position as they can freely move within the enterprise network after they have compromised a few users.

Source: CSO

What are the challenges in InfoSec?

Information Security is a highly skewed and asymmetric problem. The defense team may need to write nearly 10,000 lines of code to fix a vulnerability and secure the system. However, an adversary just needs to find another vulnerability and come up with just 5-6 lines of code that can easily evade the security patch.
There are multiple ‘doors’ for an adversary to ‘walk’ into the enterprise network. It’s difficult to guard all the gates because the security tools/gates (like firewalls, network-based intrusion detection/prevention tools, host-based intrusion detection tools, anti-virus, etc.) at times cannot distinguish between a genuine user versus an adversary who has compromised a user’s account.
The adversaries use the same commands, scripts, and tools that are used by the system administrators. Depending on the attacker’s skill set, they either use existing tools, like Nmap, Metasploit, PowerSploit. etc., or any home-grown scripts to execute their attack.

Why does InfoSec need data science?

When the attackers are within the enterprise network, they first need to figure out where they are. Once they accomplish this, they move towards their targets, and carry out the attack. During these reconnaissance queries and movements, they usually leave some traces or signals. These signals are present in the data, and their presence can be detected using data science to raise timely alerts.

Earlier, we used to bring all the data to a security data lake called SIEM (security information and event management). But now with the advancements in data science, correlations across multiple events can be performed in real-time. Using algorithms, we can connect the dots and find the patterns which used to be difficult to find manually owing to the lack of security analysts.

One of the key advantages of data science-based systems is that they learn from the decisions taken by the security analysts. After training the systems extensively, they can also start taking the same preventive measures/actions as the security analysts.

What are the data science challenges for InfoSec?

The problems in InfoSec are multi-dimensional, that is, thousands of features present in tons of data sources. We need to detect the adversary presence by mining petabytes of machine logs. This is a complex and difficult problem, because the signal to noise ratio is very low. Also, connecting the attack sequences among isolated and rare signal events is a significant challenge.

The majority of the security data has no labels, which makes it difficult to apply deep learning networks to a large number of InfoSec use cases. However, the industry is tackling this problem by generating class labels for a few use cases at a time.

For example, detection of malware, and the ranking of malicious websites and DNS domains, is primarily done using Machine Learning techniques. Another successful use case of data science for security is making a baseline of each user/network device/entity within the network and comparing it with the real-time data to find rare/abnormal behavior and raising anomalies.

These user behavior-based anomalies are certainly more than 100 times lesser than rule-based anomalies. However, their magnitudes are still quite high and a large number of them end up being false positives. In short, security data science is not a silver bullet for InfoSec. We need to marry multiple technologies along with it to improve the defense.

Figure 1: Data Sources and Use Cases for Security Data Science

What are the key data sources and use cases for security data science?

The InfoSec domain has a large number of logs. The data volume and variety depend on the organisation’s size and domain. Most of the big MNCs use 20-50 InfoSec tools and record the data into hot and cold storage. They use so-called “security data lakes” or Security Information and Event Management (SIEM) tools to store recent data (e.g. DNS logs, authentication logs, Windows security logs, etc.) for monitoring the threats. Data older than a few months, or high volume data (e.g. NetFlow, Bro logs etc.), is pushed to cold storage in Hadoop-based systems.

Here is a list of typical data sources in InfoSec:

Endpoints: Processes, applications, host-based IDS alerts, file system changes, registry changes, operating system logs, anti-virus alerts.
Network: Network packets and flows, network IDS/IPS alerts, network topology, firewall logs, HTTP proxy logs, DNS logs, Netflow, Bro logs.
Authentication: Windows/Mac/Linux authentication logs, Windows security logs, Active directory logs, Privilege user management logs.
Threat Intelligence: Indicators of compromise, malicious domain names, IP addresses from peer organizations and open source communities, malware signatures.
Asset management logs
Vulnerability logs

All these logs provide a lot of visibility about the adversary’s presence and activities. The table below summarises various use cases according to the data source type. Figure 1 (above) shows that these use cases are typically solved using anomaly detection and ML techniques.

Table 1: Use Cases for the Security Data Science

	Network logs – Use cases	Endpoint logs – Use cases	Authentication logs – Use cases
1	Unusual volume of network traffic from a host/network device	Anomalous New Listening Ports/Services/Processes	Excessive Failed Logins – Brute Force Attack
2	Network intrusion detection (Scanning, Spoofing detection etc.)	Host with Excessive No. of Listening Ports/Services/Processes	Default Account Usage
3	Application attack detection (Top 10 OWASP attacks)	Malware detection and classification	User Behavior Analytics
4	Reputation of DNS servers and CnC Detection	Spyware, Ransomware detection	Active directory and Privilege user monitoring
5	Substantial increase in Port activity/Events	Prohibited Process/Service creation	Geographically improbable authentication detection
6	Detection of unapproved port activity	Host with multiple infections	Brute force access behavior detection
7	DNS tunnel attack detection	Unusual registry changes	Spam Mitigation

How did security data science evolve over time?

Security data science has evolved in three phases as shown in Figure 2 below.

Phase 1 – Rule-based and Anomaly Detection systems

Since the 1990s, data science has played an increasingly important role in information security. This started with rules-based approaches to finding anomalies in intrusion detection system (IDS) and intrusion prevention system (IPS). Most of the firewall, network/host IDS/IPS are either rule-based or anomaly detection-based systems.

Rules are written by security experts and the system raises alerts based on the rules, for instance, failed authentication beyond a specific count indicates a brute force attack. However, these rules don’t capture the dynamic nature of events and context around the events.

Anomaly detection systems are based on the normal behavior models of hosts and networks. Whenever there is significant deviation from the normal behavior, then they raise alerts. Anomaly detection algorithms, such as Clustering, Robust-PCA, SVD, One-Class SVM, DB Scan and KDE, are used to detect anomalous events.

Anomaly-based algorithms are used in networks to detect:

anomalous ports
unusual traffic from a host
excessive DNS failures
endpoints having unusual processes/applications/registry changes
users/hosts having unusual behaviors

Unfortunately, most of the AD systems raise high false alarms and need a lot of security analysts to validate the alerts.

Figure 2: Evolution of the Security Data Science

Phase 2 – Security Data Lakes/SIEM

In the early 2000s, the second generation of security tools evolved. These facilitated triaging the alerts by correlating multiple data sources in a security data lake called as security information and event management (SIEM) tools. SIEM was successful when the data was large, but in the Big data era, they are slow and are missing an intelligence layer.

Phase 3 – UEBA, Malware detection

With the advances in Big data frameworks, a new form of security data science has evolved. Now, it is possible to boil the ocean of raw logs in real time and raise alerts. This gave rise to user and entity behavior analytics (UEBA) that leverages Hadoop/Spark and anomaly detection techniques to raise real-time alerts whenever there is abnormal behavior of hosts/users within the enterprise network.

This has enabled enterprises to detect insider attacks. However, the anomaly-based solutions have a drawback of generating a large number of false-positive alerts. Each investigation of a false-positive alert adds a significant burden to an already overloaded security analyst.

Another emerging area that is rapidly gaining traction is endpoint security where deep learning is used to detect and classify malware in real time. Supervised ML algorithms such as Deep Learning Networks (ANN, RNN, CNN), Random Forest and XGBoost are used to classify malicious scripts vs benign scripts, detect DNS tunnels, detect C&C servers, detect malware, detect known network scans, application attacks, and many more known threats that have labels available for training the system.

Phase 4 – Deception-triggered data science

In this evolution, we are bringing a new paradigm shift for the InfoSec field. In this security defense, we first deploy deceptions (reincarnation of honeypots, honeynets, honeywords etc.) in the enterprise network. Then, we leverage data science to profile adversary behavior and their movements within the network. We termed this research “deception-triggered data science.”

Deception-triggered data science is significantly different from conventional security data science. The latter primarily leverages anomaly detection techniques to identify anomalous behavior in network traffic, or user/host/network element behavior. Whereas deception-triggered data science starts from a real attack, i.e., anomaly announced by a deception event, and hence does not require anomaly detection algorithms.

Deception alerts are high fidelity alerts. Data science correlates other security event data with these high fidelity alerts to generate a lot of insights about the adversary behavior. In this approach, we collect and describe the context around a deception alert instead of looking for anomalies like a needle in a haystack. Instead, this kind of data science can focus on capturing everything about how an attack begins and proceeds as it progresses.

To draw on a metaphor, comparing deception-triggered data science to brute force security data science is like boiling a cup of tea rather than boiling an entire ocean. The former is practical, clever and elegant; the latter is expensive, cumbersome and impractical. Deception triggered data science significantly reduces the false positives thereby reducing the overall infrastructure and maintenance cost associated with security-related chores.

More details about this topic can be found in my talk at Splunk .conf 2016 (#4 in the references at the bottom of this article). To read more about deception, please refer to Almeshekah and Spafford’s paper listed in point #5.

Figure 3: Security Data Science Methods

How to model an InfoSec use case into a data science problem?

Most of the InfoSec problems can be modeled using anomaly detection and machine learning techniques, as shown with an example in Figure 3 above. I have shared the details of algorithms, feature engineering and data science pipeline for several InfoSec case studies during my webinars.

The video titles mentioned below, along with the timeline, contain the InfoSec use cases. The links to these videos are in the references section at the bottom of this article.

Data exfiltration detection using anomaly detection [Webinar [1], timeline – 26:36-37:50]
Detect Command and Control (C&C) Center [Webinar [2], timeline – 19:42-29:20]
PowerShell Obfuscation and Detection [Webinar [2], timeline -29:20-38:45]

End Notes

I have put together a link to the datasets, papers and talks related to Security Data Science. In the references below, use [6,7,8] Github links to learn further. Enjoy learning 🙂

References

[1] TechGig Webinar, Demystifying Security Data Science – Part 1

[2] TechGig Webinar, Demystifying Security Data Science – Part 2

[3] DataHack Summit

[4] Splunk .conf 2016 Talk, “Deception-Triggered Security Data Science to Detect Adversary Movements”.

[5] Mohammed H. Almeshekah and Eugene H. Spafford, “Planning and Integrating Deception into Computer Security Defenses,” Proceedings of the 2014 New Security Paradigms Workshop, 2014.

[6] GitHub – Awesome ML for Cyber Security

[7] The Definitive Security Data Science and Machine Learning Guide

[8] SANS Institute: Reading Room

About the Author

Dr. Satnam Singh, Chief Data Scientist – Acalvio Technologies

Dr Satnam Singh is currently leading security data science development at Acalvio Technologies. He has more than a decade of work experience in successfully building data products from concept to production in multiple domains. In 2015, he was named as one of the top 10 data scientists in India. To his credit, he has 25+ patents and 30+ journal and conference publications.

Apart from holding a PhD degree in ECE from University of Connecticut, Satnam also holds a Masters in ECE from University of Wyoming. Satnam is a senior IEEE member and a regular speaker in various Big Data and Data Science conferences.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Demystifying Information Security Using Data Science

Introduction

Table of Contents

Why are so many ransomware attacks and data breaches happening now?

What are the challenges in InfoSec?

Why does InfoSec need data science?

What are the data science challenges for InfoSec?

What are the key data sources and use cases for security data science?

How did security data science evolve over time?

Phase 1 – Rule-based and Anomaly Detection systems

Phase 2 – Security Data Lakes/SIEM

Phase 3 – UEBA, Malware detection

Phase 4 – Deception-triggered data science

How to model an InfoSec use case into a data science problem?

End Notes

References

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I