Master Generative AI with 10+ Real-world Projects in 2025!

Machine Learning in Cyber Security — Malicious Software Installation

Guest Blog Last Updated : 16 Sep, 2020

4 min read

Introduction

Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework will recommend the implementation of a whitelist mechanism.

Machine Learning Cyber Security

However, the real world is often not ideal. You will always have different developers or users having local administrator rights to bypass controls specified. Is there a way to monitor the local administrator activities?

Let’s talk about the data source

machine learning cyber security — An example of how the dataset looks like — the 3 entries listed above are referring to the same software

We have a regular batch job to retrieve the software installed on each of the workstations which are located in different regions. Most of the software installed is displayed in their local languages. (Yes, you name it — it could be Japanese, French, Dutch …..) So you will meet a situation that the software installed is displayed as 7 different names while it is referring to the same software in the whitelist. Not to mention, we have thousands of devices.

Attributes of the dataset

Hostname — The hostname of the devices
Publisher Name — the software publisher
Software Name — Software Name in Local Language and different versions number

Is there a way we could identify non-standard installation?

My idea is that legit software used in the company — should have more than 1 installation and the software name should be different. In such a case, I believe it will be effective to use machine learning to help a user classify the software and highlight any outlier.

Char processing using Term Frequency — Inverse Document Frequency (TF-IDF)

Natural Language Processing (NLP) is a sub-field of artificial intelligence that deals with understanding and processing human language. In light of new advancements in machine learning, many organizations have begun applying natural language processing for translation, chatbots, and candidate filtering.

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF is usually used for word extraction. However, I was thinking about whether it could also be applicable for char extraction. The idea is to explore how well we could apply TF-IDF to extract the features related to each character in the software name by exporting the importance of each character to the software name.
An example of the script below is going through how I apply the TF-IDF to the software name field in my data set.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# Import the dataset 
df=pd.read_csv("your dataset") # Extract the Manufacturer into List 
field_extracted = df['softwarename']# initialize the TF-IDF 
vectorizer = TfidfVectorizer(analyzer='char')
vectors = vectorizer.fit_transform(field_extracted)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
result = pd.DataFrame(denselist, columns=feature_names)

A snippet of the result:

In the above diagram, you could see that a calculation is performed to evaluate how “important” each char is on the software name. This could also be interpreted as how “many” of the char specified is available on each of the software names. In this way, you could present statistically on the characteristic of each “software name” and we could put these features into the machine learning model of your choice.

Other features I extracted and believe it will also be meaning to the models:

The entropy of the Software Name

import math
from collections import Counter# Function of calculating Entropy 
def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }if len(data) <= 1:
        return 0counts = Counter()for d in data:
        counts[d] += 1ent = 0probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])return ententropy  = [eta(x) for x in field_extracted]

Space Ratio — How many spaces the software name has
Vowel Ratio — How many vowels (aeiou) the software name has

At last, I have these features listed above with labels to run against randomtreeforest classifier. You could select any classifier of your choice as long as it could give you a satisfactory result.

Thanks for reading!

About the Author

Elaine Hung

Elaine is a machine learning enthusiast, digital forensic and incident response consultant. Interested in applying ML and NLP on cyber security topics.

Guest Blog

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Ace a Data Scientist Interview in 2025

Build a powerful 2025-ready data science resume using AI tools.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.5

Learn to Build Intelligent Chatbots using AI

Build ethical chatbots via OpenAI & LangChain using PDF data.

4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Responses From Readers

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Machine Learning in Cyber Security — Malicious Software Installation

Introduction

Let’s talk about the data source

Attributes of the dataset