The world of auditing data can be complex, with many challenges to overcome. One of the biggest challenges is handling categorical attributes while dealing with datasets. In this article, we will delve into the world of auditing data, anomaly detection, and the impact of encoding categorical attributes on models.
One of the main challenges associated with anomaly detection for auditing data is handling categorical attributes. Encoding categorical attributes is mandatory because the models cannot interpret text input. Commonly, this is done using Label encoding or One Hot encoding. However, in a large dataset, One-hot encoding can lead to poor model performance due to the curse of dimensionality.
Learning Objectives
To understand the concept of auditing data and the challenges associated with handling categorical attributes while dealing with datasets.
To evaluate different methods of deep unsupervised anomaly detection.
To understand the impact of encoding categorical attributes on models used for anomaly detection in auditing data.
How Does Encoding Categorical Attributes Impact the Models?
8.1 t-SNE representation of the Car Insurance dataset
8.2 t-SNE representation of the Vehicle Insurance dataset
8.3 t-SNE representation of the Vehicle Claims dataset
Conclusion
What is Auditing Data?
Auditing data can include Journals, Insurance Claims, and Intrusion Data for information systems; in this article, the examples provided are insurance claims of vehicles. Insurance claims are distinguishable from anomaly detection datasets, e.g., KDD, by a larger number of categorical features.
Categorical features are discrete attributes in our data that can be either of type integer or character. Numerical features are continuous attributes in our data that are always real-valued. Datasets with numerical features are popular in the anomaly detection community like Credit Card fraud data. Most of the publically available datasets contain fewer categorical features than insurance claims data. Categorical features are more in number than numerical features in the insurance claims datasets.
An insurance claim includes features like Model, Brand, Income, Cost, Issue, Color, etc. The number of categorical features is higher in auditing data than in the Credit Card and KDD datasets. These datasets are benchmarks in unsupervised anomaly detection methods. As seen in the below table, Insurance claim datasets have more categorical features, which are important to understanding the behavior of fraudulent data.
The auditing datasets used to evaluate the impact of categorical encodings are Car Insurance, Vehicle Insurance, and Vehicle Claims.
What is Anomaly Detection?
An anomaly is an observation located far away from normal data in a dataset by a specific distance (Threshold). In terms of auditing data, we prefer the term fraudulent data. Anomaly detection distinguishes between normal and fraudulent data using machine learning or deep learning model. Different methods can be used for anomaly detection, like density estimation, reconstruction error, and classification methods.
Density Estimation – These methods estimate normal data distribution and classify anomalous data if it has not been sampled from the learned distribution.
Reconstruction Error – Reconstruction error-based methods are based on the principle that normal data can be reconstructed with smaller losses than anomalous data. The higher the reconstruction loss increases the chances that the data is an anomaly.
Classification Methods –Classification methods like Random Forest, Isolation Forest, One Class – Support Vector Machines, and Local Outlier Factors can be used for anomaly detection. Classification in anomaly detection involves identifying one of the classes as the anomaly. Still, the classes are divided into two groups (0 and 1) in the multi-class scenario, and the class with fewer data is the anomalous class.
The output of the above methods is anomaly scores or reconstruction errors. Then we have to decide on a threshold, according to which we classify the anomalous data.
Major Challenges Faced While Auditing Data
Handling of Categorical Attributes:Encoding categorical attributes is mandatory because the model cannot interpret text input. So, the values are encoded with Label encoding or One Hot encoding. But in a large dataset, One hot encoding transforms the data into a high dimensional space by increasing the number of attributes. The model performs poorly due to the curse of dimensionality.
Selecting threshold for classification: If the data is not labeled, it is difficult to evaluate the model’s performance because we do not know the number of anomalies present in the dataset. The prior knowledge about the dataset makes it easier to determine the threshold. Let’s say we have 5 out of 10 anomalous samples in our data. So, we can select the threshold at the 50 percentile score.
Public Datasets:Most auditing datasets are confidential because they belong to corporate companies and contain sensitive and personal information. One possible way to mitigate confidentiality issues is to train using synthetic datasets (Vehicle Claims).
Auditing Datasets for Anomaly Detection
Insurance claims for vehicles include information about the properties of the vehicle, like the model, brand, price, year, and fuel type. It includes information about the driver, date of birth, gender, and profession. Additionally, the claim may include information about the total cost of repairing. The datasets used in this article are all from a single domain, but they vary in the number of attributes and the number of instances.
The Vehicle Claims dataset is large, containing over 250,000 rows, and its categorical attributes have a cardinality of 1171. Due to its large size, this dataset suffers from the curse of dimensionality.
The Vehicle Insurance dataset is medium-sized, with 15,420 rows and 151 unique categorical values. This makes it less prone to suffer from the curse of dimensionality.
The Car Insurance dataset is small, with labels and 25% anomalous samples, and it contains a similar number of numerical and categorical features. With 169 unique categories, it does not suffer from the curse of dimensionality.
Encoding of Categorical Attributes
Different encodings of categorical values
Label Encoding – In label encoding, the categorical values are replaced with numeric integer values between 1 and the number of categories. Label encoding represents the categories in the intended way for ordinal values. Still, when the features are nominal, the representation is incorrect as the categorical values do not conform to a specific order.
For example, if we have categories like Automatic, Hybrid, Manual, and Semi-Automatic in a feature, label encoding transforms these values into {1: Automatic, 2: Hybrid, 3: Manual, 4:Semi-Automatic}. This representation provides no information about the categorical values, but a representation such as {0: Low, 1: Medium, 2: High} provides a clear representation because the feature variable Low is assigned a lower numerical value. Therefore, label encoding is better for ordinal values but disadvantageous for nominal values.
One Hot Encoding – One Hot encoding is used to address the problem of nominal encoding values, which transforms each categorical value into a distinct feature in the dataset consisting of binary values. For example, in the case of four different categories encoded as {1, 2, 3, 4}, One Hot encoding would create new features such as {Automatic: [1,0,0,0], Hybrid: [0,1,0,0], Manual: [0,0,1,0], Semi-Automatic: [0,0,0,1]}.
The dimension of the dataset then directly depends on the number of categories present in the dataset. As a result, One Hot encoding can lead to the curse of dimensionality, which is a drawback of this encoding method.
GEL Encoding– GEL encoding is an embedding technique that can be used in supervised and unsupervised learning methods. It is based on the principle of One Hot encoding and can be used to decrease the dimensionality of categorical features that have been encoded using One Hot encoding.
Embedding Layer – Word embeddings provide a way to use a compact and dense representation in which similar words have similar encodings. An embedding is a dense vector of floating-point values that are trainable parameters. Word embeddings can range from 8-dimensional (for small datasets) to 1024-dimensional (for large datasets).
A higher dimensional embedding can capture more detailed relationships between words, but it requires more data to learn. The embedding layer is a lookup table that converts each word present in the matrix to a vector of a specific size.
Unsupervised Anomaly Detection Models
In the real world, data is not labeled in most cases, and labeling data is expensive and time-consuming. Therefore, we will use unsupervised models for our evaluations.
SOM –The Self-Organizing Map (SOM) is a competitive learning method where the neurons’ weights are updated competitively rather than using backpropagation learning. SOM consists of a map of neurons, each with a weight vector of the same size as the input vector. The weight vector is initialized with random weights before training starts. During the training, each input is compared to the neurons of the map based on a distance metric (e.g., Euclidean distance) and is mapped to the Best Matching Unit (BMU), which is the neuron with the minimum distance to the input vector.
The weights of the BMU are updated with the weights of the input vector, and the neighboring neurons are updated based on the neighborhood radius (sigma). Since the neurons compete with each other to be the best matching unit, this process is known as competitive learning. In the end, the neurons for normal samples are closer than the anomalous ones. Anomaly scores are defined by the quantization error, which is the difference between the input sample and the weights of the best matching unit. A higher quantization error indicates a higher probability of the sample being an anomaly.
DAGMM – The Deep Autoencoding Gaussian Mixture Model (DAGMM) is a density estimation method that assumes that anomalies lie in a low-probability region. The network is divided into two parts: a compression network, which is used to project data into lower dimensions using an autoencoder, and an estimation network, which is used to estimate the parameters of the Gaussian mixture model. DAGMM estimates k number of Gaussian mixtures, where k can be any number from 1 to N (the number of data points), and it is assumed that normal points lie in a high-density region, meaning that the probability of being sampled from a Gaussian mixture is higher for normal points than for anomalous samples. Anomaly scores are defined by the estimated energy of the sample.
RSRAE –The Robust Surface Recovery Layer for Unsupervised Anomaly Detection is a reconstruction error method that first projects the data to a lower dimension using an autoencoder. The latent representation is then subjected to an orthogonal projection onto a linear subspace that is robust to outliers. The decoder then reconstructs the output from the linear subspace. In this method, a higher reconstruction error indicates a higher probability of the sample being an anomaly.
SOM-DAGMM- A Self-organizing Map (SOM) – Deep Autoencoding Gaussian Mixture Model (DAGMM) is also a density estimation model. Like DAGMM, it also estimates the probability distribution of normal data points and classifies a data point as an anomaly if it has a low probability of being sampled from the learned distribution. The main difference between SOM-DAGMM and DAGMM is that SOM-DAGMM includes the normalized coordinates of SOM for the input sample, which provides the missing topological information in the case of DAGMM to the estimation network. The objective is also similar to DAGMM in that anomaly scores are defined by the estimated energy of the sample, and low energy indicates a higher probability of the sample as an anomaly.
Next, we will address the challenge of handling categorical attributes.
How Does Encoding Categorical Attributes Impact the Models?
To understand the impact of different encodings on datasets, we will use t-SNE to visualize the low-dimensional representations of the data for different encodings. t-SNE projects high-dimensional data into a lower-dimensional space, making it easier to visualize. By comparing the t-SNE visualizations and numerical results of different encodings of the same dataset, the difference is observed in the resulting representations and understanding of the impact of the encoding on the dataset.
t-SNE representation of the Car Insurance dataset
Since it is a small dataset, even after One Hot encoding, the data is loosely bound and clusters far apart.
One Hot encoding and GEL encoding have the best performance for the dataset.
t-SNE representation of the Vehicle Insurance dataset
The data is closer to each other because the number of rows is higher than in the Car Insurance dataset. It becomes difficult to separate with increased dimensionality in One Hot encoding.
GEL encoding is better than One Hot encoding in all cases except DAGMM.
t-SNE representation of the Vehicle Claims dataset
The data is tightly bound in all cases, making it difficult to separate with increased dimensionality. This is one of the reasons for the poor performance of models due to increased dimensionality.
SOM outperforms all other models for this dataset. Still, the embedding layer is more suitable in most cases, which allows us an alternative to encoding categorical attributes for anomaly detection.
Conclusion
This article presents a brief overview of auditing data, anomaly detection, and categorical encodings. It is important to understand that handling categorical attributes in auditing data is challenging. By understanding the impact of encoding the attributes on models, we can improve anomaly detection accuracy in the datasets. The key takeaways from this article are:
As the size of data increases, it is important to use alternative encoding approaches for categorical attributes, like GEL encoding and Embedding layers, because One Hot encoding is unsuitable.
One model does not work for all datasets. For tabular datasets, domain knowledge is extremely important.
The choice of encoding method depends upon the choice of model.
The code for the evaluation of models is available on GitHub.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Detecting Fraud and Anomalies with Unsupervised Learning, Graph Neural Networks, Graph Representation Learning, and Bayesian Inference. Like to experiment with new open-source tools and contribute to Machine Learning and Artificial Intelligence community.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Great article Ajay!