In a future where jet engines are able to anticipate their own failures before they occur, millions of dollars and possibly lives could be saved. This research uses NASA jet engine simulation data to explore a novel method to predictive maintenance. We explore how machine learning can assess the condition of these vital components by analyzing sensor data from jet engines, which records variables such as temperature and pressure. This study demonstrates the potential of artificial intelligence (AI) to revolutionize engine maintenance and improve safety by going through the steps of data preparation, feature selection, and the use of sophisticated algorithms like Random Forest and Neural Networks. Come along as we explore the complexities of predictive modeling and data processing to anticipate engine failures before they happen.
This article was published as a part of the Data Science Blogathon.
The United States space agency or popularly known as NASA some time ago shared a dataset containing jet engine simulation data. This data includes sensor readings from a jet engine, covering its operation from initial use until failure. It is certainly interesting to discuss how we can recognize sensor data patterns and then perform classification to determine whether a jet engine is still functioning normally or failed. This project will explore how machine learning models analyze sensor data to predict engine health. This project follows the CRISP-DM concept, a workflow that organizes the data mining process. For more details, let’s take a look together!
This stage will explain the project’s background, define the problems faced, and outline the ultimate goal of the jet engine predictive maintenance project to address the defined issues.
Jet engines play a crucial role in NASA’s space industry, serving as the power source for vehicles like airplanes by generating thrust. Given their importance, we need to analyze and predict the engine’s health to determine whether it is functioning normally or requires maintenance. This aims to avoid engine failure suddenly that could potentially endanger the vehicle. One way to measure engine performance is by using sensors. These sensors work to find out various things such as temperature, rotation, pressure, vibration in the engine, and others. Therefore, this project will carry out an analysis process to predict engine health based on sensor data before the engine actually fails.
Ignorance of machine health can potentially lead to sudden machine failure during use.
Classify machine health into normal or failure categories based on sensor data.
This stage is the process of recognizing the data. This process will call the data and display the initial dataset before further processing.
The dataset that will be used in this project comes from CMAPSS Jet Engine Simulated Data. This dataset consists of several files which are broadly grouped into 3 category: train, test, and RUL. However, this project will only use train data. There is train_FD001.txt. This dataset has 26 columns and 20,631 data.
Parameters | Symbol | Description | Unit |
Engine | – | – | – |
Cycle | – | – | t |
Setting 1 | – | Altitude | ft |
Setting 2 | – | Mach Number | M |
Setting 3 | – | Sea-level Temperature | °F |
Sensor 1 | T2 | Total temperature at fan inlet | °R |
Sensor 2 | T24 | Total temperature at LPC outlet | °R |
Sensor 3 | T30 | Total temperature at HPC outlet | °R |
Sensor 4 | T50 | Total temperature at LPT outlet | °R |
Sensor 5 | P2 | Pressure at fan inlet | psia |
Sensor 6 | P15 | Total pressure in bypass-duct | psia |
Sensor 7 | P30 | Total pressure at HPC outlet | psia |
Sensor 8 | Nf | Physical fan speed | rpm |
Sensor 9 | Nc | Physical core speed | rpm |
Sensor 10 | epr | Engine pressure ratio | – |
Sensor 11 | Ps30 | Static pressure at HPC outlet | psia |
Sensor 12 | phi | Ratio of fule flow to Ps30 | pps/psi |
Sensor 13 | NRf | Corrected fan speed | rpm |
Sensor 14 | NRe | Corrected core speed | rpm |
Sensor 15 | BPR | Bypass ratio | – |
Sensor 16 | farB | Burner fuel-air ratio | – |
Sensor 17 | htBleed | Bleed enthalpy | – |
Sensor 18 | Nf_dmd | Demanded fan speed | rpm |
Sensor 19 | PCNfR_dmd | Demanded corrected fan speed | rpm |
Sensor 20 | W31 | HPT coolant bleed | lbm/s |
Sensor 21 | W32 | LPT coolant bleed | lbm/s |
Notes:
We can check the dimensions and view raw data before processing it further.
import pandas as pd
# Read dataset files and convert to dataframes
data = pd.read_csv("/content/train_FD001.txt", sep=" ", header=None)
# Show dataset dimension
print("Shape of data :", data.shape)
# Show initial data
data
Notes:
From the dataset, you can see that the column names are not representative (still in the form of numbers) and there are columns that contain NaN (Not a Number) values in the last 2 columns. You need to further clean the data. Perform this cleaning process during the data preparation stage.
This stage cleans the data, producing a clean dataset ready for the Machine Learning modeling process. There is a term Garbage In, Garbage Out (GIGO) which means that if the data trained is garbage data, it will create a garbage model too. A model that is not good for the prediction process. To avoid this, a data preparation process is needed. Some of the processes carried out at this stage include:
Remove NaN values from the dataset because they do not influence the data. In addition, it is also important to rename the columns to make them easier to read and more representative.
# Remove NaN values from the last 2 columns of the dataset
data.drop(columns=[26, 27], inplace=True)
# List the column names according to the dataset description
columns = [
'engine', 'cycle', 'setting1', 'setting2', 'setting3', 'sensor1',
'sensor2', 'sensor3', 'sensor4', 'sensor5', 'sensor6', 'sensor7',
'sensor8', 'sensor9', 'sensor10', 'sensor11', 'sensor12', 'sensor13',
'sensor14', 'sensor15', 'sensor16', 'sensor17', 'sensor18', 'sensor19',
'sensor20', 'sensor21'
]
# Rename a column in the dataset
data.columns = columns
Naming the dataset after the column descriptions makes it easier to understand the meaning of the predictors. So, there are now only 26 columns (predictors) in the dataset.
This process determines statistical details from the data, such as the average value, standard deviation, minimum value, Q1, median, Q2, and maximum value for each column.
# Melihat statistik dari dataset
data.describe().transpose()
The data reveals that several predictors have identical min and max values. This indicates that the predictor has a constant value, which is the same value for all rows. This will not affect the target so it is necessary to remove these predictors to reduce the computational time.
A constant value is characterized by identical min and max values. Here is the function to remove the constant value.
def drop_constant_value(dataframe):
'''
Function:
- Deletes constant value columns in the data set.
- A constant value is a value that is the same for all data in the data set.
- A value is considered constant if the minimum (min) and maximum (max) values in the column are the same.
Args:
dataframe -> dataset to validate
Returned value:
dataframe -> dataset cleared of constant values
'''
# Creating a temporary variable to store a column name with a constant value
constant_column = []
# The process of finding a constant value by looking at the minimum and maximum values
for col in dataframe.columns:
min = dataframe[col].min()
max = dataframe[col].max()
# Append the column name if the min and max values are equal.
if min == max:
constant_column.append(col)
# Delete column with constant value
dataframe.drop(columns=constant_column, inplace=True)
# return data
return dataframe
# call function to drop constant value
data = drop_constant_value(data)
data
After the constant value removal process, the dataset left 19 predictors from the original 26 predictors. This shows that there are 7 predictors that have constant values
Since this is a classification task and the dataset doesn’t have a target column, it is necessary to create a target column manually. We will create a target that classifies the machine as either normal or failed (binary classification). In this project, we will label normal status as 0 and failure as 1.
We use a threshold value of 20 to determine whether a cycle is labeled as failure or normal. This value is subjective, and we chose 20 to anticipate a complete engine failure (20 cycles remaining). This allows technicians to inspect the engine earlier and prepare for a replacement. This is useful to anticipate sudden engine failure during use. That is, for each engine if the cycle value has reached (maximum cycle – threshold), then the cycle will be labeled as failure. For example, engine 1 has a maximum cycle of 120. Then cycle 101 to 120 will be labeled as failure. Here is the function to create a machine status label.
def assign_label(data, threshold):
'''
Function:
- Labeling a dataset
Args:
- data -> dataset to be labeled
- threshold -> threshold value of cycle before failure
Return:
- data -> labeled dataset
'''
for i in range(1, 101):
# Get max cycle each engine
max_cycle = data.loc[(data['engine'] == i), 'cycle'].max()
# Determine when cycle is labeled as failure
start_warning = max_cycle - threshold
# Assign label 1 to dataset
data.loc[(data['engine'] == i) & (data['cycle'] > start_warning), 'status'] = 1
# Assign label 0 to dataset
data['status'].fillna(0, inplace=True)
# Return labeled dataset
return data
# Determine the threshold value
threshold = 20
# Call assign_label function to get label
data = assign_label(data, threshold)
# Show data after labelling
data
The influence value or known as the correlation value in the dataset can be divided into 5 categories, namely:
We will use a heatmap visualization to see the correlation value between the predictor and the target, with a threshold value of 0.20 in this project.
# Heatmap for checking the correlation
threshold = 0.2
plt.figure(figsize=(12, 10))
sns.set(font_scale=0.7)
sns.set_style("whitegrid", {"axes.facecolor": ".0"})
cluster = data.corr()
mask = cluster.where((abs(cluster) >= threshold)).isna()
plot_kws={"s": 1}
sns.heatmap(cluster,
cmap='RdYlBu',
annot=True,
mask=mask,
linewidths=0.2,
linecolor='lightgrey').set_facecolor('white')
plt.title("Feature Correlation using Heatmap")
# Heatmap for checking the correlation
threshold = 0.2
plt.figure(figsize=(12, 10))
sns.set(font_scale=0.7)
sns.set_style("whitegrid", {"axes.facecolor": ".0"})
cluster = data.corr()
mask = cluster.where((abs(cluster) >= threshold)).isna()
plot_kws={"s": 1}
sns.heatmap(cluster,
cmap='RdYlBu',
annot=True,
mask=mask,
linewidths=0.2,
linecolor='lightgrey').set_facecolor('white')
plt.title("Feature Correlation using Heatmap")
The heatmap visualization will display only predictors with an absolute correlation value greater than or equal to the threshold. We use a threshold value of 0.2 because a correlation above 0.2 indicates a fairly strong relationship, while a correlation below 0.2 is too weak to be useful.
A negative value in the correlation indicates that the predictor has an opposite correlation with other predictors. For example, sensor 2 and sensor 7 have a correlation value of -0.7. This means that when the value of sensor 2 increases, the value of sensor 7 will decrease and vice versa. The higher the correlation value, the more they affect each other. The absolute value of the correlation value is between 0 and 1. A value of 0 means no correlation while 1 means a very strong correlation.
In some cases, not all predictors (columns) in the dataset have a strong enough influence on the target. For this reason, it is necessary to perform a feature selection process to remove features that have no influence. The goal is to reduce the time and computational burden used in the learning process. As in the previous stage, a threshold value of 0.2 will be used. So that predictors that have a correlation value < 0.2 will be removed. Here is the function for feature selection.
# Show predictor that have correlation value >= threshold
correlation = data.corr()
relevant_features = correlation[abs(correlation['status']) >= threshold]
relevant_features['status']
# Keep a relevant features (correlation value >= threshold)
list_relevant_features = list(relevant_features.index[1:])
# Applying feature selection
data = data[list_relevant_features]
After the feature selection process, we are left with 15 columns consisting of 14 predictors and 1 target.
The next step is to look at the proportion of classes in the dataset. We will look at the proportion of normal (0) and failure (1) classes. This is done to determine the balance of the dataset.
The visualization above shows that the dataset contains 18,631 cycles classified as normal and 2,000 cycles classified as failure. This means that the proportion of minority values is 9.7% of the total dataset. Since this proportion falls into the moderate category, it is necessary to perform a sampling process to increase the number of minority data points. This phenomenon is referred to as an unbalanced dataset. The article about unbalanced datasets can be seen here.
Before balancing the data (sampling process), first divide it into two parts: train data and test data. Use the train data to build machine learning models and the test data to evaluate the performance of the resulting models.
In this project, we will use an 80:20 scheme for data sharing, meaning we will use 80% of the data as training data and 20% as test data. We chose this scheme without a specific rule. Some projects use 60:40, 70:30, 75:25, 80:20, and 90:10 schemes. But one thing for sure is that the amount of test data should not exceed the train data. Additionally, we will divide the data into predictor columns (prefix X) and target columns (prefix y).
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# Determine predictor (X) and target (y)
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]
# Split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Change y_train into 1 dimension form
y_train = y_train.squeeze()
After the dataset is divided, we look at the number of train data and test data by using the shape function.
# Check dimension of data train and test
print("Shape of train : ", X_train.shape)
print("Shape of test : ", X_test.shape)
Out of the total 20,631 data points in the dataset, we will use 16,504 for training and 4,127 for testing. The number 14 signifies the 14 predictors that will be analyzed for patterns during the learning process.
The sampling process is used to overcome the problem of unbalanced datasets. The purpose of this process is to balance the proportion of classes in the dataset so that the normal and failure classes will have the same amount of data. This will make the machine learning model sensitive to both classes of data (normal and failure) not just to one of them.
To prevent data leakage from the test data, you should perform the sampling process only on the train data. Therefore, in the previous stage, we first divided the data into training and testing sets.
In this project, we will use the oversampling technique to generate synthetic data for the minority class (failure) to match the number of samples in the majority class (normal). The algorithm used is Synthetic Minority Oversampling Technique (SMOTE). Read more about SMOTE at the following link.
from imblearn.over_sampling import SMOTE
# Oversmapling process to overcome imbalanced dataset
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
# Class proportion checking
data = X_train
data['status'] = y_train
sns.countplot(x='status', data=data)
plt.title("Class proportion after sampling")
plt.xlabel('Status Mesin')
plt.ylabel('Jumlah Data')
print("0: ", len(data[data['status'] == 0]), " data")
print("1: ", len(data[data['status'] == 1]), " data")
The barplot above shows that after the oversampling process, the data for normal and failure machines is balanced, with each status having 14,861 data points.
Just like the sampling process, we should perform the scaling process only on the train data to prevent data leakage from the test data. Additionally, we must scale the data after sampling, not before. Therefore, we first divide the data into train and test sets, then perform sampling, and finally apply scaling.
The scaling process is used to equalize the range of values of all features. This aims to reduce the computational burden during the training process and improve the performance of the resulting model. The scaling process is carried out if there is a predictor that has a value far above the value of other predictors.
In this project, the Z-Score method will be used for the scaling process. More information about Z-Score normalization can be found at the following link.
# Change X_train to dataframe
X_train = pd.DataFrame(X_train, columns = X.columns)
# Scaling process
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Show data after scaling process
X_train_scaling = pd.DataFrame(X_train, columns = X.columns)
X_train_scaling
From the scaling results, it can be seen that all predictors have a range of data that is not much different. This will facilitate the process of building machine learning models and reduce the time and computational resources required.
This stage is a process of creating a machine learning model that will later be used for the prediction process. Some of the things done in this phase are:
This stage produces a trained model that is ready for the prediction process.
Random forest is a popular classification algorithm due to its excellent performance. This article does not discuss the details of random forest so you can read more about random forest in the following sources.
After the data is cleaned in the pre-processing process, the next step is to build a machine learning model. To create an ML model from random forest, we will use the library provided by scikit-learn.
# Creating object from RandomForestClassifier() class
model = RandomForestClassifier()
# Training process
model = model.fit(X_train, y_train)
# Predicting test data
y_predict = model.predict(X_test)
After successfully predicting the data using the predict() function, then we will evaluate the prediction results to find out whether the resulting model is good or not. To evaluate, we will use several measures: accuracy, precision, recall, and F1 score. First, we will use the confusion matrix to determine the values of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) before calculating these evaluation metrics. More information about confusion matrix can be seen in the following link.
# Visualize confusion matrix table
matrix = metrics.confusion_matrix(y_test, y_predict)
matrix_display = metrics.ConfusionMatrixDisplay(confusion_matrix = matrix, display_labels = ["normal", "failure"])
matrix_display.plot()
plt.grid(False)
plt.show()
The confusion matrix table above reveals the following:
print("Accuracy : ", metrics.accuracy_score(y_test, y_predict))
print("Precision : ", metrics.precision_score(y_test, y_predict))
print("Recall : ", metrics.recall_score(y_test, y_predict))
print("F1 Score : ", metrics.f1_score(y_test, y_predict))
From the evaluation scores above, we can conclude as follows:
ANN is one of the machine learning algorithms that is the forerunner of deep learning algorithms. It is called neural because it mimics how neurons in the human brain transfer signals to other neurons. Further discussion about ANN can be seen in the following article.
In this project, the Tensorflow library will be used to build the ANN model. Here is the code to build the ANN architecture.
# Import library to build neural network architecture
from keras.layers import Dense, LeakyReLU
from keras.models import Sequential
# Import library for optimization
from keras.optimizers import Adam
# Import library to prevent overfitting
from keras.callbacks import EarlyStopping
from keras.regularizers import l2
# Build neural network architecture
model = Sequential()
model.add(Dense(512, input_dim=X_train.shape[1], activation = LeakyReLU(), kernel_regularizer=l2(0.01)))
model.add(Dense(256, activation = LeakyReLU(), kernel_regularizer=l2(0.01)))
model.add(Dense(128, activation = LeakyReLU(), kernel_regularizer=l2(0.01)))
model.add(Dense(1, activation = 'sigmoid'))
opt = Adam(learning_rate = 0.0001) # optimizer
model.compile(optimizer = opt,
loss = 'binary_crossentropy',
metrics=['accuracy'])
# Create a object from EarlyStopping class
earlystopper = EarlyStopping(
monitor = 'val_loss',
min_delta = 0,
patience = 5,
verbose= 1)
# Fitting network
history = model.fit(
X_train,
y_train,
epochs = 200,
batch_size = 128,
validation_split = 0.20,
verbose = 1,
callbacks = [earlystopper])
history_dict = history.history
The Neural Network algorithm used has the following architecture:
After completing the training process, we will evaluate the ANN model’s performance, similar to the approach used with Random Forest. The following is the confusion matrix code from ANN.
# Predicting test data
y_predict = (model.predict(X_test) > 0.5).astype('int32')
# Show confusion matrix table
matrix = metrics.confusion_matrix(y_test, y_predict)
matrix_display = metrics.ConfusionMatrixDisplay(confusion_matrix = matrix, display_labels = ["normal", "failure"])
matrix_display.plot()
plt.grid(False)
plt.show()
From the evaluation scores above, we can conclude as follows:
This article underscores the transformative potential of machine learning in predictive maintenance for jet engines. By leveraging NASA’s comprehensive simulation data, we demonstrated how advanced algorithms like Random Forest and Neural Networks can effectively forecast engine failures, thus significantly enhancing operational safety and efficiency. The successful application of feature selection, data preparation, and sophisticated modeling techniques highlights the critical role of predictive analytics in preempting equipment failures. As we advance, these insights not only pave the way for more reliable engine maintenance strategies but also set a precedent for future innovations in predictive maintenance across various industries.
Get full code in Here at GitHub.
Sure, here are some key takeaways in one-liners:
A. Predictive maintenance uses data and algorithms to forecast when jet engine components might fail, allowing for timely repairs and minimizing downtime.
A. It enhances safety, reduces unexpected failures, and lowers maintenance costs by addressing issues before they lead to significant problems.
A. Common models include Random Forest and Neural Networks, which analyze historical data to predict potential failures.
A. NASA provides simulation data that helps develop and refine predictive maintenance algorithms for jet engines.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.