Explore the power of TensorFlow Keras preprocessing layers! This article will show you the tools that TensorFlow Keras gives you to get your data ready for neural networks quickly and easily. Keras’s flexible preprocessing layers are extremely handy when working with text, numbers, or images. We’ll examine the importance of these layers and how they simplify the process of preparing data, including encoding, normalization, resizing, and augmentation.
The TensorFlow-Keras preprocessing layers API allows developers to construct input processing pipelines that seamlessly integrate with Keras models. These pipelines are adaptable for use both within Keras workflows and as standalone preprocessing routines in other frameworks. They can be effortlessly combined with Keras models, ensuring efficient and unified data handling. Additionally, these preprocessing pipelines can be saved and exported as part of a Keras SavedModel, facilitating easy deployment and sharing of models.
Prior to the data being fed into the neural network model, it plays a crucial role in the data preparation pipeline. You may construct end-to-end model pipelines that incorporate phases for both data preparation and model training using Keras preprocessing layers. By combining the entire workflow into a single Keras model, this feature simplifies the development process and promotes reproducibility.
We have two approaches to use these preprocessing layers. Let us explore them.
Incorporating preprocessing layers directly into the model architecture. This involves integrating preprocessing steps as part of the model’s computational graph, ensuring that data transformations occur synchronously with the rest of the model execution. This approach leverages the computational power of devices, such as GPUs, enabling efficient preprocessing alongside model training. Particularly advantageous for operations like normalization, image preprocessing, and data augmentation, this method maximizes the benefits of GPU acceleration.
Applying preprocessing to the input data pipeline, here the preprocessing is conducted on the CPU asynchronously, with the preprocessed data buffered before being fed into the model. By utilizing techniques such as dataset mapping and prefetching, preprocessing can occur efficiently in parallel with model training, optimizing overall performance. This can be used for TextVectorization.
Image preprocessing layers, such as tf.keras.layers.Resizing, tf.keras.layers.Rescaling, and tf.keras.layers.CenterCrop, prepare image inputs by resizing, rescaling, and cropping them to standardized dimensions and ranges.
Image data augmentation layers, like tf.keras.layers.RandomCrop, tf.keras.layers.RandomFlip, tf.keras.layers.RandomTranslation, tf.keras.layers.RandomRotation, tf.keras.layers.RandomZoom, and tf.keras.layers.RandomContrast, introduce random transformations to augment the training data, enhancing the model’s robustness and generalization.
Let us use these layers on the emergency classification dataset from kaggle to learn how they can be implemented (note that here label 1 means presence of an emergency vehicle).
import pandas as pd
import numpy as np
import cv2
from skimage.io import imread, imshow
data=pd.read_csv('/kaggle/input/emergency-vehicles-identification/Emergency_Vehicles/train.csv')
data.head()
x=[]
for i in data.image_names:
img=cv2.imread('/kaggle/input/emergency-vehicles-identification/Emergency_Vehicles/train/'+i)
x.append(img)
x=np.array(x)
y=data['emergency_or_not']
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
target_size = (224, 224)
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal"),
tf.keras.layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
tf.keras.layers.RandomRotation(factor=0.2),
tf.keras.layers.RandomZoom(height_factor=0.2, width_factor=0.2),
tf.keras.layers.RandomContrast(factor=0.2)
])
# Define the model
model = Sequential([
Input(shape=(target_size[0], target_size[1], 3)), # Define input shape
Resizing(*target_size),
Rescaling(1./255),
data_augmentation,
Conv2D(32, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Conv2D(128, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(128, activation='relu'),
Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Display model summary
model.summary()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=45,test_size=0.3,shuffle=True,stratify=y)
model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=10)
data=pd.read_csv('/kaggle/input/emergency-vehicles-identification/Emergency_Vehicles/test.csv')
x_test=[]
for i in data.image_names:
img=cv2.imread('/kaggle/input/emergency-vehicles-identification/Emergency_Vehicles/test/'+i)
x_test.append(img)
x_test=np.array(x_test)
y_preds=model.predict(x_test)
y_predictions = [1 if x > 0.5 else 0 for x in y_preds]
import matplotlib.pyplot as plt
# Create a figure and axis outside the loop
fig, axes = plt.subplots(2, 2, figsize=(12, 6))
for i, ax in enumerate(axes.flatten()):
ax.imshow(x_test[i])
if y_predictions[i]==1:
ax.set_title(f"Emergency")
else:
ax.set_title(f"Non-Emergency")
ax.axis('off')
plt.tight_layout()
plt.show()
For text preprocessing we use tf.keras.layers.TextVectorization, this turns the text into an encoded representation that can be easily fed to an Embedding layer or a Dense layer.
Let me demonstrate the use of the TextVectorizer using Tweets dataset from kaggle:
import pandas as pd
import tensorflow as tf
import re
# Read the CSV file into a pandas DataFrame
data = pd.read_csv('train.csv')
# Define a function to remove special characters from text
def remove_special_characters(text):
pattern = r'[^a-zA-Z0-9\s]'
cleaned_text = re.sub(pattern, '', text)
return cleaned_text
# Apply the remove_special_characters function to the 'tweet' column
data['tweet'] = data['tweet'].apply(remove_special_characters)
# Drop the 'id' column
data.drop(['id'], axis=1, inplace=True)
# Define the TextVectorization layer
preprocessing_layer = tf.keras.layers.TextVectorization(
max_tokens=100, # Adjust the number of tokens as needed
output_mode='int', # Output integers representing tokens
output_sequence_length=10 # Adjust the sequence length as needed
)
# Adapt the TextVectorization layer to the data and then fit to it
preprocessing_layer.adapt(data['tweet'].values)
# Convert pandas DataFrame to TensorFlow Dataset
dataset = tf.data.Dataset.from_tensor_slices((data['tweet'].values, data['label'].values))
# Apply the preprocessing layer to the dataset
dataset = dataset.map(lambda x, y: (preprocessing_layer(x), tf.expand_dims(y, -1)))
# Prefetch the data for efficient processing
dataset = dataset.prefetch(tf.data.AUTOTUNE)
train_size = int(0.8 * data.shape[0])
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)
# Prefetch the data for efficient processing
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.prefetch(tf.data.AUTOTUNE)
# Build the model
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=len(preprocessing_layer.get_vocabulary()) + 1, output_dim=64, mask_zero=True),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy')
history = model.fit(train_dataset, epochs=10, validation_data=val_dataset)
The TextVectorization layer exposes itself to the training data using the adapt() method because these are non-trainable layers, and their state must be set before the model training. This allows the layer to analyze the training data and configure its internal state accordingly. Once the object is instantiated, it can be reused on the test data later on.
“tf.data.AUTOTUNE” dynamically adjusts the data processing operations in TensorFlow to maximize CPU utilization. Applying prefetching to the pipeline enables the system to automatically tune the number of elements to prefetch, optimizing performance during training and validation.
Let’s compare TextVectorizer with another module Tokenizer from tf.keras.preprocessing.text to convert text to numerical values:
import tensorflow as tf
# Define the sample text data
text_data = [
"The quick brown fox jumps over the lazy dog.",
"The dog barks loudly in the night.",
"A brown cat sleeps peacefully on the windowsill."
]
# Define TextVectorization layer
vectorizer = tf.keras.layers.TextVectorization(output_mode='int', output_sequence_length=10)
# Adapt the TextVectorization layer to the text data
vectorizer.adapt(text_data)
# Vectorize the text data
vectorized_text = vectorizer(text_data)
print("Vectorized Text (using TextVectorization):")
print(vectorized_text.numpy())
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Initialize Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
# Convert text to matrix using texts_to_matrix
matrix = tokenizer.texts_to_matrix(text_data, mode='count')
print("\nMatrix (using texts_to_matrix):")
print(matrix)
At the first glance we can see that the dimensions from both of them are different, let’s look at the differences in detail:
Other Preprocessing Layers in TensorFlow Keras
These layers can easily be implemented in the following way:
import numpy as np
import tensorflow as tf
import keras
from keras import layers
data = np.array(
[
[0.1, 0.4, 0.8],
[0.8, 0.9, 1.0],
[1.5, 1.6, 1.7],
]
)
layer = layers.Normalization()
layer.adapt(data)
normalized_data = layer(data)
print("Normalized features: ", normalized_data)
print()
print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))
data = np.array([[-1.5, 1.0, 3.4, .5], [0.0, 3.0, 1.3, 0.0]])
layer = tf.keras.layers.Discretization(num_bins=4, epsilon=0.01)
layer.adapt(data)
print(layer(data))
The Normalization layers make each feature to have a mean close to 0 and a standard deviation close to 1, which is a characteristic of standardized data.
It is worth noting that we can set the mean and standard deviation of the resultant features to our preferences by utilizing the normalization layer’s hyperparameters.
Coming to the outputs of the latter code, the discretization layer creates equi-width bins. In the first row, the first feature -1.5 belongs to bin 0, the second feature 1.0 belongs to bin 2, the third feature 3.4 belongs to bin 3, and the fourth feature 0.5 belongs to bin 2.
Let’s explore how to preprocess categorical features:
import tensorflow as tf
# Sample data
data = [3,2,0,1]
# Category encoding
encoder_layer = tf.keras.layers.CategoryEncoding(num_tokens=4, output_mode="one_hot")
category=encoder_layer(data)
print("Category Encoding:")
print(category)
hashing_layer = tf.keras.layers.Hashing(num_bins=3)
data = [['A'], ['B'], ['C'], ['D'], ['E']]
hash=hashing_layer(data)
print(hash)
The elements in the matrix are float values representing the one-hot encoding of each category.
For example, the first row [0. 0. 0. 1.] corresponds to the category 3 (as indexing starts from 0), indicating that the original data item was 3.
Each element represents the hash value assigned to the corresponding item.
For example, the first row [1] indicates that the hashing algorithm assigned the first item to the value 1.
Similarly, the second row [0] indicates that the hashing algorithm assigned the second item to the value 0.
There are multiple applications of TF-Keras. Let us look into few of the most important ones:
By integrating preprocessing layers into the model itself, it becomes easier to export an inference-only end-to-end model. This ensures that all the necessary preprocessing steps are encapsulated within the model, making it portable.
Users of the model don’t need to worry about the details of how each feature is preprocessed, encoded, or normalized. Whether it’s raw images or structured data, the inference model can handle them seamlessly without requiring users to understand the preprocessing pipelines.
Exporting models to other runtimes, such as TensorFlow.js, becomes more straightforward when the model includes preprocessing layers within it. There’s no need to reimplement the preprocessing pipeline in the target language or framework.
With preprocessing layers integrated into the model, the inference model can directly process raw data. This is advantageous as it simplifies the deployment process and eliminates the need for users to preprocess data separately before feeding it into the model.
Preprocessing layers are compatible with the tf.distribute API, enabling training across multiple machines or workers. For optimal performance, place these layers inside a tf.distribute.Strategy.scope().
The text can be encoded using different schemes such as multi-hot encoding or TF-IDF weighting. These preprocessing steps can be included within the model, simplifying the deployment process.
The TensorFlow Keras preprocessing layers API empowers developers to create Keras-native input processing pipelines. It facilitates building end-to-end models that handle raw data, perform feature normalization, and apply categorical feature encoding or hashing. You can integrate these preprocessing layers, adaptable to training data, directly into Keras models or employ them independently. Whether processed within the model or as part of the dataset, these functionalities enhance model portability and mitigate training/serving discrepancies, offering flexibility and efficiency in model deployment across diverse environments.
A. To utilize TensorFlow preprocessing layers, you can employ the tensorflow.keras.layers module. First, import the necessary layers for your preprocessing tasks such as Normalization, TextVectorization ..etc.
A. Yes, you can define custom layers in Keras by subclassing tf.keras.layers.Layer and implementing the __init__ and call methods to specify the layer’s configuration and computation, respectively.
A. TensorFlow Keras preprocessing layers support a wide range of preprocessing tasks, including:
-Normalization and standardization of numerical features.
-Encoding categorical features using one-hot encoding, integer encoding, or embeddings.
-Text vectorization for natural language processing tasks.
-Handling missing values and feature scaling.
-Feature discretization and bucketization.
-Image preprocessing such as resizing, cropping, and data augmentation.