Top 4 Pre-Trained Models for Image Classification with Python Code

Purva Huilgol Last Updated : 18 Dec, 2024
14 min read

The human brain can quickly recognize and tell apart objects in an image, like distinguishing a cat from a dog in milliseconds. Computer Vision aims to replicate this ability in machines. Today, machines can classify images, detect objects, recognize faces, and even generate images of people who don’t exist. Transfer Learning has played a key role in improving image classification by allowing models trained on large datasets to be reused for new tasks. This article covers four popular pre-trained models for image classification that are widely used in the industry.

What is the Image Classification Model?

Image classification involves recognizing and grouping images into distinct categories or labels according to their content. For instance, a model could categorize pictures as either “cats,” “dogs,” or “cars.” This is achieved through algorithms trained with numerous labeled images, aiding the model in identifying patterns and characteristics.

Also, With that, we will also be explaining four pre-trained models used for image classification.

Setting Up the System

Since we started with cats and dogs, let us use the Cat and Dog images dataset. The original training dataset on Kaggle has 25000 images of cats and dogs, and the test dataset has 10000 unlabelled images. I have taken a much smaller dataset since we only aim to understand these models. You can run this and the rest of the code on Google Colab, so let us get started!

!wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O /tmp/cats_and_dogs_filtered.zip

Let us also import the basic libraries. Further, I will cover future imports depending on the model, including the best CNN model for image classification using Python:

Python Code:

import os 
import zipfile 
import tensorflow as tf 
from tensorflow.keras.preprocessing.image import ImageDataGenerator 
from tensorflow.keras import layers 
from tensorflow.keras import Model 
import matplotlib.pyplot as plt

Preparing the Dataset

Some popular datasets are used in pretrained models for image classification across research, industry, and hackathons. The following are some of the prominent ones:

and many more.

We will first prepare the dataset and separate out the images for pre-trained models for image classification model:

  • We first divide the folder contents into the train and validation directories.
  • Then, in each directory, create a separate directory for cats containing only cat images and a separate directory for dogs containing only dog images.
local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

base_dir = '/tmp/cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

# Directory with our training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')

# Directory with our training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')

# Directory with our validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')

# Directory with our validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

The following code will let us check if the images have been loaded correctly:

# Set up matplotlib fig, and size it to fit 4x4 pics
import matplotlib.image as mpimg
nrows = 4
ncols = 4

fig = plt.gcf()
fig.set_size_inches(ncols*4, nrows*4)
pic_index = 100
train_cat_fnames = os.listdir( train_cats_dir )
train_dog_fnames = os.listdir( train_dogs_dir )


next_cat_pix = [os.path.join(train_cats_dir, fname) 
                for fname in train_cat_fnames[ pic_index-8:pic_index] 
               ]

next_dog_pix = [os.path.join(train_dogs_dir, fname) 
                for fname in train_dog_fnames[ pic_index-8:pic_index]
               ]

for i, img_path in enumerate(next_cat_pix+next_dog_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes (or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()

Now that our dataset is ready, let’s move to the model-building stage. We will use four different pre-trained models on this dataset.

In case you want to learn computer vision in a structured format, refer to this course- Certified Computer Vision Master’s Program!

Now lets look at the 4 pre-trained models for image classification as follows:

Very Deep Convolutional Networks for Large-Scale Image Recognition(VGG-16)

The VGG-16 is one of the most popular pre-trained models for image classification. Introduced at the famous ILSVRC 2014 Conference, it was and remains THE model to beat even today. Developed at the Visual Graphics Group at the University of Oxford, VGG-16 beat the then-standard AlexNet and was quickly adopted by researchers and the industry for their image classification tasks.

Here is the architecture of VGG-16:

Pretrained Models for Image Classification : VGG-16 Architecture

Here is a more intuitive layout of the VGG-16 Model. Pretrained Models for Image Classification: VGG-16 Layout

The following are the layers of the model:

  • Convolutional Layers = 13
  • Pooling Layers = 5
  • Dense Layers = 3

Let us Explore the Layers in Detail

  1. Input: Image of dimensions (224, 224, 3).
  2. Convolution Layer Conv1:
    • Conv1-1: 64 filters
    • Conv1-2: 64 filters and Max Pooling
    • Image dimensions: (224, 224)
  3. Convolution layer Conv2: Now, we increase the filters to 128
    • Input Image dimensions: (112,112)
    • Conv2-1: 128 filters
    • Conv2-2: 128 filters and Max Pooling
  4. Convolution Layer Conv3: Again, double the filters to 256, and now add another convolution layer
    • Input Image dimensions: (56,56)
    • Conv3-1: 256 filters
    • Conv3-2: 256 filters
    • Conv3-3: 256 filters and Max Pooling
  5. Convolution Layer Conv4: Similar to Conv3, but now with 512 filters
    • Input Image dimensions: (28, 28)
    • Conv4-1: 512 filters
    • Conv4-2: 512 filters
    • Conv4-3: 512 filters and Max Pooling
  6. Convolution Layer Conv5: Same as Conv4
    • Input Image dimensions: (14, 14)
    • Conv5-1: 512 filters
    • Conv5-2: 512 filters
    • Conv5-3: 512 filters and Max Pooling
    • The output dimensions here are (7, 7). At this point, we flatten the output of this layer to generate a feature vector
  7. Fully Connected/Dense FC1: 4096 nodes, generating a feature vector of size(1, 4096)
  8. Fully ConnectedDense FC2: 4096 nodes generating a feature vector of size(1, 4096)
  9. Fully Connected /Dense FC3: 4096 nodes, generating 1000 channels for 1000 classes. This is then passed on to a Softmax activation function
  10. Output layer

As you can see, the model is sequential in nature and uses many filters. At each stage, small 3 * 3 filters are used to reduce the number of parameters. All the hidden layers use the ReLU activation function. Even then, the number of parameters is 138 Billion, which makes it a slower and much larger model to train than others.

Additionally, there are variations of the VGG16 model, which are improvements, like VGG19 (19 layers). You can find a detailed explanation

Let us now explore how to train a VGG-16 model on our dataset-

Step 1: Image Augmentation

Since we used a much smaller dataset of images earlier, we can compensate by augmenting this data and increasing our dataset size. If you are working with the original larger dataset, you can skip this step and build the model.

# Add our data-augmentation parameters to ImageDataGenerator
train_datagen = ImageDataGenerator(
    rescale=1. / 255.,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(
    rescale=1.0 / 255.
)

Step 2: Training and Validation Sets

# Flow training images in batches of 20 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
    train_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(224, 224)
)

# Flow validation images in batches of 20 using test_datagen generator
validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(224, 224)
)

Step 3: Loading the Base Model

We will use only the basic models, with changes to the final layer. This is because this is just a binary classification problem, while these models are built to handle up to 1000 classes.

from tensorflow.keras.applications.vgg16 import VGG16

base_model = VGG16(
    input_shape=(224, 224, 3),  # Shape of our images
    include_top=False,  # Leave out the last fully connected layer
    weights='imagenet'
)

Since we don’t have to train all the layers, we make them non_trainable:


for layer in base_model.layers:
    layer.trainable = False

Step 4: Compile and Fit

We will then build the last fully connected layer. I have just used the basic settings, but feel free to experiment with different dropout values, optimizers, and activation functions.

# Flatten the output layer to 1 dimension
x = layers.Flatten()(base_model.output)

# Add a fully connected layer with 512 hidden units and ReLU activation
x = layers.Dense(512, activation='relu')(x)

# Add a dropout rate of 0.5
x = layers.Dropout(0.5)(x)

# Add a final sigmoid layer with 1 node for classification output
x = layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(base_model.input, x)

model.compile(optimizer = tf.keras.optimizers.RMSprop(lr=0.0001), loss = 'binary_crossentropy',metrics = ['acc'])

We will build the final model based on the training and validation sets we created earlier. Please note to use the original directories instead of the augmented datasets I have used below. I have used just 10 epochs, but you can also increase them to get better results:

vgghist = model.fit(train_generator, validation_data = validation_generator, steps_per_epoch = 100, epochs = 10)

Awesome! As you can see, we achieved a validation accuracy of 93% with just 10 epochs and without any major changes to the model. This is where we realize how powerful Transfer Learning for Image Classification is and how useful pre-trained models for image classification can be. A caveat here, though: VGG16 takes a long time to train compared to other models, which can be a disadvantage when dealing with huge datasets.

Inception

While researching this article, one thing was clear: significant progress in pre-trained models for image classification has come from the introduction of models like GoogLeNet (also known as Inception). The Inceptionv1 model, with only 7 million parameters, was much smaller compared to older models like VGG and AlexNet, making it a breakthrough. Its major innovation was the Inception Module, which performs convolutions with different filter sizes and max pooling, then concatenates the results.

The introduction of the 1×1 convolution operation also helped drastically reduce the number of parameters, improving efficiency.

Pretrained Models for Image Classification : Inception module with dimension reduction

Though the number of layers in Inceptionv1 is 22, the massive parameter reduction makes it a formidable model to beat.

Pretrained Models for Image Classification : Auxillary Classifiers

The Inceptionv2 model was a major improvement on the Inceptionv1 model, which increased its accuracy and made it less complex. In the same paper as Inceptionv2, the authors introduced the Inceptionv3 model with a few more improvements on v2.

The following are the major improvements included:

  • Introduction of Batch Normalisation
  • More factorization
  • RMSProp Optimiser
Pretrained Models for Image Classification : Inception

While it is not possible to provide an in-depth explanation of Inception in this article, you can go through this comprehensive article covering the Inception Model in detail: Deep Learning in the Trenches: Understanding Inception Network from Scratch

As you can see the number of layers is 42, compared to VGG16’s paltry 16 layers. Also, Inceptionv3 reduced the error rate to only 4.2%.

Let’s see how to implement it in Python for this pretrained models for image classification model-

Step 1: Data Augmentation

You will note that I am not performing extensive data augmentation. The code is the same as before. I have just changed the image dimensions for each model.

# Add our data-augmentation parameters to ImageDataGenerator
train_datagen = ImageDataGenerator(
    rescale=1. / 255.,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

test_datagen = ImageDataGenerator(
    rescale=1.0 / 255.
)

Step 2: Training and Validation Generators

train_generator = train_datagen.flow_from_directory(
    train_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(150, 150)
)

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(150, 150)
)

Step 3: Loading the Base Model

from tensorflow.keras.applications.inception_v3 import InceptionV3

base_model = InceptionV3(
    input_shape=(150, 150, 3),
    include_top=False,
    weights='imagenet'
)

Step 4: Compile and Fit

Just like VGG-16, we will only change the last layer.

for layer in base_model.layers:
    layer.trainable = False

We perform the following operations:

  • Flatten the output of our base model to 1 dimension
  • Add a fully connected layer with 1,024 hidden units and ReLU activation
  • This time, we will go with a dropout rate of 0.2
  • Add a final Fully Connected Sigmoid Layer
  • We will again use RMSProp, though you can try out the Adam Optimiser too
from tensorflow.keras.optimizers import RMSprop

x = layers.Flatten()(base_model.output)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.2)(x)

# Add a final sigmoid layer with 1 node for classification output
x = layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(base_model.input, x)

model.compile(optimizer = RMSprop(lr=0.0001), loss = 'binary_crossentropy', metrics = ['acc'])

We will then fit the image classification model:

inc_history = model.fit_generator(
    train_generator,
    validation_data=validation_generator,
    steps_per_epoch=100,
    epochs=10
)
Pre-trained Models for Image Classification : epoch

As a result, we can see that we get 96% Validation accuracy in 10 epochs. Also, note that this model is much faster than VGG16. Each epoch takes around only 1/4th the time that each epoch in VGG16 does. Of course, you can always experiment with the different hyperparameter values and see how much better/worse it performs.

I liked studying the Inception model. While most models at that time were merely sequential and followed the premise that the deeper and larger the model, the better it would perform, Inception and its variants broke this mold. Just like its predecessors, Inceptionv3 achieved the top position in CVPR 2016 with only a 3.5% top-5 error rate.

ResNet50

Just like Inceptionv3, ResNet50 is not the first image classification model from the ResNet family. The original model, the Residual net or ResNet, was another milestone in the CV domain back in 2015.

The main motivation behind this image classification model was to avoid poor accuracy as the model went on to become deeper. Additionally, if you are familiar with Gradient Descent, you would have encountered the Vanishing Gradient issue – the ResNet model also aimed to tackle this issue. Here is the architecture of the earliest variant: ResNet34(ResNet50 also follows a similar technique with just more layers)

Pre-trained Models for Image Classification : ResNet Architecturee

You can see that after starting with a single convolutional layer and Max Pooling, there are 4 similar layers with varying filter sizes – all of them using 3 * 3 convolution operation. Also, after every 2 convolutions, we are bypassing/skipping the layer in between. This is the main concept behind ResNet models. These skipped connections are called ‘identity shortcut connections” and use what is called residual blocks:

pre-trained Models for Image Classification : Residual Blocks

In simple terms, the authors of ResNet propose that fitting a residual mapping is much easier than fitting the actual mapping and thus applying it to all the layers. Another interesting point to note is that the authors of ResNet are of the opinion that the more layers we stack, the model should not perform worse.

This is contrary to what we saw in Inception and is almost similar to VGG16 in the sense that it just stacks layers on top of each other. ResNet changes the underlying mapping.

The ResNet model has many variants, of which the latest is ResNet152. The following is the architecture of the ResNet family in terms of the layers used:

pre-trained Models for Image Classification : ResNet family architecture

Let us now use ResNet50 on our dataset for image classification model:

Step 1: Data Augmentation and Generators

# Add our data-augmentation parameters to ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255., rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1.0/255.)

train_generator = train_datagen.flow_from_directory(train_dir, batch_size = 20, class_mode = 'binary', target_size = (224, 224))

validation_generator = test_datagen.flow_from_directory( validation_dir, batch_size = 20, class_mode = 'binary', target_size = (224, 224))
vie

Step 2: Import the base model

from tensorflow.keras.applications import ResNet50

base_model = ResNet50(input_shape=(224, 224,3), include_top=False, weights="imagenet")

Again, we are using only the basic ResNet model, so we will keep the layers frozen and only modify the last layer:


for layer in base_model.layers:
    layer.trainable = False

Step 3: Build and Compile the Model

I would like to show you an even shorter code for using the ResNet50 model. We will use this pretrained models for image classification model as a layer in a Sequential model and add a single Fully Connected Layer.

from tensorflow.keras.applications import ResNet50
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Flatten, GlobalAveragePooling2D

base_model = Sequential()
base_model.add(ResNet50(include_top=False, weights='imagenet', pooling='max'))
base_model.add(Dense(1, activation='sigmoid'))

We compile the image classification model, and this time, let us try the SGD optimizer:

base_model.compile(
    optimizer=tf.keras.optimizers.SGD(lr=0.0001),
    loss='binary_crossentropy',
    metrics=['acc']
)

Step 4: Fitting the model

resnet_history = base_model.fit(
    train_generator,
    validation_data=validation_generator,
    steps_per_epoch=100,
    epochs=10
)

The following is the result we get-

pre-trained Models for Image Classification : resnet50 result

You can see how well it performs on our dataset, making ResNet50 one of the most widely used Pre-trained models. Like VGG, it also has other variations, as seen in the table above. Remarkably, ResNet not only has its own variants, but it also spawned a series of architectures based on ResNet. These include ResNeXt, ResNet as an Ensemble, etc. Additionally, the ResNet50 is among the most popular image classification models out there and achieved  a top-5 error rate of around 5%.

EfficientNet

We finally came to the latest model among these four that has caused waves in this domain, and of course, it is from Google. In EfficientNet, the authors propose a new Scaling method called Compound Scaling. The long and short of it is this: The earlier models like ResNet follow the conventional approach of arbitrarily scaling the dimensions and adding more layers.

However, the paper proposes that if we simultaneously scale the dimensions by a fixed amount and do so uniformly, we achieve much better performance. The scaling coefficients can, in fact, be decided by the user.

Though this scaling technique can be used for any CNN-based model, the authors started off with their own baseline model called EfficientNetB0:

EfficientNetB0

MBConv stands for mobile inverted bottleneck Convolution(similar to MobileNetv2). They also propose the Compound Scaling formula with the following scaling coefficients:

  • Depth = 1.20
  • Width = 1.10
  • Resolution = 1.15

This formula is used to build a family of EfficientNets – EfficientNetB0 to EfficientNetB7 again. The following is a simple graph showing the comparative performance of this family vis-a-vis other popular models:

Number of Parameters

As you can see, even the baseline B0 model starts at a much higher accuracy, which only increases, and that too with fewer parameters. For instance, EfficientB0 has only 5.3 million parameters!

The simplest way to implement EfficientNet is to install it. The rest of the steps are similar to what we have seen above.

Installing EfficientNet

!pip install -U efficientnet

Import it

import efficientnet.keras as efn

Step 1: Image Augmentation

We will use the same image dimensions for VGG16 and ResNet50. By now, you would be familiar with the Augmentation process:

# Add our data-augmentation parameters to ImageDataGenerator
train_datagen = ImageDataGenerator(
    rescale=1. / 255.,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

test_datagen = ImageDataGenerator(
    rescale=1.0 / 255.
)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(224, 224)
)

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    batch_size=20,
    class_mode='binary',
    target_size=(224, 224)
)

Step 2: Loading the Base Model

We will use the B0 version of EfficientNet since it is the simplest of the 8. I urge you to experiment with the rest of the models, though do keep in mind that they are becoming increasingly complex, which might not be best suited for a simple binary classification task.

base_model = efn.EfficientNetB0(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'
)

Again, let us freeze the layers:

for layer in base_model.layers:
    layer.trainable = False

Step 3: Build the Model

Just like Inceptionv3, we will perform these steps at the final layer:

x = base_model.output
x = Flatten()(x)
x = Dense(1024, activation="relu")(x)
x = Dropout(0.5)(x)

# Add a final sigmoid layer with 1 node for classification output
predictions = Dense(1, activation="sigmoid")(x)
model_final = Model(input=base_model.input, output=predictions)

Step 4: Compile and Fit

Let us again use the RMSProp Optimiser, though here, I have introduced a decay parameter:

model_final.compile(
    optimizers.rmsprop(lr=0.0001, decay=1e-6),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

We finally fit the model on our data:

eff_history = model_final.fit_generator(
    train_generator,
    validation_data=validation_generator,
    steps_per_epoch=100,
    epochs=10
)

There we go: we achieved a whopping 98% accuracy on our validation set in only 10 epochs. I urge you to try training the larger dataset with EfficientNetB7 and share the results below.

Result

In this article we found the top State-of-the-Art pre-trained models for image classification. Here is a handy table for you to refer to these models and their performance:

Pre-Trained Models for Image Classification with Python

Conclusion

The exploration of pre-trained models for image classification reveals the remarkable advancements in the field of Computer Vision. Each model discussed—VGG-16, Inception, ResNet50, and EfficientNet—represents significant strides in achieving near-human-level accuracy in recognizing and categorizing images. Thus, pre-trained models have transformed the landscape of image classification, making state-of-the-art techniques accessible for a wide range of applications. By understanding and utilizing these models, practitioners can significantly enhance the efficiency and accuracy of their computer vision tasks, paving the way for further innovations and applications in the field. As the technology continues to evolve, it will be exciting to see how future models build upon these foundations to achieve even greater feats in artificial intelligence.

Hope you like this guide about the best image classification models and how they enhance our understanding of visual data through advanced techniques.

Resources

Frequently Asked Questions

Q1. What is pre-trained models for image classification?

A. Pre-trained models for image classification are models previously trained on large datasets like ImageNet. They can be fine-tuned for specific tasks, saving time and computational resources.

Q2. Which is Pretrained model for medical image classification?

A. For medical image classification, pre-trained models like VGG16, ResNet, and DenseNet, particularly those adapted with domain-specific datasets such as CheXNet for chest X-rays, are effective.

Q3. What are pre training models?

A. Pre-training models involves training on a large, general dataset before fine-tuning on a specific task. This leverages learned features and accelerates convergence.

Q4. Which models are best for image classification?

A. The best models for image classification include CNN-based architectures like ResNet, VGG, Inception, and EfficientNet, known for their high accuracy and efficiency.

Associate of Data Science @ JP Morgan

Responses From Readers

Clear

Mansi
Mansi

Hi, thank you for this article. I tried the InceptionV3 model on my custom data but I found drastically bad predictions. I found out that the model was predicting 1 class 99% of the time. I followed the step in this article and tried changing parameters also but same problem. What could the issue be?

Vandana Lingampally
Vandana Lingampally

i executed all the above code in google colab mam, iam doing research with medical images. These medical images will be of gray colour, but for these above networks we need colour images or of depth 3. How to get the gray images to colour or of depth 3 without any distortions in image.

Muhammad Rifqi
Muhammad Rifqi

Hi, Mrs. Purva Huilgol, what a content, thanks for it! i've been try the code on my case. It's all work for VGG-16, Inception and RestNet50. But there are some error for EfficientNet. When i try to build the model, it said that name 'model' is not defined. When i try to add the model from the RestNet50, antoher error appeared TypeError: ('Keyword argument not understood:', 'inputs'). Is there any reason why this happen? However, i'm new in python. Sorry for my lack of experience.

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details