Google has Released the Latest Open Images Dataset! Every Data Scientist should Work with this

Overview

Open Images is a massive dataset which contains close to 9 million images
All images come with labels that were prepared manually by professional annotators
The dataset is divided into the training (9 million+ images), validation (41k+ images), and test (125k+ images) set
Google has also announced an object detection challenge for data scientists

Introduction

As a data scientist, finding large datasets to work with is a challenge. Most organizations treasure their data and prefer not releasing it to the community. But Google has been one of the few who has consistently open sourced a lot of their research in order to speed up studies and also help budding data scientists.

This week, they have released version 4 of their popular Open Images dataset – free and available for anyone to download and work with.

Open Images is a massive dataset of images which was released by Google back in 2016. The dataset consists of 9 million images that have already been labelled by the team. According to their site, “The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations”.

These annotations have been drawn manually by professional annotators in order to ensure accuracy and consistency. The subject matter in the images is diverse in nature. There are 8.4 objects per image on average in this dataset. To add the icing on the cake, the data is annotated with image-level labels that span thousands of classes!

The Open Images dataset is pre-split into the training, validation and test sets. The training set contains 9,011,219 images, the validation set has 41,260 images and the test set has 125,436 images. All of these images come with proper labels to help you get down to building a model as quickly as possible.

Along with this dataset release, Google has announced the ‘Open Images Challenge 2018’. This is scheduled to be held at the European Conference on Computer Vision and will be an object detection challenge. This latest competition is offering a far more broader range of object classes than any previous challenge. It will have two tracks:

Object Class Detection: predicting a tight bounding box around all instances of the 500 classes
Visual Relationship Detection: detecting pairs of objects in particular relations, e.g. “woman playing guitar”. This is done by adding large number of images with multiple object annotations

The deadline for submission of results is 1st September, 2018. The evaluation metric for this challenge will be mean Average Precision (mAP) over the given 500 classes.

This is the fourth update the team has released in the last 2 years. You can download the dataset from Google’s page here.

Our take on this

This is a treasure trove for data scientists! Anyone interested in deep learning and image classification can download and work on this dataset. The fact that Google has worked on labelling the images is a testament to their team and to the power of their resources. The training set, with it’s massive size, is expected to stimulate research on more complex detection models. The hope is that this release will help in improving current state-of-the-art models.

Their open challenge is already generating a huge buzz in the ML community and we are expecting to see some serious competition. We will be sure to cover any major projects that come up in this challenge.

If you’re a newcomer to image processing, or have been working in this field for a while, this dataset is perfect for you. Use the comments section below to tell us how you plan on using this!

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

AVbytes

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Aditya Malte

This is a breakthrough!! However, I am unable to download data of a specific category (eg. cat images) to my computer from the given link. Any suggestions?

Show 1 reply

Hi Aditya, I don't think that is available anywhere on their site. You have to download the entire dataset (or the train/test/validation splits separately). I'll look into it more and give you an update in case I come across this particular feature.

ddflower

is there a places describe the 500 class label -- what type of objects? thanks!

Pulkit Sharma

Hi, You can download the csv file from here which contains the description of each class.

Google has Released the Latest Open Images Dataset! Every Data Scientist should Work with this

Overview

Introduction

Our take on this

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR