As a data scientist, finding large datasets to work with is a challenge. Most organizations treasure their data and prefer not releasing it to the community. But Google has been one of the few who has consistently open sourced a lot of their research in order to speed up studies and also help budding data scientists.
This week, they have released version 4 of their popular Open Images dataset – free and available for anyone to download and work with.
Open Images is a massive dataset of images which was released by Google back in 2016. The dataset consists of 9 million images that have already been labelled by the team. According to their site, “The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations”.
These annotations have been drawn manually by professional annotators in order to ensure accuracy and consistency. The subject matter in the images is diverse in nature. There are 8.4 objects per image on average in this dataset. To add the icing on the cake, the data is annotated with image-level labels that span thousands of classes!
The Open Images dataset is pre-split into the training, validation and test sets. The training set contains 9,011,219 images, the validation set has 41,260 images and the test set has 125,436 images. All of these images come with proper labels to help you get down to building a model as quickly as possible.
Along with this dataset release, Google has announced the ‘Open Images Challenge 2018’. This is scheduled to be held at the European Conference on Computer Vision and will be an object detection challenge. This latest competition is offering a far more broader range of object classes than any previous challenge. It will have two tracks:
The deadline for submission of results is 1st September, 2018. The evaluation metric for this challenge will be mean Average Precision (mAP) over the given 500 classes.
This is the fourth update the team has released in the last 2 years. You can download the dataset from Google’s page here.
This is a treasure trove for data scientists! Anyone interested in deep learning and image classification can download and work on this dataset. The fact that Google has worked on labelling the images is a testament to their team and to the power of their resources. The training set, with it’s massive size, is expected to stimulate research on more complex detection models. The hope is that this release will help in improving current state-of-the-art models.
Their open challenge is already generating a huge buzz in the ML community and we are expecting to see some serious competition. We will be sure to cover any major projects that come up in this challenge.
If you’re a newcomer to image processing, or have been working in this field for a while, this dataset is perfect for you. Use the comments section below to tell us how you plan on using this!
This is a breakthrough!! However, I am unable to download data of a specific category (eg. cat images) to my computer from the given link. Any suggestions?
Hi Aditya, I don't think that is available anywhere on their site. You have to download the entire dataset (or the train/test/validation splits separately). I'll look into it more and give you an update in case I come across this particular feature.
is there a places describe the 500 class label -- what type of objects? thanks!
Hi, You can download the csv file from here which contains the description of each class.