Computer vision models allow digital systems to recognize and make sense of the information contained in images, similar to how humans view and interpret the world around them with their eyes and minds. To acquire the fundamental level of image/object identification, computer vision technologies must go through several training stages using machine learning, deep learning algorithms, and neural networks, unlike humans’ cognitive learning capacity to understand the visual world instantaneously. Therefore, to train computer vision-based visual perception models, you’ll need curated computer vision datasets that can assist these models in discovering or distinguishing things in images.
In a larger sense, computer vision algorithms can deconstruct and turn visual material into metadata, which can then be saved, classified, and analyzed much like any other dataset. The data used for training should be of the highest quality to establish the quality of computer vision models.
While it is good to have a dataset that comprises a vast range of images and video sequences for training, there are chances that a lack of a sufficient number of carefully selected training examples, i.e., labeled images can cause under-fitted models. Also, it is better to have datasets that contain information relatable to the type of industry for which you are developing the solution.
However, manually labeling data is a strenuous process. Today, organizations can quickly get the data they need to train computer vision models thanks to the advent of pre-labeled computer vision datasets. Instead of collecting data, developers and researchers can focus their resources on building and training a computer vision model with pre-labeled datasets. Furthermore, the greater the number of open-source datasets available, the higher the data quality will become.
In the following listicle, we have compiled the top computer vision datasets that are widely used.
1. CIFAR-10 and CIFAR-100
The Canadian Institute For Advanced Research provides both CIFAR-10 and CIFAR-100. The CIFAR-10 dataset is developed by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It has 60000 photos divided into ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. CIFAR-100 is similar in that there are 60000 photos altogether, but there are 100 classes, each with 600 images. This computer vision dataset is used for object recognition and contains 60,000 32-bit color photos divided into ten groups, each with 6,000 images. It’s split into five training batches and one test batch, each with 10,000 photos, totaling 50,000 training and 10,000 test images.
The Modified National Institute of Standards and Technology database of handwritten digits, is among the most common datasets for computer vision, which was compiled by Professor Yann LeCun. It comprises 70000 photos of handwritten digits structured in 28×28 grayscale for each number, i.e. 0–9. The data is pre-split into two sets in the release: a training set of 60000 and a test set of 10000. All digits are placed at the center of the image. It’s employed in a basic computer vision project called handwritten digital recognition.
This is a computer vision dataset based on Zalando’s (a fashion retailer) article images includes a training set of 60,000 instances and a test set of 10,000. Each instance in this collection is a 28×28 grayscale image with a label from one of ten classifications, with fashion-related topics including T-shirt/top, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. There is a Scikit-learn-based automated benchmarking system that covers 129 classifiers with various parameters.
4. Labeled Faces in the Wild
This is a computer vision open-source dataset comprising of images of people’s faces that was created to research the challenge of unconstrained facial recognition. More than 13,000 photos of faces were gathered from the internet for the data collection. Each face has been identified with the name of the individual shown. In the data set, 1680 of the persons featured had two or more different photographs. There are currently four separate sets of LFW photos, including the original and three other types of “aligned” photographs. The aligned photos include “funneled images,” LFW-a, and “deep funneled” images. In comparison to the original images and the funneled images, LFW-a and the deep funneled images offer improved results for most face verification methods.
This deep learning computer vision dataset was created jointly by Stanford University and Princeton University for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual computer vision competition in which participating teams were challenged with five main tasks, namely object classification, object localization, object detection, object detection from video, and scene recognition. Only nouns are chosen in this dataset, which is based on the WordNet (A lexical database for English) hierarchy. Each node of the hierarchy has an average of over 500 images. There are almost 1.4 million photos representing over 220,000 classes in all. It’s the world’s largest categorized image collection, and it’s free to use.
This image dataset for computer vision was developed by researchers at the MIT-IBM Watson AI Lab with the purpose of eliminating biases in existing image datasets. The researchers used Mechanical Turk, Amazon’s micro-task platform, to crowdsource the photographs instead of curating them from existing online sources. The Turkers documented each thing individually and forwarded them to be reviewed. The procedure ensured that the background, lighting, rotation, and other aspects were varied enough. 50,000 photos are scattered among 313 object classes in the ObjectNet.
This medical image classification dataset was obtained from the TensorFlow website. PatchCamelyon is a brand-new and complex image classification dataset. It is made up of 327.680 color pictures (96 x 96px) taken from lymph node histopathologic scans. Each picture has a binary label that indicates the existence of metastatic tissue. PCam is a new machine learning benchmark that is larger than CIFAR10, smaller than Imagenet, and trainable on a single GPU.
8. IMDB-Wiki Dataset
This is said to be the world’s biggest publicly available training dataset of face images with gender and age information. This collection comprises 460,723 face images from 20,284 IMDb celebrities and 62,328 Wikipedia celebrities, for a total of 523,051. The data includes important meta-information such as the location of the person’s face in the image, their name, date of birth, and gender. Typically, this dataset is used for gender and age prediction tasks.
DOTA (Dataset of Object deTection in Aerial Images) is a large-scale dataset for aerial object detection which can be used to design and test object detectors using high-altitude cameras. The dataset features images amassed from a variety of sensors and platforms. Each image has a resolution of 800 x 800 pixels to 20,000 x 20,000 pixels and includes items of various sizes, orientations, and forms. Experts in aerial image interpretation mark the occurrences in DOTA photos using an arbitrary (8 d.o.f.) quadrilateral.
The computer vision dataset comprises 15 common categories (e.g., ship, plane, car, swimming pool, etc.) that are annotated as bounding boxes defined by four pairs of points on the photos. The data is divided into three categories: training, validation, and testing.
10. MPII Human Pose
This dataset is used to test the accuracy of estimated articulated human poses. It contains around 25K photos of over 40K humans with annotated body joints. Each image is taken from a separate YouTube video and accompanied with a description. The collection contains around 410 human images, each of which is labeled with a particular activity.
Microsoft’s COCO stands for Common Objects in Context and is a large-scale dataset for object detection, segmentation, and captioning. The dataset includes images from 91 different stuff categories and 80 different object categories. This dataset has over 120000 photos with over 880000 tags (each image could have several tags). It also includes annotations for more advanced computer vision applications including multi-object labeling, segmentation mask annotations, image captioning, panoptic segmentation, stuff image segmentation, Dense human pose estimation, and key-point identification. It comes with an easy-to-use API that makes COCO’s loading, parsing, and visualizing annotations a breeze.
12. Embrapa Wine Grape Instance Segmentation Dataset
This computer vision agriculture dataset for aimed at providing images and annotation for research into object recognition and instance segmentation in viticulture for image-based monitoring and field robots. It includes examples from five distinct grape types that were harvested in the field. These instances illustrate the variation in grape position, lighting, focus, and genetic and phenological variables like form, color, and compactness. The dataset includes 300 images with 4,432 grape clusters identified using bounding boxes. Binary masks that identify the pixels of each cluster are included in a subset of 137 photos.
13. Bosch Small Traffic Lights Dataset
When developing an automated driving vehicle for urban cityscapes, it is crucial that the computer vision model is efficient in vision-only based traffic light detection and tracking. This dataset comprises around 24000 annotated traffic lights and 13427 camera photos with a resolution of 1280×720 pixels. The annotations contain traffic light boundary boxes along with the current condition (active light) of each traffic signal.
The camera images are raw 12bit HDR images shot with a red-clear-clear-blue filter and reconstructed 8-bit RGB color photographs. The RGB photographs are available for troubleshooting and training purposes. It’s vital to remember that this dataset was produced to test traffic light detection methods; it’s not meant to cover all eventualities and shouldn’t be used in production.
14. Google Open Images
This computer vision open-source dataset from Google is a 9 million-image URL to images that have been annotated with labels spanning over 6000 categories. This computer vision dataset has 16 million bounding boxes for 600 object classes on 1.9 million images, making it the biggest collection with object location annotations currently available. The boxes were mostly drawn by hand by skilled annotators to guarantee accuracy and uniformity. The photographs are diverse and frequently include convoluted scenarios with several items (8.3 per image on average).
It also has visual relationship annotations, which show pairings of items in certain relationships (e.g., “woman playing guitar,” “beer on the table”), object qualities (e.g., “table is wooden”), and human behaviors (e.g., “woman is jumping”). It comprises 3.3 million annotations from 1,466 different relationship triplets in all.
15. Waymo Open Dataset
The Waymo Open Dataset is the most extensive and diversified multimodal autonomous driving dataset to date. It includes images from a variety of high-resolution cameras and points clouds from a variety of high-quality LiDAR sensors, as well as 12 million LiDAR box annotations and around 12 million camera box annotations. It has 798 video sequences for training, 202 video sequences for validation, and 150 video sequences for testing, each of which lasts 20 seconds. Each video sequence has five views, where each camera captures 171-200 frames with the image resolution of 1920 × 1280 pixels or 1920 × 886 pixels.