Top computer vision datasets for manufacturing?

Top computer vision datasets for manufacturing?

Manufacturing Datasets for Enhanced AI in Industry

Image source

Artificial intelligence (AI) has dramatically changed the manufacturing industry, putting manufacturers in a position to increase productivity and optimize operations and processes. With the increasing complexity of production processes and the ever-growing volume of data generated by sensors and machines, manufacturers are turning to machine learning (ML) as a powerful tool to gain insights, optimize operations, and drive innovation.

Manufacturing datasets and machine learning models are the fundamental building blocks at the heart of AI solutions in the manufacturing industry. They are critical to the success of AI solutions. Gaps exist between manufacturing datasets and developing the ML models used in manufacturing AI solutions.

This article discusses the challenges the manufacturing industry faces with manufacturing datasets when implementing AI solutions and highlights a handful of high-quality, publicly-available manufacturing datasets that can serve as a foundation for developing manufacturing AI solutions.  

Manufacturing Datasets: Demystifying the use of AI in the manufacturing industry

To address manufacturing inefficiencies, AI solutions analyze large amounts of data and identify patterns to improve the quality and efficiency of operations and processes. Unfortunately, manufacturing data isn’t readily available. Manufacturers usually have to build their datasets from scratch, which typically involves a tedious, time-consuming, and manual process of taking sensor readings from running machines and preprocessing them afterward. 

Different production lines and manufacturing industries require different datasets to develop their solutions, which is normal with AI solutions. However, in most cases, manufacturers are highly confidential with their data, so there aren’t many publicly available manufacturing datasets. Some manufacturers also hesitate to share data due to privacy and security concerns and the fear of giving competitors an advantage.

These significant drawbacks of manufacturing datasets discourage the ability to rapidly scale or develop ML models for manufacturing AI solutions. Other manufacturing dataset problems faced with implementing an AI solution in the manufacturing ecosystem include the following:

  • Access to quality data: Companies find it difficult to store manufacturing data or to come across data that matches their AI use case, and when this data is encountered, it tends to be private or expensive to acquire.
  • Shortage of AI talents: Not enough AI talents are invested in the industry. The manufacturing industry needs experts who don’t just understand AI but can translate their domain knowledge to build powerful manufacturing AI models, and these experts are hard to find.
  • Edge Deployments: The industry needs AI models that can be pushed into production. This is the major problem impeding the spread of AI in the manufacturing industry. Many models are built that cannot be pushed into production or make real-time predictions because of some constraints on edge devices. 
  • Lack of continuous improvements: This is another issue facing the acceptance of AI in the manufacturing industry. An AI model learns from the data fed into it, and once it is deployed, it stops learning and would most likely perform poorly on new data (that is, data that is different from the data used during training). AI models cannot automatically learn from new data once deployed; they must be retrained. This causes companies to accrue additional maintenance costs to maintain the model's efficiency. This flaw is gradually eliminated with MLOps due to CI/CD/CT integration. 
  • Lack of standardization: The absence of standardization in manufacturing datasets poses a significant challenge for AI development within the factory environment. This issue arises from the diverse use of formats, units, and definitions across datasets, creating a state of data disarray. Consequently, the lack of standardization has several detrimental effects, such as data integration conflicts, misleading AI models, and time-consuming detours for data scientists.

Despite these challenges with manufacturing datasets, the potential benefits of using machine learning in manufacturing are vast. By understanding and addressing the issues associated with manufacturing datasets, developers can build robust and reliable solutions that improve efficiency, quality, and safety in production processes. some solutions include:

  • Employ robust data cleaning and pre-processing techniques.
  • Promote strong collaboration between data scientists and domain experts to enhance domain-specific knowledge and insights
  • Promote data democratization by dismantling data silos and implementing cohesive data governance practices.
  • Initiative ways to generate realistic synthetic data to augment existing datasets and fill in missing pieces. 
  • Implement active learning techniques and empower the AI model to identify areas needing additional data autonomously.
  • Utilize transfer learning techniques by leveraging pre-trained models from similar domains to expedite AI development, focusing on constructing a subset of manufacturing AI models instead of a singular golden model.

Notwithstanding, AI has many application areas in the manufacturing industry, including anomaly detection, remote monitoring of facilities, production automation, supply chain optimization, demand forecasting, energy management, etc.

The right data has to be applied to build robust models that automate tedious industrial processes. Manufacturing data can be any data or information gathered during the manufacturing of any good or material. This could be production, quality, machine, or energy consumption data. 

Common Manufacturing datasets

Manufacturing data can contain various data types and come in multiple formats. Tabular formats and images are two of the most common formats for manufacturing datasets. If you’re looking to build scalable AI solutions that can be applied in manufacturing industries, here are some quality, publicly available manufacturing datasets you can use. These datasets involve common manufacturing processes or companies.

Visual Anomaly Dataset (VisA)

source: papperswithcode

VisA contains 12 classes of electrical boards and instruments. There are 10,821 images, of which 1,200 are anomalous and 10,821 are not. Anomalies and defects are unwanted in materials, especially circuit boards and electronic equipment, as they render the devices useless. 

This manufacturing dataset can be used to automatically detect which manufactured equipment is defective and which is not. This relieves industries from having to test every product they produce manually. The dataset contains subsets of PCBs, cashews, chewing gums, macaroni, capsules, and so on. It is a vast dataset that can be used for various use cases.

MVTEC Anomaly Dataset (MVTecAD)

source: mvtec

MVTecAD contains 5,000 high-resolution images that can be used for benchmarking anomaly detection in industrial inspection. These images are further divided into 15 object and texture categories, where each category comprises a set of defect-free training images and a test dataset that contains defect-free images mixed with defective images.

With this type of test dataset, it will be possible to test the model's accuracy in identifying defective images in a collection of images. Two metrics can be used to evaluate this anomaly detection model trained on MVTecAD; detection AUROC and segmentation. The detection method outputs a single float per input test image, while the segmentation method outputs the anomaly probability for each pixel.

Personal Protective Equipment (PPE) Dataset

Image source

PPE contains images of personal protective equipment that factory workers use. To ensure the safety of people and equipment in the industry, workers must put on certain PPE, such as goggles, overalls, and so on. Industrial supervisors will not always be available to inspect workers. Still, it is possible to train a computer vision model to detect whether or not workers are putting on their PPE.This dataset contains 11,978 images, which are subdivided into 12 unique classes. These are:

  • goggles
  • helmet
  • mask
  • no-suit
  • no_goggles
  • no_helmet
  • no_mask
  • no_shoes
  • shoes
  • suit
  • no_glove
  • glove

Casting Product Image Data for Quality Inspection

Image source

The casting product image data contains images of product images before casting. This data collection aims to enable data scientists and machine learning engineers to build models that can detect faults in product images before they are cast. Faults in products are undesirable, as they can cause the product to be faulty or less appealing to the user.

A casting defect is an undesired irregularity in a metal casting process. There are many types of defects in casting, like blow holes, pinholes, burrs, shrinkage defects, mold material defects, pouring metal defects, metallurgical defects, etc. The dataset contains 7348 grey-scaled images with a dimension of 300 by 300. These images are divided into “Defective” and “Ok” sub-categories.

Synthetic Corrosion Dataset

Image source

The synthetic corrosion dataset contains images of corroded pipes. Corroded pipes are unwanted in manufacturing industries as they have negative environmental impacts, cause quality control issues, and cause production loss. This dataset can be used to train a detective computer vision model that can detect when a pipe has corroded and flag it appropriately. This dataset contains 76 images of corroded pipes and metals that can be used for training and validation.

LeakDB (Leakage Diagnosis Benchmark)

LeakDB is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios on different water distribution networks under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.

Leakages are undesirable in industries as they cause companies to lose valuable resources. This dataset can be applied in a computer vision model that is capable of automatically detecting leakages in industrial pipes and flagging them as required. The dataset can be found here.

Water Bottle Image Classification Dataset

Image source

The water bottle image classification dataset contains images of bottles partially or fully filled with water. This dataset can be used to train a quality control computer vision model that can detect which bottles are not full in a production plant. The images are classified into three classes: full water level, half water level, and overflowing. This dataset can be used to train a machine-learning model that can be used to develop automated systems for monitoring and managing liquid levels in containers. 

Another good application of this dataset could be to classify and separate water bottles based on their water levels, which can help streamline the manufacturing process in industries.


Despite facing challenges, the manufacturing industry is increasingly embracing a shift towards a culture of open data sharing and engaging in collaborative AI development. The potential benefits of collaboration, the emergence of secure data-sharing platforms, and mounting pressure to adopt AI are just a few factors motivating the emerging cultural trend. 

Various collaborative efforts within the industry exemplify this trend. The Industrial Internet Consortium (IIC) is globally promoting the use of data and AI in manufacturing by developing standards and best practices. The Manufacturing Leadership Council (MLC), comprising manufacturing executives, actively engages in initiatives focused on data sharing and collaboration to advance AI adoption. Major players like Siemens and SAP are also contributing to the collaborative landscape.

The overall trajectory points to a promising future for AI in manufacturing. With increased openness and a unified data landscape, AI is expected to foster the emergence of more innovative and powerful AI solutions.

Start managing your AI data the right way.

Request a demo

Recommended for you:

french language
english language