Abstract:
Foundation Models (FMS) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMS on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark
significant effort. Promising ideas for prompting during quality inspection are, e.g., a description of the normal state of the product or a description of the visual or physical properties of the defects. These reduced labeling efforts, combined with the additional opportunity to integrate domain expert knowledge via text input, would enable easier scaling across several products, i.e., with significantly lower efforts for each product. ... mehrThe seemingly clear advantages over State Of The Art (SOTA) approaches motivate the question of how suitable FMS are for image-based quality inspection tasks. To close these gaps for industrial use cases, we analyze the applicability of various recent FMS on
real-world industrial data in this work. The achieved performance is compared to the performance of the same models on a public dataset.
The main objective during quality inspection is the distinction of defective and defect-free products. A classification model can perform such an inspection task with minimal setup. However, it might not cover all desired functionalities fully: In some cases, the classification of an AI model is re-checked by a human operator. Furthermore, a high level of explainability is preferred during model monitoring. A model that outputs a full segmentation mask as opposed to only a single class makes both manual re-checking and model monitoring easier. As such, we include both classification and segmentation models in our study.
The remainder of this work is structured as follows: The following section contains an overview of related work regarding e.g. FMS and SOTA pipelines for image segmentation. The datasets that we use for benchmarking are introduced in Section 3. The setup of our experiments is described in Section 4 and respective results are shown in Section 5. Section 6 contains our interpretation of results and potential implications.