Teaching computers to understand what they see is the subject that keeps all the computer vision engineers awake. Even though a lot of progress has been accomplished in Image Recognition field over the past few years, there are a lot of puzzle pieces still missing that should fit together to get a complete and clear picture on how to teach machines to make sense of what they see.
For a long time Image Classification was not considered as a statistical problem until a partial solution came from the Machine Learning field under the name of Neural Networks, in particular, Convolutional Neural Networks (CNN). CNN is a special type of Artificial Neural Networks that offer human-like results in image classification tasks.
This article explains how the human brain reconstructs the visual world, how machines learn to understand visuals and what the applications of Image Classification are.
Image Recognition: How hard can it be?
Image Recognition is the process of identifying what an image depicts.
For humans interpreting the visual world comes easy. When humans see something, there is an inherent understanding of what it is. In most cases, there is no need for a conscious study of the object to make sense of it.
However, for computers, it is a challenging task because they can only manipulate digits. For example, a 3×3 square point on Albert Einstein’s forehead, for a computer is a 3-dimensional matrix where each dimension represents one of the primary colors: red, green, and blue.
Even though humans can interpret images in a fraction of seconds, a complex cognitive process occurs in the visual cortex of the brain.
The visual cortex is divided into layers (V1-V8), and it processes visual information coming from the eyes. When a stimulus is present at the receptive field, its representation first reaches the V1 layer, or in other words the neurons in the area of V1 layer fire first. This layer is a map which preserves the spatial information of the stimulus in the world and also detects its edges. V1 layer is strongly connected to V2 layer, which in turn is involved in discriminating shapes, orientations, colors, and other low-level features. Higher-level visual features involve the brain’s understanding of the context and relationship of the images and are only perceived in the higher layers, such as V6-V8.
Let’s say, the perceived stimulus is your dad. The object detection itself is accomplished in layer V1. However, the semantic information is only perceived in the layers V6-V8.
It is important to stress that what each layer is responsible for, is almost always related to controversy as the research brings more and more discoveries over time. However, it is a fact that the higher the layer is, the more abstract the presentation becomes.
Apart from this high-level architecture, on the micro level, the neuron’s mechanics has been applied to simulate the processes in visual cortex layers. In particular, each neuron receives input from the dendrites and based on complex non-linearity which is applied to its input will fire, if the summed non-linear input overcomes some threshold. Although this explanation, is very simplified, it was enough for a research to invent the first Artificial Neural Network.
Inspired by the human visual system, engineers tried to replicate this process with machines. To enable computers to understand objects, it was necessary to create a system that would extract high-level features from visual “stimuli” by using only numerical manipulations. That’s when Convolutional Neural Nets come into place. When fed with enough of clean and well-defined data, CNN allows extracting high-level, common features for each category the data encompass.
How does CNN work in Image Recognition?
The representations learned by CNN are similar to how the human visual layers represent visual information: the first convolutional layers extract low-level features, such as edges and blobs, and the latest layers assign the semantic part to the image.
All in all, image classification for a computer translates into the problem of identifying common features by “looking” at the digits and doing mathematical manipulations to find a function (i.e. model), which can generalize on unseen data.
The state-of-the-art performance of Convolutional Neural Nets in image classification task can be equivalent to human’s, but it’s only possible if the following factors are met: plenty of data is provided (GBs), a long length of time is allotted, and the appropriate neural network architecture is in place.
Implementing CNN in ProcessMaker IDP for Image Classification
ProcessMaker IDP is about smart content management and Image Recognition is part of a large chain of Machine Learning solutions that we offer. Although there are a number of open APIs available to gain insights from images, ProcessMaker IDP develops its own, unique classifier to protect clients’ sensitive data. Using open services for image classification such as Google implies sharing clients’ data with 3rd parties. While it’s not an issue if you need to classify images with cats and dogs, it is a compliance problem when IDs and credit cards need to be classified.
Maintaining a high level of security while providing an accurate performance of the ML classifier comes with some challenges. Since ProcessMaker IDP does not have large image libraries, it relies heavily on clients’ data or open-source datasets, which are usually not ready for direct use. Cleaning and manually labeling them requires a lot of time.
Another challenge is finding the right architecture. In most cases building an in-house architecture is more efficient. However, if we don’t have enough data, then using a pre-trained architecture is a better option.
By investing in a dedicated hardware we were able to overcome these challenges and significantly improve the timeframe required to train a model.