Clustering: What is it and how can it help you in big data management?
How do you manage large data volumes at your organization?
Analysing them manually is not really efficient, and it’s not always easy to grasp the most important facts and discover the information. Cluster analysis can help you to reduce manual input and reveal hidden value in your data.
Cluster analysis is part of a wide range of Machine Learning technologies that help to uncover hidden structures found in big data sets and group data elements with similar characteristics together.
In this article, we explain how clustering works and provide you some use cases.
Let’s start with the basic concept
You don’t have to label an apple as an apple to see that it is different from an orange. With clustering, we try to recognize groups of similar objects without having a human label those objects. This has many advantages and also raises some interesting questions.
For example, we see that an apple and an orange are different. However, if we compare them with a potato, we might conclude that an apple and an orange are quite similar because they are both fruits.
The need for clustering comes from the size of the datasets that have become available.
If I want to compare ten types of fruit, I can compare them all relatively easy. However, if I want to analyze a dataset the size of Wikipedia, I will have to make 31 trillion comparisons. Even if a single comparison takes just 1 microsecond, it will still take almost 12 days to execute the comparisons. Therefore, it is useful to split these sets up into smaller chunks that are easier to process. We can use clustering to split the dataset up in a sensible way without having a human look at it.
A powerful approach to clustering is to order the data into general groups and then within these groups make smaller more specific groups. This allows the Machine Learner to extract both a general overview and a more detailed structure of the data. This is hierarchical clustering and it is a powerful tool to gain insight into large datasets.
Types of clustering
Different types of data have different types of clustering algorithms that work the best. Some data contains natural subgroups. Other data might have a normal distribution, so an algorithm that facilitates this works best.
For big data, it is also important to keep in mind that some algorithms work more efficient for certain distributions of data. If you want to cluster cats by the length of their tail, then an algorithm that is designed for continuous data works the best, since the length can be any value within a certain range. If you want to cluster stars a different algorithm might work better.
Text has its own twists when it comes to clustering.
There are millions of different keywords (every proper name can be considered a keyword) but each document only contains a few of them. Therefore, it is not simple to find two documents that even have some similarity. This poses some challenges when designing an efficient Machine Learner.
There are two approaches to tackle this problem. We can either use our knowledge from linguistics to get a better grip on the dataset, or we can make an algorithm that has a low computational complexity for clustering specifically text documents.
How does ProcessMaker IDP use clustering in content management?
For ProcessMaker IDP we combined our expertise in computational linguistics, mathematics and software implementation to provide state of the art clustering for documents.
However, this still does not take us all the way to the quality that we want to offer to our clients. No matter how good our AI is, there will always be cases where a human expert is needed to make an accurate decision. With this in mind, we can create a Machine Learner that cooperates with a human expert so that the Machine Learner will do all the bulk work and the human expert has to evaluate only the cases where his expertise is most valuable.
Some features of documents are easy to recognize for the Machine Learner than others.
For example, if two documents contain the word ‘contract’ they probably belong to the same cluster together with other contracts. However, it is not always so obvious. Is a temporary contract more similar to a permanent contract or to a freelancer contract? This is not obvious for the Machine Learner and it is a good opportunity to ask some help from a human expert.
How to teach a Machine Learner?
Your grocer has field expertise and knows where to put his veggies and fruits. How can we teach this to a Machine Learner?
The Machine Learner is aware of the distance between clusters and the cohesion within clusters. Based on this it can give an indication of confidence for the clusters. If a human expert can guide the Machine Learning by indicating what the best decisions are for the few clusters with the lowest confidence, the quality of the resulting clusters can be improved and will closely match the natural grouping the human expert would give, without the need for too many interactions with the human expert. This combination of Machine Learner and human expert is part of interactive learning.
Combining powerful clustering algorithms with human expertise results in maximum accuracy with minimum effort.
With the right approach, we can let the Machine Learner tell apples and oranges apart but group them together when comparing with potatoes. Now, where would you put a tomato?