Discovering knowledge with KDD
Aditya Kadam Published on: June 27, 2022Every day we deal with loads of data. We read things on social media, magazines, discussions, forums and much more. Every business is always trying to utilise data most efficiently. It would be safe to say that data has become like oxygen for the business.
Gathering data might not be challenging but storing it, segregating and cleaning it, and discovering the important and useful insights from it is a challenge in itself. This entire process in simple language can be termed a Knowledge Discovery from Databases (KDD) process. KDD is one of the important aspects of data mining.
There are various names for KDD, some call it knowledge mining, some call it business intelligence and some call it pattern analysis.
Now let's try to break down this entire process and see each component.
Scenario.
Let's make it easier and assume that we have a company, ABC, which has created a SaaS app for headless CMS. We want to create a solid marketing campaign to have more customer registrations and sales in our business. To begin with, let's assume that we have tons of data from social media. It could be tweets, images, articles, posts, transcripts and whatnot. Now we aim to identify useful patterns from this unstructured raw data to create a robust marketing campaign for our business. Please note that this is just an example and not a 100% accurate marketing or business plan. Now let's refer to the diagram below and try to achieve our end goal.
1. Selection
Now we have tons of raw data. Some of it is useful while some of it is of no use. We begin by selecting relevant data for our end goal. Our app is a headless CMS app, so it's most likely our target market will be developers. So let's say we keep tweets or posts only by the developers. In short we selected relevant data (data related to developers) and removed all the other data which is not required for this campaign.
2. Pre-processing
We did some selections, that's great but our data is still raw. So we clean our selected data further and try to remove the outliers or noise. As we have data from various sources we also need to consolidate the data and make it single source. For example, let's say we have 200k tweets, 100k LinkedIn posts, and 100k Facebook posts from various developers. Now we need to remove data from all of these tweets and posts which are relevant to some tech stack like JAM, or which are more professional and we remove all the random tweets and posts. We can take it further by removing tags (@xyz) from the tweets and posts and keep text/body only.
In short, we are trying to remove the noise and outliers from our selected data and try to bring it into one source, let's say one big file.
3. Transformation
Now we have cleaned the data, next we need to flatten it or transform it into a row and column format. The purpose of this step is to make it easier for our algorithm or tool so that it can provide accurate results. This step depends on what tool you are using, what algorithm you are going to use, etc.
4. Data mining
Now we have our clean and transformed data, we can feed it to our algorithm or the tool we are going to use. This will give us good insights and will help us to identify unknown interesting patterns. You might see that most of the time the algorithm used in this process will be a classification algorithm. But it is not a hard and fast rule.
5. Evaluation / Interpretation
If all the above steps are done right, we now have interesting and useful patterns which will help us to create a solid marketing campaign. For example. let's say we have the following insights (please note these are just examples)
1. The majority of developers use Twitter.
2. The major keywords used are JAM stack, React, and Vue.
3. Developers don't like headless CMS which has hard-to-read docs, and too much dependency on the provider's system, but they like headless CMS with free tiers so that they can give them a test.
Based on these insights, we can now create a post as a part of our social media campaign. The post can have text which shows the app is easy to use, is free and there will be continuous support. Other posts could be tutorials on the headless CMS using Vue or React.
Again, the example above is just an example and not a 100% accurate marketing plan. It is just to show how we can use the KDD process to extract knowledge from the raw data and make smart business decisions.
In this article, I would like to mention two great tools which might be a good fit for you.
Aggua is a data ops startup based in Israel and they have created a great tool which helps you with data management. It is suited for companies dealing with a huge amount of data (big data). The tool is mostly used for finding and defining data,
understanding the connection between data, mapping upstream and downstream dependencies, performing impact analysis and enforcing data regulation policies.
Another tool which is interesting is columns.ai. Columns is a Seattle-based startup. It has a free tier and is a great data visualisation tool. It allows you to import data from various sources like Rest API, Google Sheets, CSV files and much more. The tool then creates great charts and graphs based on your data which you can further customise. The best thing is you can do all of this with zero knowledge of programming.