The terms data mining and web scraping are often used interchangeably, but they actually don’t mean the same thing.
Although the two are closely related, and often used alongside each other, there are significant differences between web scraping and data mining techniques.
In this short article below, we’ll explain what each process means, how you perform it, and what uses you might get out of it. Based on this we’ll further explain the key differences between web scraping and data mining.
At the end of this post, you should have a clear understanding of how data mining and web scraping are different. Ready to find out?
Web scraping explained
Let’s start with web scraping, because technically speaking, that comes before data mining. Other common terms used for web scraping are data harvesting or data scraping.
Web scraping is the process of extracting data from a web page. It is an automated process that uses a bot (web scraper).
Web scraping consists of a few basic steps:
- The bot visits a certain web page
- The bot fetches all or parts of the data on that page
- The bot extracts this data
- The bot parses the data (i.e. converting the unstructured data into a pre-defined format)
- The bot copies the data into a database, the cloud, or another format (e.g. an Excel sheet)
Web scraping uses
Businesses use web scraping techniques in a variety of ways. The important thing to bear in mind is that in all these cases the emphasis is on the extraction of data. We’ll see why that’s important when we move on to explaining data mining.
Some common use cases of web scraping are:
- Competitive price monitoring – The extraction of competitors’ pricing data (e.g. on Google Shopping or Amazon) to gather insights to inform a company’s own pricing strategy.
- Comparison sites – These sites extract product data from all across the web and present it on their own site so you can easily compare product prices.
- News aggregators – Similar to comparison sites, news aggregators extract information from many different news sites (from scraping sites of local newspapers to creating their own Google News API) and combine it all on their own site.
- Lead generation – The extraction of all sorts of (publicly-available) contact data, like company email addresses or phone numbers, from places like Google Maps. This information is then used for lead generation.
As you can see, all these use cases are centered around the extraction of data from one place and copying it to another. This doesn’t, however, involve any further processing of the data (beyond the parsing of data to change it into a more accessible format).
Data mining can be seen as the next step, following after this data extraction or web scraping.
Data mining explained
Data mining is the process of analyzing large sets of data to try and identify patterns and trends. In many cases, this analysis is done with the help of artificial intelligence (machine learning).
As such, data mining is the science of working through vast amounts of data to extract knowledge from that data.
Data mining includes many different steps in trying to make sense of the data. These steps include:
- Data pre-processing
- Model and inference considerations
- Interestingness metrics
- Complexity considerations
- Post-processing of the found structures in the data
- Visualization of the data
As you can see, data mining is rooted in statistical and analytical disciplines.
Data mining uses
Over the years, data mining has been used in almost any area imaginable. As long as there is a large set of data, data mining techniques can try to process it to discover patterns.
Some common areas where data mining is used in business include:
- Market basket analysis – The data analyst looks at previous buying behavior and preferences of customers to predict trends for the future.
- Sales forecasting – Aside from looking at what customers might buy in the future (basket analysis), data analysts try to predict when customers will buy based on sets of customer data.
- Database marketing – The data analyst will analyze large amounts of data about a certain type of buyer and will try to base the company’s marketing strategy on their predicted preferences.
And these are just three of the many possible applications. Note that in these examples we used the words data analyst every time, whereas for web scraping it was about data extraction.
Data mining vs web scraping
By now, we hope you understand the difference between data mining and web scraping. Whereas data mining analyzes data, web scraping extracts data. This means you could first use web scraping to extract data, and then use data mining to analyze that data.
But it is a bit confusing, and that’s mostly because of the terminology. The term data mining is often considered a misnomer because it implies that you are mining (i.e. extracting) data from a source. Data extraction, however, is the definition of web scraping, not data mining. This may explain why the terms data mining and web scraping so often get mixed up.
As Jiawei Han accurately mentions in his book Data Mining: Concepts and Techniques (2001), “data mining should have been more appropriately named ‘knowledge mining from data’”.