Effective data cleaning is a vital part of the data analytics process. But what is it, why is it important, and how do you do it?
Good data hygiene is so important for business. For starters, it’s good practice to keep on top of your data, ensuring that it’s accurate and up-to-date. However, data cleaning is also a vital part of the data analytics process. If your data has inconsistencies or errors, you can bet that your results will be flawed, too. And when you’re making business decisions based on those insights, it doesn’t take a genius to figure out what might go wrong!
In a field like marketing, bad insights can mean wasting money on poorly targeted campaigns. In a field like healthcare or the sciences, it can quite literally mean the difference between life and death. In this article, I’ll explore exactly what data cleaning is and why it’s so vital to get it right. We’ll also provide an overview of the key steps you should take when cleaning your data.
Why not get familiar with data cleaning and the rest of the data analytics process in our free 5-day data short course?
We’ll answer the following questions:
- What is data cleaning?
- Why is data cleaning important?
- How do you clean data?
- What are some of the most useful data cleaning tools?
First up…
1. What is data cleaning?
Data cleaning (sometimes also known as data cleansing or data wrangling) is an important early step in the data analytics process.
This crucial exercise, which involves preparing and validating data, usually takes place before your core analysis. Data cleaning is not just a case of removing erroneous data, although that’s often part of it. The majority of work goes into detecting rogue data and (wherever possible) correcting it.
What is rogue data?
‘Rogue data’ includes things like incomplete, inaccurate, irrelevant, corrupt or incorrectly formatted data. The process also involves deduplicating, or ‘deduping’. This effectively means merging or removing identical data points.
Why is it important to correct rogue data?
The answer is straightforward enough: if you don’t, they’ll impact the results of your analysis.
Since data analysis is commonly used to inform business decisions, results need to be accurate. In this case, it might seem safer simply to remove rogue or incomplete data. But this poses problems, too: an incomplete dataset will also impact the results of your analysis. That’s why one of the main aims of data cleaning is to keep as much of a dataset intact as possible. This helps improve the reliability of your insights.
Data cleaning is not only important for data analysis. It’s also important for general business housekeeping (or ‘data governance’). The sources of big data are dynamic and constantly changing. Regularly maintaining databases, therefore, helps you keep on top of things. This has several additional benefits, which we’ll cover in the next section.
Want to try your hand at cleaning a dataset? You may be interested in this introductory tutorial to data cleaning, hosted by Dr. Humera Noor Minhas.
2. Why is data cleaning important?
A common refrain you’ll hear in the world of data analytics is: ‘garbage in, garbage out’. This maxim, so often used by data analysts, even has its own acronym… GIGO. But what does it mean?
Essentially, GIGO means that if the quality of your data is sub-par, then the results of any analysis using those data will also be flawed. Even if you follow every other step of the data analytics process to the letter, if your data is a mess, it won’t make a difference.
For this reason, the importance of properly cleaning data can’t be overstated. It’s like creating a foundation for a building: do it right and you can build something strong and long-lasting. Do it wrong, and your building will soon collapse. This mindset is why good data analysts will spend anywhere from 60-80% of their time carrying out data cleaning activities. Beyond data analytics, good data hygiene has several other benefits. Let’s look at those now.
Key benefits of data cleaning
As we’ve covered, data analysis requires effectively cleaned data to produce accurate and trustworthy insights. But clean data has a range of other benefits, too:
- Staying organized: Today’s businesses collect lots of information from clients, customers, product users, and so on. These details include everything from addresses and phone numbers to bank details and more. Cleaning this data regularly means keeping it tidy. It can then be stored more effectively and securely.
- Avoiding mistakes: Dirty data doesn’t just cause problems for data analytics. It also affects daily operations. For instance, marketing teams usually have a customer database. If that database is in good order, they’ll have access to helpful, accurate information. If it’s a mess, mistakes are bound to happen, such as using the wrong name in personalized mail outs.
- Improving productivity: Regularly cleaning and updating data means rogue information is quickly purged. This saves teams from having to wade through old databases or documents to find what they’re looking for.
- Avoiding unnecessary costs: Making business decisions with bad data can lead to expensive mistakes. But bad data can incur costs in other ways too. Simple things, like processing errors, can quickly snowball into bigger problems. Regularly checking data allows you to detect blips sooner. This gives you a chance to correct them before they require a more time-consuming (and costly) fix.
- Improved mapping: Increasingly, organizations are looking to improve their internal data infrastructures. For this, they often hire data analysts to carry out data modeling and to build new applications. Having clean data from the start makes it far easier to collate and map, meaning that a solid data hygiene plan is a sensible measure.
Data quality
Key to data cleaning is the concept of data quality. Data quality measures the objective and subjective suitability of any dataset for its intended purpose.
There are a number of characteristics that affect the quality of data including accuracy, completeness, consistency, timeliness, validity, and uniqueness. You can learn more about data quality in this full article.
3. How to clean your data (step-by-step)
So far, we’ve covered what data cleaning is and why it’s important. In this section, we’ll explore the practical aspects of effective data cleaning. Since there are multiple approaches you can take for completing each of these tasks, we’ll focus instead on the high-level activities.
Step 1: Get rid of unwanted observations
The first stage in any data cleaning process is to remove the observations (or data points) you don’t want. This includes irrelevant observations, i.e. those that don’t fit the problem you’re looking to solve.
For instance, if we were running an analysis on vegetarian eating habits, we could remove any meat-related observations from our data set. This step of the process also involves removing duplicate data. Duplicate data commonly occurs when you combine multiple datasets, scrape data online, or receive it from third-party sources.
Step 2: Fix structural errors
Structural errors usually emerge as a result of poor data housekeeping. They include things like typos and inconsistent capitalization, which often occur during manual data entry.
Let’s say you have a dataset covering the properties of different metals. ‘Iron’ (uppercase) and ‘iron’ (lowercase) may appear as separate classes (or categories). Ensuring that capitalization is consistent makes that data much cleaner and easier to use. You should also check for mislabeled categories.
For instance, ‘Iron’ and ‘Fe’ (iron’s chemical symbol) might be labeled as separate classes, even though they’re the same. Other things to look out for are the use of underscores, dashes, and other rogue punctuation!
Step 3: Standardize your data
Standardizing your data is closely related to fixing structural errors, but it takes it a step further. Correcting typos is important, but you also need to ensure that every cell type follows the same rules.
For instance, you should decide whether values should be all lowercase or all uppercase, and keep this consistent throughout your dataset. Standardizing also means ensuring that things like numerical data use the same unit of measurement.
As an example, combining miles and kilometers in the same dataset will cause problems. Even dates have different conventions, with the US putting the month before the day, and Europe putting the day before the month. Keep your eyes peeled; you’ll be surprised what slips through.
Step 4: Remove unwanted outliers
Outliers are data points that dramatically differ from others in the set. They can cause problems with certain types of data models and analysis.
For instance, while decision tree algorithms are generally accepted to be quite robust to outliers, outliers can easily skew a linear regression model. While outliers can affect the results of an analysis, you should always approach removing them with caution.
Only remove an outlier if you can prove that it is erroneous, e.g. if it is obviously due to incorrect data entry, or if it doesn’t match a comparison ‘gold standard’ dataset.
Step 5: Fix contradictory data errors
Contradictory (or cross-set) data errors are another common problem to look out for. Contradictory errors are where you have a full record containing inconsistent or incompatible data.
An example could be a log of athlete racing times. If the column showing the total amount of time spent running isn’t equal to the sum of each racetime, you’ve got a cross-set error.
Another example might be a pupil’s grade score being associated with a field that only allows options for ‘pass’ and ‘fail’, or an employee’s taxes being greater than their total salary.
Step 6: Type conversion and syntax errors
Once you’ve tackled other inconsistencies, the content of your spreadsheet or dataset might look good to go.
However, you need to check that everything is in order behind the scenes, too. Type conversion refers to the categories of data that you have in your dataset. A simple example is that numbers are numerical data, whereas currency uses a currency value. You should ensure that numbers are appropriately stored as numerical data, text as text input, dates as objects, and so on. I
n case you missed any part of Step 2, you should also remove syntax errors/white space (erroneous gaps before, in the middle of, or between words).
Step 7: Deal with missing data
When data is missing, what do you do? There are three common approaches to this problem.
The first is to remove the entries associated with the missing data. The second is to impute (or guess) the missing data, based on other, similar data. In most cases, however, both of these options negatively impact your dataset in other ways. Removing data often means losing other important information. Guessing data might reinforce existing patterns, which could be wrong.
The third option (and often the best one) is to flag the data as missing. To do this, ensure that empty fields have the same value, e.g. ‘missing’ or ‘0’ (if it’s a numerical field). Then, when you carry out your analysis, you’ll at least be taking into account that data is missing, which in itself can be informative.
Step 8: Validate your dataset
Once you’ve cleaned your dataset, the final step is to validate it. Validating data means checking that the process of making corrections, deduping, standardizing (and so on) is complete.
This often involves using scripts that check whether or not the dataset agrees with validation rules (or ‘check routines’) that you have predefined. You can also carry out validation against existing, ‘gold standard’ datasets.
This all sounds a bit technical, but all you really need to know at this stage is that validation means checking the data is ready for analysis. If there are still errors (which there usually will be) you’ll need to go back and fix them…there’s a reason why data analysts spend so much of their time cleaning data!
4. Data cleaning tools
Now we’ve covered the steps of the data cleaning process, it’s clear that this is not a manual task. So, what tools might help? The answer depends on factors like the data you’re working with and the systems you’re using. But here are some baseline tools to get to grips with.
Microsoft Excel
MS Excel has been a staple of computing since its launch in 1985. Love it or loathe it, it remains a popular data-cleaning tool to this day. Excel comes with many inbuilt functions to automate the data cleaning process, from deduping to replacing numbers and text, shaping columns and rows, or combining data from multiple cells. It’s also relatively easy to learn, making it the first port of call for most new data analysts.
Programming languages
Often, data cleaning is carried out using scripts that automate the process. This is essentially what Excel can do, using pre-existing functions. However, carrying out specific batch processing (running tasks without end-user interaction) on large, complex datasets often means writing scripts yourself.
This is usually done with programming languages like Python, Ruby, SQL, or—if you’re a real coding whizz—R (which is more complex, but also more versatile). While more experienced data analysts may code these scripts from scratch, many ready-made libraries exist. Python, in particular, has a tonne of data cleaning libraries that can speed up the process for you, such as Pandas and NumPy.
Visualizations
Using data visualizations can be a great way of spotting errors in your dataset. For instance, a bar plot is excellent for visualizing unique values and might help you spot a category that has been labeled in multiple different ways (like our earlier example of ‘Iron’ and ‘Fe’). Likewise, scatter graphs can help spot outliers so that you can investigate them more closely (and remove them if needed).
Proprietary software
Many companies are cashing in on the data analytics boom with proprietary software. Much of this software is aimed at making data cleaning more straightforward for non-data-savvy users. Since there are tonnes of applications out there (many of which are tailored to different industries and tasks) we won’t list them here. But we encourage you to go and see what’s available. To get you started, play around with some of the free, open-source tools. Popular ones include OpenRefine and Trifacta.
You’ll find a more thorough comparison of some of the best data cleaning tools in this guide.
Final thoughts
Data cleaning is probably the most important part of the data analytics process. Good data hygiene isn’t just about data analytics, though; it’s good practice to maintain and regularly update your data anyway. Clean data is a core tenet of data analytics and the field of data science more generally.
In this post, we’ve learned that:
- Clean data is hugely important for data analytics: Using dirty data will lead to flawed insights. As the saying goes: ‘Garbage in, garbage out.’
- Data cleaning is time-consuming: With great importance comes great time investment. Data analysts spend anywhere from 60-80% of their time cleaning data.
- Data cleaning is a complex process: Data cleaning means removing unwanted observations, outliers, fixing structural errors, standardizing, dealing with missing information, and validating your results. This is not a quick or manual task!
- There are tools out there to help you: Fear not, tools like MS Excel and programming languages like Python are there to help you clean your data. There are also many proprietary software tools available.
Why not try your hand at data analytics with our free, five-day data analytics short course? Alternatively, read the following to find out more: