Top 3 Tips for Data Classification
By Lee Baker
Here we deal with the classification of data, and I’m going to give you my Top 3 Tips.
OK, so we learnt how to collect, record and store our data (in Excel) and we set up a way of working that allows us to follow the trail of data from the beginning to the end and, more importantly, from end to beginning so that we can identify and rectify mistakes.
We learnt how to clean our data, use the functions and formulae built into Excel to ward off ‘bleeding eye syndrome’ and report any errors back to the original source to save time, effort and money later on.
So you now have a perfectly clean dataset, but you still have some work to do before you start analysing it. You need to learn how to classify data, and that is what we’ll learn here.
1. Keep a Code Sheet
OK, so what is data classification?
Well, simply put, it’s the process of organising your data into categories for its most effective and efficient use.
But you already knew that, didn’t you?
You’ve already classified your data at the collection and cleaning stages. You’ve probably used codes to represent categories, like 1 for male and 2 for female, but what’s important right now is that you note what your data means.
After all, they’re not a secret are they?
If you go away on holiday for a month and come back to your dataset, will you still understand it? What if you hand it to someone else to work with. Will they understand it?
To make sure that the dataset can be easily understandable by anyone at any time, we need a code sheet.
Say you’ve entered the data for a variable as 1, 2 or 3. What does that mean?
- Small, Medium or Large?
- Pig, Sheep or Goat?
It matters because you shouldn’t be expected to remember all the details of how and why you coded your data that way.
Keep your codes in a separate worksheet and name it ‘Codes’. For each column make a note of what codes you’ve used and what they really mean.
If you’ve used additional codes using ‘illegal’ entries such as negative numbers or letters, make a note of what they mean too, and you’ll have a code sheet that might look something like this:
When you come back to the dataset after a couple of weeks away from it, you’ll be glad you got organised like this. You’ll also make your boss, colleagues and local friendly statistician happy too, and that’s never a bad thing…
2. Identify Your Data Types
When you get to the analysis stage you’ll need to know your data types – Ratio, Interval, Ordinal and Nominal – so take a little time to decide which of these are appropriate for each variable, and note this down in your code sheet.
When you have a variable that has more than 2 categories, check whether there is some kind of order or progression to the data. If the categories are descriptive (Nominal), like ‘Pig’, ‘Sheep’ or ‘Goat’, rather than ordered (Ordinal), like ‘Small’, ‘Medium’ or ‘Large’, you’ll need to create a new variable for each category, like this:
It may seem like you’re just creating more work for yourself, but it’ll save you time later on – analysis of variables with only 2 categories is much easier than with 3 or more categories because the question of whether there is a natural order to the data becomes irrelevant if your data has only 2 categories.
3. Check That Your Data is Sensible
Just because you’ve got a perfectly clean, classified, codified and organised dataset, it doesn’t mean that the data are correct.
Real life follows rules, and your data must too.
I once discovered that we had the oldest man in the world currently being treated in the hospital. At well over 300 years old it could be said that he’d had ‘a good innings’.
In the dataset I was analysing, the difference between his date of birth (somewhere in the 18th century) and date of hospital admission (21st century) meant that he was very old indeed.
Or perhaps his DOB wasn’t quite right…
The error in his DOB couldn’t be detected by standard error-checking in Excel because it was a perfectly legitimate date.
Sometimes, putting together 2 or more pieces of data can reveal errors that otherwise can be difficult to find, so it is sensible to do a few simple calculations on each variable to check that the data conform to sensible rules, such as:
- Calculate the minimum, maximum and mean
- Keep a count for each variable and each category
- Check differences between dates
Making these checks (in a separate worksheet!) lets you find outliers, such as people who have a negative age or are several hundred years old, and gives you a good feel for your data.
Something doesn’t feel right about the answers?
Then dive back in and take a look.
Don't be afraid of getting your hands dirty!
Bonus Tip: DataKleenr – Automated Data Cleaning & Classification
While it’s great to get your hands dirty and learn the basics of data cleaning and classification there’s really no substitute for cleaning and classifying your data quickly, accurately and automatically, so let us do the heavy lifting for you.
We’ve created DataKleenr, an automated data cleaning tool that cleans and classifies your data for you effortlessly, saving you time and money. It might just be the fastest and easiest data classification tool out there.
Well, that’s it up for this blog post. I hope you found something valuable here.
If you would like an eBook of this series of blogs about data collection, data cleaning and data classification you can get it here.
blog comments powered by Disqus