Top 3 Tips for Data Classification
By Lee Baker
OK, so we learnt how to collect, record and store our data (in Excel) and we set up a way of working that allows us to follow the trail of data from the beginning to the end and, more importantly, from end to beginning so that we can identify and rectify mistakes.
We learnt how to clean our data, use the functions and formulae built into Excel to ward off ‘bleeding eye syndrome’ and report any errors back to the original source to save time, effort and money later on.
So you now have a perfectly clean dataset, but you still have some work to do before you start analysing it. You need to learn how to classify data, and that is what we’ll learn here.
1. Keep a Code Sheet
OK, so what is data classification?
Well, simply put, it’s the process of organising your data into categories for its most effective and efficient use.
But you already knew that, didn’t you?
You’ve already classified your data at the collection and cleaning stages. You’ve probably used codes to represent categories, like 1 for male and 2 for female, but what’s important right now is that you note what your data means.
After all, they’re not a secret are they?
If you go away on holiday for a month and come back to your dataset, will you still understand it? What if you hand it to someone else to work with. Will they understand it?
To make sure that the dataset can be easily understandable by anyone at any time, we need a code sheet.
Say you’ve entered the data for a variable as 1, 2 or 3. What does that mean?
- Small, Medium or Large?
- Pig, Sheep or Goat?
It matters because you shouldn’t be expected to remember all the details of how and why you coded your data that way.
Keep your codes in a separate worksheet and name it ‘Codes’. For each column make a note of what codes you’ve used and what they really mean.
If you’ve used additional codes using ‘illegal’ entries such as negative numbers or letters, make a note of what they mean too, and you’ll have a code sheet that might look something like this:
When you come back to the dataset after a couple of weeks away from it, you’ll be glad you got organised like this. You’ll also make your boss, colleagues and local friendly statistician happy too, and that’s never a bad thing…
2. Identify Your Data Types
When you get to the analysis stage you’ll need to know your data types – Ratio, Interval, Ordinal and Nominal – so take a little time to decide which of these are appropriate for each variable, and note this down in your code sheet.
When you have a variable that has more than 2 categories, check whether there is some kind of order or progression to the data. If the categories are descriptive (Nominal), like ‘Pig’, ‘Sheep’ or ‘Goat’, rather than ordered (Ordinal), like ‘Small’, ‘Medium’ or ‘Large’, you’ll need to create a new variable for each category, like this:
It may seem like you’re just creating more work for yourself, but it’ll save you time later on – analysis of variables with only 2 categories is much easier than with 3 or more categories because the question of whether there is a natural order to the data becomes irrelevant if your data has only 2 categories.
Enjoying this blog post? Share it with the world...
3. Check That Your Data is Sensible
Just because you’ve got a perfectly clean, classified, codified and organised dataset, it doesn’t mean that the data are correct.
Real life follows rules, and your data must too.
I once discovered that we had the oldest man in the world currently being treated in the hospital. At well over 300 years old it could be said that he’d had ‘a good innings’.
In the dataset I was analysing, the difference between his date of birth (somewhere in the 18th century) and date of hospital admission (21st century) meant that he was very old indeed.
Or perhaps his DOB wasn’t quite right…
The error in his DOB couldn’t be detected by standard error-checking in Excel because it was a perfectly legitimate date.
Sometimes, putting together 2 or more pieces of data can reveal errors that otherwise can be difficult to find, so it is sensible to do a few simple calculations on each variable to check that the data conform to sensible rules, such as:
- Calculate the minimum, maximum and mean
- Keep a count for each variable and each category
- Check differences between dates
Making these checks (in a separate worksheet!) lets you find outliers, such as people who have a negative age or are several hundred years old, and gives you a good feel for your data.
Something doesn’t feel right about the answers?
Then dive back in and take a look.
Don't be afraid of getting your hands dirty!
Bonus Tip: DataKleenr – Automated Data Cleaning & Classification
While it’s great to get your hands dirty and learn the basics of data cleaning and classification there’s really no substitute for cleaning and classifying your data quickly, accurately and automatically, so let us do the heavy lifting for you.
We’ve created DataKleenr, an automated data cleaning tool that cleans and classifies your data for you effortlessly, saving you time and money. Check it out – it might not be the worst thing that you do today!
Well, that’s it for this blog post. I hope you found something valuable here.
Want a FREE Excel cheatsheet with 22 Essential data cleaning formulae?
Of course you do! Well here you go:
If you're interested in learning more about the content in this blog post we've sought out the best blogs, books, video courses and other stuff from around the internet for you. Some may be free while others may not, and to help you decide we use the following ratings:
- FREE content
- costs less than 10 £/$/Euro
- costs less than 50 £/$/Euro
- costs less than 100 £/$/Euro
- costs more than 100 £/$/Euro
Disclosure: some of these resources may be affiliate links, and we may earn an affiliate commission for purchases you make when using these links
You can find further details in our TCs
11 Essential Tips for Effective Data Collection
Collecting data is serious business. Get it right and your analysis can go relatively smoothly, but get it wrong and you're in for a world of pain. Here are 11 essential and completely painless tips that'll get you off to a great start.
5 Productivity Tips for Efficient Data Cleansing
Now that you've collected your data, your next job is to clean it. This is a job that can be painful, thankless and can take you weeks, so it pays to be organised. Here are 5 tips that will help you get through it quickly and easily.
Videos & Video Courses
DataKleenr - all your data cleaned in minutes, not months
DataKleenr cleans and classifies your data - including outliers - as it is being uploaded. A few moments later, your data is analysis-ready!
DataKleenr is fast, simple and accurate, leaving you the time you need to do the really important parts of your job.
blog comments powered by Disqus