How to Clean Text Data Without Manual Data Manipulation
There is plenty of information in Excel’s help files about the way Microsoft suggests that you use them.
You can learn a lot from them but did you know that there are actually alternative uses of these functions and formulae?
It’s actually these alternative uses that turn out to be the most valuable because they can help you save a lot of time in your data cleaning and preparation.
In this blog post I will show you an example of these “alternative uses” of Excel functions and formulae.
In particular I'll show you how to combine Remove Duplicates and VLOOKUP in a way that is different to what Microsoft intended but which is actually a lot more powerful.
The first time that I ever used Excel, way back in 1999, I’d been given a worksheet with 2 columns of numbers in it. All I needed to do was add column A to column B to get column C.
There were about 30,000 rows or so, and I really didn’t feel like adding columns A and B manually but I had absolutely no idea how to get started completing this simple task.
I asked a fellow student. He had no idea, so I asked another. She had no idea either. Eventually I discovered that there were loads of different ways to complete the task using Excel functions and formulae.
And here is where my Excel journey began…
Remove Duplicates & VLOOKUP
Over the past couple of decades I’ve done a lot more work using Excel and, although I still wouldn’t really call myself an expert (there are lots of different uses for Excel, many of which I’ve never experienced), I think I’m pretty handy when it comes to cleaning and analysing data.
Microsoft have built a huge library of functions and formulae into Excel which are very powerful, but I’ve noticed over the years that I often use these functions and formulae in different ways to which Microsoft intended.
It’s not that I’m using them wrongly, just differently, and I think that some of Excel’s functions and formulae are underused and underrated because most people don’t realise they can be used in alternative ways.
This blog post is all about using the function Remove Duplicates and the formula VLOOKUP, and how they can be used together to clean and translate text data into an analysis-ready form.
Most statistics programs have their own little quirks. Some allow the use of text data and some don’t, so if you find that you’ve got columns of text categories, you might find that you need to transform them into integer categories. Worse still, there might be speeling misteaks or typos, and you’re going to have to clean these up too before you can get started.
Let’s have a look at a sample of text data to see the kinds of problems we might typically encounter. We have a categorical variable called ‘Size’ and possible category names of [Small, Medium, Large]. Unfortunately there are a number of typographical errors that we’re going to have to deal with:
There are lots of ways to clean and translate text data like these, but here we’re going to use Remove Duplicates and VLOOKUP.
Our first task is to find out just what is in our Size variable. We can’t see to the bottom of our dataset and we don’t really want to take hours scanning it by eye. That’s a sure way to make your eyes bleed. Not good.
What we need to do is use Remove Duplicates to give us a list of all the unique elements.
The first step is to make a copy of our Size column. I’ve put it in column F, selected the whole of column F containing our copy and clicked Remove Duplicates under the Data Tools section of the Data tab.
Here’s what that looks like:
Make sure that ‘My data has headers’ is ticked, then click OK. Column F will now list all the unique elements of our Size column:
If you look closely at our original data and the list of unique elements you should notice something quite odd – Excel has ignored the case of our data. Row 10 (corresponding to the UniqueID of 9) has the entry ‘large’ (all lower case), which is not present in our list of unique elements. That’s because Excel ignores the case and can deal with upper, lower and proper case text as if they were all the same. This is both really smart and really dumb at the same time. It’s great for doing work in Excel but awful for transferring data into other programs that don’t have this functionality.
In the context of what we’re trying to do today, it doesn’t matter one jot – we’re going to get around this issue with no extra work!
OK, so now we have a list of all the (case-insensitive) unique elements, we’re going to transform our data from what we have into what we need. The list of what we have is what’s in column F, so we need to put what we need into column G, like this:
Here, I’ve made sure there are no incorrect spellings in column G because this is what our data is going to become. Mistakes here will give us extra work to do (more cleaning – just what I always wanted…).
Columns F and G are now a ‘translation matrix’, as in ‘we want all of these (column F) to be like these (column G)’.
The next step is to put VLOOKUP to work.
In case you haven’t used VLOOKUP before, we build it like this:
=VLOOKUP(Value you want to look up, range where you want to lookup the value, the column number in the range containing the return value, Exact Match (FALSE) or Approximate Match (TRUE).
In column C (titled ‘Cleaned’) we want to have the cleaned data, so in cell C2 we type the following:
This means ‘look up the value in B2 from the table in F2:G7 and return the matching value from the 2nd column (column G) in cell C2’.
So Excel will go off and look up ‘Small’ (the value in B2) in the table, and when it finds it (in row 2) it will return the corresponding row from the 2nd column (column G), which is ‘Small’. Not surprisingly, it is returning exactly the same value as it started with.
The next step is to fill in this formula to the bottom of the data. You can do this by double-clicking on the little square in the bottom right-hand corner of the selected cell (C2).
This will apply the formula to the whole column and your data will now look like this:
All our data has now been cleaned, and there’s been no manual manipulation.
VLOOKUP – Again!
We’re not finished though. The data may be clean, but it’s not analysis-ready – our stats program won’t accept text data so we’ll need to translate these data into integers.
For this we can use VLOOKUP again.
We know that our categories are [Small, Medium, Large] with no typos and all case-corrected, so we can use a simpler translation matrix. In columns I and J we enter the following:
This time our VLOOKUP formula looks up data from column C (the cleaned data) in the translation matrix in columns I and J and returns the matching values from the 2nd column (column J). Our VLOOKUP formula in cell D2 looks like this:
This time, entries in column C matching the entry ‘Small’ will be translated to the integer ‘1’, ‘Medium’ will translate to ‘2’ and ‘Large’ to ‘3’.
Here’s what the data look like at the end of this operation:
Our data is now in integer classes rather than in text form, and can be used in pretty much all stats programs.
So you can see that by using Excel functions and formulae in ways that were not originally intended, we can discover much more powerful solutions to our problems. In this case we’ve managed to clean and case-control a text column with just a couple of minutes work, and translate the text into integers ready to upload to a stats program.
And all this with absolutely no data manipulation whatsoever!
I hope you learnt something new in this blog post and that you start to use it in your own datasets.
We’ve learnt here that Microsoft Excel is not all that it seems – sometimes useful functions and formulae can be really useful and much more powerful when we find alternative ways of using them.
I hope you’ve also learnt that data cleaning and preparation doesn’t have to involve a lot of work – by the careful use of Excel’s built-in tools we can make the process much easier and very much quicker.
If you would like to explore the issues highlighted in this blog post in more detail, why not join us for our next webinar? It’s free, fun, informative and – best of all – free. Oh, did I mention, it’s FREE…
blog comments powered by Disqus