Top 12 Tips - What To Do With Data
This is the first entry in our Discover Data Blog Series, so it seems appropriate to ask the questions 'what is data?', 'why is it so important?' and 'what can we do with it?'. We give you the Top 12 Tips on what you can do with your data...
So what is data?
Information is all around us. It is in everything we see, hear, smell, touch and taste. It can be found in the largest event, like the formation of a new galaxy, and in the smallest, such as the spin-state of an electron (you can tell I'm a physicist, can't you?)
Simply put, data is a collection of facts and information that we have gathered and translated into a form that is convenient to process
It can be numbers, words, measurements, observations or even just descriptions of things
So why do we collect data?
Information on its own can be interesting, but it is not really very useful. We need to collect data so we can find out 'what the world is like'
We might observe things like:
- It's rained every day this week
- My daughter is taller than most of her classmates
- I seem to be diagnosing more cases of lung cancer than usual just lately
The questions to ask might be:
- Are the current rainfall patterns unusual for this time of year?
- Is my daughter tall for her age or is her height within accepted limits?
- Are my observations correct, and if so, why are there more cases of lung cancer than usual?
In each of these cases we need to gather the information, observe it, measure it, count it and categorise it so that we can begin to understand the 'story' behind the information
What do we do with the data?
We typically collect data to answer one of 2 questions:
- What is the world like?
- What is the world going to be like?
The infographic below might help to explain the difference between these questions:
Infographic: We analyse data to find features, patterns and trends that enable us to describe what the world is like and predict what it will be like in the future
We might want to analyse the data to find features that describe to us what the world was like at the time the data was collected
There is a whole branch of statistics dedicated to finding these features, and typically we use descriptive methods to measure things such as:
- averages (mean, median, mode)
- variation (standard deviation, confidence intervals)
What these can't do is tell you the future
For this we need to create models that can spot patterns and trends that allow us to predict what the world will be like in the future
There are many different ways of producing predictions and forecasts from data, but they can be broadly grouped into 2 techniques:
- regression (linear, multivariate, logistic, etc.)
- machine learning (ANNs, SVMs, etc.)
I'll talk about these in greater detail in future posts
Ultimately, data is information that can tell us how the world works, and this is important if we want to be able to predict the future with any degree of accuracy
If we want accurate predictions, then we need accurate data, so it is of the utmost importance that we take care when we observe and measure
As a statistical consultant I have lost count of the number of times that I have had to tell a researcher that his/her data is not fit for purpose and if they want their questions answered correctly and accurately they need to start again
For a 3 year PhD student with just a month to go before submitting their thesis, this is not what they want to hear - and not what I want to tell them!
So how do we know when our data is not up to scratch?
There is a whole branch of statistics dedicated to answering this question (which I'm not going to go into here), but one of the questions we can ask is:
- Is our data biased?
Want a FREE Excel cheatsheet with 22 Essential data cleaning formulae?
Of course you do! Well here you go:
An example of how to detect bias in data is to check the remainder (the right-hand side of the decimal point) of continuous measurements
Say that we are measuring the heights of 10 year old children to 1 decimal place
We'll have measurement such as
Now leave off everything to the left of the decimal point and we have
Count up all the .1s, the .2s, .3s, etc., and plot the counts against the remainder
We expect to see approximately the same number of children in each of the deciles (the .1s, .2s, .3s and so on) so the plot should be square-ish (below left):
Example of how to detect bias in the data
If we see the graph on the right, we'll know that something has gone wrong with our measuring procedures
Most likely the person/people doing the measuring have rounded to the nearest .0 or .5 for some (but not all) of the measurements, and this has inevitably introduced bias into the data
Is this a problem?
Well, it might be, but it all depends on what questions you are asking and how accurate you need the answers to be
Only you can answer that question, and it would be a really good idea to discuss this with your local friendly statistician before you begin collecting your data rather than just hours before an important deadline!
I wish I had a pound for every time I'd told that to someone...
Enjoying this blog post? Share it with the world...
So what have we learnt from this? As promised, here are our Top 12 Tips about what to do with data:
Tip #1: We can use data to describe what the world is like
Tip #2: We can use data to predict what the world will be like
Tips #3-12: The accuracy of our data is 10 times more important than what we plan to do with it!
So the next time you collect data, remember GIGO:
Garbage In, Garbage Out...
And maybe, just maybe, your statistician won't ruin your day by telling you to scrap your data and start again
Did you forget to download your FREE cheatsheet?
22 spiffing Excel data cleaning formulae
blog comments powered by Disqus