The One Reason Your Correlation Results Are Probably Wrong
I'm sure you all know how to do a correlation analysis to test if one variable is related to another. You probably even know when to use a Spearman correlation test over a Pearson correlation test.
But did you know that the answers from these tests are probably wrong?
When you have a pair of variables, finding a statistical relationship between the pair of them is pretty straightforward. It’s easy to see that where you have maybe a dozen variables you can analyse them pairwise using standard correlation tests and find out which relationships exist in your dataset, and which do not.
It's so simple it's enough to make you feel smug and self-satisfied - maybe even a stats hotshot.
Not so fast, cowboy – it’s not quite that simple!
You see, when you analyse a pair of variables using univariate tests, you’re testing to see whether there is a relationship between these variables without taking into account any other potential factors.
There are loads of ways in which your variables might be interacting with and influencing each other, so when you have a significant p-value from univariate analysis you can’t be sure that the answer you get is correct.
Let me make it easy for you.
If you get a non-significant p-value (larger than 0.05), you can be pretty sure (actually, 95% sure) that there is not a direct relationship between your variables. That’s not to say that one does not influence the other indirectly, it may do, but there is not likely to be an independent relationship between them.
On the other hand, if you get a significant p-value (smaller than 0.05), the best you can say is that there may be a relationship between them. The relationship might be independent, but equally it might not.
This flow chart might help you a little:
Feeling a little less smug now, aren’t we?
So if univariate tests don’t give us the answers we need, where do we go from here? Well, the univariate tests are still useful to us. Remember that univariate tests are pretty good at telling us when there isn’t a direct relationship between a pair of variables.
This is useful information and allows us to narrow the field of possibilities between what might be related to your main variable (aka hypothesis variable) and which ones aren’t.
So, in turn you test each variable against your hypothesis variable to see which of them are not related. Then you discard them. What remains are the variables that might be related to it.
The next step gets tricky because we now need to test the relationship between the hypothesis variable and all of these variables whilst taking into account all the possible interactions between them. Sounds scary!
We’re now dipping our toes into the world of multivariate analysis.
I'm not going to go into detail about univariate and multivariate correlations here because I explain all about them in my FREE eBook Beginner's Guide to Correlation Analysis.
In this book you'll learn:
- the difference between correlations, associations and statistical relationships
- how to analyse your data to get a better understanding of what it's trying to tell you
- how to use both univariate and multivariate statistical tests to pin down the correct story of your data first time, every time
You can get your copy of this book right here:
I will give you a little advice though: do univariate analyses on your data first to get a good understanding of the underlying patterns of your data, then confirm or deny these patterns with the more powerful multivariate analyses. This way you get the best of both worlds and when you discover a new relationship, you can have confidence in it because it has been discovered and confirmed by two different statistical analyses.
When pressed for time I’ve often just jumped straight into the multivariate analysis. Whenever I’ve done this, it has always ended up costing me more time – I find that some of the results don’t make sense and I have to go back to the beginning and do the univariate analyses before repeating the multivariate analyses.
I advise that you think like the tortoise rather than the hare – slow and methodical wins the race…
Correlation Analysis Resources
This blog post is an accompaniment to Beginner's Guide to Correlation Analysis, and is here to help you take the next steps.
Below you'll find the best resources on learning about correlations, associations, univariate and multivariate analysis on the web, and we update it frequently with new books, video courses, software and whatever else we can find (and create ourselves), so feel free to bookmark us, share us on the web and call in regularly to top up your ninja correlation skills.
Just so you know, some of these resources may be free while others may not, and to help you decide we use the following ratings:
- FREE content
- costs less than 10 £/$/Euro
- costs less than 50 £/$/Euro
- costs less than 100 £/$/Euro
- costs more than 100 £/$/Euro
Disclosure: some of these resources may be affiliate links, and we may earn an affiliate commission for purchases you make when using these links
You can find further details in our TCs
Udemy Video Courses
Udemy is a great place to learn new stuff, not just about data, stats and AI, but about making model trains, how to apply make-up and, oh, just about anything else you can think of.
Courses (when they're on sale, which is very often) are typically priced at about 10-15 £/$/Euro. The upside is that the courses are very cheap, and usually very good. The downside is that courses aren't part of any formal programme, so you won't get any kind of certification.
If you want to fill gaps in your education or even learn whole topics, then Udemy is a great place to go.
4 hour Udemy Video Course with animated videos. Although it only briefly covers correlation, it will help get you started with basic statistics. Perfect for beginners
7 hour Udemy Video Course. Great for learning about linear regression and logistic regression, and takes you all the way from beginner to more advanced levels
5 hour Udemy Video Course. Gets started with the simple stuff - linear regressions - then moves on to logistic regressions and machine learning. Gives examples in Excel, R and Python
Coursera Video Courses
Coursera offer a more considered approach to learning and offer individual and full degree courses, and you get certificates too that you can display on your LinkedIn profile to impress your future boss.
Prices are usually around the 50 £/$/Euro mark per course.
The great thing about Coursera is that all courses are taught by University professionals, so you know that these guys are the best and brightest in their field. On the downside, these courses are quite intensive and you need to be able to set aside a fair chunk of your time over a few weeks to complete the course. Unless you enjoy pulling all-nighters...
5 week Coursera Video Course. In-depth course covering causal effects, the difference between association and causation, causal graphs and inference methods
8 week Coursera Video Course. Great primer for basic statistics, and deals with correlation and regression right from the beginning. Also deals with many other aspects of statistics including descriptive stats, probabilities, inference, confidence intervals and significance tests
4 week Coursera Video Course. Here you will learn linear regressions, learn to fit and utilize simple and multiple linear regression models to examine relationships between multiple variables using the free statistical software R and RStudio
DataCamp Video Courses
DataCamp are the new kid on the block when it comes to offering online courses, but don't let that put you off - the guys that do the teaching are seriously high spec, and they really know their onions.
Pricing at DataCamp is different to that at Udemy and Coursera. Rather than paying for an individual course, at DataCamp you'll be expected to subscribe monthly for around $25, but for that you get access to all their courses. The whole lot of them. And that's a serious amount of learning.
If you need to learn a boat-load of new stuff, then DataCamp might just be a more cost-effective option than both Udemy and Coursera.
4 hour DataCamp Video Course. Learn how to describe relationships between two numerical quantities and characterize them graphically, in the form of summary statistics and through simple linear regression models
4 hour DataCamp Video Course. By learning multiple and logistic regression techniques you will gain the skills to model and predict both numeric and categorical outcomes using multiple input variables
4 hour DataCamp Video Course. This course gives you a chance to think about how different samples can produce different linear models, where your goal is to understand the underlying population model
CorrelViz - visualise all the correlations in your data in minutes
CorrelViz is completely automated and gives you the Story of Your Data in minutes, with one click - saving you months of manual analysis and shed-loads of cash!
Analyse all your data, discover all the correlations you seek - and some you never even dreamed of...
blog comments powered by Disqus