5 'Correlation Is Not Causation' Traps that Even Pros Fall Into (sometimes)
How many times have you heard that ‘correlation is not causation’? Many times, I’m sure. So you know not to say things like ‘wow, the correlation between A and B has a p-value of 0.000001 so A must be causing B…’. I wish I had a pound for every time a very experienced, intelligent scientist with PhDs and titles and stuff has said this.
“Yeah, but just because you’ve got strong evidence of a correlation…” I say, “…it doesn’t mean that A causes B”.
“But look at that small p-value”, they say, “surely when the p-value is so small, then A must be causing B…”.
Well, no. Not at all.
You see, when it comes to storytelling, we have a problem.
It’s not our fault though – as human beings we are hard-wired from birth to look for patterns and explain why they happen. This problem doesn’t go away when we grow up though, it becomes worse the more intelligent we think we are. We convince ourselves that now we are older, wiser, smarter, that our conclusions are closer to the mark than when we were younger.
The smarter we think we are the more likely we are to try putting an explanation to a pattern that we see, even when we don’t have enough information to reach such a conclusion. We can’t help it.
This is the thing about being human. We seek explanation for the events that happen around us. If something defies logic, we try to find a reason why it might make sense. If something doesn’t add up, we make it up.
This reminds me of a rookie error I made a few years ago with the results of some analyses I'd done.
I was doing a survival audit of breast cancer patients and trying to figure out which variables were correlated. The details aren't important.
After a few weeks of analysis I was digging deeper and deeper into the dataset, getting results that I expected to see, but one in particular leapt out at me. I discovered that patients that were receiving chemotherapy for their breast cancer had a much worse survival rate than patients that were not receiving chemotherapy.
This result slapped me in the face like a wet kipper and I exclaimed out loud 'Oh my God, the chemo is killing the patients!'.
I felt like running out into the clinic and screaming at the doctors to stop chemo treatments immediately. Fortunately, I took a deep breath, thought about it for a moment, then slapped my forehead and exclaimed 'Doh!'.
I had made 2 rookie mistakes at the same time.
The first mistake is what this blog post is all about - that correlation does not necessarily imply causation.
Just because there is a correlation between chemotherapy and poor survival, it does not necessarily follow that chemotherapy is the cause of the prognosis.
But actually there was a causal link between chemotherapy and poor prognosis, but I'd got it the wrong way round - it was the poor prognosis that was the cause of the chemotherapy. The patients with the more aggressive breast cancers were given more aggressive treatment, hence the chemotherapy, whereas patients that had less aggressive cancers didn't need chemo - they received alternative treatments.
So you see, even experienced analysts make mistakes when it comes to correlation and causation. I fell into the trap of Wrong Direction Causation.
Actually, if we uncover a correlation between A and B, there are five alternatives to A being the direct cause of B:
- Wrong Direction Causation
- The Third Cause Fallacy
- Indirect Causation
- Cyclic Causation
- Coincidental Causation
I explain all about these alternatives in a FREE book Correlation Is Not Causation.
In this book, you'll learn:
- 5 reasons you should be sceptical about your correlation
- The alternatives as to why correlation is not necessarily causation
- How to avoid falling into these traps
You can get your copy right here:
Correlation and Causation Resources
This blog post is an accompaniment to Correlation Is Not Causation, and is here to help you take the next steps.
Below you'll find the best resources on learning about correlation and causation that we've found on the web, and we update it frequently with new books, video courses, software and whatever else we can find (and create ourselves), so feel free to bookmark us, share us on the web and call in regularly to top up your ninja correlation skills.
Just so you know, some of these resources may be free while others may not, and to help you decide we use the following ratings:
- FREE content
- costs less than 10 £/$/Euro
- costs less than 50 £/$/Euro
- costs less than 100 £/$/Euro
- costs more than 100 £/$/Euro
Disclosure: some of these resources may be affiliate links, and we may earn an affiliate commission for purchases you make when using these links
You can find further details in our TCs
Udemy Video Courses
Udemy is a great place to learn new stuff, not just about data, stats and AI, but about making model trains, how to apply make-up and, oh, just about anything else you can think of.
Courses (when they're on sale, which is very often) are typically priced at about 10-15 £/$/Euro. The upside is that the courses are very cheap, and usually very good. The downside is that courses aren't part of any formal programme, so you won't get any kind of certification.
If you want to fill gaps in your education or even learn whole topics, then Udemy is a great place to go.
4 hour Udemy Video Course with animated videos. Although it only briefly covers correlation, it will help get you started with basic statistics. Perfect for beginners
7 hour Udemy Video Course. Great for learning about linear regression and logistic regression, and takes you all the way from beginner to more advanced levels
5 hour Udemy Video Course. Gets started with the simple stuff - linear regressions - then moves on to logistic regressions and machine learning. Gives examples in Excel, R and Python
Coursera Video Courses
Coursera offer a more considered approach to learning and offer individual and full degree courses, and you get certificates too that you can display on your LinkedIn profile to impress your future boss.
Prices are usually around the 50 £/$/Euro mark per course.
The great thing about Coursera is that all courses are taught by University professionals, so you know that these guys are the best and brightest in their field. On the downside, these courses are quite intensive and you need to be able to set aside a fair chunk of your time over a few weeks to complete the course. Unless you enjoy pulling all-nighters...
5 week Coursera Video Course. In-depth course covering causal effects, the difference between association and causation, causal graphs and inference methods
8 week Coursera Video Course. Great primer for basic statistics, and deals with correlation and regression right from the beginning. Also deals with many other aspects of statistics including descriptive stats, probabilities, inference, confidence intervals and significance tests
4 week Coursera Video Course. Here you will learn linear regressions, learn to fit and utilize simple and multiple linear regression models to examine relationships between multiple variables using the free statistical software R and RStudio
DataCamp Video Courses
DataCamp are the new kid on the block when it comes to offering online courses, but don't let that put you off - the guys that do the teaching are seriously high spec, and they really know their onions.
Pricing at DataCamp is different to that at Udemy and Coursera. Rather than paying for an individual course, at DataCamp you'll be expected to subscribe monthly for around $25, but for that you get access to all their courses. The whole lot of them. And that's a serious amount of learning.
If you need to learn a boat-load of new stuff, then DataCamp might just be a more cost-effective option than both Udemy and Coursera.
4 hour DataCamp Video Course. Learn how to describe relationships between two numerical quantities and characterize them graphically, in the form of summary statistics and through simple linear regression models
4 hour DataCamp Video Course. By learning multiple and logistic regression techniques you will gain the skills to model and predict both numeric and categorical outcomes using multiple input variables
4 hour DataCamp Video Course. This course gives you a chance to think about how different samples can produce different linear models, where your goal is to understand the underlying population model
CorrelViz - visualise all the correlations in your data in minutes
CorrelViz is completely automated and gives you the Story of Your Data in minutes, with one click - saving you months of manual analysis and shed-loads of cash!
Analyse all your data, discover all the correlations you seek - and some you never even dreamed of...
blog comments powered by Disqus