Biggest trap of Simpson's paradox is the results can change with every level of granularity.
If you take the example of Treatment A vs Treatment B for tumors, you can get infinite layers of seemingly contradicting statemens:
- Overall, Treatment A has better average results
- But if you add tumor size, Treatment B is always better
- But if you add gender to size, Treatment B is always better
- But if you add age category to gender and size, Treatment A is always better
- etc...
It totally contradicts our instincts, and shows statistics can be profoundly misleading (intentionally or not).
To add some proofs to my answer, I actually coded a Z3 program to prove it! The 3-variables version takes too long to resolve, but I got results for the 2-variables version (tumor size + gender):
For pedagogues and practitioners alike: there is a subtle connection between Simpson’s paradox and the wild geometry of relative entropy. This might be partly why effect sizes are also contentious.
Besides Ellenberg’s mind-altering discussion of that link[1], see hints on the second page of:
[1] "[the point of Simpson’s paradox] isn't really to tell us which viewpoint to take but to insist that we keep both the parts and the whole in mind at once."
Ellenberg, from Shape: The Hidden Geometry of Information, Biology, Strategy, Democracy, and Everything Else (2021)
I actually coded a Z3 program to prove it! The 3-variables version takes too long to resolve, but I got results for the 2-variables version (tumor size + gender):
I think the real-world resolution to this problem is straightforward though. You should look at the finest level of granularity available, and pick the best treatment in the relevant subpopulation for the patient.
Unfortunately our level of certainty generally falls off as we increase the granularity. For example, imagine the patient is a 77yo Polish-American man, and we're lucky enough to have one historical result for 77yo Polish-American men. That man got treatment A and did better than expected. But say if we go out to 70-79y white men we have 1,000 people, of which 500 got treatment A and generally did significantly worse than the 500 who got treatment B. While the more granular category gives us a little information, the sample size is so small that we would be foolish to discard the less granular information.
This is all true. I originally added a disclaimer to my post that said "assuming you have enough data to support the level of granularity" but I removed it for brevity because I thought it was implied -- small sample size isn't part of Simpson's paradox. My apologies for being unclear
If you take the example of Treatment A vs Treatment B for tumors, you can get infinite layers of seemingly contradicting statemens: - Overall, Treatment A has better average results - But if you add tumor size, Treatment B is always better - But if you add gender to size, Treatment B is always better - But if you add age category to gender and size, Treatment A is always better - etc...
It totally contradicts our instincts, and shows statistics can be profoundly misleading (intentionally or not).