randombio.com | Science Dies in Unblogginess | Believe All Science | I Am the Science Thursday, February 20, 2024 | science | socks Science, bad statistics, and socksScientists warn not to put periods in text messages. Also, how to pick up socks in a drawer to get a matched pair |
hould scientific articles on frivolous topics be statistically rigorous?
This question goes to the heart of what science is. It's also a good
excuse for a teaching moment.
A recent paper in psychology journal just published an article asking whether adding a period at the end of a text message affected people's perception of it. They claimed it was true and their results were statistically significant at p<0.05, but they were not.
Here is their table. The last column shows my calculations.Experiment | Comparison | Mean ± SD | Mean ± SD | Claimed p-val | Actual p-val |
---|---|---|---|---|---|
1 | No period vs period | 5.13 ± 0.74 | 5.45 ± 0.74 | <0.05 | 0.060 (ns) |
2 | Single vs multi text | 5.65 ± 0.85 | 5.89 ± 0.70 | <0.05 | 0.177 (ns) |
3 | Single vs multi line | 5.22 ± 0.79 | 5.32 ± 0.82 | (ns) | 0.585 (ns) |
There were 78 students in each experiment. That means, best case, that half the population (or 39) gave one answer and half gave the other. So the N, the size of the population in each group, is 39. The correct results are p=0.06 and p=0.177. Since this was the only result in the paper, nothing they reported was significant.
What happened? The only way they could have gotten these statistics would be to do a one-tailed test. That would compare the experimental group to a fixed value, for instance 5.45±0.74 vs 5.13. This throws away the variance in the control group. It's incorrect because the variance was not zero, which they had to know because they reported it. Thus, there is no difference between groups exposed to text messages with or without periods at the end.
Of course, no one cares about this particular result. What the paper really shows is that increasingly the goal of academic researchers is not to contribute to human understanding but to add to their publication count.
A two-tailed test is essential unless it's physically impossible for something to change unless the intervention was done. The canonical example is a hole-punching machine, since you can never end up with fewer than zero holes. But when you give somebody a drug, it's quite possible that the drug could make the patient worse. In fact, it's more likely to do so. We're lucky if it doesn't kill them.*
It gets worse. I know a guy who measured something in two populations, with maybe 3 mice in each group. Then he counted 10 cells in each group and reported a highly significant result because he incorrectly stated his N as 30 instead of 3.
The SD of measurements in one mouse is important to know, but it has nothing to do with the experiment. Though it's possible to combine the two numbers, they mean different things and it doesn't really make sense. It can only make your error term bigger, which is probably the real reason nobody does it.
Statistical errors happen in serious studies as well, including the famous Raoult paper on hydroxychloroquine (HCQ) for Covid. That error sent people on a wild goose chase and hindered proper testing of the drug. You'll recall the paper claimed that HCQ alone had an effect. Re-do their statistics and you'll find that only the combination of HCQ and Azithromycin was significant. It seems that most of the researchers who reported negative results with HCQ didn't bother to check Raoult's calculations. Could people have died from this mistake?
I'd go so far as to say you haven't read a paper unless you've at least verified they didn't mess up their statistics. You'd be amazed how often that happens.
Not all topics that seem frivolous are useless. The problem of picking matched socks in the dark is a good example.
Here's the problem. Suppose you had only white socks and black socks. You keep them jumbled together in a drawer and you get dressed in the dark. You don't care which ones you wear, but they must match. How many socks do you have to pull out of the drawer to get a matched pair?
Bob Dylan once asked the question “How many roads must a man walk down before you can call him a man?” That's always bothered me about questions like this: they never tell you the answer. Here we're asking “How many socks must a man pick up before you stop calling him a dork?” And we'll give you the answer!
If you only have one foot, you're in luck: no matter how many socks you pick and how many different colors there are, you'll never be one. But suppose, for the sake of argument, that you have two feet and there are two different colors in the drawer. If you picked 3 you'd be 100% certain to get two that were the same color. If there were three colors, you'd need to grab four to be certain to get two of the same color. (Obviously, you could get lucky and get them on the first two you grab, but let's suppose there's a law that says you can only reach into the drawer once.)
What if you had three feet? One sock, whichever color it is, gets probability 1 because we define it to be the desired color that the others must match. Each subsequent sock has 1 / c chance (where c is the number of colors) of matching whatever the first one is. The total number of socks you must pick up is n = 1 + (f−1) × c (where f is the number of feet), as the table below shows.
c no. of colors |
f no. of feet |
n no. to grab |
Probabilities |
---|---|---|---|
any | 1 | 1 | 1 |
1 | 2 | 2 | 1 + 1/1 |
2 | 2 | 3 | 1 + 1/2 + 1/2 |
3 | 2 | 4 | 1 + 1/3 + 1/3 + 1/3 |
4 | 2 | 5 | 1 + 1/4 + 1/4 + 1/4 + 1/4 |
1 | 3 | 3 | 1 + 1/1 + 1/1 |
2 | 3 | 5 | 1 + 1/2 + 1/2 + 1/2 |
3 | 3 | 7 | 1 + 1/3 + 1/3 + 1/3 + 1/3 + 1/3 + 1/3 |
4 | 3 | 9 | etc. |
If a man has four feet and there are 17 different kinds of socks, he must pick up n = 1 + (4−1) × 17 or 52 socks to be certain of not looking like a dork. Of course, this works for women as well.
This seemingly frivolous exercise tells us something important: if you remove one of the constraints, a statistical problem can change from a measure of probability to a way of gaining certain knowledge. Try it with your own socks if you don't believe me.
* There's not really any fundamental difference between them. The only difference is that the variance in one group is zero. So you could always use two-tailed tests.
feb 20 2025, 6:37 am
Hydroxychloroquine is great again
Two higly publicized papers—one on HCQ and one on dexamethasone—show
the dangers of relying too much on statistics
The Earth is Round (p<0.05)
After 23 years, the paper with that title still raises uncomfortable questions
Statistics do not decide scientific truth
Some people think statistical validity is a criterion for whether a
scientific finding is true. They're wrong
There is no such thing as an irreproducible result
There are no irreproducible results, only badly described ones
Problems with linear regression
First, a tedious statistical question. We'll fix the end of the world later
Meta-analysis of junk science is still junk science
A paper on gender violence and global warming
reminds us that meta-analysis doesn't make something
true
The standard model of sock physics
Socks actually tell us a great deal about quantum mechanics.
Unfortunately, most of what they tell us is wrong.