|February 19th, 2016|
- Show your ads, as usual
- Randomly divide people into control and experimental groups, 50-50.
- Experimental group sees anti-veg page, control group sees something irrelevant.
- Use retargeting cookies to advertise to pull people back in for a follow-up.
- Ask people whether they eat meat.
Well, some people planned a study along these lines (methodology) and the results are now out. They randomized who saw the anti-meat videos, followed up with retargeting cookies, and asked people questions about their consumption of various animal products. This is the biggest study of its type I know of, and I'm very excited that its now complete.
The biggest problem I see is that they ended up surveying many fewer people than they set out to. The methodology considered how many people would need to complete the survey to pick up changes of varying sizes, and concluded:
We need to get at minimum 3.2k people to take the survey to have any reasonable hope of finding an effect. Ideally, we'd want say 16k people or more.They only got 2k responses, however, and only 1.8k valid ones. This means the study is "underpowered": even if an effect exists at the size the experimenters expected, there's a large chance the study wouldn't be able to clearly show the effect.
Still, let's work with what we have. To compensate for having minimal data, we should run a single test, the one we think is most applicable. Running multiple tests would mean we'd need to use a Bonferroni correction or something similar, and that dramatically decreases your statistical power.
Before looking at the data or reading the writeup, I committed (via email to David and Allison) to an approach, what I thought of as the simplest, most straight-forward way of looking at it. I would categorize each sample as "meat-eating" or "vegetarian" based on whether they reported eating any meat in the past two days, compute an effect size as the difference in vegetarianism between the two groups, and compute a p-value with a standard two-tailed t-test.
So what do we have to work with for questions? The survey asked, among other things:
In the past two days, how many servings have you had of the following foods? Please give your best guess.
- Pork (ham, bacon, ribs, etc.)
- Beef (hamburgers, meatballs, in tacos, etc.)
- Dairy (milk, yogurt, cheese, etc.)
- Eggs (omelet, in salad, etc.)
- Chicken and Turkey (fried chicken, turkey sandwich, in soup, etc.)
- Fish and Seafood (tuna, crab, baked fish, etc.)
This is potentially rich data, except I don't expect people's responses to be very good. If I tried to answer it, I'm sure I'd miss things for silly reasons, like forgetting what I had for dinner yesterday or not being sure what counts as a serving. On the other hand, if I had a policy for myself of not eating meat, it would very easy to answer those questions! So I categorized people just as "eats meat" vs "doesn't eat meat".
There were 970 control and 1054 experimental responses in the dataset they released. Of these, only 864 (89%) and 934 (89%) fully filled out this set of questions. I counted someone as a meat-eater if they answered anything other than "0 servings" to any of the four meat-related questions, and a vegetarian otherwise. Totaling up responses I see:
The bottom line is, 2% more people in the experimental group were
vegetarians than in the control group (
p=0.108). Honestly, this is far higher than I expected. We're
surveying people who saw a single video four months ago, and we're
seeing that about 2% more of them are vegetarian than they would have
Update 2016-02-20: I computed the p-value wrong; 0.053 was from a one-tailed test instead of a two-tailed test. The right p-value is 0.108. (I had used an online calcalculator intended for evaluating A/B tests that give you conversion numbers. It didn't specify one- or two-tailed, but since two-tailed is what you should use for A/B tests that's what I thought it would be using. After Alexander, Michael, and Dan pointed out that it looked wrong, I computed a p-value computationally. )
This is a very different way of interpreting the study results than any of the writeups I've seen. Edge's Report, Mercy for Animals, and Animal Charity Evaluators all conclude that there was basically no effect. I think this mostly comes from their asking questions where I'd expect the data to be noisier, like looking at how much of various things people think they eat or their attitudes toward meat consumption, plus their asking lots of different questions and so needing to correct downward to compensate for the multiple comparisons.
(There's probably something interesting you could do comparing the responses to the attitude questions with whether people reported eating any meat. I started looking at this some, just roughly, but didn't get very far. Maybe there are hints that the ads do their work by reducing recidivism instead of convincing people to give up meat, but I'm too sleepy to figure this out. My work is all in this sheet.)
 This drops category lables and assigns people to the two groups, drawing with replacement, and looks at what fraction of the time we get a result at least this extreme in either direction:
import sys import math import random def delta(n_con, n_exp, s_con, s_exp): return abs(1.0*s_con/n_con - 1.0*s_exp/n_exp) def draw_sample(haystacks, needles): return (random.random() < 1.0 * needles / haystacks) def start(n_con, n_exp, s_con, s_exp, trials): threshold = delta(n_con, n_exp, s_con, s_exp) n_this_extreme = 0 for i in range(trials): i_con = 0 i_exp = 0 for _ in range(n_con): if draw_sample(n_con + n_exp, s_con + s_exp): i_con += 1 for _ in range(n_exp): if draw_sample(n_con + n_exp, s_con + s_exp): i_exp += 1 if delta(n_con, n_exp, i_con, i_exp) >= threshold: n_this_extreme +=1 print ("Got absolute difference at " "least this big %0.2f%% (%s/%s) " "of the time" % ( 100.0 * n_this_extreme / trials, n_this_extreme, trials) if __name__ == "__main__": start(*[int(x) for x in sys.argv[1:]])