Jeff Kaufman's Writing
http://www.jefftk.com/news/
Jeff Kaufman's Writing from post 1264 backen-us/p/name-gender-over-timeName gender over time
https://www.jefftk.com/p/name-gender-over-time
gendernames23 Feb 2016 08:00:00 EST<p>
There are various gendered patterns you hear people talk about with
names, one of which is that names go from male to female <a
href="http://www.dailylife.com.au/life-and-love/parenting-and-families/the-subconscious-misogyny-of-unisex-baby-name-trends-20160119-gm9cay.html">but
not the other way around</a>. After a friend put up a challenge to
find even a single name that went the other way, I decided to look
into the data.
<p>
In the US, the Social Security Admninistration <a
href="https://www.ssa.gov/oact/babynames/limits.html">provides a
list</a> of all names that at least five social security card holders
have had, broken down by year and gender. I looked at this list and
found all the names that had a decade where they were 60%+ one gender
and then a later decade where they were 60%+ the other gender. For
example, 86% of babies named "Gale" were male in 1920s, but by the
1950s 81% were female:
<p>
<img src="/gale-name-popularity.png" srcset="/gale-name-popularity-2x.png 2x">
<p>
Overall, I found 62 popular male-to-female names and 26 female-to-male
ones. What do these look like?
<p>
<h5>Names that moved from male to female</h5>
<table border=1 style="margin: 10px">
<tr><th>Name</th><th>Male % (n)</th><th>Male<br>Decade</th>
<th>Female % (n)</th><th>Female<br>Decade</th>
<tr><td>Addison</td><td>100% (n=170)</td><td>1880s</td><td>93%(n=51348)</td><td>2000s</td>
<tr><td>Allison</td><td>100% (n=70)</td><td>1880s</td><td>100%(n=74418)</td><td>1990s</td>
<tr><td>Angel</td><td>87% (n=1088)</td><td>1930s</td><td>60%(n=16816)</td><td>1970s</td>
<tr><td>Ariel</td><td>67% (n=1477)</td><td>1970s</td><td>86%(n=28135)</td><td>1990s</td>
<tr><td>Ashley</td><td>100% (n=80)</td><td>1880s</td><td>100%(n=301779)</td><td>1990s</td>
<tr><td>Aubrey</td><td>97% (n=3371)</td><td>1950s</td><td>95%(n=29506)</td><td>2000s</td>
<tr><td>Avery</td><td>100% (n=132)</td><td>1880s</td><td>72%(n=41680)</td><td>2000s</td>
<tr><td>Bailey</td><td>100% (n=78)</td><td>1880s</td><td>90%(n=4079)</td><td>1980s</td>
<tr><td>Beverly</td><td>100% (n=107)</td><td>1880s</td><td>100%(n=3347)</td><td>1990s</td>
<tr><td>Billie</td><td>78% (n=123)</td><td>1880s</td><td>89%(n=9978)</td><td>1970s</td>
<tr><td>Blair</td><td>100% (n=43)</td><td>1880s</td><td>67%(n=3562)</td><td>1990s</td>
<tr><td>Brook</td><td>70% (n=709)</td><td>1960s</td><td>88%(n=3009)</td><td>1990s</td>
<tr><td>Carey</td><td>96% (n=144)</td><td>1880s</td><td>67%(n=5603)</td><td>1970s</td>
<tr><td>Courtney</td><td>67% (n=628)</td><td>1940s</td><td>97%(n=113275)</td><td>1990s</td>
<tr><td>Dana</td><td>67% (n=90)</td><td>1880s</td><td>91%(n=7930)</td><td>2000s</td>
<tr><td>Dee</td><td>76% (n=276)</td><td>1880s</td><td>88%(n=4829)</td><td>1960s</td>
<tr><td>Elisha</td><td>100% (n=336)</td><td>1880s</td><td>83%(n=4840)</td><td>1980s</td>
<tr><td>Emerson</td><td>100% (n=185)</td><td>1880s</td><td>62%(n=6199)</td><td>2000s</td>
<tr><td>Emery</td><td>100% (n=449)</td><td>1880s</td><td>68%(n=3546)</td><td>2000s</td>
<tr><td>Gale</td><td>86% (n=2008)</td><td>1920s</td><td>81%(n=9479)</td><td>1950s</td>
<tr><td>Garnett</td><td>79% (n=46)</td><td>1880s</td><td>63%(n=219)</td><td>1900s</td>
<tr><td>Harley</td><td>100% (n=685)</td><td>1880s</td><td>62%(n=8192)</td><td>2000s</td>
<tr><td>Harper</td><td>100% (n=52)</td><td>1880s</td><td>80%(n=5663)</td><td>2000s</td>
<tr><td>Hollie</td><td>72% (n=52)</td><td>1880s</td><td>100%(n=2883)</td><td>1990s</td>
<tr><td>Jackie</td><td>66% (n=15674)</td><td>1930s</td><td>73%(n=25895)</td><td>1960s</td>
<tr><td>Jaime</td><td>91% (n=769)</td><td>1940s</td><td>66%(n=22772)</td><td>1970s</td>
<tr><td>Jodie</td><td>65% (n=324)</td><td>1910s</td><td>95%(n=4256)</td><td>1980s</td>
<tr><td>Kelley</td><td>100% (n=81)</td><td>1900s</td><td>92%(n=9786)</td><td>1980s</td>
<tr><td>Kelly</td><td>100% (n=119)</td><td>1880s</td><td>93%(n=21607)</td><td>2000s</td>
<tr><td>Kendall</td><td>100% (n=355)</td><td>1910s</td><td>81%(n=21020)</td><td>2000s</td>
<tr><td>Kennedy</td><td>99% (n=1128)</td><td>1960s</td><td>94%(n=27554)</td><td>2000s</td>
<tr><td>Kerry</td><td>90% (n=535)</td><td>1930s</td><td>71%(n=16128)</td><td>1970s</td>
<tr><td>Kim</td><td>68% (n=2724)</td><td>1940s</td><td>93%(n=94380)</td><td>1960s</td>
<tr><td>Lacy</td><td>84% (n=873)</td><td>1940s</td><td>97%(n=8404)</td><td>1980s</td>
<tr><td>Lauren</td><td>100% (n=383)</td><td>1910s</td><td>100%(n=97115)</td><td>2000s</td>
<tr><td>Leigh</td><td>100% (n=33)</td><td>1880s</td><td>95%(n=11412)</td><td>1970s</td>
<tr><td>Lesley</td><td>100% (n=74)</td><td>1900s</td><td>98%(n=4121)</td><td>2000s</td>
<tr><td>Leslie</td><td>93% (n=10629)</td><td>1910s</td><td>97%(n=31036)</td><td>2000s</td>
<tr><td>Lindsay</td><td>100% (n=62)</td><td>1880s</td><td>99%(n=12258)</td><td>2000s</td>
<tr><td>Lindsey</td><td>100% (n=109)</td><td>1880s</td><td>99%(n=21629)</td><td>2000s</td>
<tr><td>Lindy</td><td>87% (n=546)</td><td>1920s</td><td>97%(n=1777)</td><td>1980s</td>
<tr><td>Loren</td><td>100% (n=230)</td><td>1880s</td><td>64%(n=3519)</td><td>1990s</td>
<tr><td>Lynn</td><td>95% (n=272)</td><td>1880s</td><td>89%(n=51935)</td><td>1960s</td>
<tr><td>Madison</td><td>100% (n=218)</td><td>1880s</td><td>99%(n=193063)</td><td>2000s</td>
<tr><td>Morgan</td><td>100% (n=252)</td><td>1880s</td><td>90%(n=91179)</td><td>1990s</td>
<tr><td>Paris</td><td>100% (n=64)</td><td>1880s</td><td>91%(n=10321)</td><td>2000s</td>
<tr><td>Pat</td><td>100% (n=220)</td><td>1880s</td><td>75%(n=16149)</td><td>1940s</td>
<tr><td>Patsy</td><td>60% (n=1845)</td><td>1910s</td><td>98%(n=40274)</td><td>1940s</td>
<tr><td>Reese</td><td>100% (n=81)</td><td>1880s</td><td>68%(n=13819)</td><td>2000s</td>
<tr><td>Regan</td><td>62% (n=583)</td><td>1960s</td><td>90%(n=6041)</td><td>2000s</td>
<tr><td>Rosario</td><td>85% (n=821)</td><td>1910s</td><td>77%(n=1380)</td><td>1950s</td>
<tr><td>Sandy</td><td>100% (n=191)</td><td>1880s</td><td>94%(n=2849)</td><td>2000s</td>
<tr><td>Shelby</td><td>100% (n=115)</td><td>1880s</td><td>97%(n=31283)</td><td>2000s</td>
<tr><td>Shelly</td><td>100% (n=54)</td><td>1890s</td><td>99%(n=9095)</td><td>1980s</td>
<tr><td>Sidney</td><td>97% (n=12794)</td><td>1910s</td><td>75%(n=8400)</td><td>2000s</td>
<tr><td>Skylar</td><td>75% (n=1576)</td><td>1980s</td><td>78%(n=21860)</td><td>2000s</td>
<tr><td>Stacy</td><td>64% (n=586)</td><td>1940s</td><td>96%(n=36989)</td><td>1980s</td>
<tr><td>Stevie</td><td>99% (n=3571)</td><td>1960s</td><td>71%(n=2842)</td><td>1990s</td>
<tr><td>Sydney</td><td>91% (n=137)</td><td>1880s</td><td>99%(n=76409)</td><td>2000s</td>
<tr><td>Taylor</td><td>100% (n=276)</td><td>1880s</td><td>85%(n=100918)</td><td>2000s</td>
<tr><td>Tracy</td><td>84% (n=846)</td><td>1930s</td><td>89%(n=28489)</td><td>1980s</td>
<tr><td>Whitney</td><td>100% (n=335)</td><td>1910s</td><td>99%(n=32883)</td><td>1990s</td>
</table>
<p>
<h5>Names that moved from female to male</h5>
<table border=1 style="margin: 10px">
<tr><th>Name</th><th>Female % (n)</th><th>Female<br>Decade</th>
<th>Male % (n)</th><th>Male<br>Decade</th>
<tr><td>Angel</td><td>60% (n=16816)</td><td>1970s</td><td>79%(n=94237)</td><td>2000s</td>
<tr><td>Artie</td><td>85% (n=699)</td><td>1890s</td><td>81%(n=570)</td><td>1960s</td>
<tr><td>Ashton</td><td>63% (n=3609)</td><td>1980s</td><td>87%(n=32400)</td><td>2000s</td>
<tr><td>Audie</td><td>92% (n=54)</td><td>1880s</td><td>93%(n=632)</td><td>1960s</td>
<tr><td>Dell</td><td>68% (n=169)</td><td>1900s</td><td>71%(n=599)</td><td>1960s</td>
<tr><td>Donnie</td><td>91% (n=179)</td><td>1880s</td><td>97%(n=3824)</td><td>1980s</td>
<tr><td>Elisha</td><td>83% (n=4840)</td><td>1980s</td><td>67%(n=3456)</td><td>2000s</td>
<tr><td>Frankie</td><td>100% (n=439)</td><td>1880s</td><td>82%(n=3656)</td><td>1980s</td>
<tr><td>Garnett</td><td>63% (n=219)</td><td>1900s</td><td>62%(n=505)</td><td>1940s</td>
<tr><td>Germaine</td><td>100% (n=230)</td><td>1900s</td><td>63%(n=972)</td><td>1970s</td>
<tr><td>Gerry</td><td>76% (n=800)</td><td>1920s</td><td>86%(n=1339)</td><td>1970s</td>
<tr><td>Jackie</td><td>94% (n=150)</td><td>1900s</td><td>66%(n=15674)</td><td>1930s</td>
<tr><td>Jaime</td><td>66% (n=22772)</td><td>1970s</td><td>84%(n=12040)</td><td>2000s</td>
<tr><td>Jan</td><td>84% (n=12827)</td><td>1960s</td><td>96%(n=2054)</td><td>2000s</td>
<tr><td>Jean</td><td>98% (n=80360)</td><td>1940s</td><td>76%(n=2613)</td><td>2000s</td>
<tr><td>Jessie</td><td>83% (n=9186)</td><td>1880s</td><td>67%(n=8697)</td><td>1960s</td>
<tr><td>Joan</td><td>100% (n=121)</td><td>1880s</td><td>64%(n=1805)</td><td>2000s</td>
<tr><td>Kris</td><td>63% (n=7157)</td><td>1960s</td><td>72%(n=1182)</td><td>1980s</td>
<tr><td>Lennie</td><td>92% (n=511)</td><td>1900s</td><td>76%(n=568)</td><td>1960s</td>
<tr><td>Maxie</td><td>87% (n=48)</td><td>1880s</td><td>72%(n=579)</td><td>1950s</td>
<tr><td>Merle</td><td>65% (n=717)</td><td>1890s</td><td>91%(n=929)</td><td>1970s</td>
<tr><td>Pat</td><td>75% (n=16149)</td><td>1940s</td><td>77%(n=788)</td><td>1970s</td>
<tr><td>Patsy</td><td>97% (n=193)</td><td>1880s</td><td>60%(n=1845)</td><td>1910s</td>
<tr><td>Robbie</td><td>96% (n=293)</td><td>1890s</td><td>78%(n=2487)</td><td>1980s</td>
<tr><td>Theo</td><td>64% (n=212)</td><td>1890s</td><td>65%(n=680)</td><td>1930s</td>
<tr><td>Toby</td><td>65% (n=947)</td><td>1930s</td><td>91%(n=9889)</td><td>1970s</td>
</table>
<p>
The first thing I noticed with these two lists, is that it's not as
lopsided as I expected. While there are definitely more male names
that have become female, it's only 2-3x more and there are quite a few
that have gone the other way:
<p>
<img src="/ashton-name-popularity.png" srcset="/ashton-name-popularity-2x.png 2x">
<p>
<img src="/merle-name-popularity.png" srcset="/merle-name-popularity-2x.png 2x">
<p>
And even some that have switched more than once:
<p>
<img src="/jackie-name-popularity.png" srcset="/jackie-name-popularity-2x.png 2x">
<p>
<img src="/elisha-name-popularity.png" srcset="/elisha-name-popularity-2x.png 2x">
<p>
One thing that jumps out at me about the female-to-male list, though,
is that there are a lot of diminutives, like "Frankie".
<p>
<img src="/frankie-name-popularity.png" srcset="/frankie-name-popularity-2x.png 2x">
<p>
My guess is what's happening here is that instead of a traditionally
female name becoming a common male name, we're mostly seeing changing
norms around what to put down as official names for Social Security
purposes. The idea is that maybe there have always been more men
going by "Frankie," but initially most of these men had been given the
name "Frank" and later more parents started putting down the
diminutive as the official name. I don't know how I'd test this guess
though.
<p><i>Comment via: <a href="https://plus.google.com/103013777355236494008/posts/3gW8Mz8stcT">google plus</a>, <a href="https://www.facebook.com/jefftk/posts/772529163242">facebook</a></i>/p/mfa-veg-ads-studyMFA Veg Ads Study
https://www.jefftk.com/p/mfa-veg-ads-study
veg19 Feb 2016 08:00:00 EST<p>
If you would like to get people to stop eating animals, there are a
lot of things you could do: protest meat-serving restaurants, hand out
leaflets, show online ads, lobby companies to do "meatless mondays,"
etc. To compare these, it would be useful to know how much of an
impact they have. A while ago I <a
href="/p/vegetarian-survey-proposal">proposed</a> a simple survey to
measure the impact of online ads:
<p>
<ul>
<li>Show your ads, as usual</li>
<li>Randomly divide people into control and experimental groups,
50-50.</li>
<li>Experimental group sees <a
href="http://www.meatvideo.com/">anti-veg page<a/>, control
group sees something irrelevant.</li>
<li>Use retargeting cookies to advertise to pull people back in for
a follow-up.</li>
<li>Ask people whether they eat meat.</li>
</ul>
<p>
Well, some people planned a study along these lines (<a
href="https://docs.google.com/document/d/1nA3VYi3-UbAOefSWCNcFQQ2fTIceiX1CkyZMIHO-q2Q/preview">methodology</a>)
and the results are now <a
href="http://www.mercyforanimals.org/impact-study">out</a>. They
randomized who saw the anti-meat videos, followed up with retargeting
cookies, and asked people questions about their consumption of various
animal products. This is the biggest study of its type I know of, and
I'm very excited that its now complete.
<p>
The biggest problem I see is that they ended up surveying many fewer
people than they set out to. The methodology considered how many
people would need to complete the survey to pick up changes of varying
sizes, and concluded:
<blockquote>
We need to get at minimum 3.2k people to take the survey to have any
reasonable hope of finding an effect. Ideally, we'd want say 16k
people or more.
</blockquote>
They only got 2k responses, however, and only 1.8k valid ones. This
means the study is "underpowered": even if an effect exists at the
size the experimenters expected, there's a large chance the study
wouldn't be able to clearly show the effect.
<p>
Still, let's work with what we have. To compensate for having
minimal data, we should run a single test, the one we think is most
applicable. Running multiple tests would mean we'd need to use a <a
href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni
correction</a> or something similar, and that dramatically decreases
your statistical power.
<p>
Before looking at the data or reading the writeup, I committed (via
email to <a href="http://davidchudzicki.com">David</a> and <a
href="http://www.animalcharityevaluators.org/about/meet-our-team/#allison">Allison</a>)
to an approach, what I thought of as the simplest, most
straight-forward way of looking at it. I would categorize each sample
as "meat-eating" or "vegetarian" based on whether they reported eating
any meat in the past two days, compute an effect size as the
difference in vegetarianism between the two groups, and compute a
p-value with a standard two-tailed t-test.
<p>
So what do we have to work with for questions? The survey asked,
among other things:
<p>
<blockquote>
In the <span style="text-decoration: underline;">past two
days</span>, how many servings have you had of the following foods?
Please give your best guess.
<ul>
<li>Pork (ham, bacon, ribs, etc.)</li>
<li>Beef (hamburgers, meatballs, in tacos, etc.)</li>
<li>Dairy (milk, yogurt, cheese, etc.)</li>
<li>Eggs (omelet, in salad, etc.)</li>
<li>Chicken and Turkey (fried chicken, turkey sandwich, in soup,
etc.)</li>
<li>Fish and Seafood (tuna, crab, baked fish, etc.)
</ul>
</blockquote>
<p>
This is potentially rich data, except I don't expect people's
responses to be very good. If I tried to answer it, I'm sure I'd miss
things for silly reasons, like forgetting what I had for dinner
yesterday or not being sure what counts as a serving. On the other
hand, if I had a policy for myself of not eating meat, it would very
easy to answer those questions! So I categorized people just as "eats
meat" vs "doesn't eat meat".
<p>
There were 970 control and 1054 experimental responses in the dataset
they released. Of these, only 864 (89%) and 934 (89%) fully filled
out this set of questions. I counted someone as a meat-eater if they
answered anything other than "0 servings" to any of the four
meat-related questions, and a vegetarian otherwise. Totaling up
responses I see:
<p>
<table border=1 style="margin: 20px" cellpadding=6>
<tr><td></td><th>valid responses</th>
<th>vegetarians</th>
<th>%</th></tr>
<tr><th>control</th>
<td>864</td>
<td>55</td>
<td>6.4%</td></tr>
<tr><th>experimental</th>
<td>934</td>
<td>78</td>
<td>8.4%</td></tr>
</table>
<p>
The bottom line is, 2% more people in the experimental group were
vegetarians than in the control group (<i><strike>p=0.053</strike>
p=0.108</i>). Honestly, this is far higher than I expected. We're
surveying people who saw a single video four months ago, and we're
seeing that about 2% more of them are vegetarian than they would have
been otherwise.
<p>
<a name="update-2016-02-20"></a><b>Update 2016-02-20</b>: I computed the p-value wrong; 0.053 was from
a one-tailed test instead of a two-tailed test. The right p-value is
0.108. (I had used an <a
href="https://docs.google.com/spreadsheets/d/1q7x9aGGsM20SlCVKq54qYC6EK7-a9KH83AERc9TLOBU/edit#gid=0">online
calcalculator</a> intended for evaluating A/B tests that give you
conversion numbers. It didn't specify one- or two-tailed, but since
two-tailed is what you should use for A/B tests that's what I thought
it would be using. After <a
href="/p/mfa-veg-ads-study#fb-771959135582_771961875092">Alexander</a>, <a href="/p/mfa-veg-ads-study#6sn">Michael</a>,
and <a href="/p/mfa-veg-ads-study#6sr">Dan</a> pointed out that it
looked wrong, I computed a p-value computationally. [1])
<p>
This is a very different way of interpreting the study results than
any of the writeups I've seen. <a
href="http://www.mercyforanimals.org/files/Edge-Report.pptx">Edge's
Report</a>, <a
href="http://www.mercyforanimals.org/impact-study">Mercy for
Animals</a>, and <a
href="http://www.animalcharityevaluators.org/blog/our-initial-thoughts-on-the-mfa-facebook-ads-study/">Animal
Charity Evaluators</a> all conclude that there was basically no
effect. I think this mostly comes from their asking questions where
I'd expect the data to be noisier, like looking at how much of various
things people think they eat or their attitudes toward meat
consumption, plus their asking lots of different questions and so
needing to correct downward to compensate for the multiple
comparisons.
<p>
(There's probably something interesting you could do comparing the
responses to the attitude questions with whether people reported
eating any meat. I started looking at this some, just roughly, but
didn't get very far. Maybe there are hints that the ads do their work
by reducing recidivism instead of convincing people to give up meat,
but I'm too sleepy to figure this out. My work is all in <a
href="https://docs.google.com/spreadsheets/d/1qS20H6fiH2Bnz3Wz1cro9TvxyXeB_RlBtvzq85ziXDA/edit#gid=565125460">this
sheet</a>.)
<p>
<br>
[1] This drops category lables and assigns people to the two
groups, drawing with replacement, and looks at what fraction of
the time we get a result at least this extreme in either
direction:
<pre>
import sys
import math
import random
def delta(n_con, n_exp, s_con, s_exp):
return abs(1.0*s_con/n_con -
1.0*s_exp/n_exp)
def draw_sample(haystacks, needles):
return (random.random() < 1.0 *
needles / haystacks)
def start(n_con, n_exp,
s_con, s_exp,
trials):
threshold = delta(n_con, n_exp,
s_con, s_exp)
n_this_extreme = 0
for i in range(trials):
i_con = 0
i_exp = 0
for _ in range(n_con):
if draw_sample(n_con + n_exp,
s_con + s_exp):
i_con += 1
for _ in range(n_exp):
if draw_sample(n_con + n_exp,
s_con + s_exp):
i_exp += 1
if delta(n_con, n_exp,
i_con, i_exp) >= threshold:
n_this_extreme +=1
print ("Got absolute difference at "
"least this big %0.2f%% (%s/%s) "
"of the time" % (
100.0 * n_this_extreme / trials,
n_this_extreme, trials)
if __name__ == "__main__":
start(*[int(x) for x in sys.argv[1:]])
</pre>
<p><i>Comment via: <a href="https://plus.google.com/103013777355236494008/posts/awqrCnQoqra">google plus</a>, <a href="https://www.facebook.com/jefftk/posts/771959135582">facebook</a>, <a href="http://effective-altruism.com/ea/tz">the EA Forum</a></i>/p/one-moreOne more
https://www.jefftk.com/p/one-more
parenting15 Feb 2016 08:00:00 EST<p>
When Lily wants something, she often wants more than I'm ok giving
her. Maybe she wants me to sing forever and not put her down in her
crib for the night, maybe she wants to keep playing and it's time to
go, maybe she wants to keep eating chocolates; I find this comes up a
lot. Something that's been useful in these cases is having a little
routine around "one more":
<p>
<blockquote>
A: more sing<br>
B: [sings]<br>
A: more<br>
B: I'm going to do one more song, and then night-night, ok?<br>
A: [sad]<br>
B: One more?<br>
A: yes<br>
B: [sings]<br>
A: more<br>
B: We did one more song, and now it's time to go night-night.<br>
</blockquote>
<p>
She still asks for more at the end or protests some, but because she
had the heads up she's not really expecting me to say yes, and doesn't
get that upset when I tell her no.
<p>
We started doing this at maybe a year? When she could just say a few
words but mostly just cried to indicate that things weren't to her
satisfaction. At that point it looked more like:
<p>
<blockquote>
B: [starts to put A down in the crib]<br>
A: [cries]<br>
B: [picks A back up]<br>
A: [stops crying]<br>
B: Would you like one more?<br>
A: [cries a little]<br>
B: [sings]<br>
B: Time for night night<br>
B: [puts A down again]<br>
A: [cries a little]<br>
</blockquote>
<p>
Even though she wasn't at talking age yet, she could still learn the
pattern around "one more". It communicated that we were nearly done,
that we weren't going to have more of the thing if she cried a lot,
and that it was time to adjust.
<p>
I've been pretty happy with how this has worked out, but n=1 so this
may not generalize.
<p><i>Comment via: <a href="https://plus.google.com/103013777355236494008/posts/RwseL92JzNo">google plus</a>, <a href="https://www.facebook.com/jefftk/posts/771386672802">facebook</a></i>/p/gaussian-apartment-price-mapGaussian Apartment Price Map
https://www.jefftk.com/p/gaussian-apartment-price-map
housingmap12 Feb 2016 08:00:00 EST<p>
I've reworked my <a href="/apartment_prices/">apartment price map</a>
again. Here's what it used to look like:
<p>
<img src="/apartment-map-closeup-old.png" srcset="/apartment-map-closeup-old-2x.png 2x">
<br>
<i><a href="/apartment_prices/old#2016-01-18&2">old version, live</a></i>
<p>
And here's what it looks like now:
<img src="/apartment-map-closeup-new.png" srcset="/apartment-map-closeup-new-2x.png 2x">
<br>
<i><a href="/apartment_prices/index#2016-01-18&2">new version, live</a></i>
<p>
The basic problem of the map is that I have a large number of samples
(points with prices), and I want to make predictions for what new
samples would be if I had them. Then I can color each pixel on the
map based on that prediction.
<p>
People say location matters a lot in prices, but imagine it really was
the only thing. All units would be the same size and condition, and
furthermore say all units would be listed at exactly the right price
for their location. In that case the advertised price you'd see for a
unit would be a completely accurate view of the underlying
distribution of location values.
<p>
In the real world, though, there's a lot of noise. There's still a
distribution of location value over space, everyone knows some areas
are going to cost more than others, but some units are bigger, newer,
or more optimistically advertised which makes this hard to see. So
we'd like to smooth our samples, to remove some of this noise.
<p>
When I first wrote this, back in 2011, I just wanted something that
looked reasonable. I figured some kind of averaging made sense, so I
wrote:
<pre>
def weight(loc, p_loc):
return 1/distance(loc, p_loc)
predict(loc):
return average(
weight(loc, p_loc) * price
for price, p_loc in samples)
</pre>
<p>
This is a weighted average, of all the samples, where the weight is
<code>1/distance</code>. Every sample contributes to our prediction
for a new point, but we consider the closer ones more. A lot more:
<p>
<img src="/graph-one-over-x-city.png" srcset="/graph-one-over-x-city-2x.png 2x">
<p>
The problem is, though, if we have a sample that's very close to the
point we're interested in, then the weight is enormous. We end up
overvaluing that sample and overfitting.
<p>
What if we add something to distance, though? Then nothing will be
too close and our weights won't get that high:
<p>
<img src="/graph-one-over-x-plus-city.png" srcset="/graph-one-over-x-plus-city-2x.png 2x">
<p>
This still isn't great. One problem is that it overweights the tails;
we need something that falls off fast enough that it stops caring
about things that are too far away. One way to do that is to add an
arbitrary "ignore things farther than this" cutoff:
<p>
<img src="/graph-one-over-x-plus-cutoff-city.png" srcset="/graph-one-over-x-plus-cutoff-city-2x.png 2x">
<p>
This was as far as I got in 2011, all as product of fiddling with my
map until I had something that looked vaguely reasonable. At this
point it did, so I went with it. But it had silly things, like places
where it would color the area around one high outlier sample red, and
then just up the street color the area around a low outlier sample
green:
<p>
<img src="/apartment-clear-overfitting.png" srcset="/apartment-clear-overfitting-2x.png 2x">
<p>
The problem is that <code>1/distance</code> or even
<code>1/(distance+a) if distance < b else 0</code> is just not a
good smoothing function. While doing the completely right thing here
is hard, gaussian smoothing would be better:
<p>
<img src="/graph-gaussian-simple-city.png" srcset="/graph-gaussian-simple-city-2x.png 2x">
<p>
This doesn't need arbitrary corrections, and it's continuous. Good stuff.
<p>
How wide should we make it, though? The wider it is, the more it
considers points that are farther away. This makes it less prone to
overfitting, but also means it may miss some detailed structure. To
figure out a good width we can find one that minimizes the prediction
error.
<p>
If we go through our samples and ask the question, "what would we
predict for this point if we didn't have any data here," the
difference between the prediction and the sample is our error.
I ran this predictor across all the last several years of data, and
found the variance that minimized this error. Judging by eye it's a
bit smoother than I'd like, but apparently the patterns you get when
you fit it more tightly are just misleading.
<p>
So: a new map, with gaussian smoothing and a data-derived variance
instead of the hacks I had before.
<p>
Still, something was bothering me. I've been assuming that unit
pricing can be represented in an affine way. That is, while a 2br
isn't twice as expensive as a 1br, I was assuming the difference
between a 1br and a 2br was the same as between a 2br and a 3br. While
this is close to correct, here's a graph I generated <a
href="/p/better-apartment-price-map">last time</a> that shows it's a
bit off:
<p>
<img src="/costs-by-bedroom.png" srcset="/costs-by-bedroom-2x.png 2x">
<p>
You can see that if you drew a trend line through those points 1br and
2br units would lie a bit above the trend while the others would be a
bit below.
<p>
There's also a more subtle problem: areas with cheaper units overall
tend to have larger units. Independence is an imperfect assumption
for unit size and unit price.
<p>
Now that I'm computing errors, though, I can partially adjust for
these. In my updated model, I compute the average error for each unit
size, and then scale my predictions appropriately. So if studios cost
3% more than I would have predicted, I bump their predictions up.
There's still just as much getting things wrong, but at least now it's
not systematically high or low for a given unit size. Here's what
this looks like for the most recent month:
<p>
<table border=1 cellpading=5 style="margin: 20px">
<tr><td>studio</td><td>+3%</td></tr>
<tr><td>1br</td><td>-3%</td></tr>
<tr><td>2br</td><td>-5%</td></tr>
<tr><td>3br</td><td>-2%</td></tr>
<tr><td>4br</td><td>+5%</td></tr>
<tr><td>5br</td><td>+11%</td></tr>
</table>
<p>
I'm still not completely satisfied with this, but it's a lot better
than it was before.
<p><i>Comment via: <a href="https://plus.google.com/103013777355236494008/posts/HUNfo2FTq9h">google plus</a>, <a href="https://www.facebook.com/jefftk/posts/771094333652">facebook</a></i>/p/dirname-is-evilDirname is Evil
https://www.jefftk.com/p/dirname-is-evil
tech11 Feb 2016 08:00:00 EST<p>
I recently was writing some <a
href="https://github.com/pagespeed/mod_pagespeed/pull/1260">code</a> [1]
that needed to know the parent directory of a file:
<p>
<pre>
size_t final_slash = filename.find_last_of('/');
return filename.substr(0, final_slash);
</pre>
<p>
Why do this when there's <a
href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/dirname.html"><code>dirname(3)</code></a>?
Because <code>dirname</code> is evil:
<p>
<ul>
<li>It has big traps.</li>
<li>Different implementations have different traps.</li>
</ul>
<p>
On some systems, <code>dirname</code> modifies its input. For
example, here's an implementation that's nearly [2] posix conforming:
<pre>
char* dirname(char* path) {
static char dot[] = ".";
if (!path) return dot;
char* last_slash = NULL;
for (char* p = path; *p; p++) {
if (*p == '/') last_slash = p;
}
if (!last_slash) return dot;
*last_slash = '\0';
return path;
}
</pre>
There are nice things about this: it doesn't need to allocate any
memory and it's thread safe. This is what <a
href="http://alien.cern.ch/cache/glibc-2.3.2/misc/dirname.c">glibc
does</a> and is probably the most common behavior. Still, modifying
your input string may not be what you want!
<p>
Systems can choose, however, to define it in other ways. For example,
here's an implementation that leaves its input alone, but instead
isn't thread-safe.
<pre>
char* dirname(char* path) {
static char buffer[PATH_MAX];
static const char dot[] = ".";
if (!path) return dot;
size_t last_slash_pos = -1;
for (size_t i; path[i]; i++) {
if (i >= PATH_MAX) return dot
if (path[i] == '/') last_slash_pos = i;
}
if (last_slash_pos == -1) return dot;
strncpy(buffer, path, last_slash_pos);
buffer[last_slash_pos] = '\0';
return buffer;
}
</pre>
<p>
Instead of modifying its argument, this version of
<code>dirname</code> uses internal storage. This means that it's not
thread safe, and you can't trust its return value to stick around if
you call anything that might possibly also call <code>dirname</code>.
<p>
One more thing: <code>dirname</code> returns a <code>char*</code>
not a <code>const char*</code> but it's not always safe to modify
its return value. For example, glibc does:
<p>
<pre>
char *dirname (char *path) {
static const char dot[] = ".";
...
/* This assignment is ill-designed
but the XPG specs require to
return a string containing "."
in any case no directory part is
found and so a static and constant
string is required. */
path = (char *) dot;
return path;
}
</pre>
<p>
This means if you give <code>dirname</code> a slashless string and
pass the output to something that modifies its input, you'll pass
compile-time const checking but you're in for problems at runtime. [3]
<p>
So if you're going to use <code>dirname</code> you have to treat it as
being both thread unsafe and input modifying. At which point it's
much easier to use something else that's better specified.
<p>
(Warning: I haven't actually tried running or even compiling these
code samples.)
<p>
<br>
[1] <a name="update-2016-02-12"></a><b>Update 2016-02-12</b>: that code no longer needs anything like
dirname at all because I rewrote it to handle everything with pipes
instead of PID files.
<p>
[2] I've left out the bit where it's supposed to ignore trailing '/'
characters.
<p>
[3] Either changing the return value of <code>dirname</code> for
future calls, or undefined behavior, I'm not sure which.
<p><i>Comment via: <a href="https://plus.google.com/103013777355236494008/posts/Af7Rk5EXMGU">google plus</a>, <a href="https://www.facebook.com/jefftk/posts/770652449192">facebook</a></i>