• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Two Distribution Graphs

    April 30th, 2016
    math, stats  [html]
    At last year's big EA conference, Tyler Alterman gave a talk on "succeeding at EA Global" (video). He opened with an argument that people should be open to changing their minds about what they should be doing. He described two worlds, a "Gaussiana" where outcomes are approximately Gaussian, and a "Paretonia" where they follow a Pareto distribution. Citing Toby Ord's arguments in The Moral Imperative Towards Cost-Effectiveness (pdf) that a Pareto distribution is closer to what we see [1], and noting that in a Pareto distribution the top outcomes are much much greater than average outcomes, he argued people should be giving a lot of thought to whether they could be working on something higher impact.

    He showed two graphs to illustrate this:

    These two graphs are how people commonly visualize these two distributions, but they're not equivalent graphs. The Pareto graph axis labels are correct, [2] but on the Gaussian graph the x-axis should be labeled "impact in expectation" and the y-axis should be labeled something like "probability". But what are these different ways of looking at distributions?

    The Pareto graph above is the kind of graph where you line up all your samples from lowest to highest. These are great graphs for non-statisticians trying to understand this sort of data, and are a discrete (and sideways) kind of CDF. The Gaussian graph is a histogram (the bars) plus a probability density function (the line).

    What this means is that on the Paretonia graph the bar heights represent how good the outcome is, while on the Gaussiana graph the bar heights represent how frequent an outcome that good is. When he showed what moving from a random plan to the best outcome looked like on the Gaussian graph, he was actually showing what it looks like to go from a random outcome to the most common outcome.

    Let's generate some synthetic data, one Pareto and one Gaussian [3], and see what that looks like:

    This graph shows all the samples, from lowest to highest, like the Paretonia graph above. You can see that these are very different distributions, but they have similar basic shapes: lots of lower samples, a few higher ones. Zooming in on the bottom half of the graph by hiding samples >5 (the 13/1000 highest Pareto samples) we can see this more clearly:

    The main difference is that the Pareto distribution is much more uneven than the Gaussian one, where the very highest samples are just so much bigger than the rest. This is where the histogram view is helpful:


    This shows that in the Gaussian distribution there is a wide range of samples, and only a small number that are much bigger than the rest, while in the Pareto distribution nearly all the samples are concentrated in a narrow band, with just a few samples much higher than the rest. This is even true if we make a histogram excluding the 13/1000 samples >5:

    Overall, these are two kinds of graph are both very useful ways of visualizing a set of samples, and I typically generate and examine both when trying to understand a dataset.

    (His point was still correct, however: in a Pareto universe it really is much more valuable to identify the very best options.)


    [1] I also talked about Ord's argument in my post on the unintuitive power laws of giving, and after the conference there was more discussion about Hanson thought there was more disagreement than there actually was.

    [2] Though the y-axis does look clipped.

    [3] Specifically, I'm going to compare a half-normal distribution (the positive half of a Gaussian distribution centered on 0) with an 80-20 Pareto distribution (alpha = 1.161), where both have a mean of 1. Here's the code I used:

    import numpy as np
    from math import sqrt, pi
    
    def sample_pareto(mean=1,
                      alpha=1.161,
                      samples=1000):
      # mean is
      #  (alpha * scale) / (alpha - 1)
      # which means:
      scale = mean * (alpha - 1) / alpha
    
      return np.random.pareto(
        alpha, size=samples) * scale
    
    def sample_gaussian(mean=1,
                        samples=1000):
      # mean is sigma * sqrt(2) / sqrt(pi)
      # which means:
      sigma = mean * sqrt(pi) / sqrt(2)
    
      return [abs(x)
              for x in np.random.normal(
                  loc=0.0,
                  scale=sigma,
                  size=samples)]
    
    Log normal could also make sense to compare here, but it's pretty similar to Pareto and I'm feeling done writing.

    To see the samples I generated, have a look at my scratch sheet.

    Comment via: google plus, facebook

    Recent posts on blogs I like:

    Who Should Bear the Risk in Infrastructure Projects?

    The answer to the question is the public sector, always. It’s okay to have private-sector involvement in construction, but the risk must be borne by the public sector, or else the private sector will just want more money to compensate for the extra risk. …

    via Pedestrian Observations November 30, 2020

    Fireside Friday, November 27, 2020

    Hey folks! Fireside this week. A bit of a change-up in terms of the coming attractions. I had planned to start “Textiles, How Did They Make It?” next, but I want to do a bit more reading on some of the initial stages of textile production (that is, the pr…

    via A Collection of Unmitigated Pedantry November 27, 2020

    Thoughts you mightn't have thunk about remote meetings

    Welcome to this week's edition of "building a startup in 2020," in which all your meetings are suddenly remote, and you probably weren't prepared for it. I know I wasn't. We started a "fully remote" company back in 2019, but …

    via apenwarr November 23, 2020

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact