• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Hit frequencies

    December 2nd, 2016
    tech  [html]
    When load testing server-side software there are two straight-forward options:
    1. Stress all endpoints evenly
    2. Stress one endpoint in isolation
    For example, if you're testing a dynamic website with siege you could (1) give it a list of all the urls on your site and have it rotate among them, or you could (2) pick a representive url and just hit that one repeatedly. Like any synthetic test this isn't completely realistic, but in this case you're missing something pretty important: if you want to test something that uses caching, you need to approximate the distribution of a real workload.

    Why? How well caching works depends on your hit rate, which depends on what the distribution of your requests looks like. Situation (1) is a caching worst-case, (2) is a caching best case, and your real situation is somewhere in the middle. For example, from my access logs over the past couple years, here's the distribution of requests I've seen:

    1275358 /news.rss
    562369 /
    437312 /favicon.ico
    406583 /index
    168296 /robots.txt
    162728 /p/
    116353 /simple_piano_recordings/piano_chords.svg
    94048 /icdiff
    92680 /p/mercury-spill
    63728 /wsgi/json-comments-cached/gp/i6WrmjSHXLg
    [snip ~1k entries]
    1179 /news/2011-11-02
    1178 /wsgi/json-comments/gp/YTnbXQoRPRN
    1178 /fiddle-clip-on/pictures/11-top.jpg
    [snip ~10k entries]
    113 /news/back_from_564.rss
    113 /news/2012-07-24
    113 /news/2012-07-09.html
    [snip ~100k entries]
    3 /nextbus/omnitrans/82/7387/
    3 /nextbus/art/1135/759763/
    3 /nextbus/mbta/39/6460/
    [snip ~1M entries]
    1 /ngx_pagespeed_beacon?ets=load:906&rload=1605...
    1 /nextbus/jtafla/17/1433/next/
    1 /news/all/trillion-dollar-platinum-coin

    This is kind of a mess, but the main idea is that we have a few "hot" endpoints, and then a long tail of less popular entries. I can use this distribution in load testing, to get something between uniform sampling (option 1) and singleton sampling (option 2) that better represents what real load on the site would look like.

    In case this is useful to other people, here's the frequency list (hits-frequency.txt.gz) and a short script to sample from it (generate_urls.py).

    Comment via: google plus, facebook

    Recent posts on blogs I like:

    Children’s podcast recommendations

    I’ve emailed my list to enough friends that I should really just post it. Podcasts have been a big part of Lily’s life since around age 4. Before that, she spent a lot of her time demanding of any available adult, “READ. READ TO ME” (once she literally as…

    via The whole sky October 11, 2020

    Collections: Iron, How Did They Make It, Part IVa: Steel Yourself

    This week, we continue our four(and a half)-part (I, II, III, IVa, IVb) look at pre-modern iron and steel production. Last week, we looked at how a blacksmith reshapes our iron from a spongy mass called a bloom first into a more workable shape and then fi…

    via A Collection of Unmitigated Pedantry October 9, 2020

    High-Speed Rail and Cities

    When preparing various maps proposing high-speed rail in Germany, I was told that it looks nice but it overfocuses on the largest cities and not about connecting the entirety of the country. I’ve seen such criticism elsewhere, asserting that high-speed ra…

    via Pedestrian Observations October 8, 2020

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact