::  Posts  ::  RSS  ::  ◂◂RSS  ::  Contact

Hit frequencies

December 2nd, 2016
tech  [html]
When load testing server-side software there are two straight-forward options:
  1. Stress all endpoints evenly
  2. Stress one endpoint in isolation
For example, if you're testing a dynamic website with siege you could (1) give it a list of all the urls on your site and have it rotate among them, or you could (2) pick a representive url and just hit that one repeatedly. Like any synthetic test this isn't completely realistic, but in this case you're missing something pretty important: if you want to test something that uses caching, you need to approximate the distribution of a real workload.

Why? How well caching works depends on your hit rate, which depends on what the distribution of your requests looks like. Situation (1) is a caching worst-case, (2) is a caching best case, and your real situation is somewhere in the middle. For example, from my access logs over the past couple years, here's the distribution of requests I've seen:

1275358 /news.rss
562369 /
437312 /favicon.ico
406583 /index
168296 /robots.txt
162728 /p/
116353 /simple_piano_recordings/piano_chords.svg
94048 /icdiff
92680 /p/mercury-spill
63728 /wsgi/json-comments-cached/gp/i6WrmjSHXLg
[snip ~1k entries]
1179 /news/2011-11-02
1178 /wsgi/json-comments/gp/YTnbXQoRPRN
1178 /fiddle-clip-on/pictures/11-top.jpg
[snip ~10k entries]
113 /news/back_from_564.rss
113 /news/2012-07-24
113 /news/2012-07-09.html
[snip ~100k entries]
3 /nextbus/omnitrans/82/7387/
3 /nextbus/art/1135/759763/
3 /nextbus/mbta/39/6460/
[snip ~1M entries]
1 /ngx_pagespeed_beacon?ets=load:906&rload=1605...
1 /nextbus/jtafla/17/1433/next/
1 /news/all/trillion-dollar-platinum-coin

This is kind of a mess, but the main idea is that we have a few "hot" endpoints, and then a long tail of less popular entries. I can use this distribution in load testing, to get something between uniform sampling (option 1) and singleton sampling (option 2) that better represents what real load on the site would look like.

In case this is useful to other people, here's the frequency list (hits-frequency.txt.gz) and a short script to sample from it (generate_urls.py).

Comment via: google plus, facebook

Recent posts on blogs I like:

I’m Giving a Talk About Construction Costs Tomorrow

By popular demand, I’m giving the talk I gave 2 weeks ago at NYU, again. The database will be revised slightly to include more examples (like Ukraine, which I added between when I gave the talk and when I blogged about it), and I may switch around a few t…

via Pedestrian Observations December 2, 2019

Your room can be as bright as the outdoors

The effect was huge: I became dramatically more productive between 3:30pm and whenever I turned off the light. I estimate the lamp bought me between half an hour and two hours a day, depending on how overcast it was.

via benkuhn.net November 26, 2019

git-subtrac: all your git submodules in one place

Long ago, I wrote git-subtree to work around some of my annoyances with git submodules. I've learned a lot since then, and the development ecosystem has improved a lot (shell scripts are no longer the best way to manipulate git repos? Whoa!). Thus, I …

via apenwarr November 24, 2019

more     (via openring)

More Posts:


  ::  Posts  ::  RSS  ::  ◂◂RSS  ::  Contact