• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Sharding the Brigade

    November 13th, 2020
    audio, singing, tech  [html]
    The Secular Solstice is planning on using the bucket brigade singing app that some friends and I have been building. While the events we've hosted so far have been something like 20 people, this might be 500 or more. I've spent some time over the last couple weeks figuring out how to scale it, and I think it's in a good place now.

    I started, as one always should, with profiling. By far the biggest amount of time was being spent in Opus encoding and decoding. This is what lets us send audio over the internet at decent quality without using an enormous amount of bandwidth, so it's not optional. We're also already doing it in C (libopus with Python bindings) so this isn't a case where we can get large speedups by moving to a lower-level language.

    The first question was, if we pulled off encoding and decoding into other processes, could we keep everything else in the same process? The app is built around a shared buffer, that everyone is constantly reading and writing from at various offsets, and it's a lot nicer if that can stay in one process. Additionally, there is a bunch of shared state that represents things like "are we currently in a song and when did it start?" or "who's leading?" that, while potentially separable, would be easier to keep together.

    I split the code into a "outer" portion that implemented decoding and encoding, and a "inner" portion that handled everything else. Running a benchmark [1], I got the good news that the inner portion was fast enough to stay all in one process, even with a very large number of clients:

    $ python3 unit-stress.py
    2.45ms each; est 245 clients
    2.44ms each; est 246 clients
    2.46ms each; est 243 clients
    $ python3 unit-stress.py inner
    0.05ms each; est 11005 clients
    0.05ms each; est 11283 clients
    0.05ms each; est 11257 clients

    Since encoding and decoding are stateful, I decided that those should run in long-lived processes. Each user can always talk to just one of these processes, and it will always have the appropriate state. This means we don't have to do any locking, or any moving the state between CPUs. I don't know of a great way to manage many sharded processes like this, but since we only need about eight of them we can do it manually:

    location /echo/api/01 {
      include uwsgi_params;
    location /echo/api/02 {
      include uwsgi_params;

    This also meant creating echo-01.service, echo-02.service, etc. to listen on 7101, 7102, etc.

    Once we have our codec processes running, we need to way for them all to interact with global state. After playing around with a bunch of ideas, I decided on each codec process (client) having a shared memory area which is also open in a singleton global process (server). The client can make blocking RPCs to the server, and because it's so fast that's not a problem that it's blocking.

    I decided on a buffer layout of:

    1 byte: status (whose turn)
    2 bytes: json length
    N bytes: json
    4 bytes: data length
    N bytes: data
    To make an RPC, the client fills the buffer and, as a final step, updates the status byte to tell the server to take its turn. The server is constantly looping over all of the shared memory buffers, and when it sees one that is ready for processing it decodes it, updates the global state, writes its response, and updates the status byte to tell the client the response is ready.

    The protocol is the same in both directions: arbitrary JSON (10kB max), then audio data (200kB max). This means that when we want to pass a new value through shared memory we don't need to update the protocol, but it does mean the server has a bit more processing to do on every request to decode / encode JSON. This is a fine trade-off, since the only part of the system that is too slow is dealing with the audio codec.

    I set up a reasonably realistic benchmark, sending HTTP requests directly to uWSGI servers (start_stress_servers.sh) from python (stress.py). I needed to use two other computers to send the requests, since running it on the same machine was enough to hurt the server's performance, and one additional machine was not able to push the server to its limit.

    Initially I ran into a problem where we are sending a user summary which, when the number of users get sufficiently high, uses more than our total available space for JSON in the shared-memory protocol. We are sending this for a portion of the UI that really doesn't make sense for a group this large, so I turned it off for the rest of my testing.

    With no sharding I measure a maximum of ~216 simulated clients, while with sharding I get ~1090.

    Looking at CPU usage, the server process (python3) is at 73%, so still some headroom:

    While it would be possible to make various parts more efficient and get even larger speed ups, I think this is will be sufficient for Solstice as long as we run it on a sufficiently parallel machine.

    [1] All benchmarks in this post taken on the same 6-processor 12-thread Intel i7-8700 CPU @ 3.20GHz running Linux, courtesy of the Swarthmore CS department.

    Comment via: facebook, lesswrong

    Recent posts on blogs I like:

    Governance in Rich Liberal American Cities

    Matt Yglesias has a blog post called Make Blue America Great Again, about governance in rich liberal states like New York and California. He talks about various good government issues, and he pays a lot of attention specifically to TransitMatters and our …

    via Pedestrian Observations November 19, 2020

    Collections: Why Military History?

    This week, I want to talk about the discipline of military history: what it is, why it is important and how I see my own place within it. This is going to be a bit of an unusual collections post as it is less about the past itself and more about how we st…

    via A Collection of Unmitigated Pedantry November 13, 2020

    Misalignment and misuse: whose values are manifest?

    Crossposted from world spirit sock puppet. AI related disasters are often categorized as involving misaligned AI, or misuse, or accident. Where: misuse means the bad outcomes were wanted by the people involved, misalignment means the bad outcomes were wan…

    via Meteuphoric November 13, 2020

    more     (via openring)

  • Posts
  • RSS
  • ◂◂RSS
  • Contact