• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • How Much Can a Webpage Inflate?

    August 23rd, 2013
    tech  [html]
    When your browser loads a page it's probably using gzip compression. When it asks for http://en.wikipedia.org/wiki/Cabbage it adds a note telling the server that it understands gzip:
      GET /wiki/Cabbage HTTP/1.1
      Host: en.wikipedia.org
      Accept-Encoding: gzip
      ...
    
    Wikipedia's server responds:
      HTTP/1.0 200 OK
      Content-Encoding: gzip
      Content-Length: 37924
      ...
    
    This tells us that the server did decide to use gzip to compress the page, and that after compression it was 37924 bytes. We can verify this manually:
    $ curl -H 'Accept-Encoding: gzip' \
           -s http://en.wikipedia.org/wiki/Cabbage \
           | wc -c
    37924
    
    Your browser can't work with gzipped data directly, so it has to expand it first:
    $ curl -H 'Accept-Encoding: gzip' \
           -s http://en.wikipedia.org/wiki/Cabbage \
           | gunzip | wc -c
    157194
    
    Unzipping it we have 157194 bytes, which means gzip managed to make the page 4x smaller. Great! Compression saves you gobs of time, bandwidth, and mobile data charges every year.

    You might wonder, though: how much bigger can a file when you inflate it? That wikipedia page got 4x bigger, but could we have a small page that gets 100x bigger? A million times bigger? If someone with a devious attitude crafted a maximally pathological page, how bad could it get?

    To answer this we need to look what gzip is doing. Like zip, gzip uses the deflate compression algorithm. (rfc1951) It has two pieces: first it condenses duplicate sections and then encodes symbols in length inversely proportional to their frequency. What does that mean?

    Imagine I have "abc a abc abc ab ab a a cab abq". There's a lot of duplication: "abc" is repeated over an over. When "abc " comes up for the second time, however, we can replace it with a note saying "go back six and take four from there". That's what LZ77 does. Another thing I can do is notice that in those 31 characters there are 10 'a's, 9 spaces, 7 'b's, 4 'c's and a 'q'. While normal writing uses just as much space for 'q' as 'a', we can notice that 'a' is being used more often and use a small number of bytes to represent it while letting 'q' take up many more bytes. That's the main idea of Huffman coding.

    The most likely candidate for the worst possible file is one that is all the same character: "aaaaaaaaaaa...". This can be encoded as "go back 1 and copy the maximum number of characters allowed". That requires two symbols: Code 0 for "go back one" and Code 285 for "copy the maximum allowed" which happens to be 258 bytes. [1] Huffman coding two symbols is easy: '0' for one symbol and '1' for the other. A single byte of "01010101" can then expand to 258*4 output bytes, or 1032x expansion. [2]

    With real usage there's overhead, but it should be small compared to the total size. What do we see if we feed this to gzip?

    dd if=/dev/zero bs=1000 count=1000 | gzip | wc -c
    
    input bytes output bytes expansion potential
    1000 (1K) 29 34x
    10000 (10K) 45 222x
    100000 (100K) 132 758x
    1000000 (1M) 1003 997x
    10000000 (10M) 9737 1027x
    100000000 (100M) 97071 1030x
    1000000000 (1G) 970501 1030x
    10000000000 (10G) 9704731 1030x
    We very quickly get close to 1032x, but even with 10G to compress we don't quite get there, so either I'm missing something about the format or the compressor I'm using could be slightly more efficient.

    It's surprising that the limitation on the potential size of a gzip bomb turns out to be that the longest supported backreference is 258 bytes. I wonder how much thought went into that limit?

    (If we used a format that supported run length encoding, for example bzip2, then we could have far larger potential for expansion.)


    [1] You would think we would need a third symbol to represent the initial character to copy, but Code 0 and Code 285 are going to be repeated so many more times that we really want to keep their encodings small. Luckily deflate lets us divide the stream into blocks, each with their own encoding system. We can start with a tiny block that just encodes a single character, and then follow that with one very large block that is optimized for repeating that character over and over in sequences of 258.

    [2] Technically I've only shown a lower bound on the amount of expansion, but after reading the gzip format several times I don't see how you can do better.

    Comment via: google plus, facebook

    Recent posts on blogs I like:

    Fireside Friday, November 27, 2020

    Hey folks! Fireside this week. A bit of a change-up in terms of the coming attractions. I had planned to start “Textiles, How Did They Make It?” next, but I want to do a bit more reading on some of the initial stages of textile production (that is, the pr…

    via A Collection of Unmitigated Pedantry November 27, 2020

    Building Depth and Window Space

    How much window space does an apartment need, relative to its area, and how does this affect building style? A fascinating post from about a year ago on Urban Kchoze makes the argument that modern North American buildings are too deep – Simon calls them o…

    via Pedestrian Observations November 27, 2020

    Thoughts you mightn't have thunk about remote meetings

    Welcome to this week's edition of "building a startup in 2020," in which all your meetings are suddenly remote, and you probably weren't prepared for it. I know I wasn't. We started a "fully remote" company back in 2019, but …

    via apenwarr November 23, 2020

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact