• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Weissman Scores: Useful?

    September 23rd, 2015
    tech  [html]
    A TV show got an information theory professor to make up a metric for evaluating compression algorithms so they could make claims of a breakthrough. This is great—who doesn't like striving for accuracy in media—until people start saying we should be using the metric to evaluate real data compression algorithms. Now we move from the fluffy world of "good enough to be passable on TV" to the far stricter "actually captures what we care about" and, unfortunately, this metric fails the harder test.

    The metric is:

          efficiency_new
        -------------------
        efficiency_baseline
    
    Where efficiency is:
           compression_ratio
         ---------------------
         log(compression_time)
    
    The compression_time is just how long it takes to run the algorithm while the compression_ratio is:
        uncompressed_size
        -----------------
         compressed_size
    
    There are two ideas here: (a) the ratio between the efficiency of a new algorithm and a standard baseline algorithm like gzip is a good way to control for how compressable a corpus is and (b) efficiency captures what matters about a compression algorithm. While (a) seems plausible, (b) is way off. The problem with this metric is that it pays much more attention to the time than the ratio, when we usually care much more about the ratio. So let's look again at how efficiency is defined:
           compression_ratio
         ---------------------
         log(compression_time)
    
    Say we currently can manage 10% compression (ratio = 100/90 = 1.11) and it takes us 16ms. That's
        compression_ratio / log(compression_time)
          = 1.11 / log(16)
          = 0.2775
    
    Now we have two proposals. One brings us to from 10% compression to 55% compression (ratio = 100/45 = 2.22) while the other one drops compression time to 4ms:
       compression_ratio / log(compression_time)
          = 2.22 / log(16)
          = 0.555
    
    vs
        compression_ratio / log(compression_time)
          = 1.11 / log(4)
          = 0.555
    
    These have the same efficiency, but improving compression by 5x will nearly always be better than improving speed by 4x. Take this to the extreme: a compressor that exits immediately leaving its input unchanged has the best possible efficiency score, despite being useless.

    This score is also missing some other things we care a lot about. Many things are compressed once but decompressed millions of times. Consider an image on a popular website. The site owner probably compressed the image once on uploading, while every single site visitor will need to decompress the image before they can view it. So in this sort of case we're often willing to put much more work into compression if it can save a little work each time it's decompressed.

    Another thing that matters a lot for practical adoption of compression algorithms is resource efficiency, primarily memory. When your browser downloads a web page it tells the server Accept-Encoding: gzip and the server will typically gzip-compress its response on the fly. The server tags it Content-Encoding: gzip, and the browser automatically unzips it before using the response. Both the browser and server are processing lots of simultaneous transfers, and each one of those needs to handle compression, so there's a big difference between each needing about ~50K of memory with gzip and needing up to several MB with brotli.

    This also illustrates another important issue with compression: different use cases need different metrics. For on-the-fly compression you don't want to take so much time compressing that this beats the time savings from needing to transfer fewer bytes, while for advance compression small increases in the compression ratio matter a huge amount.

    I'm definitely in favor of TV shows seeking out expert advice, but this metric should probably stay on TV.

    Comment via: google plus, facebook

    Recent posts on blogs I like:

    Pulses (Hoisted from Comments)

    Robert Jackel asked me an excellent question in comments: what is a pulse? I’ve talked about timed transfers a lot in the last almost 10 years of this blog, but I never wrote a precise definition. This is a critical tool for every public transportation op…

    via Pedestrian Observations February 23, 2021

    Collections: The Universal Warrior, Part III: The Cult of the Badass

    This is the third and final part of a discussion (I, IIa, IIb) discussion of the notion that there is a ‘universal warrior’ – a transcendent sameness about either the experience of war or ‘warrior values’ which might provide some sort of useful blueprint …

    via A Collection of Unmitigated Pedantry February 19, 2021

    The Troubling Ethics of Writing (A Speech from Ancient Sumer)

    (Translated from a transcript of an ancient Sumerian speech by Uruk's most well-respected Scriptological Ethicist) Writing is a profoundly dangerous technology: Access to writing was initially, and still remains, uneven. What's worse, the rich are m…

    via BLOG - Cullen O'Keefe February 15, 2021

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact