• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Repeated HTML Text Is Cheap

    February 19th, 2014
    tech  [html]
    The Guardian recently moved from guardian.co.uk to www.theguardian.com and wrote up their experiences. Among them:

    At this point, however, all the URLs on the site still pointed to www.guardian.co.uk. We attempted to fix this by implementing relative URLs across our site, but a lengthy investigation proved that this would be more difficult than it should have been. Instead, we wrote a filter which detected the HTTP Host header. If the host was www.theguardian.com, we would rewrite all the URLs on the site to be www.theguardian.com. If the Host was www.guardian.co.uk we would rewrite all the URLs on the site to be www.guardian.co.uk. This was a simple configuration change that swapped one domain for another, per request.

    Wait, they're using absolute urls instead of relative? Isn't that inefficient, repeating http://www.theguardian.com for every single link? That's 295 times:

        $ curl -s http://www.theguardian.com/us | sed 's~http:~^http:~g' \
           | tr '^' '\n' | grep -c http://www.theguardian.com/
        295
    

    But how much bigger is this making their site, after all?

        $ curl -s http://www.theguardian.com/us | wc -c
        222491
        $ curl -s http://www.theguardian.com/us \
           | sed s'~http://www.theguardian.com~~' | wc -c
        214795
        $ python -c 'print 222491-214795'
        7696
        $ python -c 'print "%.2f%%" % (7696.0 / 214795 * 100)'
        3.58%
    
    So they're using an extra 7.7kB and making their site 3.6% bigger, right? Except almost everyone will be downloading the site with gzip enabled:
        $ curl -s http://www.theguardian.com/us | wc -c
        222491
        $ curl -s -H 'Accept-Encoding: gzip' \
           http://www.theguardian.com/us | wc -c
        33576
    
    In other words, if you request the page simply it's 222k but if your browser sends Accept-Encoding: gzip with the request, and any browser you're likely to use does this, then it's only 34k. This is equivalent to downloading the page and then gzipping it ourselves:
        $ curl -s http://www.theguardian.com/us | gzip | wc -c
        33576
    

    Now gzip compression does well with simple repeated strings, so how well does it handle these absolute urls? Let's repeat the test from above, this time encoding with gzip before counting bytes:

        $ curl -s http://www.theguardian.com/us | gzip | wc -c
        33576
        $ curl -s http://www.theguardian.com/us \
           | sed s'~http://www.theguardian.com~~' | gzip | wc -c
        33805
        $ python -c 'print 33576-33347'
        229
        $ python -c 'print "%.2f%%" % (229.0 / 33347 * 100)'
        0.69%
    
    So yes, they could save some bytes by switching to relative urls, but the savings are under 1%.

    Comment via: google plus, facebook

    Recent posts on blogs I like:

    Streaming the Biden Infrastructure Plan

    I streamed my thoughts about the Biden infrastructure plan, and unlike previous streams, I uploaded this to YouTube. I go into more details (and more tangents) on video, but, some key points: Out of the nearly $600 billion in the current proposal that is …

    via Pedestrian Observations April 11, 2021

    Collections: Clothing, How Did They Make it? Part IVb: Cloth Money

    This is the second half of the fourth part of our four part (I, II, III, IVa) look at the production of textiles, particularly wool and linen, in the pre-modern world. Last time, we looked at commercial textile workers and the finishing processes for text…

    via A Collection of Unmitigated Pedantry April 9, 2021

    Notes from “Don’t Shoot the Dog”

    I just finished Karen Pryor’s “Don’t Shoot the Dog: the New Art of Teaching and Training.” Partly because a friend points out that it’s not on Audible and therefore she can’t possibly read it, here are the notes I took and some thoughts. It’s a quick, eas…

    via The whole sky April 2, 2021

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact