::  Posts  ::  RSS  ::  ◂◂RSS  ::  Contact

Repeated HTML Text Is Cheap

February 19th, 2014
tech  [html]
The Guardian recently moved from guardian.co.uk to www.theguardian.com and wrote up their experiences. Among them:

At this point, however, all the URLs on the site still pointed to www.guardian.co.uk. We attempted to fix this by implementing relative URLs across our site, but a lengthy investigation proved that this would be more difficult than it should have been. Instead, we wrote a filter which detected the HTTP Host header. If the host was www.theguardian.com, we would rewrite all the URLs on the site to be www.theguardian.com. If the Host was www.guardian.co.uk we would rewrite all the URLs on the site to be www.guardian.co.uk. This was a simple configuration change that swapped one domain for another, per request.

Wait, they're using absolute urls instead of relative? Isn't that inefficient, repeating http://www.theguardian.com for every single link? That's 295 times:

    $ curl -s http://www.theguardian.com/us | sed 's~http:~^http:~g' \
       | tr '^' '\n' | grep -c http://www.theguardian.com/

But how much bigger is this making their site, after all?

    $ curl -s http://www.theguardian.com/us | wc -c
    $ curl -s http://www.theguardian.com/us \
       | sed s'~http://www.theguardian.com~~' | wc -c
    $ python -c 'print 222491-214795'
    $ python -c 'print "%.2f%%" % (7696.0 / 214795 * 100)'
So they're using an extra 7.7kB and making their site 3.6% bigger, right? Except almost everyone will be downloading the site with gzip enabled:
    $ curl -s http://www.theguardian.com/us | wc -c
    $ curl -s -H 'Accept-Encoding: gzip' \
       http://www.theguardian.com/us | wc -c
In other words, if you request the page simply it's 222k but if your browser sends Accept-Encoding: gzip with the request, and any browser you're likely to use does this, then it's only 34k. This is equivalent to downloading the page and then gzipping it ourselves:
    $ curl -s http://www.theguardian.com/us | gzip | wc -c

Now gzip compression does well with simple repeated strings, so how well does it handle these absolute urls? Let's repeat the test from above, this time encoding with gzip before counting bytes:

    $ curl -s http://www.theguardian.com/us | gzip | wc -c
    $ curl -s http://www.theguardian.com/us \
       | sed s'~http://www.theguardian.com~~' | gzip | wc -c
    $ python -c 'print 33576-33347'
    $ python -c 'print "%.2f%%" % (229.0 / 33347 * 100)'
So yes, they could save some bytes by switching to relative urls, but the savings are under 1%.

Comment via: google plus, facebook

Recent posts on blogs I like:

Incoming Gantz-Led Government to Invest in Israel’s Infrastructure

Israel’s incoming prime minister Benny Gantz unveiled an emergency government, to take power following an upcoming confidence vote in the Knesset. The last two MKs required to give Gantz a 61-59 majority, two members of Gantz’s own Blue and White Party wh…

via Pedestrian Observations April 1, 2020

Finding home in the time of coronavirus

Disclaimer: I’m going to say this once. Obviously, I am not happy about the coronavirus’s threat to public health or the economic toll it’s taking. I do not think the existence of this pandemic is good. Just so happens that social distancing and remote wo…

via Holly Elmore March 31, 2020

Massachusetts should shut down immediately

We’ve been running 30x fewer tests than other states, and have been extremely tardy in responding to the Biogen outbreak.

via benkuhn.net March 15, 2020

more     (via openring)

More Posts:

  ::  Posts  ::  RSS  ::  ◂◂RSS  ::  Contact