Repeated HTML Text Is Cheap |
February 19th, 2014 |
| tech |
guardian.co.uk to
www.theguardian.com and wrote
up their experiences. Among them:
At this point, however, all the URLs on the site still pointed towww.guardian.co.uk. We attempted to fix this by implementing relative URLs across our site, but a lengthy investigation proved that this would be more difficult than it should have been. Instead, we wrote a filter which detected the HTTPHostheader. If the host waswww.theguardian.com, we would rewrite all the URLs on the site to bewww.theguardian.com. If theHostwaswww.guardian.co.ukwe would rewrite all the URLs on the site to bewww.guardian.co.uk. This was a simple configuration change that swapped one domain for another, per request.
Wait, they're using absolute urls instead of relative? Isn't that
inefficient, repeating http://www.theguardian.com for every
single link? That's 295 times:
$ curl -s http://www.theguardian.com/us | sed 's~http:~^http:~g' \
| tr '^' '\n' | grep -c http://www.theguardian.com/
295
But how much bigger is this making their site, after all?
$ curl -s http://www.theguardian.com/us | wc -c
222491
$ curl -s http://www.theguardian.com/us \
| sed s'~http://www.theguardian.com~~' | wc -c
214795
$ python -c 'print 222491-214795'
7696
$ python -c 'print "%.2f%%" % (7696.0 / 214795 * 100)'
3.58%
So they're using an extra 7.7kB and making their site 3.6% bigger,
right? Except almost everyone will be downloading the site with gzip
enabled:
$ curl -s http://www.theguardian.com/us | wc -c
222491
$ curl -s -H 'Accept-Encoding: gzip' \
http://www.theguardian.com/us | wc -c
33576
In other words, if you request the page simply it's 222k but if your
browser sends Accept-Encoding: gzip with the request, and any
browser you're likely to use does this, then it's only 34k. This is
equivalent to downloading the page and then gzipping it ourselves:
$ curl -s http://www.theguardian.com/us | gzip | wc -c
33576
Now gzip compression does well with simple repeated strings, so how
well does it handle these absolute urls? Let's repeat the test from
above, this time encoding with gzip before counting bytes:
$ curl -s http://www.theguardian.com/us | gzip | wc -c
33576
$ curl -s http://www.theguardian.com/us \
| sed s'~http://www.theguardian.com~~' | gzip | wc -c
33805
$ python -c 'print 33576-33347'
229
$ python -c 'print "%.2f%%" % (229.0 / 33347 * 100)'
0.69%
So yes, they could save some bytes by switching to relative urls, but
the savings are under 1%.
Comment via: google plus, facebook, substack