ngx_pagespeed: how does it work?

ngx_pagespeed: how does it work?	April 29th, 2013
	ngx_pagespeed, tech

I've been very busy the past few days getting ngx_pagespeed out in beta, and then even busier with the flood of interest. I wrote a post for the Google Developers blog, TechCrunch wrote it up, and it hit #1 on HN:

So, how does it work? First, let's look at what happens normally when you just have the browser and the server, without PageSpeed. Imagine we have a page like this:

  /index.html
    -> navbar.js
    -> site.css
    -> cat.jpg

We'll have:

The browser requests index.html.
The server reads /var/www/index.html from disk and sends it out.
The browser parses the html, learns about navbar.js, site.css, and cat.jpg sends requests for them.
The server reads each of them from disk and sends them out.

This is simple in many ways: we're not proxying, generating dynamic content, requesting resources or html asynchronously with javascript, nesting images or more css inside css, or caching anything. But it's still complex enough to be interesting.

Let's add PageSpeed to the picture:

The browser requests index.html.
The server reads /var/www/index.html from disk.
The response passes through PageSpeed on the way out, giving an opportunity for optimization.
PageSpeed sees references in index.html to navbar.js, site.css, and cat.jpg, but doesn't immediately know their contents. To find out it requests them from the server.
The fetches take too long for it to be ok blocking the response on them, so PageSpeed let's them continue in the background and sends out index.html without optimizing the resources.
The browser parses the html, learns about navbar.js, site.css, and cat.jpg sends requests for them.
The server reads each of them from disk and sends them out.

What's this? PageSpeed hasn't done anything useful, just given the server more work to do. But look at the second request for the page:

The browser requests index.html.
The server reads /var/www/index.html from disk.
The response passes through PageSpeed on the way out, giving an opportunity for optimization.
PageSpeed sees references in index.html to navbar.js, site.css, and cat.jpg, and this time knows what they contain because the fetches from before had time to complete.
PageSpeed sees that navbar.js is only a few lines. At that size it's probably not worth it to force the browser to make another round trip just to retrieve it, so PageSpeed inlines it. The css and image are large enough that inlining doesn't make sense, partly because inlining keeps caching from working, so for those it just wants to send optimized versions.
PageSpeed sends out index.html with some substitutions. Aside from inlining navbar.js, it replaces site.css with A.site.css.pagespeed.cf.KM5K8SbHQL.css and cat.jpg with 256x192xcat.jpg.pagespeed.ic.AOSDvKNItv.jpg. These longer urls contain a hash of the contents, which means it's safe to serve them with a very long cache lifetime because when the content changes they'll get a different hash.
The browser parses the html, learns about A.site.css.pagespeed.cf.KM5K8SbHQL.css and 256x192xcat.jpg.pagespeed.ic.AOSDvKNItv.jpg, and requests them from the server.
PageSpeed handles the request and sends its contents out.

So how does PageSpeed integrate with nginx to make these changes? There are two places it needs to run:

PageSpeed needs to intercept outgoing html and rewrite it.
PageSpeed needs to respond for requests for rewritten resources like A.site.css.pagespeed.cf.KM5K8SbHQL.css.

It does this with an body filter and a content handler. The body filter intercepts all outgoing requests, but doesn't make any changes if they're not html. The content handler intercepts all incoming requests but declines the opportunity to handle them unless they're .pagespeed. rewritten resources. What does this look like?

Nginx receives a request for `http://example.com/index.html`
```
    GET /index.html HTTP/1.1
    Host: example.com
```
Nginx calls PageSpeed's content handler, which looks at the url and determines whether this request is for an optimized .pagespeed. resource. In this case it isn't, so the content handler declines this request.
Nginx continues trying other content handlers until it finds one that can handle the request. This may be a proxy_pass, fastcgi_pass, a try_files, static file, or anything else that the webmaster might have configured Nginx to use.
Whatever content handler Nginx selects will start streaming a response as a linked list of buffers ("buffer chain").
```
    ngx_chain_t in:
      ngx_buf_t* buf:
        u_char* start
        u_char* end
      ngx_chain_t* next
```
Nginx passes that chain of buffers through all registered body filters, which includes PageSpeed's. If this were not html being sent, PageSpeed's body filter would immediately pass the buffers on to the next registered body filter.
The body filter will see one buffer chain at a time, but it might not be the whole file's worth. For static files on disk it usually will be, but perhaps if we're proxying from an upstream that quickly dumps some layout html but takes much longer to generate personalized content.
We pass this to PageSpeed via a ProxyFetch. While PageSpeed is running in another thread, ProxyFetch handles all the thread-safety complexity here. Nginx doesn't usually have any threads, so we need to be pretty careful.
We need to give PageSpeed time to optimize this html, and it's running in a different thread, so we're not going to have output ready for Nginx immediately. Nginx uses an event loop so we can't just wait around here, or else Nginx won't be able to handle other requests until this one finishes. Instead we create a pipe and tell Nginx to watch the output end. Once PageSpeed has some data ready it will be able to write a byte to the pipe and notify Nginx.
PageSpeed parses this html, identifies the resources in it, and tells the fetcher to retrieve them. This means a "loopback fetch" where PageSpeed requests the resources from Nginx over http.
There's a Schedule thread that keeps this optimization under a very tight deadline. If Nginx takes too long to respond with resources or anything else makes us take too long we send the html out with whatever optimizations we have completed so far. Imagine that in this case only site.css is fetched and optimized by the time we hit our rewrite deadline.
PageSpeed writes a byte to the pipe Nginx is watching, which makes Nginx invoke our code on its main thread (the only thread it knows about). We copy the output bytes from PageSpeed to an Nginx buffer chain and then Nginx sends them out to the user's browser:
```
    index.html
     -> navbar.js
     -> A.site.css.pagespeed.cf.KM5K8SbHQL.css
     -> cat.jpg
```
All html will go through the same path: as it comes into Nginx it will go to the body filter and then to PageSpeed via the ProxyFetch, then after optimizing or hitting the deadline PageSpeed wakes up Nginx with the pipe, and Nginx sends out the rewritten html.
When the user's browser sees navbar.js and cat.jpg it will request them from Nginx, and while first the content handler and then the body filter will see each request they won't do anything.
The request for A.site.css.pagespeed.cf.KM5K8SbHQL.css, however, will be answered by the content filter. It will pass the request to PageSpeed via ResourceFetch, and PageSpeed will pull the rewritten resource out of cache. (In the unlikely event that the resource is not in cache, there is enough information in the requested filename that PageSpeed can fully reconstruct the optimized resource.) We go through the same output flow with writing a byte to a pipe to notify Nginx that we have data to send out.

←

Helping Specific People So, You Want a Wikipedia Article?

→

Comment via: google plus, facebook

ngx_pagespeed: how does it work?

Recent posts on blogs I like:

Thoughts on EA Funds

Clarendon Postmortem

How web bloat impacts users with slow devices