{"items": [{"author": "Al", "source_link": "https://www.facebook.com/jefftk/posts/276158789111652?comment_id=276167142444150", "anchor": "fb-276167142444150", "service": "fb", "text": "Could this be a result of google now sourcing both web and social (google+) for results?  I've heard of some interesting search results.", "timestamp": "1326294401"}, {"author": "Jeff&nbsp;Kaufman", "source_link": "https://www.facebook.com/jefftk/posts/276158789111652?comment_id=276193895774808", "anchor": "fb-276193895774808", "service": "fb", "text": "@Al: In this case the comment is a g+ one, but this also happens for facebook comments.  (Image added above.)", "timestamp": "1326297486"}, {"author": "BDan", "source_link": "https://www.facebook.com/jefftk/posts/276158789111652?comment_id=276230669104464", "anchor": "fb-276230669104464", "service": "fb", "text": "It may be a violation of how you're used to bots working, in that most bot writers used to be too lazy to include javascript, but I think it makes more sense: the bot is indexing the content on the page as it would appear to anyone actually looking at it.  If they're doing it correctly, that probably also means that they're not including content that's *removed* via javascript, which should help reduce spurious results.", "timestamp": "1326301626"}, {"author": "James", "source_link": "https://plus.google.com/106345404829653994850", "anchor": "gp-1326307597536", "service": "gp", "text": "I once saw Googlebot do a GET on a POST-only form URL, without filling out the form, where the URL was in Javascript and only ever added to the DOM if form validation passed. Based on this, some priors based on how I'd do it, and the requirements for indexing AJAX using pages, I think I have a reasonable idea of what Googlebot's architecture looks like.\n<br>\n<br>\nGooglebot is a modified, headless Chrome. After the page finishes loading, it freezes the page state and walks the DOM and the Javascript heap looking for strings that look like URLs. Chrome contains all of the infrastructure needed for this, in the \"content script\" extension API. My own project, Textcelerator, uses that API for a similar sort of text extraction to what Googlebot would have to do, and I can attest that it is well suited for it. Googlebot would then associate each URL it found with the DOM element(s) where it was found. Then it tries all the onclick handlers, freezing state, doing a similar traversal, then rolling back after the page either reaches steady state (no http requests or short timers pending), opens a new window or tries to redirect, in which case it associates the destination URL with the element that was clicked. (There's a fair bit of extra complexity here; some pages will add a bunch of content to the DOM, without triggering an actual page load. So it would have to decide whether to try a second click, avoid infinite tunnels, and so on, much like with regular dynamic pages, except that complete Javascript heaps take the place of URLs. I don't know how far they've advanced down that path, though).\n<br>\n<br>\nWith each DOM traversal, it does text extraction on the page. A pile of hacks classifies bits of text as hidden text, visible text, title, author name, date, or boilerplate, and groups them into posts/comments, using a combination of text length, rectangle bounds, CSS attributes, node and class names and ancestors. (Of these, Textcelerator contains heuristics to distinguish hidden from visible, and regular from title, and I've put some effort into text vs boilerplate for future versions. I recall seeing reference to a patent with Google's name on it, for specific aspects of inferring structure from rectangle bounds, though I don't recall where). All of these distinctions are potentially tricky, of course. The hidden/visible text distinction hits nasty corner cases, like out of bounds text, pages with custom Javascript scrollbars, white on white, or opacity:0 in an ancestor, and Google has faced considerable abuse from spammers injecting hidden links into others' pages. Based on Google's general philosophy, I'd guess that they manually tagged the structure of a bunch of web pages, and applied some machine-learning to generate heuristics for them. I'd also guess that links in posts and comments confer more pagerank if they bear the name of a prolific author.\n<br>\n<br>\nAll this seems complicated, but Google has spent \nmassive\n resources on search, and this stuff is all low-hanging fruit.", "timestamp": 1326307597}, {"author": "David&nbsp;German", "source_link": "https://plus.google.com/111229345142780712481", "anchor": "gp-1326329143747", "service": "gp", "text": "@Lucas\n I'm pretty sure that's a spam-bot.  Its IP address is in the Ukraine, and its posts include BBCode and sketchy URLs.", "timestamp": 1326329143}, {"author": "David", "source_link": "https://www.facebook.com/jefftk/posts/276158789111652?comment_id=277215275672670", "anchor": "fb-277215275672670", "service": "fb", "text": "If you set robots.txt to block retrieving the Javascript, this might prevent it (or maybe some meta tag on the JS itself). This sort of thing is really needed for some web pages which are little more than a shell that use JS to load the content.", "timestamp": "1326419727"}, {"author": "Jeff&nbsp;Kaufman", "source_link": "https://www.facebook.com/jefftk/posts/276158789111652?comment_id=277215812339283", "anchor": "fb-277215812339283", "service": "fb", "text": "@David: I don't actually mind; I think I like it.  It's just not what I was expecting.", "timestamp": "1326419789"}]}