• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Parsing HTML With Regular Expressions

    February 22nd, 2013
    html, tech  [html]
    Perhaps you need to get some information out of HTML. Regular expressions look promising, but you get stuck so you ask for help. A typical response would be:
    HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag. Regular expressions can only match regular languages but HTML is a context-free language.
    This is true, but only in a sense so narrow it's useless. When someone asks about "regular expressions" they don't mean the restricted computer science kind but the implementations available in various programming languages. Those are much more powerful, capable of parsing HTML.

    Here again we're using dangerously precise words. When someone says "HTML" they don't mean HTML as defined by the spec but HTML that they are likely to encounter in the real world. Which means you might need to handle monstrosities like "<i>italic <b>bold-italic</i> bold</b>" which have no parse tree but your browser still renders as italic bold-italic bold. [1]

    But again with the precision! For typical tasks you don't need to 'parse' HTML, just extract some information from it. Regular expressions and other ad-hoc techniques can do that well and are likely to be less trouble in practice [2] than trying to use a parser.

    (Because mod_pagespeed needs to make context-dependent changes to arbitrary web pages, regular expressions are not a good fit. You might think using the parsing code from a web browser would work well, but it turns out that browsers mix HTML-parsing with HTML-cleanup [1]. So pagespeed instead goes token-by-token, triggering callbacks for start and end tags.)


    [1] Webkit turns <i>italic <b>bold-italic</i> bold</b> into <i>italic <b>bold-italic</b></i> <b>bold<b> at an early stage of interpreting the page.

    [2] The advice to use an XML parser is just bad, as funny as the author is. XML parsers get to assume they're only going to be given valid XML and just reject anything that isn't. Extremely few pages are (or even try to be) XML, so your XML parser isn't going to be helpful.

    Comment via: google plus, facebook, hacker news

    Recent posts on blogs I like:

    Not Everything is Like Rail Transport

    Sometimes, when I write about cost comparisons or public-sector incompetence, I see people make analogies to other fields. and sometimes these analogies are really strained. So I want to make this clear that I am talking about things that are specific to …

    via Pedestrian Observations April 30, 2021

    Collections: Teaching Paradox, Europa Universalis IV, Part I: State of Play

    This is the first post in a series that will be examining the historical assumptions of Paradox Interactive’s grand strategy computer game set in the early modern period, Europa Universalis IV. And this series will in turn be part of a larger series looki…

    via A Collection of Unmitigated Pedantry April 30, 2021

    Books and websites on babies

    Several people I know are expecting a first baby soon, and I wrote up notes for one of them. Might as well share here too: Medical:Scott Alexander’s Biodeterminist’s Guide to Parenting is an interesting read, and some parts are actionable.  If you live in…

    via The whole sky April 14, 2021

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact