Quotation Detection

July 20th, 2012
algorithms, tech
I want to thread comments: when a comment quotes another comment it's probably a reply, so attach them together to make the conversation clearer. This means figuring out which comments quote which other comments. I tried a brute force way:
  for each comment A:
    for each later comment B:
      for each word X in A:
        for each word Y in B:
          do A and B match for N words starting at X and Y?
This is psuedocode for an O(n^2*m^2) solution. Not so good. I coded it up in javascript and while I think it's fast enough in Chrome, Firefox, and even my phone, it took IE8 nearly a minute. While I could probably run this on my server and do fine with some combination of caching and C, it seems inefficient. Is there a better algorithm? What is the general version of this problem called?

Update 2012-07-21: Several commenters suggested a better algorithm: to find all quotes of length N, build a dictionary from all sequences of N words to a list of comments in which they appeared. This is O(n*m) and running it it's much faster. I tested it in IE8, and it loaded quickly instead of freezing the browser. I thought briefly about doing something like this earlier, but wrote it off as using insane amounts of memory. After more people suggested it, I realized it only uses N times as much memory as just storing the comments.

Comment via: google plus, facebook

Recent posts on blogs I like:

Pay For Fiction

Against piracy

via Thing of Things February 29, 2024

When Nurses Lie to You

When the nurse comes to give you the flu shot, they say it won't hurt at all, right? And you trust them. Then they give you the shot, and it hurts! They lied to you. A lot of nurses lie to children about shots and blood draws. Part of it is they probabl…

via Lily Wise's Blog Posts February 28, 2024

How I build and run behavioral interviews

This is an adaptation of an internal doc I wrote for Wave. I used to think that behavioral interviews were basically useless, because it was too easy for candidates to bullshit them and too hard for me to tell what was a good answer. I’d end up grading eve…

via benkuhn.net February 25, 2024

more     (via openring)