• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Archiving Yahoo Groups

    October 18th, 2019
    logging, tech  [html]
    On December 14th Yahoo will shut down Yahoo Groups. Since my communities have mostly moved away from @yahoogroups.com hosting, to Facebook, @googlegroups, and other places, the bit that hit me was that they are deleting all the mailing list archives.

    Digital archives of text conversations are close to ideal from the perspective of a historian: unlike in-person or audio-based interaction this naturally leaves a skimmable and easily searchable record. If I want to know, say, what people were thinking about in the early days of GiveWell, their early blog posts (including comments) are a great source. Their early mailing list archives, however, are about to be deleted.

    Luckily we still have two months to export the data before it's wiped, and people have written tools to do automate this. Here's how to download a backup of all the conversations in a group:

    # Download the archiver
    $ git clone https://github.com/andrewferguson/YahooGroups-Archiver.git
    
    $ cd YahooGroups-Archiver/
    
    # Start archiving the group.
    $ python archive_group.py [group-name]
    

    If things are going well it will start spitting out messages like:

    Archiving message 1 of 8098
    Archiving message 2 of 8098
    Archiving message 3 of 8098
    

    And it will be creating files:

    $ ls [group-name]/
    1.json
    2.json
    3.json
    ...
    

    If you get a message like:

    Archiving message 5221 of 8098
    Archiving message 5222 of 8098
    Archiving message 5223 of 8098
    Cannot get message 5223, attempt
       1 of 3 due to HTTP status code 500
    Cannot get message 5223, attempt
       2 of 3 due to HTTP status code 500
    Cannot get message 5223, attempt
       3 of 3 due to HTTP status code 500
    Archive halted - it appears Yahoo has blocked you. Check if you can
       access the group's homepage from your browser. If you can't, you
       have been blocked. Don't worry, in a few hours (normally less than
       3) you'll be unblocked and you can run this script again - it'll
       continue where you left off.
    
    It may mean that you have been blocked, but it may also just mean that for some reason an individual message can't be downloaded. In that case, to tell it to give up on that message and just continue on, create the json file with the stuck message number:
    $ touch [group-name]/5223.json
    
    You might also get a message like:
    Traceback (most recent call last):
      File "archive_group.py", line 150, in 
        archive_group(sys.argv[1])
      File "archive_group.py", line 71, in archive_group
        max = group_messages_max(groupName)
      File "archive_group.py", line 94, in group_messages_max
        raise valueError
      File "archive_group.py", line 87, in group_messages_max
        pageJson = json.loads(pageHTML)
      ...
        raise JSONDecodeError("Expecting value", s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
    
    This is what I see if I try to archive a private group. It's still possible to use the tool to archive a private group that you have access to, but it's a bit involved. First you visit Yahoo Groups in your web browser with Devtools open to the Networking tab. Then you look at what cookies are set on the HTML request, and find the T and Y cookies. The T cookie should start with z= and the Y cookie should start with v=. Paste these into the cookie_T and cookie_Y variable definitions at the beginning of archive_group.py.

    Once you've downloaded all the messages in a group you can run:

    $ pip2 install natsort
    $ python2 make_Yearly_Text_Archive_html.py [group-name]
    
    Which will create a bunch of files like [group-name]-archive/archive-YYYY.html. They're not that easy to read, because it doesn't do any kind of quote folding, but we can always do that later. If you made any empty files to get around messages that wouldn't archive (see the touch command above) you'll get an error at this stage; just delete the empty files and re-run.

    I've archived five groups: givewell, Boston-Contra, BostonAreaContraCommunity, contrasf, and trad-dance-callers. The first two are public groups with public archives, so I've made archives available at /givewell-archive and /Boston-Contra-archive. The remaining three are private, but if you want to look at them and you were a participant or otherwise have a good reason let me know.

    Comment via: facebook, lesswrong

    Recent posts on blogs I like:

    Collections: Clothing, How Did They Make It? Part I: High Fiber

    This week we are starting the first of a four (?) part look at pre-modern textile production. As with our series on farming and iron, we are going to follow the sequence of production from the growing of fibers all the way to the finished object, with a f…

    via A Collection of Unmitigated Pedantry March 5, 2021

    Austerity is Inefficient

    Working on an emergency timetable for regional rail has made it clear how an environment of austerity requires tradeoffs that reduce efficiency. I already talked about how the Swiss electronics before concrete slogan is not about not spending money but ab…

    via Pedestrian Observations February 27, 2021

    The Troubling Ethics of Writing (A Speech from Ancient Sumer)

    (Translated from a transcript of an ancient Sumerian speech by Uruk's most well-respected Scriptological Ethicist) Writing is a profoundly dangerous technology: Access to writing was initially, and still remains, uneven. What's worse, the rich are m…

    via BLOG - Cullen O'Keefe February 15, 2021

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact