Sorting with Instapaper's API

A brief story about scripting Instapaper's API to sort out hundreds of unread articles.

I’ve recently got back into using Instapaper in a big way. This is in part due to my purchase of a second-hand Onyx Boox Note 3, perhaps a post for another time. The project at hand is to go through all the saved articles on my Instapaper account and move them into folders if they are part of a common website or newsletter.

How did things get so cluttered?

I have had email rules setup for years in Gmail that auto forward certain newsletters to Instapaper (each account has an email address for saving articles this way, which I think is great!). I’ve not really been using the service over the last year or so, which means these articles have been slowly but surely accumulating.

There’s no bulk way in the app or portal to filter and move articles based on the criteria I’m after, at least not with the free tier product.

Fixing with Instapaper’s API

Instapaper has an API that is freely available. A saved article (or web page) is called a bookmark in Instapaper’s world. They offer a very simple API just for saving bookmarks, or a full featured API for standard CRUD operations on bookmarks, folders, highlights etc. For our use case, which is moving bookmarks to folders, we will need the full API.

Of course for this quick set of tidy-up scripts I reached for my favourite scripting language, Perl. This github repo has the basic API wrapper and example scripts if anyone wants to follow along at home.

The sorting scripts do two things:

Group articles by domain name, if the group had over three articles I made a new folder and moved the bookmarks accordingly. This gave me a nytimes.com folder with some articles, even though I don’t subscribe to any NYT newsletters.
For articles saved via email there is no useful URL, so I instead used a really basic title parsing approach that just looks for newsletter name and tries to match with an existing folder. This worked for me because the newsletters in question happen to always have the name in the email subject which ends up as the bookmark title.

API Gotchas

The Time Attribute Changes…

Bookmarks have a time attribute, which is a Unix time stamp. Moving a bookmark to a folder modifies the value to the time of the move operation, which caused great confusion when I went back to the Instapaper portal to check things were working and saw hundreds of articles I apparently added today.

The API returns arrays… except when it doesn’t

The main API method you probably want to use is ‘list the bookmarks’, as this is how you get all unread articles, or all articles in a specific folder etc.

This endpoint returns an array like the others, but included objects are not all bookmarks. You get some user/meta stuff as well. In fairness to Instapaper, the docs do tell you this, and you did read the API docs right? Just ignore the type where the value isn’t bookmark and you’re fine.

Old articles throw an error

Articles saved via email that are ‘old’ don’t work for me anymore. The portal lists them fine, but throws an error when trying to view the full article. I’m not sure whether this is a legitimate bug or feature (I’m still on the free version of Instapaper). I guess if it is by design the error is confusing.

I can’t tell you how old articles can get before this bug kicks in as I didn’t realise until I’d moved all the articles and that reset the time attribute (see above).

My fix was another script to programmatically request the text for each bookmarked article. If the API threw an error rather than returning article content I just deleted the bookmark.

Progress

The scripts are a success. I may even start running this on a cron so I can continue adding newsletters and articles without organizing as I go. To be decided.

In other news, I noticed whilst checking my initial API connectivity the following HTTP response header:

x-powered-by: a lot of coffee and Phish

Nice! That takes me back to a project from years ago where I was crawling the web for unusual/interesting HTTP headers, possibly 2024 is the year to bring that back.

Dev

Falkus .co

Falkus.co

06 Feb 2024