A project I worked on in some of my spare time back in Summer 2015 was to look for unusual HTTP headers. Things like X-Clacks-Overhead:GNU Terry Pratchett
, or ancient use of PICS-label headers. I recently revisited the project and put it online as popular-headers.
One interesting part of the project was using an event loop to speed up the process of making a large number of HTTP requests. The first-draft Perl script read in a list of URLs and then printed out the headers used. The module used to make the requests was Mojo::UserAgent. I am a big fan of the Mojolicious/Mojo framework and libraries. The Mojolicious website provides a good overview for those unfamiliar.
Whilst this initial script functioned fine it was hardly fast. Actually, it was really slow. Initial runs averaged over a second per request. To look for some interesting HTTP headers I was planning on making a million requests (using Alexa’s free top 1 million dataset). That would have taken about 11 days at best, much longer than I wanted to wait!
Enter Event Loops
A few years back Jeff Dean gave a presentation where he talked about designing large-scale distributed systems. The slides can be found here. Of particular interest is the ‘numbers everyone should know’ slide (page 13 of the PDF). Note how slow networking is compared to the rest of the tasks a computer will deal with. During our request of a million URLs we are basically spending a significant portion of those 11 days just idling. We have to make a request and then sit waiting for the result, or worse no result, in which case we hit the timeout and have nothing to show for it.
Non-blocking requests can provide a big improvement. We put our request out on the network then go and do something else until that request returns a result. What makes this an attractive option is that our Mojo::UserAgent
library already supports non-blocking requests.
We can also easily share global state between requests. We are not managing multiple threads or processes, we’re just being more intelligent with how we use the time of our single process. This is useful for the popular-headers project which uses a database — we can make a single database handle and use it safely in multiple callbacks. Sure we could make multiple handles and tackle the problem differently, but for a quick increase in efficiency this approach is golden.
Using Mojo::IOLoop::Delay for Management
The Mojo::IOLoop::Delay module provides an easy way to manage and control the flow of these non-blocking events in our program. There are several convenience methods provided but the only ones I ended up using were begin()
and wait()
.
Let’s say $delay = Mojo::IOLoop->delay
. For each non-blocking request we make we need to call $delay->begin
which increments our event counter and returns a callback. When our non-blocking request finishes we can execute this callback which decrements our event counter.
$delay->wait()
does as you might expect and waits for the event counter to reach 0. You could use this right at the end of your script to make sure you don’t exit before all the requests have finished. I used this to manage processing the URL list in batches of a few hundred at a time, as well as catching any left-over requests right at the end.
Further Notes
There is a good video on Mojo::IOLoop::Delay on the Utah Open Source YouTube channel from a couple of years back. The presentation is by Scott Wiersdorf. Scott gives an interesting overview of other similar Perl libraries and works through a ‘real-world’ example of how this technique can be used. It’s worth checking out.