Behind The Scenes

Whistle While Your Node.js Queue Works

By August 27, 2018 May 28th, 2019 No Comments

First, a Story…

Let’s face it. We’ve all seen and enjoyed “Snow White and the Seven Dwarves” at some point in our lives. It has it all: a lovable main character, an evil antagonist and the real highlight, the seven dwarves. While we don’t get a whole lot of backstory on the dwarves, they seem like pretty upstanding miners.

Maybe it’s just me but have you ever wondered if the dwarves have a manager? I mean, are they employed by someone and if so, how do they coordinate what part of the mine they’re working on? Oh, maybe they unionized at some point which, now that you mention it, would explain why they look so well-fed and happy.

Ok. At this point, you’re probably wondering where this is going. Well so am I! You see, we all have bits of work that need to get done and at the end of the day, most work streams need a coordinator to make sure everything runs smoothly. The last thing you want is duplicated work, incomplete work or worse, one worker that does everything while another one sits around, drinking a slurpee while watching Family Guy. Yes, I’m looking at you Grumpy!

Ahem… let me give you an example: we have been working on some fun new features here at Tiled and one of them requires us to run through every account in our database as well as every person who may have viewed a Microapp and run some basic statistics over their interactions. As you can imagine, this is a pretty complicated work stream so let’s run through a naïve implementation for how this might work and then work our way up to something more robust and correct.

The design principles we’ll be using are:

  1. The system should only perform a work stream once; duplicated work is a no-no.
  2. While work is being performed, the Tiled website should still be able to answer web requests without a hitch.
  3. Work should be schedule-able (i.e. perform work every night at 12am) and it should be able to happen on-demand.

Dopey’s Solution

So here’s a first go at it. This is a high-level view of our backend system and in it, we have a bunch of api boxes setup behind a load balancer. This is a pretty typical setup for most early-stage startups. Now let’s say we need this expensive task, bigJob(), to run every 2 hours.

You could use the wonderful node-cron to schedule this work and after writing the body of bigJob() let’s face it, you’re done! Well, kind of. There are a few issues with this approach.

  1. Because the api boxes are all running the same code, bigJob() gets scheduled on each machine separately. If we’ve written our code correctly, we might be able to guarantee that each job runs until completion but in the process, we are duplicating work for no reason. We are literally running bigJob() n times, once on each api box. Worse, let’s say the job is supposed to send out an email upon completion. With this solution, we’ve got n emails going out saying the job was completed. What a mess!
  2. What if a user in the system could request their account to be run through bigJob()on-demand? And what if n people ask for their accounts to run this job at the same time? Does that mean we can’t service any api requests during that time because all of our api boxes are chewing through data running bigJob()? This is resource starvation at its worst!
  3. Lastly, is this a stream of work we could parallelize? In our example, that’s exactly what we could do since each account is not dependent on another for its calculations. This type of problem is the textbook definition of embarrassingly parallel. However, it’s not immediately obvious how we might split the work up across multiple machines.

Ok, so clearly we’re not done. If we used this solution and each api box was a dwarf, I think the last thing they would be doing is whistling by the end of the day. Let’s try another approach.

Doc’s Solution

There has to be a better way to do this and there is. This is where a good job queueing system will really help us. There are lots of options for Node applications but here at Tiled, we ended up going with bull. The main reason was its support of “repeatable jobs” meaning a job that can be repeated on a specific schedule i.e. like a cron.

No matter which library you choose, the concept that underpins them all is some way of adding jobs to and removing jobs from a queue in a thread-safe way. By thread-safe, I mean that if a computer removes a job from a queue, another computer should not be able to remove the same job again. This prevents work from being duplicated.

Let’s take a look at the setup this time. I’ve removed the Load Balancer from the diagram for the sake of screen real-estate but don’t worry. It’s still there!

Notice we now have a series of worker boxes and a Redis box. Setting up a Redis server is beyond the scope of this document but I have added one security note at the end of the article in case you need it. In this scheme, all the machines talk to Redis to coordinate the work needing to be performed.

Once you have Redis running and bull installed, you’ll need some way of designating a machine as a worker. In our case, we just added a new environment variable (WORKER=true) to the pm2 deploy script. That way, we could mark some computers as workers.

Let’s take a look at some code which should make it pretty easy to understand how it all works. Keep in mind that our api boxes and our workers are running the same code and the only difference is that worker boxes get that extra WORKER=true environment variable.

const Queue = require('bull');

// 1. create a job queue
const queue = new Queue('BIG_JOB_QUEUE', 'redis://my-redist-host:6379');

// 2. mark if the computer is a worker
const isWorker = process.env.WORKER === 'true';

// 3. our work!
function bigJob(data, done) {
  // whistle while you work!
  // ...
  done();
}

// 4. only workers run this block
if (isWorker()) {
  queue.process(bigJob)
  queue.add({},
    {
      jobId: 'BIG_JOB_NAME',   // 5. ensure one job gets registered
      repeat: {
        cron: '0 */2 * * *'   // every two hours
      }
    }
  )
}

In 1), we’re creating the queue that is used to add work (through the add() call) and schedule it (add() with the repeat option). The api box needs to run this code so it can listen for work being added to the queue and the api needs to do the same in case it needs to enqueue work on-demand (we don’t show this in the example). Note that we use the same queue name, 'BIG_JOB_QUEUE'. This is the name (more or less) that is being used to publish and subscribe to jobs in Redis using its pub/sub system.

Then, in 2), we’re figuring out if the current computer is a worker or not.

As for 3) we’re defining what bigJob() actually does. Don’t forget to call done() when the work is done!

Finally in 4), if the computer is a worker, we tell the queueing system two things: a) that this computer should process the job queue (i.e. subscribe) and then next, that a job should get added to the queue every 2 hours (i.e. publish every 2 hours). I should also point out that the job that gets scheduled uses the job id 'BIG_JOB_NAME'. This is an important item in our setup, otherwise every worker box would schedule the job and we’d end up in the same situation where we were before, doing duplicate work.

Ok, it looks like we have a solid solution now but does it solve the problems we set out to fix in the first place? Going back to those original problems:

  1. By using the worker id 'BIG_JOB_NAME', the job will only ever get scheduled once every 2 hours and we will not be doing any duplicate work.
  2. While a job is running, the api boxes are free to service web requests. And if a user requests a bunch of jobs to be run, they will each get added to the queue and the worker boxes will chew through the queue one job at a time. It may take some time for the worker boxes to get through all the queued jobs but as long as the api boxes aren’t tied up along the way, we should be a-ok!
  3. If we needed to parallelize the work, we could go one step further and give each worker box an id like 0, 1, …m-1. Then, in bigJob(), we could do something simple like take the account id modulo the worker id (accountId % worker_id) and if the output is 0, the worker runs the job. Assuming account ids were handed out with an equal distribution, we could feel pretty good knowing our worker boxes would distribute our work evenly.

So that’s about it! You now have a working queuing system where every dwarf, errr, worker does its part. If they ever make another Snow White movie, maybe they’ll add a new character and her name will be Redis and… ok you get the point.

Security: A Last Word

I mentioned that setting up Redis was beyond the scope of this document but that I would add one security note. Well, here’s that one note.

If you’re anything like us at Tiled, you want to try new technologies and so when we put up our Redis instance, we decided to opt for a containerized solution. It’s not that we would realistically need to scale Redis any time soon but we figured that it might be a good way to dip our toes into containerization in a production environment.

We run Ubuntu on our boxes and we use ufw to configure our firewall. If you do end up running a Redis image and then expose the container port to the machine, something funny happens: Redis becomes exposed to the entire world. Yes, you heard me right. In the system we designed, only the api and worker boxes should be able to talk to Redis so you can imagine how surprised we were to find our Redis machine completely exposed on the Internet. How could it be that the ufw rules we had been using were setup correctly (i.e. we only opened the machine to other machines internal to our private network) and yet, I was able to connect to our Redis instance from something like my laptop?!

Well, I’m glad you asked! It turns out that Docker writes to the computer’s iptables directly. In fact, it inserts rules higher in the chain than ufw does. So, in effect, Docker explicitly opens up the machine to the world and there’s nothing ufw can do about it. Once a Redis machine is opened to the world, lots of bad things can happen so beware (read: hackers installing Redis keys that run cryptomining scripts)!

To prevent this:

  1. Run dockerd (the Docker daemon) with the--iptables=false option. This tells Docker not to touch your iptables. You may need to clear your iptable rules before running this since rules that are already inserted will not be removed by running dockerd with this option.
  2. Restart your computer to ensure the new iptable rules get picked up and make sure your ufw rules are still in place.

You can thank me for saving your computer the next time I see you!

Leave a Reply