On NOT Using PostgreSQL Advisory Locks in Haskell

Recently two different administrative users of a Haskell app I maintain started a long-running background job at the same time. This resulted in some Bad Things happening which I would like to prevent in the future!

So I want a lock. This app runs on at least two servers at any given time, so it needs to be a distributed lock, and for Reasons (not relevant to this blogpost) I can't use our Redis cache. So the PostgreSQL database it is!

My first thought is that I could just make a table of "locks", write a row to the table, if the row is there, the job is locked. That probably would have worked just fine! But then I did a quick Google to see what other options were out there and learned about advisory locks. And these seem perfect for what I want to do.

PostgreSQL advisory locks

From the PostgreSQL documentation:

While a flag stored in a table could be used for the same purpose, advisory locks are faster, avoid table bloat, and are automatically cleaned up by the server at the end of the session.

This blog post is a rundown of how these locks work.

So I got coding. And here's what my lock function in the Haskell app looks like:

withLock :: Pool Connection -> Text -> IO a -> IO a
withLock connPool lockName func =
  withResource connPool $ \conn ->
    withTransaction conn $ do
      let key = hash lockName
      locked <- query conn "SELECT pg_try_advisory_xact_lock(?)" (Only key)
      case listToMaybe locked of
        Just (Only True) -> func
        _                -> throw (LockException lockName)

Line-by-line

Let's go through it line-by-line:

withResource connPool $ \conn ->
  withTransaction conn $ do

We use withResource to get a connection from the database connection pool, then we use postgresql-simple's withTransaction to wrap the rest of the function in a database transaction.

That sounds a little scary at first. Does this mean everything that happens inside the lock is happening in the same transaction? So, if we're charging 500 credit cards and recording 500 receipts, and the 499th charge has an error, is it going to roll back 498 receipts!? That would be terrible!! 😱

Luckily that is not what happens! Only database calls using this connection conn will be a part of the transaction. We don't pass conn as an argument into func, so we know that database calls within that function are going to grab their own connections from the pool. Anything that those connections do will be outside this transaction.

let key = hash lockName

Next we hash the lockName to turn it into a integer that we can give to pg_try_advisory_xact_lock.

locked <- query conn "SELECT pg_try_advisory_xact_lock(?)" (Only key)

Now we'll try to take the lock. The postgresql-simple library's query function takes a connection to the database, a query, and an n-tuple of arguments for the query. In this case there's only one, but how does the compiler distinguish a 1-tuple from just... a value? The postgresql-simple has us wrap it in this Only constructor to make the types work.

We know that pg_try_advisory_xact_lock returns just a single boolean, but as far as postgresql-simple is concerned, any query can return a list of rows containing a list or tuple of columns. But we just want to know if that single result is true or false.

That's where listToMaybe locked comes in. listToMaybe on a list of "rows" will give us Just the first element or Nothing. So then we have Maybe the single result we're looking for.

We only want to call the function inside the lock if the query returned at least one row (it's Just something), if there was Only one field returned from the database, and if the lock succeeded (the result is True). Otherwise we need to do something else.

_                -> throw (LockException lockName)

Here, we're throwing a custom LockException. This will let us handle just these errors if we want to let the user know that a job is already in progress.

Here's what the definition of LockException looks like:

newtype LockException = LockException Text deriving Show
instance Exception LockException

Testing

I'm not going to attempt automated tests for this function, because these sorts of tests (trying to force two things to run exactly at once) are just way too flaky.

So what kind of test can I write for this to make sure it works? Here's what I did to test as I developed:

testLock :: Pool Connection -> IO ()
testLock conn = do
  forkIO $
    catch (withLock conn "test lock" $ do
      putStrLn "✅ Successfully took first lock ✅"
      threadDelay 1000000)
    (\(LockException _) -> putStrLn "🚫 Unable to take first lock 🚫")
  forkIO $
    catch (withLock conn "test lock" $ do
      putStrLn "🚫 Successfully took second lock 🚫")
    (\(LockException _) -> putStrLn "✅ Unable to take second lock ✅")
  forkIO $
    catch (withLock conn "a different lock" $ do
      threadDelay 1000000 -- this delay is just to prevent overlapping output
      putStrLn "✅ Successfully took third lock ✅")
    (\(LockException _) -> putStrLn "🚫 Unable to take third lock 🚫")
  threadDelay 1500000
  forkIO $
    catch (withLock conn "test lock" $ do
      putStrLn "✅ Successfully took fourth lock ✅")
    (\(LockException _) -> putStrLn "🚫 Unable to take fourth lock 🚫")
  return ()

We need to use forkIO to create two threads that can run more or less at once. They need to take a little time, so I add a threadDelay inside the first job to make sure they'll both try to take the lock. Next I'll make sure I can take a lock that uses a different key. Finally, we'll wait for the first job to finish and try to take a third lock.

This results in:

✅ Successfully took first lock ✅
✅ Unable to take second lock ✅
✅ Successfully took fourth lock ✅
✅ Successfully took third lock ✅

Yay!

...Yay?

I asked for feedback on this blog post from the Recurse Center community and got this super helpful thought from Kamal Marhubi:

On the locking, the only thing I can think of right now is what might happen if the connection holding the lock is broken. Will the inner work in func happily continue running? If so, meanwhile the other machine might try and succeed to grab the lock, and proceed to cause Bad Things to happen. Only when the first machine attempts to release the lock (by exiting the transaction holding the lock) will it learn that Something Went Wrong.

What does happen when the connection is broken? I can simulate it crudely by passing the connection through to func and then closing the connection in the middle of my first locked "job".

testLostConnection :: Pool Connection -> IO ()
testLostConnection connPool = do
  forkIO $
    catch (withLock connPool "test lock" $ \conn -> do
      close conn
      putStrLn "✅ Successfully took first lock ✅"
      threadDelay 1000000)
    (\(LockException _) -> putStrLn "🚫 Unable to take first lock 🚫")
  forkIO $
    catch (withLock connPool "test lock" $ \_ -> do
      putStrLn "🚫 Successfully took second lock 🚫")
    (\(LockException _) -> putStrLn "✅ Unable to take second lock ✅")
  return ()

Sure enough, this does result in:

✅ Successfully took first lock ✅
🚫 Successfully took second lock 🚫

(Plus some gnarly SQL errors.)

Oh no! This is rough. It looks like an advisory lock is not actually what I'm looking for. I had only heard of advisory locks from a blog post about "application-level" locking, and I'm working on an application, so I thought it would work. (Whoops.) And it kinda does? But it's not reliable enough to be called a "lock". Maybe a "lock" made out of tissue paper.

So what are advisory locks good for?

From Michael Malis:

Advisory locks are good for controlling the concurrency of queries within the database. One example is if you have a linked list in your database. You can use advisory locks to prevent multiple queries from inserting nodes at the same time. Multiple queries inserting at the same time can cause a "lost update". As an example of how the locking would work:

BEGIN;
lock linked list
look up head of linked list
insert new element as the new head of the linked list that points to the old head
unlock the linked list
COMMIT;

In this case it's ok if the connection dies. That only means the transaction is rolled back and the transaction has no effect at all.

So, if my task consisted of database queries that could run inside the same transaction where I take the lock, and the side-effects were isolated to the database, then this setup would be fine. (Is this true?)

So what to do instead?

Well, I already knew what I should do before I even started all this. I wanted to break the job up into smaller jobs that could finish faster, reducing the room for error, and make each small part of the job idempotent so it wouldn't matter if one of them was run an extra time or two.

However, that seemed like it would take a long time, and worse, it would be boring. And I was very curious about locks. So, I rationalized this would just be a "quick" fix until I could do the job properly. However, more likely, it would work well enough that I would put off fixing the legacy code until the lock failed catastropically.

It's fun to learn new things, and I have the freedom and ability to learn something new, apply it to a project, and put the result into production all in the same day. But as the great Uncle Ben once said, "with great power comes great responsibility", and I'm glad to have the amazing resource of the Recurse Center community where I can get feedback on these things!

What do you think?

I found a couple problems with my approach above, but maybe there are others lurking -- .

Finally...

I'm a developer (and soon-to-be worker-owner) at Position Development, a worker-run software development company. If you would like to support our work, please let others know about us and consider working with us on your next project!

emhoracek/not_advisory.md