Recently two different administrative users of a Haskell app I maintain started a long-running background job at the same time. This resulted in some Bad Things happening which I would like to prevent in the future!
So I want a lock. This app runs on at least two servers at any given time, so it needs to be a distributed lock, and for Reasons (not relevant to this blogpost) I can't use our Redis cache. So the PostgreSQL database it is!
My first thought is that I could just make a table of "locks", write a row to the table, if the row is there, the job is locked. That probably would have worked just fine! But then I did a quick Google to see what other options were out there and learned about advisory locks. And these seem perfect for what I want to do.
From the PostgreSQL documentation:
While a flag stored in a table could be used for the same purpose, advisory locks are faster, avoid table bloat, and are automatically cleaned up by the server at the end of the session.
This blog post is a rundown of how these locks work.
So I got coding. And here's what my lock function in the Haskell app looks like:
withLock :: Pool Connection -> Text -> IO a -> IO a
withLock connPool lockName func =
withResource connPool $ \conn ->
withTransaction conn $ do
let key = hash lockName
locked <- query conn "SELECT pg_try_advisory_xact_lock(?)" (Only key)
case listToMaybe locked of
Just (Only True) -> func
_ -> throw (LockException lockName)
Let's go through it line-by-line:
withResource connPool $ \conn ->
withTransaction conn $ do
We use withResource
to get a connection from the database connection pool, then we use postgresql-simple
's withTransaction
to wrap the rest of the function in a database transaction.
That sounds a little scary at first. Does this mean everything that happens inside the lock is happening in the same transaction? So, if we're charging 500 credit cards and recording 500 receipts, and the 499th charge has an error, is it going to roll back 498 receipts!? That would be terrible!! π±
Luckily that is not what happens! Only database calls using this connection
conn
will be a part of the transaction. We don't pass conn
as an
argument into func
, so we know that
database calls within that function are going to grab their own connections
from the pool. Anything that those connections do will be outside
this transaction.
let key = hash lockName
Next we hash
the lockName
to turn it into a integer that we can give to
pg_try_advisory_xact_lock
.
locked <- query conn "SELECT pg_try_advisory_xact_lock(?)" (Only key)
Now we'll try to take the lock. The postgresql-simple
library's query
function takes a connection to the database, a query, and an n-tuple of
arguments for the query. In this case there's only one, but how does the
compiler distinguish a 1-tuple from just... a value? The postgresql-simple
has us wrap it in this Only
constructor to make the types work.
We know that pg_try_advisory_xact_lock
returns just a single
boolean, but as far as postgresql-simple
is concerned, any query can return
a list of rows containing a list or tuple of columns.
But we just want to know if that single result is true or false.
That's where listToMaybe locked
comes in. listToMaybe
on a list of "rows" will give us Just
the first element or Nothing
. So then we have Maybe
the single result we're looking for.
We only want to call the function inside the lock if the query
returned at least one row (it's Just
something), if there was Only
one
field returned from the database, and if the lock
succeeded (the result is True
). Otherwise we need to do something else.
_ -> throw (LockException lockName)
Here, we're throwing a custom LockException
. This will
let us handle just these errors if we want to let the
user know that a job is already in progress.
Here's what the definition of LockException
looks like:
newtype LockException = LockException Text deriving Show
instance Exception LockException
I'm not going to attempt automated tests for this function, because these sorts of tests (trying to force two things to run exactly at once) are just way too flaky.
So what kind of test can I write for this to make sure it works? Here's what I did to test as I developed:
testLock :: Pool Connection -> IO ()
testLock conn = do
forkIO $
catch (withLock conn "test lock" $ do
putStrLn "β
Successfully took first lock β
"
threadDelay 1000000)
(\(LockException _) -> putStrLn "π« Unable to take first lock π«")
forkIO $
catch (withLock conn "test lock" $ do
putStrLn "π« Successfully took second lock π«")
(\(LockException _) -> putStrLn "β
Unable to take second lock β
")
forkIO $
catch (withLock conn "a different lock" $ do
threadDelay 1000000 -- this delay is just to prevent overlapping output
putStrLn "β
Successfully took third lock β
")
(\(LockException _) -> putStrLn "π« Unable to take third lock π«")
threadDelay 1500000
forkIO $
catch (withLock conn "test lock" $ do
putStrLn "β
Successfully took fourth lock β
")
(\(LockException _) -> putStrLn "π« Unable to take fourth lock π«")
return ()
We need to use forkIO
to create two threads that can run more or less
at once. They need to take a little time, so I add a threadDelay
inside
the first job to make sure they'll both try to take the lock. Next I'll
make sure I can take a lock that uses a different key. Finally,
we'll wait for the first job to finish and try to take a third lock.
This results in:
β
Successfully took first lock β
β
Unable to take second lock β
β
Successfully took fourth lock β
β
Successfully took third lock β
Yay!
I asked for feedback on this blog post from the Recurse Center community and got this super helpful thought from Kamal Marhubi:
On the locking, the only thing I can think of right now is what might happen if the connection holding the lock is broken. Will the inner work in
func
happily continue running? If so, meanwhile the other machine might try and succeed to grab the lock, and proceed to cause Bad Things to happen. Only when the first machine attempts to release the lock (by exiting the transaction holding the lock) will it learn that Something Went Wrong.
What does happen when the connection is broken? I can simulate it crudely by passing the connection
through to func
and then closing the connection in the middle of my first locked "job".
testLostConnection :: Pool Connection -> IO ()
testLostConnection connPool = do
forkIO $
catch (withLock connPool "test lock" $ \conn -> do
close conn
putStrLn "β
Successfully took first lock β
"
threadDelay 1000000)
(\(LockException _) -> putStrLn "π« Unable to take first lock π«")
forkIO $
catch (withLock connPool "test lock" $ \_ -> do
putStrLn "π« Successfully took second lock π«")
(\(LockException _) -> putStrLn "β
Unable to take second lock β
")
return ()
Sure enough, this does result in:
β
Successfully took first lock β
π« Successfully took second lock π«
(Plus some gnarly SQL errors.)
Oh no! This is rough. It looks like an advisory lock is not actually what I'm looking for. I had only heard of advisory locks from a blog post about "application-level" locking, and I'm working on an application, so I thought it would work. (Whoops.) And it kinda does? But it's not reliable enough to be called a "lock". Maybe a "lock" made out of tissue paper.
From Michael Malis:
Advisory locks are good for controlling the concurrency of queries within the database. One example is if you have a linked list in your database. You can use advisory locks to prevent multiple queries from inserting nodes at the same time. Multiple queries inserting at the same time can cause a "lost update". As an example of how the locking would work:
BEGIN;
lock linked list
look up head of linked list
insert new element as the new head of the linked list that points to the old head
unlock the linked list
COMMIT;
In this case it's ok if the connection dies. That only means the transaction is rolled back and the transaction has no effect at all.
So, if my task consisted of database queries that could run inside the same transaction where I take the lock, and the side-effects were isolated to the database, then this setup would be fine. (Is this true?)
Well, I already knew what I should do before I even started all this. I wanted to break the job up into smaller jobs that could finish faster, reducing the room for error, and make each small part of the job idempotent so it wouldn't matter if one of them was run an extra time or two.
However, that seemed like it would take a long time, and worse, it would be boring. And I was very curious about locks. So, I rationalized this would just be a "quick" fix until I could do the job properly. However, more likely, it would work well enough that I would put off fixing the legacy code until the lock failed catastropically.
It's fun to learn new things, and I have the freedom and ability to learn something new, apply it to a project, and put the result into production all in the same day. But as the great Uncle Ben once said, "with great power comes great responsibility", and I'm glad to have the amazing resource of the Recurse Center community where I can get feedback on these things!
I found a couple problems with my approach above, but maybe there are others lurking -- .
I'm a developer (and soon-to-be worker-owner) at Position Development, a worker-run software development company. If you would like to support our work, please let others know about us and consider working with us on your next project!