While our new RDF::Repository implementation theoretically improves the concurrency story for RDF.rb, it isn't, in itself, thread safe. The underlying data representation may be purely functional, but the Repository itself is swimming in shared mutable state. Specifically, we have the potential for a data race during execution of code like @data = data
; and, more generally, for race conditions wherever our changes depend on previous reads. Notably, this affects #transaction
, as demonstrated in the following snippet:
require 'rdf'
repo = RDF::Repository.new
threads = []
err_count = 0
# make 10 threads, processing 1000 transactions each
10.times do |n|
threads << Thread.new do
1_000.times do |i|
begin
repo.transaction(mutable: true) do
# insert a unique statement for each transaction
insert RDF::Statement("thread_#{n}".to_sym,
RDF::URI('http://example.com/num'),
i)
end
rescue RDF::Transaction::TransactionError
# count up the statements that fail in execution
err_count += 1
end
end
end
end
threads.each(&:join)
# not even close to 10_000!
repo.count + err_count # => 5587
(Running this in your environment is may yield different results. You may even see expected results. Nevertheless, trust me, this code is not safe.)
The good news that races are reasonably isolated. Any dreams of perfectly asynchonous concurrency are dashed, but the need for synchonization is minimized. For transactions, we need only synchonize #execute
; in place of the transaction block above, we have:
# ...
begin
tx = repo.transaction(mutable: true)
tx.insert RDF::Statement("thread_#{n}".to_sym,
RDF::URI('http://example.com/num'),
i)
mutex.synchronize { tx.execute }
rescue RDF::Transaction::TransactionError
# ...
Still, as an implementation-specific solution, this leaves something to be desired. Giving more thought to thread safety will likely uncover better options.
A proof of concept for a thread safe transaction is up at ruby-rdf/rdf@c8080e2