- Merging is resource expensive
- Multiple processes modify the vhd directory, creating some interesting race conditions (hi hanjo)
- don't move file, only move indexes
- use an existing solution to handle concurrency : a database
disk (id*, label)
blockAddress(id*, hash, diskId, offset) // (diskId, offset) is unique
// don't use an id, here, I want to ensure we won't update the hash of an idea
// also sqlite need to have unique constraint for foreign key
// so we can't use blockStorage directly as a foreign key of blockAddress
block(hash)
blockStorage(hash, storage, blockStatusId) // (hash, storage) is unique
blockStatus(id*, label) // UPLOADING, CREATING, CREATED, DELETING, ...
Storage(id*, label)
Lock(path*, from, by, taskType {UPLOAD, DELETE, ..}) // handle lock in database from is a datetime, by is the process keeping it
save vhd metadata in database
for each block of the vhd
if block does not exists in DB with success state
obtain file lock (handle stale lock)
if block is still missing
create it with status uploading, hash and size
upload ( refresh lock every minute )
update status to ok
else
// already uploaded by another backup nothing to do
dispose lock
add block to blockAddress table
upload a flat file blockIndex => hash + vhd metadata to ensure restorability even if database is broken
the flat file contains also the ancestors blocks ensuring it's not modified by merging ancestors. Its max size would be 32 MB for a 2TB vhd. It can be generated in one query using the with
keyword https://www.sqlite.org/lang_with.html
the flat file should have the *.vhd extension in the same path as today's backup, allowing it to be restored from a XO, even if it don't have the database, as long as the installation have a VHD class able to read it.
BEGIN TRANSACTION
UPDATE INTO blockaddress
SET diskId = <childDiskId > WHERE hash IN
select parent.hash
FROM blackAddress parent
LEFT JOIN blackAddress child
ON parent.diskId = child.diskId
AND parent.offser = child.offset
WHERE parent.diskId IN (list)
AND child.diskId IS NULL -- not already on child
-- here all the block still in the blockAddress of parent are block that should be deleted
UPDATE disk SET status ='DELETING' where id = <parentDiskId>
DELETE FROM blockAddress WHERE diskId = <parentDiskId>
COMMIT
// here the UI is ok
launch the vacuuming
// here we really got space back
- using https://www.sqlite.org/lang_with.html it's possible to construct the block list in one pass from database
- it's also possible to use the flat file to have one source of truth
- predict the space freed by querying the database looking for blocks only used in this disk AND without chaining to a children
- delete the disk and blockAdresses linked to the backups
- launch the vacuuming
after merge / delete backup
- list all disk with status DELETING
- remove disks metadata and files
- delete disks records in database
- obtain lock on a batch of blocks without blockAddress
- delete the files that are successfully locked.
- remove block records from database
- dispose lock
this process is the only one that can delete data from remote. It should run in an isolated process with specific permissions, and ensure as much data consistency as possible before deleting any files.
INSERT INTO Lock(path, type) -- type is great to allow us to restart a broken locked process
SELECT hash, 'DELETING'/'UPLOADING' FROM block
WHERE path = ?
RETURNING path
unicity constraint on the primary key will make this return a error if lock is already acquired by anyone else
- get older lock
- if lock is DELETING
- if delete succeed : remove lock
- else
- if file is ok remove lock
- else mark vhd depeneding of this block as broken (theorically none)
- else
- if file is ok on a remote (correct size and hash)
- remove lock
- else
- delete both if file is incomplete
- delete lock if file is complete
- mark vhd depending of this block as broken
- if file is ok on a remote (correct size and hash)
using the hash of the block size as block key gives us free deduplication
tracking which block are on each remote enables us to defines scenario where we upload to a faster SR and then move data to a slower one (think of a local backup uploaded to S3, then to glacier)
- blocks are never modified/moved , and can be easily checked against corruption or external modification
- indexes are not modified by merging
since no file modification occurs anymore, we can rely on the object lock of S3 to give a strong guaranty to our users, with a legal value. It's a key feature of ransomware mitigation tools
users can also activate legal hold preventing us to delete any files on the remote. And we can have an api extracts the files to be deleted in aws compatible format, allowing the user to use an external tool to delete unneeded files without ever giving this authorization to XO.
having data in database allow us to show some fancy visualisation of jobs processes / progress, without even reworking the Tasks
- database structure change means migration, and it's tedious, especially on a 10GB+ database
- backuping this database can't rely on the current backup process. We need to execute an database export regularly + a reindex process that can rebuild the database from listing and bat
- the smaller the block , the bigger the database . With 2MB block like Vhd that means 500 K records per TB, but with 1KB blocks it means 1G records => Not really usable as is for NBD
- sqlite doesn't really like distant access, it means we have to think of something for the proxy or used a full fledged database (postgresql for example)
- backup speed may be slower when writing to SR with low concurrency, but I'm betting on the fact that dedup + compression + no merge will speed up the process globally
sqlite handle quite well a database with ~ 10M blocks , which mean 20TB of backups (and with a dedup ratio of 5 and a compression ratio of 2 it means the users sees it as 200TB of backup). the resulting sqlite file is 15GB , so about 10 000 smaller than the saved storage
I would advise using a client server database like postgresql this will allow more flexibility and more security
Let's take a scenario where the database is corrupted by a bad hardware, a human error or a ransomware
We assume that the file storage is safe with object lock
- list all blocks per remote and fill block and blockStorage
- list all vm -> disks and fill disk table
- list all BAT and fill blockAddress
- if blockAddress references a missing block => mark the vhd as broken
- create a nbdkit plugin to read from our exploded vhd : https://archive.fosdem.org/2019/schedule/event/nbdkit/ . Don't forget to handle authentification . Example for vmdk : https://rwmj.wordpress.com/2021/01/01/read-and-writing-vmware-vmdk-disks/
- use compression filter https://www.libguestfs.org/nbdkit-gzip-plugin.1.html to decompress
- decrypt it : https://rwmj.wordpress.com/2022/05/14/nbdkit-now-supports-luks-encryption/
- use partition filter https://libguestfs.org/nbdkit-partitioning-plugin.1.html
- mount this nbd device
- do our magic with partition , mount it locally and list/restore the files
- choose database (in fact it will be postgresql)
- create a subclass of
VhdAbstract
calledVhdDirectoryIndexed
reading and writing from database - update openVhd
- update createVhdDirectoryFromStream
- test backups , restoration, dedup. Can be released in Alpha
- implement a merge fast track if both VHD are
VhdDirectoryIndexed
- implement vacuum
- measure speed gain against 4. Test with immutable FS. Can be released in Beta
- implement data tiering (replication)
- implement data buffering (backup to a fast remote, then moving all data to a slower and cheaper one)
- implement restoration without database (disaster recovery)
- implement file level restoration
alternative : don't use vhd format (with its parsing) , but directly the block api vatesfr/xen-orchestra#3123