(WORM) Write Once Read Multiple, use less storage by keeping track of data parts and not writing them multiple times on storage
in band deduplication means with handle the duplicate before writing a new copy on disk. It can be expensive, but don't use more space than needed
offline deduplication is when deduplication is handled aynschronously, using IOPs when available. It's less costly , but use more space since data are first written duplicated.
block level is when we used file block , either of fixed size or variable size as data to deduplicate. It can be inneficient on some data, but does not mandate to know the content of the data to deduplicate.
file level is when we use a file contents, it depends on a precise knowledge of the data (partition type, storage). Can be tricky with huge file that changes a lot (like database files)
deduplication is built in or available by third party tools in some FS, like BTRFS , XFS or reiserFS
https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe for btrf
I think we should handle the deduplication ourself to handle various storage backend and leverage this knowledge to limit data transfered between systems, especially when transferring to S3/glacier
in band, fixed size block (1-4MB) based deduplication
- compute a hash of the block
- if the block is not already in store
- eventually crypt and compress block
- store the block in a real filesystem
- store the block usage (which backup use it, which offset in the disk source) in an index
- when deleting a block , delete in the index. Delete the file only if it's not used anymore
- use a storage independant lock store (redis) to protect blocks
If native file level dedup is present : use it
-
proxy installed on remote, local FS
-
overload writeFile / outputStream to
- compute the hash
- if not already exists :
- really write the file in the dedup folder hierarchy
- add the checksum as an extanded attribute
setfattr -n user.checksum.sha256 -v 267607e76403760d5a2c07863ae592273105514065c67f7d5e217b9497d5f9fc ./linked.json
This will be accessible to all hard linked copy
- else touch the file to mark its content as new (for immutability)
- hardlink the file to its intended place
-
overload unlink ( WARNING : possible race condition ? )
- get the number of links stat.nlink
- if the file had exactly 2 link (itself + source ) :
- get the hash from the extended attributes
getfattr -n user.checksum.sha256 ./turbo.json
- delete the hard link in the vhd folder
- get the hash from the extended attributes
- delete local
-
rename :
- if dest exists :
- unlink it properly before writing new file
- move link, no need to update counters, we have the same number of links after
- if dest exists :
Bonus : merge only delete blocks if there aren't used anywhere else, it should speed up