Deduplication

Goal

(WORM) Write Once Read Multiple, use less storage by keeping track of data parts and not writing them multiple times on storage

Definitions

in band / offline

in band deduplication means with handle the duplicate before writing a new copy on disk. It can be expensive, but don't use more space than needed

offline deduplication is when deduplication is handled aynschronously, using IOPs when available. It's less costly , but use more space since data are first written duplicated.

block level / file level

block level is when we used file block , either of fixed size or variable size as data to deduplicate. It can be inneficient on some data, but does not mandate to know the content of the data to deduplicate.

file level is when we use a file contents, it depends on a precise knowledge of the data (partition type, storage). Can be tricky with huge file that changes a lot (like database files)

Existing Implementation

deduplication is built in or available by third party tools in some FS, like BTRFS , XFS or reiserFS

https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe for btrf

Proposition

I think we should handle the deduplication ourself to handle various storage backend and leverage this knowledge to limit data transfered between systems, especially when transferring to S3/glacier

in band, fixed size block (1-4MB) based deduplication

compute a hash of the block
if the block is not already in store
- eventually crypt and compress block
- store the block in a real filesystem
store the block usage (which backup use it, which offset in the disk source) in an index
when deleting a block , delete in the index. Delete the file only if it's not used anymore
use a storage independant lock store (redis) to protect blocks

V2

If native file level dedup is present : use it

proxy installed on remote, local FS
overload writeFile / outputStream to
- compute the hash
- if not already exists :
  - really write the file in the dedup folder hierarchy
  - add the checksum as an extanded attribute setfattr -n user.checksum.sha256 -v 267607e76403760d5a2c07863ae592273105514065c67f7d5e217b9497d5f9fc ./linked.json This will be accessible to all hard linked copy
- else touch the file to mark its content as new (for immutability)
- hardlink the file to its intended place
overload unlink ( WARNING : possible race condition ? )
- get the number of links stat.nlink
- if the file had exactly 2 link (itself + source ) :
  - get the hash from the extended attributes getfattr -n user.checksum.sha256 ./turbo.json
  - delete the hard link in the vhd folder
- delete local
rename :
- if dest exists :
  - unlink it properly before writing new file
- move link, no need to update counters, we have the same number of links after

Bonus : merge only delete blocks if there aren't used anywhere else, it should speed up

fbeauchamp/deduplication.md