Scaling file uploads, storage, retrieval, & background processing

Storage

Object store systems like Openstack Swift, Minio, and Ceph are worth consideration. They offer various advantages in scalability, accessibility, and things like arbitrary object metadata. Most of them also have S3 compatible APIs if that makes things easier.

If instead we need or desire to read files from traditional mounted filesystems, GlusterFS is one option worth considering. With it's Distributed Replicated configuration, storage space can be scaled by adding more disks, and access speed can be scaled by adding more nodes. It's supported by Kubernetes, and has systems for asynchronous replication, usually used for syncing the data to another datacenter. It can support replicated applications via ReadWriteMany.

Resource versioning

Using an object store: versioning is built in, record the version ID in the HTTP server's database.
Using a file store: too many files in a single directory can cause operational bottlenecks. A directory tree in a form such as user_id/resource_id/version_id/filename.txt allows the relative filepath to be calculated on the fly via information in the HTTP server's database.

Post-upload background processing

We want to ensure certain actions have been taken for each uploaded file. Examples are virus scanning, parsing and storing metadata, resizing images, etc. A message broker (AKA message queue) with acknowledgments and at-least-once delivery could well be a top design choice.

Using a message broker could look something like

Upload logic in HTTP server

Receive PUT request from user, do not return yet
Store file
- If failure between this step and the completion of the next two is a concern, consider:
  - (using an object store): set a short TTL on the object after which it will be automatically removed, and then update that TTL to infinite after the next two steps.
  - (using a file store): first take out an etcd distributed lock named after the filename, version, and userID. If next steps fail, PUT request fails, user tries again, lock pre-exists but no file, so continue with upload. If next steps succeed, remove lock.
Add file info to database to support GET reqeusts and the like. Use transactions for replica safety.
Send a message containing the object ID or filepath to each action's queue, waiting for acknowledgments that the messages were received. (I.E. the scan_for_viruses queue, the load_metadata queue, etc.)
Return success to PUT request

Queue logic in processing worker

Receive message from queue, do not acknowledge yet
Process (I.E. scan for viruses, load metadata, etc.)
Inform broader system of completion. It's recommended to avoid the tight coupling that connecting directly to the HTTP server's database would create. A couple of options are:
- (using an object store): update the object metadata with info like scanned for viruses on 7/12/2018, etc.
- (using an object store or file store): send messages to separate "done" queues to which the HTTP server subscribes, updating the database accordingly. (I.E. the scan_for_viruses_done queue, etc.)
Acknowledge original message, profit!

Notes

Failure anywhere in the processing workers will cause the message broker to redeliver the message that was not acknowledged.
Failure anywhere in the upload logic will cause a failed PUT request and the user will retry.
Both sides can be run in replica and scaled horizontally.

snarlysodboxer/scaling_flat_file_ops.md