Run go run main.go
to test:
2023/04/25 17:02:26.660597 blob size is 10592348960 bytes, which is ~ 10101 MB
2023/04/25 17:02:26.660668 reading 1024 bytes from azblob from offset 10592347936
2023/04/25 17:02:26.680128 reading 20 bytes from azblob from offset 10592348918
2023/04/25 17:02:26.695184 reading 56 bytes from azblob from offset 10592348862
2023/04/25 17:02:26.710803 reading 4096 bytes from azblob from offset 10592340479
2023/04/25 17:02:26.745270 reading 4096 bytes from azblob from offset 10592344575
2023/04/25 17:02:26.759124 reading 289 bytes from azblob from offset 10592348671
2023/04/25 17:02:26.775641 archive contains 102 files
2023/04/25 17:02:26.775685 found wanted file: files/hello.txt => 12 bytes
2023/04/25 17:02:26.775699 reading 30 bytes from azblob from offset 104880592
2023/04/25 17:02:26.792228 reading 12 bytes from azblob from offset 104880665
2023/04/25 17:02:26.808158 ----start content from blob
hello world
2023/04/25 17:02:26.808210 ----end content from blob
Despite this zip file being ~10GB in blob, we only need to download the specific portion of the file we're looking for (plus some extra bytes for metadata).
By implementing a io.ReaderAt
, we can leverage the power of Go's standard library zip package (archive/zip
), specifically zip.NewReader
.
Implementing this reader is pretty simple, since the Blob Streaming API already supports seeking bytes using an offset and content length:
func (b *BlobReader) ReadAt(p []byte, off int64) (n int, err error) {
httpRange := blob.HTTPRange{
Offset: off,
Count: int64(len(p)),
}
res, err := b.client.DownloadStream(context.Background(), &blob.DownloadStreamOptions{
Range: httpRange,
})
if err != nil {
return 0, err
}
defer res.Body.Close()
return io.ReadFull(res.Body, p)
}
With the power of the ZIP's Central Directory, the standard library's zip.Reader
can efficiently read small portions of the file to find the content we're looking for.
Make a zip with a bunch a large 10gb file, a bunch of 1MB files and a file we want to read:
mkdir -p files
# make a 10GB file
dd if=/dev/urandom of=files/large_file bs=1m count=10000
# make a bunch of 1MB files
for i in {1..100}; do
dd if=/dev/urandom of="files/file${i}" bs=1m count=1
done
# make an individual file that we want to read
echo "hello world" > files/hello.txt
# zip them together (may take a bit)
zip archive.zip files/*
Since we read from /dev/urandom
these don't compress well which is great for our test:
$ ls -lah archive.zip
-rw-r--r-- 1 robherley staff 9.9G Apr 25 12:21 archive.zip
This is uploaded to blob (with public access):
https://robstorage123.blob.core.windows.net/zip-test/archive.zip