Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save bee-san/dfbbb3dd9536e6a9191b56dbd3d374fe to your computer and use it in GitHub Desktop.
Save bee-san/dfbbb3dd9536e6a9191b56dbd3d374fe to your computer and use it in GitHub Desktop.

Note: If you are not comfortable with tech / python please do not follow this guide nor ask me for help. I am too busy immersing to teach someone to use Python, but if you are experienced you can ask :)

Why do this?

I work in tech and love tech.

When you love tech, sometimes you make something just because you can. There is no real reason as to why you want to do this, but perhaps if you are not familar with having fun there are some reasons:

  1. You may be in the military and need to go 1+ years without internet.
  2. You are in a country with good internet and have to go back to a country with old internet.
  3. You work as a researcher in Antartica and only have access to the internet for a few minutes a day, and most of that time will be needed for downloading research datasets.

Downloading the catalogue

  • wget is not multi-threaded and because theres a bunch of absolute shit in the catalogue it will take months to download.
  1. multi-thread your downloads. i wrote my own program for this.
  2. i downloaded the directory's html and broke up the folders to download them this way, one folder at a time.
  3. it will take at least 13 hours or so. the internet in SA is actually quite good ngl. its just the load shedding that sucks :(
  4. Create an ignore list of shit to not download. Read below.

Here's a rough script:

import os
import requests
import random
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import subprocess

def download_directory_list(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        anchor_tags = soup.find_all('a', href=True)
        return anchor_tags
    else:
        print(f"Failed to retrieve directory list from {url}")
        return []

def download_files_in_directory(directory_url):
    try:
        subprocess.check_call(["wget", "-r", "-nc", "--no-parent", "--reject", "db", "--reject-regex", ".*&quot.*", directory_url])
        print(f"Download successful from {directory_url}")
    except subprocess.CalledProcessError as e:
        print(f"Error downloading from {directory_url}: {e}")

if __name__ == "__main__":
    base_url = "https://mokuro.moe/manga/"
    anchor_tags = download_directory_list(base_url)
    anchor_tags.pop(0)

    if anchor_tags:
        with ThreadPoolExecutor() as executor:
            executor.map(download_files_in_directory, [os.path.join(base_url, tag.get('href')) for tag in anchor_tags])

this does not account for all the other files. please edit the script accordingly :)

i did not know what to ignore so i downloaded a lot of shit. here's the stuff to get rid of:

  • there is an /audio directory full of shit that isn't manga. delete this.
  • delete android.db its shit
  • there are multiple local yomitan audio dumps that are like 7gb big. delete that shit too. if you run some dirstat program they will stick out like a sore thumb
  • theres 500 billion .json files. bc file writes suck they take up much more space than you'd expect. my collection was 600gb but it took up 1.5tb because of these shitty fucking files. delete this shit too. size = 549gb size on disk = 1.35tb (note: I ran out of storage on the sd card, so this is not the whole catalogue. Later on calculations of size will be done on the whole catalogue. do not @ me in discord for this)
  • some of the manga has been processed with a literal potato and mokuro reader barely works on it. some of the manga is such low quality you will actually have to go back to nyaa to find the originals and reprocess them. do not expect this catalogue to be perfect.
  • theres some random ass scripts in there. fucking batch scripts from the early 90s, shell scripts, a bunch of html files, a fucking python server. all absolute shit. delete it all.
  • theres a folder z_bad_formatting of what appears to be badly formatted manga. at least they admit to it. delete this shit too.

basically size on disk != size. imagine u want to print out a piece of paper with the text "hello". you cannot print only 1 inch of paper, you have to use the whole paper. same applies for file systems. a lot of these small files use up a piece of paper, which massively inflates the entire catalogue.

so u want to delete all the little tiny files that are not manga or useful to us.

Here's a list of all file types in the Mokuro catalogue.

txt
jpg
html
mokuro
JPG
PNG
png
zip
jpeg
avif
url
bat
URL
json
gif
csv
info
ini
nomedia
webp
bmp
rar
VIX
ico
py
sha256
dat
db
torrent

You can safely delete (or choose to ignore when downloading):

  • txt
  • html (mokuro reader uses the .mokuro files, not the .html files. if you use migaku keep .html since mokuro reader is broken for migaku)
  • url
  • bat
  • URL
  • json
  • gif (it is an image.... but mokuro does not work with gifs. only png / jpg / jpeg / webp https://github.com/kha-white/mokuro/blob/ad8af0e374361c1d56f50cc24af5cd6f1dba9328/mokuro/volume.py#L109 )
  • csv
  • info
  • bmp
  • rar (mokuro reader only uses zip folders, not rar)
  • nomedia
  • ini
  • db - all the local yomitan audio / android.db stuff.
  • vix
  • ico
  • py
  • sha256
  • dat
  • torrent (these do not have seeders. I checked loads of them)

The total catalogue size is 830gb, but size on disk is more like 2tb. about 3.1 million files too

After deleting these extensions with (or if you are downloading, please ignore these files in your download):

find /mnt/d/Manga/mokuro.moe/manga -type f \( \
-name "*.txt" -o \
-name "*.url" -o \
-name "*.bat" -o \
-name "*.URL" -o \
-name "*.json" -o \
-name "*.gif" -o \
-name "*.csv" -o \
-name "*.info" -o \
-name "*.bmp" -o \
-name "*.rar" -o \
-name "*.nomedia" -o \
-name "*.ini" -o \
-name "*.db" -o \
-name "*.vix" -o \
-name "*.ico" -o \
-name "*.py" -o \
-name "*.sha256" -o \
-name "*.dat" -o \
-name "*.torrent" \
\) -delete

it became 810gb in total, but only 1.7 million files. so although we saved 20gb, we halved how many files there were which really tells you they are all shit.

There are 1.4 million .jpg files, 130k and 146k png and jpeg. 177 zips, 8k webps, 37k html, 9k mokuro left over.

BUT! Our size on disk == size (very roughly, there is 5gb difference). So by deleting 20gb of files, we saved about 800gb by my calculations.

I put them all onto https://www.amazon.co.uk/SanDisk-1-5TB-microSDXC-adapter-Performance/dp/B0CJMRW771 .

I use teracopy and told it to not run checksums or validate the files. maybe stupid. but it only took around 1 and a half days for it to finish.

If I paused the transfer it looked like size on disk started to spiral again. I am not sure why, maybe some low level sector curse has been placed on my SD card if you pause it. So I ended up deleting everything and making the transfer one last time, except this time I never paused it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment