Skip to content

Instantly share code, notes, and snippets.

@andyneff
Last active September 7, 2022 18:55
Show Gist options
  • Save andyneff/01568117acdd866fbf7c591b0328f06b to your computer and use it in GitHub Desktop.
Save andyneff/01568117acdd866fbf7c591b0328f06b to your computer and use it in GitHub Desktop.
Jupyter notebook filter

Putting Jupyter notebooks in source control is always a good idea. However, the size of notebooks can grow to be quite large due to the output of the cells being large or images. On top of that, re-running a notebook to generate different results would create a large diff for very little to no change in the code, further increasing the repo size. Eventually the repo will be large enough to be unwieldy or even hit maximum repo limits like 2GB most git servers enforce.

The solution is to strip the output of the notebooks automatically, to prevent accidentally commmiting the output. Git provides two different mechanisms to do this:

  • Pre-commit hooks: Pre-commit hooks must be setup after a clone, but can wipe the output of your notebooks right before commit. However this would essentially wipe the output in your working copy, which is less than desired.
  • Clean/smudge filters: Custom filters must be setup after a clone, but the better solution is to create a "clean filter" that wipes the output in the staged copy only. It also has the advantage of wiping on diff, and any other git operation that would normally show a difference that we don't want to see. A "smudge filter" is not needed as we do not plan on undoing the wiping operation.

Steps to setup "clean filter"

  1. Initially, you need to run the following function everytime you clone the repo. Make that part of your cloning process (for developers) how ever fits you best:
function setup-jupyter-filter()
{
  local git_dir=$(cd "$(git rev-parse --git-dir)"; pwd)
  local jq
  local url
  local JQ_VERSION=${JQ_VERSION-1.6}

  if [ "${OS-}" = "Windows_NT" ]; then
    url="https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/jq-win64.exe"
    jq="${git_dir}/bin/jq.exe"
  elif [[ ${OSTYPE-} = darwin* ]]; then
    url="https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/jq-osx-amd64"
    jq="${git_dir}/bin/jq"
  else
    url="https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/jq-linux64"
    jq="${git_dir}/bin/jq"
  fi
  if [ ! -f "${jq}" ]; then
    mkdir -p "${git_dir}/bin"
    curl -L "${url}" -O "${jq}"
    chmod 755 "${jq}"
  fi

  if ! git config filter.strip-jupyter-notebook.clean &> /dev/null; then
    git config --local filter.strip-jupyter-notebook.clean "${git_dir}/bin/jq '.cells |= map(if .\"cell_type\" == \"code\" then .outputs = [] | .execution_count = null else . end | .metadata = {}) | .metadata = {}'"
  fi
}
  1. Second, you need to say which files you want this to apply to, in a git tracked file called .gitattributes. For example, to apply this to all .ipynb files:
*.ipynb filter=strip-jupyter-notebook -text
  1. The optional third step is to "rewrite history" and strip all the notebooks already committed:
function rewrite-strip-notebooks()
{
  local git_dir=$(cd "$(git rev-parse --git-dir)"; pwd)
  export git_dir

  git filter-branch --prune-empty --tree-filter '
    git config --local filter.strip-jupyter-notebook.clean "${git_dir}/bin/jq '\''.cells |= map(if .\"cell_type\" == \"code\" then .outputs = [] | .execution_count = null else . end | .metadata = {}) | .metadata = {}'\''"
    echo "*.ipynb filter=strip-jupyter-notebook -text" >> .gitattributes
    git add ".gitattributes"

    for file in $(git ls-files | xargs git check-attr filter | grep "filter: strip-notebook-output$" | sed -r "s/(.*): filter: strip-notebook-output/\1/"); do
      echo "Processing ${file}"
      git rm -f --cached ${file}
      echo "Stripping and adding $file back"
      git add ${file}
    done' --tag-name-filter cat -- --all
}
  1. Optionally, garbage collect locally to see the smaller size right away
du -hs .git
rm -r .git/refs/original
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc
du -hs .git
  1. Push
git push origin --force --all --tags
# git push origin --force --all # I forget if --tags pushes only tags or also tags

* Based off of How to clear jupyter notebooks output in all cells from the linux terminal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment