Matthew Steffen msteffen

Activate Pachyderm enterprise and Pachyderm auth

pachctl enterprise activate <enterprise code>
pachctl auth activate --initial-admin=robot:abc

Write Pachyderm config

# Lookup current config version--pachyderm config has a barrier to prevent
# read-modify-write conflicts between admins

Background

In the course of working on pachyderm/pachyderm#2505, JD and I ran into a conflict between that design and our auth model:

If PipelineInfo documents are stored in output repos, then e.g. ListPipeline and InspectPipeline have no way to retrieve PipelineInfos for users who don't have access to the pipeline's output repo.

This means that ListPipeline no longer returns all PipelineInfos in a DAG (and may not even return most of the PipelineInfos) in a DAG, which I believe breaks some of the assumptions in our dashboard rendering algorithm (see Alternatives Considered for some of the conceptual problems I ran into while trying to think of solutions).

While we could hack around this issue (again, see Alternatives Considered below), I think this may be an opportunity to move our auth system in the direction of a role-based auth system similar to the one in GCP, AWS and etcd.

Step 1: Create a GKE cluster

$ CLUSTER_NAME=msteffen-cluster-$(date +%Y%m%d)
$ GCP_ZONE=us-west1-a
$ STORAGE_NAME=pach-disk
$ STORAGE_SIZE=10
$ gcloud config set container/cluster ${CLUSTER_NAME}
$ gcloud config set compute/zone ${GCP_ZONE}
$ gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw --machine-type n1-standard-4 --num-nodes=3

The high-level concept is basically this:

a) Versioning is obviously pretty useful. Developers have been versioning their code for forever, but the reality is that right now most companies' data science teams are versioning their data sets by naming them with today's date or something (if that)

b) Versioning data sets by dating them isn't actually terrible if each of your data pipelines only consumes one data set and produces one output, but if you have e.g. a pipeline P consuming data sets A and B, and you want to understand the output of P (e.g. to debug it) you need to know what the state of both A and B were at the time that P produced its output.

In other words, it's not enough to look at yesterday's version of A, you have to know what the version of B was at the same time, and if your implementation of P is changing (e.g. it's model and you keep adjusting hyperparameters) then you need to know the version of P too. So PFS tracks versions of A and B, and PPS tracks the dependence of P on A and B, so t

	\| // The master process is responsible for creating/deleting workers as
	\| // pipelines are created/removed.
	// step takes 'ptr', a newly-changed pipeline pointer in etcd, and \| func (a *apiServer) master() {
	// 1. retrieves its full pipeline spec and RC \| masterLock := dlock.NewDLock(a.etcdClient, path.Join(a.etcdPrefix, masterLockPath))
	// 2. makes whatever changes are needed to bring the RC in line with the (new) spec \| backoff.RetryNotify(func() error {
	// 3. updates 'ptr', if needed, to reflect the action it just took \| ...
	fu

	#!/bin/bash

	export RUNNING_CLUSTERS="$(
	gcloud container clusters list \
	\| tail -n+2 \
	\| awk '{print $1}'
	)"
	MACHINE_TYPE=n1-standard-4

	eval "set -- $(getopt -l "no-deploy-pachyderm,create,machine-type:,delete:,delete-all" "--" "${0}" "${@}")"

	package main

	import (
	"fmt"
	"os"
	"testing"
	)

	func BenchmarkWrite(b *testing.B) {
	for cnt, cntP := 10, 1; cntP <= 3; cnt, cntP = cnt*10, cntP+1 {