Skip to content

Instantly share code, notes, and snippets.

View msteffen's full-sized avatar

Matthew Steffen msteffen

View GitHub Profile
| // The master process is responsible for creating/deleting workers as
| // pipelines are created/removed.
// step takes 'ptr', a newly-changed pipeline pointer in etcd, and | func (a *apiServer) master() {
// 1. retrieves its full pipeline spec and RC | masterLock := dlock.NewDLock(a.etcdClient, path.Join(a.etcdPrefix, masterLockPath))
// 2. makes whatever changes are needed to bring the RC in line with the (new) spec | backoff.RetryNotify(func() error {
// 3. updates 'ptr', if needed, to reflect the action it just took | ...
fu
@msteffen
msteffen / tiny-cluster.sh
Last active March 23, 2019 17:11
Quickly create 1-node Pachyderm cluster in GKE
#!/bin/bash
export RUNNING_CLUSTERS="$(
gcloud container clusters list \
| tail -n+2 \
| awk '{print $1}'
)"
MACHINE_TYPE=n1-standard-4
eval "set -- $(getopt -l "no-deploy-pachyderm,create,machine-type:,delete:,delete-all" "--" "${0}" "${@}")"
@msteffen
msteffen / Run experimental Pachyderm release.md
Last active August 23, 2018 00:32
Running experimental Pachyderm release

Activate Pachyderm enterprise and Pachyderm auth

pachctl enterprise activate <enterprise code>
pachctl auth activate --initial-admin=robot:abc

Write Pachyderm config

# Lookup current config version--pachyderm config has a barrier to prevent
# read-modify-write conflicts between admins
@msteffen
msteffen / Discarded Issue: Auth for Pachyderm Commits and Branches.md
Created December 21, 2017 18:25
Discarded issue: Auth for Pachyderm Commits and Branches

Background

In the course of working on pachyderm/pachyderm#2505, JD and I ran into a conflict between that design and our auth model:

If PipelineInfo documents are stored in output repos, then e.g. ListPipeline and InspectPipeline have no way to retrieve PipelineInfos for users who don't have access to the pipeline's output repo.

This means that ListPipeline no longer returns all PipelineInfos in a DAG (and may not even return most of the PipelineInfos) in a DAG, which I believe breaks some of the assumptions in our dashboard rendering algorithm (see Alternatives Considered for some of the conceptual problems I ran into while trying to think of solutions).

While we could hack around this issue (again, see Alternatives Considered below), I think this may be an opportunity to move our auth system in the direction of a role-based auth system similar to the one in GCP, AWS and etcd.

@msteffen
msteffen / write_test.go
Last active July 16, 2017 06:13
Filesystem benchmark in Go
package main
import (
"fmt"
"os"
"testing"
)
func BenchmarkWrite(b *testing.B) {
for cnt, cntP := 10, 1; cntP <= 3; cnt, cntP = cnt*10, cntP+1 {
@msteffen
msteffen / gke-pach-minio-20170425.md
Last active October 15, 2021 08:51
Deploy a 1-node minio cluster in a GKE cluster, and then run a Pachyderm cluster on top of it

Step 1: Create a GKE cluster

$ CLUSTER_NAME=msteffen-cluster-$(date +%Y%m%d)
$ GCP_ZONE=us-west1-a
$ STORAGE_NAME=pach-disk
$ STORAGE_SIZE=10
$ gcloud config set container/cluster ${CLUSTER_NAME}
$ gcloud config set compute/zone ${GCP_ZONE}
$ gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw --machine-type n1-standard-4 --num-nodes=3
@msteffen
msteffen / pachyderm_description.md
Created February 25, 2017 22:51
Pachyderm description thing

The high-level concept is basically this:

a) Versioning is obviously pretty useful. Developers have been versioning their code for forever, but the reality is that right now most companies' data science teams are versioning their data sets by naming them with today's date or something (if that)

b) Versioning data sets by dating them isn't actually terrible if each of your data pipelines only consumes one data set and produces one output, but if you have e.g. a pipeline P consuming data sets A and B, and you want to understand the output of P (e.g. to debug it) you need to know what the state of both A and B were at the time that P produced its output.

In other words, it's not enough to look at yesterday's version of A, you have to know what the version of B was at the same time, and if your implementation of P is changing (e.g. it's model and you keep adjusting hyperparameters) then you need to know the version of P too. So PFS tracks versions of A and B, and PPS tracks the dependence of P on A and B, so t