Francisco Lopez fran0x

Counting files in S3 buckets and folders is harder than it should be. But here's a way to get it done using s3cmd:

Install S3cmd

On Mac, brew install s3cmd
On Windows, go here

From the command line, run s3cmd --configure
Add your credentials when prompted.

orthogonalization: know what to tune to achieve what effect; for this would help to have orthogonal controls (steering wheel, acceleration, braking; well defined impact); however that's not usually the case in machine learning

assumptions we always made in ML:

fit training set well on cost function (human like): knobs would be: bigger network, better optimization algorithm (adam)
hope it does well in dev set: knobs would be: bigger (training) data set, regularization
hope it does well in test set: knob would be: bigger dev set
performs well in real world: k: change dev set or cost function

Build Kubernetes with Spark support

From: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/README.md#building-spark-with-kubernetes-support

git clone https://github.com/apache-spark-on-k8s/spark.git
build/mvn install -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes

... to have spark-2.2.0-k8s-0.5.0-SNAPSHOT-bin-2.7.3.tgz

K8s + Istio + Spring Boot + Golang? + Spark + ReactJS?

Project to play with: https://piotrminkowski.wordpress.com/2017/03/31/microservices-with-kubernetes-and-docker/

Steps

mvn clean install package (twice)
docker build -t piomin/account-service .
docker build -t piomin/customer-service .
kubectl create -f deployment-account.json
kubectl create -f deployment-customer.json

	class HowardHinnantDate < Formula
	desc "C++ library for date and time operations based on <chrono>"
	homepage "https://github.com/HowardHinnant/date"
	url "https://github.com/HowardHinnant/date/archive/v2.4.1.tar.gz"
	sha256 "98907d243397483bd7ad889bf6c66746db0d7d2a39cc9aacc041834c40b65b98"

	bottle do
	cellar :any
	sha256 "4a838948afe43157af491b4310d36ae88e5cb731181568a19f66819198f24aee" => :catalina
	end

	class MysqlConnectorCxx < Formula
	desc "MySQL database connector for C++ applications"
	homepage "https://dev.mysql.com/downloads/connector/cpp/"
	url "https://dev.mysql.com/get/Downloads/Connector-C++/mysql-connector-c++-8.0.18-src.tar.gz"
	sha256 "63b20e446c0aadeddbbc5cef36db8222d602793e6f1e6de511bdf7bcb2181f86"
	revision 2

	depends_on "boost" => :build
	depends_on "cmake" => :build
	depends_on "mysql-client"

	In the neural network terminology:

	- one epoch = one forward pass and one backward pass of all the training examples
	- batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
	- number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

	Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

	#!/bin/bash
	# forked from http://codegists.com/code/spark-submit-emr/

	# Minimum TODOs on a per job basis:
	# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
	# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)
	# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings

	# the two most important settings:
	num_executors=6

	name := "Simple Project"

	version := "1.0"

	libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.3.0"

	libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.3.0" classifier "models"