4DA

Memory access patterns to shared memory in CUDA

CUDA has a small amount of memory available for its threads called shared memory. As the name already suggests is that this memory is available to all threads within a block simultaneously. We want to use this property to make threads read memory from global memory to shared memory in a block, use the memory together, and afterwards write the result back into global memory to avoid multiple accesses to global memory. Nevertheless, there are some rules one need to respect for high performance.

A word about warps

During launch time we specify our grid dimensions. During execution the threads of a block are then executed in groups of 32 at a time. Warps are never made beyond block boundaries. Each of those fixed size groups is called a warp and they always have a consecutive threadIdx. Those threads all execute together in SIMD style as long as there is no branching. For example an if statement with different outcome for warps units will lead to serial execu

	// compute JDS (Jagged Diagonal Storage) matrix in column-major order from
	// matrix in CSR (Compressed Sparse Row) format
	void CSRToJDS(int dim, // number of rows
	int valueCount, // number of values in hostCSRData
	int CSRColsSz, // number of values in hostCSRCols
	int *hostCSRRows, // each value is offset where row starts in hostCSRData
	int *hostCSRCols, // each value is corresponding column of value in hostCSRData
	float *hostCSRData, // non-zero values of matrix in CSR format
	int **hostJDSRowPerm, // permutations of JDS rows
	int **hostJDSRows, // length of each JDS row

	(defun my/split-org-daily (&optional max-level scope)
	"Export each headline in the current buffer to a separate org-roam daily file.
	The date of each headline is used as the filename. Existing files
	are overwritten. With a prefix argument MAX-LEVEL, only
	headlines up to the specified level will be exported. If SCOPE is
	nil, the export will be performed on the entire buffer. For
	other valid values of SCOPE, refer to `org-map-entries'."
	(interactive "P")
	(when max-level (setq max-level (format "LEVEL<=%s" max-level)))
	;; Widen buffer temporarily as narrowing would affect the exporting.

	# installing:
	# pip install pyproj
	# pip install gpxpy

	import gpxpy
	import gpxpy.gpx

	from pyproj import Transformer
	TRAN_3857_TO_GPS = Transformer.from_crs("EPSG:3857", "EPSG:4326")

	#!/usr/bin/emacs --script
	;; config
	;; storing creds in authinfo
	;; https://github.com/dengste/org-caldav#storing-authentication-information-in-authinfonetrc

	;; MELPA and use-package stuff -------------------------------------------------

	(require 'package)
	(add-to-list 'package-archives '("melpa" . "https://melpa.org/packages/") t)
	(package-initialize)

	#!/bin/bash
	DIR=$1
	CIDFILE="$1_cid"
	IPFSOPTS="--nocopy"
	DNSLINK_UPDATE="~/scripts/dnslink.sh"

	if [ -f $CIDFILE ]; then
	CID=`cat $CIDFILE`
	echo "Current CID=$CID"
	else

	#include <Arduino.h>
	#include <Wire.h> // Only needed for Arduino 1.6.5 and earlier

	const uint8_t blue = 23;
	const uint8_t vbatPin = 35;

	float VBAT; // battery voltage from ESP32 ADC read

	void setup()
	{

	fs = require('fs');

	var isosurface = require("isosurface")

	var mesh = isosurface.surfaceNets([100,100,100], function(x,y,z) {
	return Math.pow(x,4) + Math.pow(y,4) + Math.pow(z,4) - 1.5 * (xx + yy + z*z) + 1;
	} , [[-2,-2,-2], [2,2,2]])

	output = "";

	(defun mount-cryfs (what where)
	(setq pass (concat (read-passwd "cryfs pw: ") "\n"))
	(setq process
	(start-process-shell-command "cryfs" "cryfs"
	(format "cryfs %s %s" what where)))
	(process-send-string process pass)
	(while (process-live-p process) (sit-for 0.1))
	(eq (process-exit-status process) 0))

	;; mount cryfs `cryfs-dir` to `mount-dir' and open `file'