edsu’s gists

edsu / swap

Last active July 30, 2024 18:08

See what SDR collections and crawls objects have a snapshot of a given URL.

	#!/usr/bin/env python3

	"""
	Look up a URL in swap.stanford.edu and print out the collections and crawl
	SDR object identifiers that contain a snapshot of the URL.
	"""

	import sys
	import json
	import collections

edsu / csv_to_parquet.py

Created July 19, 2024 12:47

	import csv
	import sys
	from itertools import batched

	import pyarrow
	from pyarrow.parquet import ParquetWriter

	csv.field_size_limit(sys.maxsize)

	def csv_to_parquet(csv_file, parquet_file, batch_size=10_000):

edsu / .gitignore

Last active July 18, 2024 14:48

A sloppy prototype for moving browsertrix WACZs to AWS S3.

.env

edsu / cursor_test.py

Created July 10, 2024 19:16

	import requests

	author_id = 'https://openalex.org/A5067004024'

	url = 'https://api.openalex.org/works'

	params = {
	'filter': f'author.id:{author_id}',
	'cursor': '*'
	}

edsu / openalex_counts.py

Last active July 8, 2024 22:42

	#!/usr/bin/env python3

	"""
	Run this program with an institution name and see the institutions and the count
	of publications in OpenAlex.

	$ ./openalex_counts "stanford"

	Stanford University (I97018004): 430550
	Stanford Medicine (I4210137306): 32576

edsu / en.wav

Last active March 29, 2024 12:22

This seems to cause whisper to segfault on my MacBook Pro 2.4 GHz 8-Core Intel Core i9, Sonoma 14.4.1, Python 3.12.0

View raw

edsu / response.json

Last active March 20, 2024 16:54

Looking at the HTTP request that happens when you click on a citation link in a PDF when using Google Scholar's PDF extension for Chrome. You will need to be logged into Google to see the response, which comes back with the wrong Content-Type: https://scholar.google.com/scholar?oi=gsr-r&q=Ben-David%20A%20and%20Amram%20A%20(2018)%20The%20internet…

	{
	"l": "1",
	"p": "https://lh3.googleusercontent.com/-XdUIqdMkCWA/AAAAAAAAAAI/AAAAAAAAAAA/4252rscbv5M/s64-c-mo/photo.jpg",
	"r": [
	{
	"t": "The Internet Archive and the socio-technical construction of historical facts",
	"u": "https://scholar.google.com/scholar_url?url=https://www.tandfonline.com/doi/abs/10.1080/24701475.2018.1455412&hl=en&sa=T&oi=gsr-r&ct=res&cd=0&d=3272375975175528132&ei=YBH7ZeXNA4Cb6rQPmrOdoA8&scisig=AFWwaeb_dRhXurIfWX0NXA2y4G9I",
	"x": "",
	"m": "A Ben-David, A Amram - Internet Histories, 2018",
	"s": "This article analyses the socio-technical epistemic processes behind the construction of historical facts by the Internet Archive Wayback Machine (IAWM). Grounded in theoretical debates in Science and Technology Studies about digital and algorithmic platforms as “black boxes”, this article uses provenance information and other data traces provided by the IAWM to uncover specific epistemic processes embedded at its back-end, through a case study on the archiv

edsu / gist:c683a99d51d4faa26b4e18a466ba1b13

Last active March 5, 2024 16:55

GIS filenames

    
filename
count

data.zip
22397

data_EPSG_4326.zip
22397

preview.jpg
22397

index_map.json
147

Beechey_WGS.tif.xml
1

Beechey_WGS-iso19139.xml
1

Beechey_WGS-fgdc.xml
1

bathy20.txt
1

filename	count
data.zip	22397
data_EPSG_4326.zip	22397
preview.jpg	22397
index_map.json	147
Beechey_WGS.tif.xml	1
Beechey_WGS-iso19139.xml	1
Beechey_WGS-fgdc.xml	1
bathy20.txt	1

edsu / lcauthority.py

Last active February 16, 2024 22:23

Get some usable JSON for a given LC name or subject authority string: e.g. `./lcauthority.py "Southampton (England)"`

	#!/usr/bin/env python3

	"""
	A small command line tool to get the JSON-LD for a Library of Congress authority
	record by first looking up the authority as a string using the label lookup
	service and then getting the JSON-LD for the authority and writing it out using
	a JSON-LD frame where the SKOS is the default vocabulary.
	"""

	import sys

edsu / guess_doi.py

Last active January 10, 2024 17:54

Use the CrossRef API to guess the DOI for a given title

	#!/usr/bin/env python3

	import sys
	import requests

	title = sys.argv[1]
	api_url = "https://api.crossref.org/works"

	response = requests.get(api_url, params={"query.title": title})

Ed Summers edsu