Skip to content

Instantly share code, notes, and snippets.

View edsu's full-sized avatar

Ed Summers edsu

View GitHub Profile
@edsu
edsu / swap
Last active July 30, 2024 18:08
See what SDR collections and crawls objects have a snapshot of a given URL.
#!/usr/bin/env python3
"""
Look up a URL in swap.stanford.edu and print out the collections and crawl
SDR object identifiers that contain a snapshot of the URL.
"""
import sys
import json
import collections
import csv
import sys
from itertools import batched
import pyarrow
from pyarrow.parquet import ParquetWriter
csv.field_size_limit(sys.maxsize)
def csv_to_parquet(csv_file, parquet_file, batch_size=10_000):
@edsu
edsu / .gitignore
Last active July 18, 2024 14:48
A sloppy prototype for moving browsertrix WACZs to AWS S3.
.env
import requests
author_id = 'https://openalex.org/A5067004024'
url = 'https://api.openalex.org/works'
params = {
'filter': f'author.id:{author_id}',
'cursor': '*'
}
#!/usr/bin/env python3
"""
Run this program with an institution name and see the institutions and the count
of publications in OpenAlex.
$ ./openalex_counts "stanford"
Stanford University (I97018004): 430550
Stanford Medicine (I4210137306): 32576
@edsu
edsu / en.wav
Last active March 29, 2024 12:22
This seems to cause whisper to segfault on my MacBook Pro 2.4 GHz 8-Core Intel Core i9, Sonoma 14.4.1, Python 3.12.0
@edsu
edsu / response.json
Last active March 20, 2024 16:54
Looking at the HTTP request that happens when you click on a citation link in a PDF when using Google Scholar's PDF extension for Chrome. You will need to be logged into Google to see the response, which comes back with the wrong Content-Type: https://scholar.google.com/scholar?oi=gsr-r&q=Ben-David%20A%20and%20Amram%20A%20(2018)%20The%20internet…
{
"l": "1",
"p": "https://lh3.googleusercontent.com/-XdUIqdMkCWA/AAAAAAAAAAI/AAAAAAAAAAA/4252rscbv5M/s64-c-mo/photo.jpg",
"r": [
{
"t": "The Internet Archive and the socio-technical construction of historical facts",
"u": "https://scholar.google.com/scholar_url?url=https://www.tandfonline.com/doi/abs/10.1080/24701475.2018.1455412&hl=en&sa=T&oi=gsr-r&ct=res&cd=0&d=3272375975175528132&ei=YBH7ZeXNA4Cb6rQPmrOdoA8&scisig=AFWwaeb_dRhXurIfWX0NXA2y4G9I",
"x": "",
"m": "A Ben-David, A Amram - Internet Histories, 2018",
"s": "This article analyses the socio-technical epistemic processes behind the construction of historical facts by the Internet Archive Wayback Machine (IAWM). Grounded in theoretical debates in Science and Technology Studies about digital and algorithmic platforms as “black boxes”, this article uses provenance information and other data traces provided by the IAWM to uncover specific epistemic processes embedded at its back-end, through a case study on the archiv
filename count
data.zip 22397
data_EPSG_4326.zip 22397
preview.jpg 22397
index_map.json 147
Beechey_WGS.tif.xml 1
Beechey_WGS-iso19139.xml 1
Beechey_WGS-fgdc.xml 1
bathy20.txt 1
@edsu
edsu / lcauthority.py
Last active February 16, 2024 22:23
Get some usable JSON for a given LC name or subject authority string: e.g. `./lcauthority.py "Southampton (England)"`
#!/usr/bin/env python3
"""
A small command line tool to get the JSON-LD for a Library of Congress authority
record by first looking up the authority as a string using the label lookup
service and then getting the JSON-LD for the authority and writing it out using
a JSON-LD frame where the SKOS is the default vocabulary.
"""
import sys
@edsu
edsu / guess_doi.py
Last active January 10, 2024 17:54
Use the CrossRef API to guess the DOI for a given title
#!/usr/bin/env python3
import sys
import requests
title = sys.argv[1]
api_url = "https://api.crossref.org/works"
response = requests.get(api_url, params={"query.title": title})