Skip to content

Instantly share code, notes, and snippets.

@jb-adams
Last active October 9, 2019 19:14
Show Gist options
  • Save jb-adams/5024d90ebfde8c881fad518c71801946 to your computer and use it in GitHub Desktop.
Save jb-adams/5024d90ebfde8c881fad518c71801946 to your computer and use it in GitHub Desktop.
Reference sequences on AWS Tutorial

INSDC Reference Sequence Public Dataset Tutorial

The INSDC Reference Sequence Public Dataset enables access to biological reference sequences submitted to the INSDC, where sequences are identified according to checksum. The dataset includes both raw sequence as well as associated metadata.

Accessing Data

insdc-reference-sequences
|-- sequence
|   |-- 023e92ccde5f86f31ea0844a92dddb86
|   |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50
|   |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f
|   |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
|   |-- dbe6100b83178f3ac561d98c2dfc41a0
|   |-- ff734bf70e13affa85a272fda6659a5f
|   |__ *
|__ metadata
    |-- json
    |   |-- 023e92ccde5f86f31ea0844a92dddb86.json
    |   |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50.json
    |   |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f.json
    |   |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
    |   |-- dbe6100b83178f3ac561d98c2dfc41a0
    |   |-- ff734bf70e13affa85a272fda6659a5f
    |   |__ *.json
    |__ csv
        |-- AAIYXD01.full.csv
        |-- CABIKC01.full.csv
        |-- LVHX01.full.csv
        |__ *.full.csv

CSV logs

Logs of upload processing events are available from s3://PDS/metadata/csv. Each file represents a single load attempt (generally containing all sequences in an assembly), and each line is a loaded sequence. The following table outlines data columns for each sequence record.

Loaded sequence columns in metadata CSV

# Field Name Description Example
1 trunc512 Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence, truncated to 48 characters b9046fc3fb417f114d7e108637c448b2
14d78b7a5e345c7c
2 md5 Message Digest (MD5) hex-string digest of sequence (32 characters) cd8d02e2d8af721bed2ba9392a96da0e
3 length Sequence base pair length 1470266
4 sha512 Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence (128 characters) b9046fc3fb417f114d7e108637c448b2
14d78b7a5e345c7c1d527fd895f081d1
109da900101f323d142a407ef22cbfb6
c2a174eb796217d1afa7fbbe1564787a
5 trunc512_base64 Base64 representation of trunc512 digest (32 characters) uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8
6 insdc INSDC Versioned Accession Number CABIKC010000001.1
7 ena_type Record type expanded_con
8 species Human readable species taxonomic name (ie. Genus species) "Saccharomyces cerevisiae"
9 biosample BioSample Accession SAMEA5816324
10 taxon NCBI Taxonomy species identifier 4932

An example csv of loaded sequences is displayed below as a table. The table shows a subset of sequences from assembly GCA_902192315.1, a Saccharomyces cerevisiae genome assembly.

Example CSV of loaded sequences from assembly GCA_902192315.1

trunc512 md5 length sha512 trunc512_base64 insdc ena_type species biosample taxon
b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c cd8d02e2d8af721bed2ba9392a96da0e 1470266 b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c1d527fd895f081d1109da900101f323d142a407ef22cbfb6c2a174eb796217d1afa7fbbe1564787a uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8 CABIKC010000001.1 expanded_con "Saccharomyces cerevisiae" SAMEA5816324 4932
8decdfa7b43090448ae9411a77e2105390855bd0770e0ded fa6ea9d18d255f0586cf967071bacf8a 1062691 8decdfa7b43090448ae9411a77e2105390855bd0770e0ded4ec8d19bc17e0d2c2af4c9c38c694502d061bd310547020df5ded87641450a20a6e24985bef5904c jezfp7QwkESK6UEad-IQU5CFW9B3Dg3t CABIKC010000002.1 expanded_con "Saccharomyces cerevisiae" SAMEA5816324 4932
e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e14 e6d87174b53bc10a65e8b30363fe994f 1092091 e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e143e84b54652c583958f642604497bdfcf5491c9879da9b8d3e46b8a3f010859e0d27367ab82f7c808 5k3SNkLS9PzZZG6vhE-LG2bo3GynGZ4U CABIKC010000003.1 expanded_con "Saccharomyces cerevisiae" SAMEA5816324 4932
b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465 178f3cd414e0f23b97397f705fade52d 912642 b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465078fc5bbb321b09c4ad3f9717c52778d08ffb03673c317e3ab18836d110f1db86965c1840ee0bb66 ucwSoFk302K1pV3Eo4hQeCwDTN9oK9Rl CABIKC010000004.1 expanded_con "Saccharomyces cerevisiae" SAMEA5816324 4932
ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875 bba885139f4796326100ff9db55b9235 815240 ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875c6db3be3b313adb9f854b41d10faf4cc796621640158eef5b9f467dd4ff107b7bd9988f5be105b16 rmxnOhh4r8S_Si3x7ZZn0Rbh4pQBX4h1 CABIKC010000005.1 expanded_con "Saccharomyces cerevisiae" SAMEA5816324 4932

How to Start Using the Data

Download Sequence and Metadata

curl

You can download sequences/metadata via the curl command-line tool. Be sure to include the -L flag, which will redirect to sequence content (stored under the TRUNC512 id file) from the MD5 id file.

curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json

Python

You can use the requests library in Python to download sequences and metadata.

import requests

url_sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response_sequence = requests.get(url_sequence)
print(response_sequence.content)

url_metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response_metadata = requests.get(url_metadata)
print(response_metadata.content)

R

You can use the httr library in R to download sequences and metadata.

library(httr)

url.sequence <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response.sequence <- GET(url.sequence)
content(response.sequence, "text")

url.metadata <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response.metadata <- GET(url.metadata)
content(response.metadata, "text")

Java

String sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0";
URL urlSequence = new URL(sequence);
HttpURLConnection connectionSequence = (HttpURLConnection) urlSequence.openConnection();
connectionSequence.setRequestMethod("GET");

String metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
URL urlMetadata = new URL(metadata);
HttpURLConnection connectionMetadata = (HttpURLConnection) urlMetadata.openConnection();
connectionMetadata.setRequestMethod("GET");

Locate sequence identifiers from CSV data

Given a genome assembly of interest, the csv data can be used to get checksums, and therefore raw sequence, for all sequences in the assembly. For example, to locate all sequences for assembly GCA_902192315.1, we can request the following to access the CSV:

curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/csv/CABIKC01.full.csv

The first and second columns of the resulting csv give us the TRUNC512 and MD5 identifiers, respectively, of all sequences in the assembly. We can use either identifier to download each sequence. Given that the first sequence has a TRUNC512 id of b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c, we can request:

curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c

The above process can be repeated for all sequences to collect and reconstruct the entire assembly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment