Skip to content

Instantly share code, notes, and snippets.

@huddlej
Last active February 27, 2024 18:34
Show Gist options
  • Save huddlej/3dc1504c2af5e999fba2e683c2f7f5e1 to your computer and use it in GitHub Desktop.
Save huddlej/3dc1504c2af5e999fba2e683c2f7f5e1 to your computer and use it in GitHub Desktop.
Example subsampling configurations for public Nextstrain seasonal flu builds with different implementations. See the original configuration file for more context: https://github.com/nextstrain/seasonal-flu/blob/45bf4336d9485c1c9bfc44b09b384595d7685032/profiles/nextstrain-public.yaml#L78-L86
# Current implementation approach where "subsamples" is defined in build configuration YAML file.
# Build configuration parameters get passed to the optionally-templated subsample "filters" strings
# such that the same subsampling scheme can be shared across multiple builds by passing build-specific
# variables. For the full context of this subsampling scheme, see the original build configuration file:
# https://github.com/nextstrain/seasonal-flu/blob/45bf4336d9485c1c9bfc44b09b384595d7685032/profiles/nextstrain-public.yaml#L78-L86
subsamples: &subsampling-scheme
regions_except_europe:
filters: --query "(passage_category != 'egg') & (region != 'Europe') & (ha == True) & (na == True)" --group-by region year month --subsample-max-sequences 2700 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
# Note that a priority of "titers" has a special meaning in the flu
# workflow which is not portable to other pathogen workflows.
priorities: "titers"
europe:
filters: --query "(passage_category != 'egg') & (region == 'Europe') & (ha == True) & (na == True)" --group-by country year month --subsample-max-sequences 300 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
priorities: "titers"
references:
filters: --query "(is_reference == True)" --min-date {reference_min_date} --exclude {exclude} --exclude-where passage=egg
# Seasonal flu public builds with `augur subsample` YAML config.
regions_except_europe:
filter: --query "(passage_category != 'egg') & (region != 'Europe') & (ha == True) & (na == True)" --group-by region year month --subsample-max-sequences 2700 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
priorities:
# Note that this type of priority requires some pathogen-specific and data-specific calculation per
# lineage of flu. A more generic implementation might require a filename of priorities that the workflow
# could generate per lineage on demand.
type: titers
europe:
filter: --query "(passage_category != 'egg') & (region == 'Europe') & (ha == True) & (na == True)" --group-by country year month --subsample-max-sequences 300 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
priorities:
type: titers
references:
filter: --query "(is_reference == True)" --min-date {reference_min_date} --exclude {exclude} --exclude-where passage=egg
# Note that in this implementation "output" has to be treated as a
# keyword by the subsample command, so users cannot define a subsampling
# group with the name "output".
output:
- regions_except_europe
- europe
- references
# augur subsample config approach but with direct one-to-one mapping of config
# key/value pairs to augur filter arguments/flags using hyphens in the key names.
#
# Note that seasonal flu builds can pass build parameters like min date
# to templated subsampling string to enable sharing of complex subsampling
# logic across multiple lineages, time resolutions, and segments. In
# this augur subsample implementation, we would need to create a separate
# subsample configuration file for each of the 4 public tree resolutions
# (e.g., "subsample_6m.yaml", "subsample_2y.yaml", etc.).
# I've left the variable notation with curly brackets here to indicate
# where in the configuration we would need to hardcode parameters.
regions_except_europe:
query: "(passage_category != 'egg') & (region != 'Europe') & (ha == True) & (na == True)"
group-by: region year month
subsample-max-sequences: 2700
min-date: {min_date}
exclude: {exclude}
exclude-where: passage=egg
# Note: In this config, "priorities" conflicts directly with
# the augur filter argument of the same name.
priorities:
type: titers
europe:
query: "(passage_category != 'egg') & (region == 'Europe') & (ha == True) & (na == True)"
group-by: country year month
subsample-max-sequences: 300
min-date: {min_date}
exclude: {exclude}
exclude-where: passage=egg
priorities:
type: titers
references:
query: "(is_reference == True)"
min-date: {reference_min_date}
exclude: {exclude}
exclude-where: passage=egg
output:
- regions_except_europe
- europe
- references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment