Skip to content

Instantly share code, notes, and snippets.

@wbazant
Last active October 16, 2018 09:52
Show Gist options
  • Save wbazant/ade84b214502928228e2fdf64a7aeac0 to your computer and use it in GitHub Desktop.
Save wbazant/ade84b214502928228e2fdf64a7aeac0 to your computer and use it in GitHub Desktop.
ParaSite 11
<title>ParaSite 11</title>

Lab meeting 16.10.2018

After WBPS11

S. mansoni, F. hepatica, H. contortus
Large release, it turned out to be quite disruptive. In response:

  • S. mansoni mapping made available as cross-references
  • Blog post telling people how they can cope

Comments in ParaSite

As of a week ago there were three comments on the live site, by:

  • Tuan (for test, and deleted)
  • me (later deleted)
  • Alan (our best comment)

RNASeq data

I have little data on usage. The people I demonstrated the feature to in our lab seem to interact with it via Ensembl genome browser. I designed it for the JBrowse browser, and support elsewhere was added later, as a bit of a bolt-on.

WBPS12

Genomes

  • Acrobeloides nanus, a clade IV nematode
  • Ancylostoma ceylanicum: new annotation
  • Hymenolepis microstoma
  • Meloidogyne arenaria, alternative assembly
  • Meloidogyne graminicola, draft assembly
  • Taenia multiceps
  • S. mansoni annotation update
  • WormBase core species up to version WS267.

Better C. elegans references

I spent ~three weeks on the xref pipeline to incorporate a C. elegans protein mapping from WormBase.
Result: more accurate and complete UniProt, RefSeq mRNA, RefSeq protein entries.

Archiving IDs

I’ve adapted Ensembl’s ID mapping pipeline, based on exon-on-exon matching with exonerate that gets propagated to transcript and gene level.

Mapping success rate

Genome number of genes in current assembly release of WBPS with previous assembly genes successfully mapped genes in previous assembly fraction successfully mapped
ancylostoma_ceylanicum_prjna72583 11783 WBPS11 7564 15892 0.476
ascaris_suum_prjna62057 17974 WBPS9 9468 15260 0.620
fasciola_hepatica_prjeb25283 16806 WBPS10 7564 22676 0.334
haemonchus_contortus_prjeb506 19430 WBPS10 11439 21869 0.523
meloidogyne_incognita_prjeb8714 45351 WBPS10 11977 19212 0.623

This is already 5 to 20% more than running the pipeline with default parameter values.

Handling unmapped genes

Unmapped genes get killed, and:

  • user can search for them, and see what happened (i.e. they got killed)
  • user can get the protein sequence

RNASeq data: Expression querying

Where we are

Lots of S. mansoni data from our lab published and available in ENA
Comprehensive treatment of all RNASeq data sets in ENA for our species: currently over 10k runs with metadata
No quantitative results - we show alignments in genome browser

Goal: selective treatment and deeper integration of some data sets

  • profiling baseline expression across sexes, life stages, and organism parts
  • differential expression across contrasts
  • basically informative data - gene has expression evidence or not - when this is all we can tell
  • aggregated information on gene pages
  • query interface

Required effort

  • shortlist valuable experiments
  • schema for metadata
  • curate data in that format
  • develop analyses (wrappers around existing tools like DESeq2)
  • pipelines to run above
  • design database schema
  • store analysis results in the database
  • query code for gene pages
  • UI design / implementation for gene pages
  • integrating queries - the hard part!

Querying

We could use BioMart for purposes of integrating the data: it’s limited and hard to work with, but it’s our best bulk query tool.

Filters and attributes

Attributes: TPM and standard deviation for each group of samples, or fold change and p-value per contrast for differential expression.

Filters are hard - I would have to split above values into ranges and add a filter for each group of samples or contrast, but it doesn’t scale well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment