<title>ParaSite 11</title>

Lab meeting 16.10.2018

After WBPS11

Updates of popular genomes

S. mansoni, F. hepatica, H. contortus
Large release, it turned out to be quite disruptive. In response:

S. mansoni mapping made available as cross-references
Blog post telling people how they can cope

Comments in ParaSite

As of a week ago there were three comments on the live site, by:

Tuan (for test, and deleted)
me (later deleted)
Alan (our best comment)

RNASeq data

I have little data on usage. The people I demonstrated the feature to in our lab seem to interact with it via Ensembl genome browser. I designed it for the JBrowse browser, and support elsewhere was added later, as a bit of a bolt-on.

WBPS12

Genomes

Acrobeloides nanus, a clade IV nematode
Ancylostoma ceylanicum: new annotation
Hymenolepis microstoma
Meloidogyne arenaria, alternative assembly
Meloidogyne graminicola, draft assembly
Taenia multiceps
S. mansoni annotation update

WormBase core species up to version WS267.

Better C. elegans references

I spent ~three weeks on the xref pipeline to incorporate a C. elegans protein mapping from WormBase.
Result: more accurate and complete UniProt, RefSeq mRNA, RefSeq protein entries.

Archiving IDs

I’ve adapted Ensembl’s ID mapping pipeline, based on exon-on-exon matching with exonerate that gets propagated to transcript and gene level.

Mapping success rate

Genome	number of genes in current assembly	release of WBPS with previous assembly	genes successfully mapped	genes in previous assembly	fraction successfully mapped
ancylostoma_ceylanicum_prjna72583	11783	WBPS11	7564	15892	0.476
ascaris_suum_prjna62057	17974	WBPS9	9468	15260	0.620
fasciola_hepatica_prjeb25283	16806	WBPS10	7564	22676	0.334
haemonchus_contortus_prjeb506	19430	WBPS10	11439	21869	0.523
meloidogyne_incognita_prjeb8714	45351	WBPS10	11977	19212	0.623

This is already 5 to 20% more than running the pipeline with default parameter values.

Handling unmapped genes

Unmapped genes get killed, and:

user can search for them, and see what happened (i.e. they got killed)
user can get the protein sequence

RNASeq data: Expression querying

Where we are

Lots of S. mansoni data from our lab published and available in ENA
Comprehensive treatment of all RNASeq data sets in ENA for our species: currently over 10k runs with metadata
No quantitative results - we show alignments in genome browser

Goal: selective treatment and deeper integration of some data sets

profiling baseline expression across sexes, life stages, and organism parts
differential expression across contrasts
basically informative data - gene has expression evidence or not - when this is all we can tell
aggregated information on gene pages
query interface

Required effort

shortlist valuable experiments
schema for metadata
curate data in that format
develop analyses (wrappers around existing tools like DESeq2)
pipelines to run above
design database schema
store analysis results in the database
query code for gene pages
UI design / implementation for gene pages
integrating queries - the hard part!

Querying

We could use BioMart for purposes of integrating the data: it’s limited and hard to work with, but it’s our best bulk query tool.

Filters and attributes

Attributes: TPM and standard deviation for each group of samples, or fold change and p-value per contrast for differential expression.

Filters are hard - I would have to split above values into ranges and add a filter for each group of samples or contrast, but it doesn’t scale well.

wbazant/Schisto_studies.md