This is an attempt at using a gist to facilitate liveblogging in a static site. Thanks for joining me for the ride…

The event programme is available online. I'll be co-presenting a talk about using the figshare API with figshare's own Megan Hardeman on the Tuesday at 09.40.

Well, I’ve arrived and obtained biscuits and tea.

Day 1

Martin gives us the now standard housekeeping slide
Overview of the programme (see the link above)
I’m interested to hear about what they’ve been up to at Lancaster with their institutional RDM reporting dashboard
There will also be breakout groups tomorrow — I’m sure suggestions for these on the #rdmf18 hashtag will be welcome too, even if you can’t make it!

Keynote: What are the challenges or Data Science?

Prof Magnus Rattray, Professor of Computational & Systems Biology/Director of the Data Science Institute, University of Manchester

An example: Physics

Large Synoptic Survey Telescope (LSST): 3.2 Gpixel camera -> 2,000 exposures (= 20TB) per night -> 10 year survey = 100PB data
Large Hadron Collider (LHC): theoretical output of 68TB/s (!!!) -> about 1.5GB/s to disk -> 200PB total
Square Kilometre Array will produce more data than can be processed today, but will be curated and analysed over years
But this isn’t unexpected for physics: it’s being dealt with

Another example: Geography

Network analysis of 26m commuter journeys from 2011 census data
Classify journeys into 9 super-groups and a total of 40 groups
Individual journeys not interesting, but emerging patterns are
The tricky stuff is not the machine learning or analysis, but bringing together data from different sources

Mental health

Use of wearable devices to track location of people with mental illnesses
Handle missing data (e.g. due to mobile/GPS blackspots)
Classify places and activities
Overlay health status to identify patterns

Research is increasingly data driven

Bottom-up modelling: based on assumptions about microscopic principles; develop simulation, run and then compare to reality; refine assumptions
Data-driven modelling: identify measurable variables; fit a statistical model to data; make inferences and learn about system by identifying hidden variables
Increasingly connected: mixing “mechanistic” prior knowledge into data-driven models

Challenges for data science

Scalability
Complexity
Cleaning messy data (missing data, noise, poor formatting, poor/absent experimental design)
Human data (privacy, ethics)
Accessibility/availability (openness, reproducibility; e.g. clinicians who protect “their” data to safeguard their future career)

Example: genomics

Massive drop in cost of genome sequencing over the last decade
“It costs more to analyse a genome than to sequence it.” David Haussler
100k Genome project now collecting a huge number of genomes
But once you can sequence genomes you can examine much more: transcriptomics, epigenetics, proteomics
So we can now use this technology to investigate layer-upon-layer of different interacting systems and subsystems
E.g. asthma
- Good for a cohort study because a lot of people have asthma
- Inconsistency and complexity indicate multiple (sub-)diseases
- E.g. 2 different versions of CD14 gene are associated with different risk levels in different parts of the world
- Commonly thought to be a progression: eczema -> asthma -> rhinitis
- Large scale analysis shows this progression only presents in a small fraction of the population: i.e. it is false

Towards genomic medicine

100k Genomes project: 30PB data held securely, restricted access through secure virtual desktop (“Inuvika”)
Privacy of individuals’ genomes is important but difficult

Next revolution: scaling down to single cells

Existing methods effectively take an average of ~10k cells
As well as looking at large populations of people, we can also go down to individual cell level
Single-cell methods show e.g. diverse sub-populations in particular cell types
Each cell is now a high-dimensional data point
E.g. can trace different mutations through sub-populations of tumour cells
Profile individual tumour cells circulating in the blood: can diagnose and design a drug regime based on a blood sample instead of an invasive biopsy
Sophisticated modelling required to disambiguate features of interest from multiple confounding factors

Dealing with the challenges

Data volume: move compute to the data (e.g. cloud solutions); will analysis be reproducible in the future, or even across current platforms
Data analysis: scale up algorithms (e.g. deep learning, TensorFlow); use approximate methods; streaming data processing; clever tricks to avoid computationally-intensive tasks
- Things that used to be considered “software engineering” (e.g. object orientation, testing) are now important for everything
Data quality: big data often not collected for a single purpose, so no experimental design
Robust & reproducible research: record arbitrary modelling choices and vary them to test for robustness; hypothesis selection & p-hacking; keep track of all hypotheses considered (e.g. electronic lab notebook)

Conclusions

Research is increasingly data-driven; data science ubiquitous
Big & complex data: people (especially statisticians and computer scientists) are already motivated to solve these
How do we motivate people to confront problems of messiness, human data, openness (or lack of)

Day 2

Aaaand we're back again for day 2: a full day of content after yesterday's afternoon session

Case study: CRIS, Research Data & Institutional Reporting

Becky Gordon, Lancaster University

Research services view on data about research
Work quite closely with library: overlap primarily centred around Pure CRIS
Systems:
- HR, student information, costing/pFact, finance → Pure
- Pure → Departmental webpages, research directory, repository, data management, equipment register
Reporting
- Financial reports: monthly (really valued by senior academic staff) & annual
- Organisational unit performance
- Individual performance: promotions etc.
- External requirements: OA, REF, HESA, ResearchFish
Current project: strategic research management tool
- Reduce time spent manually generating reports
- Single hub with live, up-to-date data
Business questions - want data on:
- Awards (number, value)
- Applications (inc. success rates)
- Impact (publications, OA compliance, …?)
Process overview:
- Define data and pull out into a data warehouse
- Build reports on top of this (using Tableau)
- Additional internal exception reports to track things that might go wrong
- Data audit & cleaning
Challenges
- Differences in reporting criteria
- Not enough good-quality data to work with
- Difficult to make historical comparisons with older reports
Next steps
- Continue to produce manual reports & develop tool & Tableau reports in parallel
- Agree reporting criteria with senior management
- Ongoing data cleanings

Case study: data repository APIs

No updates from me for a while because I’m part of this talk!

Our slides are available on figshare (of course!)

Managing research throughout its lifecycle

Prof Paul Jeffreys, Institute of Cancer Research

About the IRC
- 8 diverse research divisions
- Able to recharge infrastructure costs to research so can fund development
- Future plans: dynamic adaptive therapy
  - As you treat it in an individual, cancer mutates and evolves so you have to keep changing treatment to keep up
  - Data must be live and online
- Big data is a key pillar in current strategic plan
HPC infrastructure
- 1,800 cores × 12–16 GB, designed for parallel workload
- Dominated by next generation sequencing; approx 70% usage
- Jisc data centre in Slough
Architecture
- 6PiB provisioned (expandable to at least 20PiB)
- 2 tier: tier 1 is fast storage (2PiB); tier 2 an object store (4PiB)
- NAS layer on top so that storage tiers are a black box for users
Policy-based migration from tier 1 → 2
- Typically migrated if not used for 90 days, but other possiblities exist
- Migrated to long-term archive at some later date
- Most files mirrored across 3 sites; smaller (<10MB) files only 2 sites
- Object store cannot provide quotas, so charge based on actual usage
Projects to develop 2 new components for sharing & syncing; also currently using a Dropbox Business service
Looking for a metadata catalogue solution
- Many solutions (e.g. iRods, DSpace) aimed at facilities or libraries
- Need something easy to use for scientists, and off-the-shelf (able to deliver a proof of concept in one person-month)
- Open to suggestions!

Scaling and empowering cultural change

Shoaib Sufi, Community Lead, Software Sustainability Institute (SSI)

SSI: national facility since 2010 to "cultivate better, more sustainable research software to enable world-class research"
- Software development: to build and maintain expertise in software
- Training: essential software skills for researchers
- Policy: campaigning for research software support and career recognition/development for research software engineers
- Community: workshops & fellowship
- Outreach: website, blog, social media
Fellowship programme
- £3000 travel/event bursary for people who want to improve research software
- Funded by support grants from research councils
- Turns out that "SSI Fellow" is quite a sought-after badge of recognition
- Fellows = ambassadors
What makes a good fellow?
- Strong plan: novelty (for institution/domain); have the skills/experience to succeed; will make a difference
- Content: demonstrate ability to create impact
- Communications skills
Typical activities
- Workshops/conferences/training (including tailored carpentries)
- Promote SSI and contribute to its success
- Contribute to SSI blog
Some amazing lasting outcomes from the fellowship programme
- Development of services (Melody Sandells)
- Contribution to RSE conference & organisation (Alys Brett)
- Library Carpentry (James Baker)
- recipy workflow management software (Robin Wilson)
- Open source versions of common commercial research software (Robin Grant)
- Data science for doctors training (Steve Harris)
- Establishing reproducible research as standard in a major research group (Stephen Eglen)
Conclusions
- The right people to effect change are in the research community
- Need support and community
- Cross-pollinate ideas across different domains
Collaborations Workshop 2018 focus on themes of Culture Change, Productivity, Sustainability

Lunchtime!

And now it's time for lunch, but after that there will be three parallel breakout groups:

Supporting resources for RDM: toolkits & workflows
Integrating data systems & cataloguse
Impact & metrics: reporting & evidencing success

Breakout group feedback

1. Supporting resources for RDM: toolkits & workflows

This includes some information from surveys and interviews around the Jisc research data toolkit project.

Presenting content through journeys is a useful approach
If available, quite a lot of people would use resources in a RDM toolkit to augment their teaching
Preferred mechanism would be working group of HEI-based RDM professionals with Jisc support
Interesting possible features: institutional subdomains with customisable content; CC-BY license; funder policy summaries; regular newsletters

2. Integrating data systems & cataloguse

Important themes: ownership, provenance, privacy
Audit trails important, but

jezcope/rdmf18.md