This is an attempt at using a gist to facilitate liveblogging in a static site. Thanks for joining me for the ride…
The event programme is available online. I'll be co-presenting a talk about using the figshare API with figshare's own Megan Hardeman on the Tuesday at 09.40.
Well, I’ve arrived and obtained biscuits and tea.
- Martin gives us the now standard housekeeping slide
- Overview of the programme (see the link above)
- I’m interested to hear about what they’ve been up to at Lancaster with their institutional RDM reporting dashboard
- There will also be breakout groups tomorrow — I’m sure suggestions for these on the #rdmf18 hashtag will be welcome too, even if you can’t make it!
Prof Magnus Rattray, Professor of Computational & Systems Biology/Director of the Data Science Institute, University of Manchester
- Large Synoptic Survey Telescope (LSST): 3.2 Gpixel camera -> 2,000 exposures (= 20TB) per night -> 10 year survey = 100PB data
- Large Hadron Collider (LHC): theoretical output of 68TB/s (!!!) -> about 1.5GB/s to disk -> 200PB total
- Square Kilometre Array will produce more data than can be processed today, but will be curated and analysed over years
- But this isn’t unexpected for physics: it’s being dealt with
- Network analysis of 26m commuter journeys from 2011 census data
- Classify journeys into 9 super-groups and a total of 40 groups
- Individual journeys not interesting, but emerging patterns are
- The tricky stuff is not the machine learning or analysis, but bringing together data from different sources
- Use of wearable devices to track location of people with mental illnesses
- Handle missing data (e.g. due to mobile/GPS blackspots)
- Classify places and activities
- Overlay health status to identify patterns
- Bottom-up modelling: based on assumptions about microscopic principles; develop simulation, run and then compare to reality; refine assumptions
- Data-driven modelling: identify measurable variables; fit a statistical model to data; make inferences and learn about system by identifying hidden variables
- Increasingly connected: mixing “mechanistic” prior knowledge into data-driven models
- Scalability
- Complexity
- Cleaning messy data (missing data, noise, poor formatting, poor/absent experimental design)
- Human data (privacy, ethics)
- Accessibility/availability (openness, reproducibility; e.g. clinicians who protect “their” data to safeguard their future career)
- Massive drop in cost of genome sequencing over the last decade
- “It costs more to analyse a genome than to sequence it.” David Haussler
- 100k Genome project now collecting a huge number of genomes
- But once you can sequence genomes you can examine much more: transcriptomics, epigenetics, proteomics
- So we can now use this technology to investigate layer-upon-layer of different interacting systems and subsystems
- E.g. asthma
- Good for a cohort study because a lot of people have asthma
- Inconsistency and complexity indicate multiple (sub-)diseases
- E.g. 2 different versions of CD14 gene are associated with different risk levels in different parts of the world
- Commonly thought to be a progression: eczema -> asthma -> rhinitis
- Large scale analysis shows this progression only presents in a small fraction of the population: i.e. it is false
- 100k Genomes project: 30PB data held securely, restricted access through secure virtual desktop (“Inuvika”)
- Privacy of individuals’ genomes is important but difficult
- Existing methods effectively take an average of ~10k cells
- As well as looking at large populations of people, we can also go down to individual cell level
- Single-cell methods show e.g. diverse sub-populations in particular cell types
- Each cell is now a high-dimensional data point
- E.g. can trace different mutations through sub-populations of tumour cells
- Profile individual tumour cells circulating in the blood: can diagnose and design a drug regime based on a blood sample instead of an invasive biopsy
- Sophisticated modelling required to disambiguate features of interest from multiple confounding factors
- Data volume: move compute to the data (e.g. cloud solutions); will analysis be reproducible in the future, or even across current platforms
- Data analysis: scale up algorithms (e.g. deep learning, TensorFlow); use approximate methods; streaming data processing; clever tricks to avoid computationally-intensive tasks
- Things that used to be considered “software engineering” (e.g. object orientation, testing) are now important for everything
- Data quality: big data often not collected for a single purpose, so no experimental design
- Robust & reproducible research: record arbitrary modelling choices and vary them to test for robustness; hypothesis selection & p-hacking; keep track of all hypotheses considered (e.g. electronic lab notebook)
- Research is increasingly data-driven; data science ubiquitous
- Big & complex data: people (especially statisticians and computer scientists) are already motivated to solve these
- How do we motivate people to confront problems of messiness, human data, openness (or lack of)
- Aaaand we're back again for day 2: a full day of content after yesterday's afternoon session
Becky Gordon, Lancaster University
- Research services view on data about research
- Work quite closely with library: overlap primarily centred around Pure CRIS
- Systems:
- HR, student information, costing/pFact, finance → Pure
- Pure → Departmental webpages, research directory, repository, data management, equipment register
- Reporting
- Financial reports: monthly (really valued by senior academic staff) & annual
- Organisational unit performance
- Individual performance: promotions etc.
- External requirements: OA, REF, HESA, ResearchFish
- Current project: strategic research management tool
- Reduce time spent manually generating reports
- Single hub with live, up-to-date data
- Business questions - want data on:
- Awards (number, value)
- Applications (inc. success rates)
- Impact (publications, OA compliance, …?)
- Process overview:
- Define data and pull out into a data warehouse
- Build reports on top of this (using Tableau)
- Additional internal exception reports to track things that might go wrong
- Data audit & cleaning
- Challenges
- Differences in reporting criteria
- Not enough good-quality data to work with
- Difficult to make historical comparisons with older reports
- Next steps
- Continue to produce manual reports & develop tool & Tableau reports in parallel
- Agree reporting criteria with senior management
- Ongoing data cleanings
No updates from me for a while because I’m part of this talk!
Our slides are available on figshare (of course!)
Prof Paul Jeffreys, Institute of Cancer Research
- About the IRC
- 8 diverse research divisions
- Able to recharge infrastructure costs to research so can fund development
- Future plans: dynamic adaptive therapy
- As you treat it in an individual, cancer mutates and evolves so you have to keep changing treatment to keep up
- Data must be live and online
- Big data is a key pillar in current strategic plan
- HPC infrastructure
- 1,800 cores × 12–16 GB, designed for parallel workload
- Dominated by next generation sequencing; approx 70% usage
- Jisc data centre in Slough
- Architecture
- 6PiB provisioned (expandable to at least 20PiB)
- 2 tier: tier 1 is fast storage (2PiB); tier 2 an object store (4PiB)
- NAS layer on top so that storage tiers are a black box for users
- Policy-based migration from tier 1 → 2
- Typically migrated if not used for 90 days, but other possiblities exist
- Migrated to long-term archive at some later date
- Most files mirrored across 3 sites; smaller (<10MB) files only 2 sites
- Object store cannot provide quotas, so charge based on actual usage
- Projects to develop 2 new components for sharing & syncing; also currently using a Dropbox Business service
- Looking for a metadata catalogue solution
- Many solutions (e.g. iRods, DSpace) aimed at facilities or libraries
- Need something easy to use for scientists, and off-the-shelf (able to deliver a proof of concept in one person-month)
- Open to suggestions!
Shoaib Sufi, Community Lead, Software Sustainability Institute (SSI)
- SSI: national facility since 2010 to "cultivate better, more sustainable research software to enable world-class research"
- Software development: to build and maintain expertise in software
- Training: essential software skills for researchers
- Policy: campaigning for research software support and career recognition/development for research software engineers
- Community: workshops & fellowship
- Outreach: website, blog, social media
- Fellowship programme
- £3000 travel/event bursary for people who want to improve research software
- Funded by support grants from research councils
- Turns out that "SSI Fellow" is quite a sought-after badge of recognition
- Fellows = ambassadors
- What makes a good fellow?
- Strong plan: novelty (for institution/domain); have the skills/experience to succeed; will make a difference
- Content: demonstrate ability to create impact
- Communications skills
- Typical activities
- Workshops/conferences/training (including tailored carpentries)
- Promote SSI and contribute to its success
- Contribute to SSI blog
- Some amazing lasting outcomes from the fellowship programme
- Development of services (Melody Sandells)
- Contribution to RSE conference & organisation (Alys Brett)
- Library Carpentry (James Baker)
- recipy workflow management software (Robin Wilson)
- Open source versions of common commercial research software (Robin Grant)
- Data science for doctors training (Steve Harris)
- Establishing reproducible research as standard in a major research group (Stephen Eglen)
- Conclusions
- The right people to effect change are in the research community
- Need support and community
- Cross-pollinate ideas across different domains
- Collaborations Workshop 2018 focus on themes of Culture Change, Productivity, Sustainability
And now it's time for lunch, but after that there will be three parallel breakout groups:
- Supporting resources for RDM: toolkits & workflows
- Integrating data systems & cataloguse
- Impact & metrics: reporting & evidencing success
This includes some information from surveys and interviews around the Jisc research data toolkit project.
- Presenting content through journeys is a useful approach
- If available, quite a lot of people would use resources in a RDM toolkit to augment their teaching
- Preferred mechanism would be working group of HEI-based RDM professionals with Jisc support
- Interesting possible features: institutional subdomains with customisable content; CC-BY license; funder policy summaries; regular newsletters
- Important themes: ownership, provenance, privacy
- Audit trails important, but