Skip to content

Instantly share code, notes, and snippets.

@caprenter
Last active August 29, 2015 14:21
Show Gist options
  • Save caprenter/1db7f68e92b02a46baa1 to your computer and use it in GitHub Desktop.
Save caprenter/1db7f68e92b02a46baa1 to your computer and use it in GitHub Desktop.
Resource Projects Data Model
Requirements
ResourceProjects.org needs to be able to accommodate data coming from a wide variety of different sources, and to use this to:
Identify projects;
Link as much information as possible to those projects;
Key challenges to be addressed include:
The incoming data is often sparse - including only a few details about a license, a mining site, or a contract. We may be able to infer a project from this, but may have no direct details about it.
The same projects, companies, licenses, locations (etc.) may be identified with different names, in different languages, and with different variations of spelling;
Projects, their participants and other features of the data change over time;
Incoming data may not be updated very often;
The license of some datasets is unclear: publicly accessible data doesn't always mean public domain;
We want to capture detailed information where it is available, but not require users to understand the full complexity of resource projects in order to meaningfully query the data;
A linked data response
Based on investigating the available options, we have begun to pilot an approach based around using a graph-based data model, building on a Linked Data stack. We have chosen this approach because:
Working with a triple store we are not constrained by a pre-defined database schema, and have the flexibility to re-shape data as the project develops;
Linked data is well suited to integrating heterogeneous data from multiple sources. Through use of the sameAs relationship, we can identify the same project found in multiple sources, whilst also maintaining a clear trail of where each assertion about the project came from;
It opens up the possibility of easily integrating third-party open data, including geodata, company information and contracts data;
There are a number of options for managing detailed provenance information available;
Tools from the LOD2 Stack provide an increasingly useful suite to manage some of the key data management tasks for ResourceProjects.org
However, in exploring a Linked Data approach we are clear that most of the time it should be entirely behind the scenes. The front-end that users deal with should be based on simple REST APIs that hide the Linked Data layer unless it is required.
Our initial prototype work has been built on Virtuoso and Ontowiki, providing an out-of-the-box interface for importing, editing, browsing and querying data.
Data Model
We have started working up a light weight ontology for representing ResourceProjects.org data.
This is based around a top level class hierarchy as below (draft, only some SubClasses included):
Project
ConfirmedProject
PotentialProject
Agreement
Concession
Licence
Contract
Site
Mine
Block
Participant
Organization
Company
GovernmentAgency
Location
Commodity
Documents
These are related through a number of key object properties:
hasParticipant (relating Agreements, Sites, and Projects to Participants)
hasLocation (relating Agreements, Sites & Projects to locations)
relatedProject (relating Agreements and Sites to Projects)
supportingDocument (relating Agreements, Sites & Projects to Documents)
commodity (relating Agreements, Sites and Projects to Commodities)
This builds on prior modelling work but makes a number of important distinctions:
It introduces 'Participant' as a layer in between a project and an organization. This means that statements can be made with respect to a given companies participation in a particular project, agreement or site, without making those in general about the company. E.g. a Participant has a share in the particular project, and a boolean status as the operator or not.
It distinguishes between 'PotentialProject' and 'ConfirmedProject'. A potential project is one that can be inferred from other data, but which needs further research to check if it duplicates an existing confirmed project. The ResourceProjects.org could distinguish between potential and confirmed projects, giving clear public identifiers to confirmed projects, and giving temporary suggested identifiers to potential projects.
By using an ontology, we can subClass properties to provide more specific kinds of relationship. This plays the role of the statements table in the earlier model.
The diagram below represents the draft of this model.
image
A simple input layer
We envisage that in most cases, data can be mapped to this model from simple tabular formats.
An example input format might work as follows:
ID ProjectID ProjectName Commodity Company Share DateKnown
1 2 3 4 5 6 7
In this case, a user with a list of projects related to companies would enter the project name (3), and company names (5) and share of project (6), the details of the commodity involved (4).
This would be enough information to expand out a number of relationships from the model.
We have more work to do on developing some simple flat formats that suit the variety of data we have encountered, but will be exploring how the ETL process could be developed so that:
Users convert data to a tabular format with common headers, with or without identifiers for projects, companies etc.;
An initial de-duplication process analyses the data, and checks for known project, organisation, site etc. identifiers;
Users are invited to use the results of this to improve their dataset;
The dataset with known identifiers included is uploaded, converted to RDF and stored in the data store;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment