caprenter · August 29, 2015 14:21
diff --git a/gistfile1.txt b/gistfile1.txt
 Resource Projects Data Model

 Requirements

 ResourceProjects.org needs to be able to accommodate data coming from a wide variety of different sources, and to use this to:

 Identify projects;
 Link as much information as possible to those projects;
 Key challenges to be addressed include:

 The incoming data is often sparse - including only a few details about a license, a mining site, or a contract. We may be able to infer a project from this, but may have no direct details about it.
 The same projects, companies, licenses, locations (etc.) may be identified with different names, in different languages, and with different variations of spelling;
 Projects, their participants and other features of the data change over time;
 Incoming data may not be updated very often;
 The license of some datasets is unclear: publicly accessible data doesn't always mean public domain;
 We want to capture detailed information where it is available, but not require users to understand the full complexity of resource projects in order to meaningfully query the data;
 A linked data response

 Based on investigating the available options, we have begun to pilot an approach based around using a graph-based data model, building on a Linked Data stack. We have chosen this approach because:

 Working with a triple store we are not constrained by a pre-defined database schema, and have the flexibility to re-shape data as the project develops;
 Linked data is well suited to integrating heterogeneous data from multiple sources. Through use of the sameAs relationship, we can identify the same project found in multiple sources, whilst also maintaining a clear trail of where each assertion about the project came from;
 It opens up the possibility of easily integrating third-party open data, including geodata, company information and contracts data;
 There are a number of options for managing detailed provenance information available;
 Tools from the LOD2 Stack provide an increasingly useful suite to manage some of the key data management tasks for ResourceProjects.org
 However, in exploring a Linked Data approach we are clear that most of the time it should be entirely behind the scenes. The front-end that users deal with should be based on simple REST APIs that hide the Linked Data layer unless it is required.

 Our initial prototype work has been built on Virtuoso and Ontowiki, providing an out-of-the-box interface for importing, editing, browsing and querying data.

 Data Model

 We have started working up a light weight ontology for representing ResourceProjects.org data.

 This is based around a top level class hierarchy as below (draft, only some SubClasses included):

 Project
 ConfirmedProject
 PotentialProject
 Agreement
 Concession
 Licence
 Contract
 Site
 Mine
 Block
 Participant
 Organization
 Company
 GovernmentAgency
 Location
 Commodity
 Documents
 These are related through a number of key object properties:

 hasParticipant (relating Agreements, Sites, and Projects to Participants)
 hasLocation (relating Agreements, Sites & Projects to locations)
 relatedProject (relating Agreements and Sites to Projects)
 supportingDocument (relating Agreements, Sites & Projects to Documents)
 commodity (relating Agreements, Sites and Projects to Commodities)
 This builds on prior modelling work but makes a number of important distinctions:

 It introduces 'Participant' as a layer in between a project and an organization. This means that statements can be made with respect to a given companies participation in a particular project, agreement or site, without making those in general about the company. E.g. a Participant has a share in the particular project, and a boolean status as the operator or not.

 It distinguishes between 'PotentialProject' and 'ConfirmedProject'. A potential project is one that can be inferred from other data, but which needs further research to check if it duplicates an existing confirmed project. The ResourceProjects.org could distinguish between potential and confirmed projects, giving clear public identifiers to confirmed projects, and giving temporary suggested identifiers to potential projects.

 By using an ontology, we can subClass properties to provide more specific kinds of relationship. This plays the role of the statements table in the earlier model.

 The diagram below represents the draft of this model.

 image

 A simple input layer

 We envisage that in most cases, data can be mapped to this model from simple tabular formats.

 An example input format might work as follows:

 ID	ProjectID	ProjectName	Commodity	Company	Share	DateKnown
 1	2	3	4	5	6	7
 In this case, a user with a list of projects related to companies would enter the project name (3), and company names (5) and share of project (6), the details of the commodity involved (4).

 This would be enough information to expand out a number of relationships from the model.

 We have more work to do on developing some simple flat formats that suit the variety of data we have encountered, but will be exploring how the ETL process could be developed so that:

 Users convert data to a tabular format with common headers, with or without identifiers for projects, companies etc.;
 An initial de-duplication process analyses the data, and checks for known project, organisation, site etc. identifiers;
 Users are invited to use the results of this to improve their dataset;
 The dataset with known identifiers included is uploaded, converted to RDF and stored in the data store;
	Resource Projects Data Model

	Requirements

	ResourceProjects.org needs to be able to accommodate data coming from a wide variety of different sources, and to use this to:

	Identify projects;
	Link as much information as possible to those projects;
	Key challenges to be addressed include:

	The incoming data is often sparse - including only a few details about a license, a mining site, or a contract. We may be able to infer a project from this, but may have no direct details about it.
	The same projects, companies, licenses, locations (etc.) may be identified with different names, in different languages, and with different variations of spelling;
	Projects, their participants and other features of the data change over time;
	Incoming data may not be updated very often;
	The license of some datasets is unclear: publicly accessible data doesn't always mean public domain;
	We want to capture detailed information where it is available, but not require users to understand the full complexity of resource projects in order to meaningfully query the data;
	A linked data response

	Based on investigating the available options, we have begun to pilot an approach based around using a graph-based data model, building on a Linked Data stack. We have chosen this approach because:

	Working with a triple store we are not constrained by a pre-defined database schema, and have the flexibility to re-shape data as the project develops;
	Linked data is well suited to integrating heterogeneous data from multiple sources. Through use of the sameAs relationship, we can identify the same project found in multiple sources, whilst also maintaining a clear trail of where each assertion about the project came from;
	It opens up the possibility of easily integrating third-party open data, including geodata, company information and contracts data;
	There are a number of options for managing detailed provenance information available;
	Tools from the LOD2 Stack provide an increasingly useful suite to manage some of the key data management tasks for ResourceProjects.org
	However, in exploring a Linked Data approach we are clear that most of the time it should be entirely behind the scenes. The front-end that users deal with should be based on simple REST APIs that hide the Linked Data layer unless it is required.

	Our initial prototype work has been built on Virtuoso and Ontowiki, providing an out-of-the-box interface for importing, editing, browsing and querying data.

	Data Model

	We have started working up a light weight ontology for representing ResourceProjects.org data.

	This is based around a top level class hierarchy as below (draft, only some SubClasses included):

	Project
	ConfirmedProject
	PotentialProject
	Agreement
	Concession
	Licence
	Contract
	Site
	Mine
	Block
	Participant
	Organization
	Company
	GovernmentAgency
	Location
	Commodity
	Documents
	These are related through a number of key object properties:

	hasParticipant (relating Agreements, Sites, and Projects to Participants)
	hasLocation (relating Agreements, Sites & Projects to locations)
	relatedProject (relating Agreements and Sites to Projects)
	supportingDocument (relating Agreements, Sites & Projects to Documents)
	commodity (relating Agreements, Sites and Projects to Commodities)
	This builds on prior modelling work but makes a number of important distinctions:

	It introduces 'Participant' as a layer in between a project and an organization. This means that statements can be made with respect to a given companies participation in a particular project, agreement or site, without making those in general about the company. E.g. a Participant has a share in the particular project, and a boolean status as the operator or not.

	It distinguishes between 'PotentialProject' and 'ConfirmedProject'. A potential project is one that can be inferred from other data, but which needs further research to check if it duplicates an existing confirmed project. The ResourceProjects.org could distinguish between potential and confirmed projects, giving clear public identifiers to confirmed projects, and giving temporary suggested identifiers to potential projects.

	By using an ontology, we can subClass properties to provide more specific kinds of relationship. This plays the role of the statements table in the earlier model.

	The diagram below represents the draft of this model.

	image

	A simple input layer

	We envisage that in most cases, data can be mapped to this model from simple tabular formats.

	An example input format might work as follows:

	ID ProjectID ProjectName Commodity Company Share DateKnown
	1 2 3 4 5 6 7
	In this case, a user with a list of projects related to companies would enter the project name (3), and company names (5) and share of project (6), the details of the commodity involved (4).

	This would be enough information to expand out a number of relationships from the model.

	We have more work to do on developing some simple flat formats that suit the variety of data we have encountered, but will be exploring how the ETL process could be developed so that:

	Users convert data to a tabular format with common headers, with or without identifiers for projects, companies etc.;
	An initial de-duplication process analyses the data, and checks for known project, organisation, site etc. identifiers;
	Users are invited to use the results of this to improve their dataset;
	The dataset with known identifiers included is uploaded, converted to RDF and stored in the data store;