Expressing Statistical Data in RDF with SDMX-RDF

Last update:
2010-05-02
Editors:
Richard Cyganiak (DERI, NUI Galway)
Chris Dollin (Epimorphics Ltd)
Dave Reynolds (Epimorphics Ltd)
@@@ add yourself!

Abstract

SDMX (Statistical Data and Metadata eXchange) is a ISO standard for exchanging and sharing statistical data and metadata among organizations. It consists of an abstract information model (SDMX-IM) and concrete XML- and UN/EDIFACT- based syntaxes.

RDF (Resource Description Framework) is a W3C specification for a general-purpose language for representing information in the world-wide web. It consists of a formal XML syntax and an abstract interpretation in terms of logical statements. RDF is also widely expressed using Turtle.

We describe how SDMX data may be represented in RDF, which we term SDMX-RDF, and so made available to RDF-aware applications and presented as linked data. This representation exploits existing RDF vocabularies such as SKOS for concept schemas, FOAF for organizational information, Dublin Core Terms for metadata and VoiD for dataset description. It builds on earlier work on representing statistical data in RDF using SCOVO.

SDMX-RDF provides a general means to publish statistical data in RDF (exploiting the SDMX information model). It also allows for RDF publication of data already in SDMX. It is not, at this stage, a complete implementation of the whole of the SDMX Information Model and is not intended as an alternative means for loss-less exchange of SDMX along existing statistical data flows. @@ Please check if this restated positioning is acceptable - Dave.

Status of this document

This is an editor's draft without any formal standing. It is not endorsed by any organisation. In particular, it has not been submitted for review to the SDMX sponsors, although the authors are planning to do so in the future.

Anything in this document is still subject to change at this point. The editors seek feedback on the document. Please send any comments to the project's Google Group.


Table of Contents


1. Introduction

Statistical data underpins many of the mash-ups and visualisations we see on the web, as well as being the foundations for policy prediction, planning and adjustments. The SDMX standard for exchanging statistical data is used by U.S. Federal Reserve Board, the European Central Bank, Eurostat, the WHO, the IMF, and the World Bank; the Organisation for Economic Cooperation and Development (OECD) and the UN expect the publishers of national statistics to use SDMX to allow aggregation across national boundaries.

SDMX is not web-friendly. The concepts, code lists, datasets, and observations are not named with URIs or routinely exposed to browsers and other web-crawlers. This makes it more difficult for third parties to annotate, reference, and discover that data. Nor is SDMX the only shape in which statistical data is published; data is available in specialist XML formats such as LGDx, de-facto semi-standards like CSV, and proprietary application formats like Excel spreadsheets or PDF documents.

RDF provides a mechanism for data publishing on the web which, through the use of linked data principles, supports easy discovery and cross-linking of published data. RDF is a simple but flexible representation in which logical statements (binary predicates) are asserted about resources. The resources, classes of resources and predicates are identified by URIs thus supporting web-based discovery of the associated information model.

There are a number of benefits to being able to publish statistical data using RDF:

SDMX-RDF is an RDF vocabulary designed to support this. It defines classes and predicates to represent statistical data within RDF, compatible with the SDMX information model.

SDMX-RDF makes use of the following existing RDF vocabularies:

1.1 Characterizing statistical data

A statistical data set comprises a collection of observations made at some points across a logical space. For example statistics for monitoring government performance typically comprise some set of indicators (e.g. economic activity, health) measured at particular times, across some set of geographic regions and population samples. The data set can be characterized by a set of dimensions which define what the observation applies to (time, area, population) and the observations themselves form a hypercube or multi-dimensional space indexed by those dimensions. In addition, the data set needs to convey, directly or indirectly, how to interpret those observations - attributes defining units of measurement, scale etc - along with metadata to support discovery and provide context to the data. The defining feature of such statistical datasets is this regular structure of dimensions and attributes around which the observations are grouped.

Many approaches to representing statistical data follow this hypercube model and individual observations can only be located and interpreted via their address within a fixed surrounding cube. This approach is convenient for storage and manipulation of a single data set.

However, when we wish to arbitrarily combine datasets, to annotate and track provenance of individual observations or arbitrary subsets of observations, to link observations to non-statistical datasets we need a different approach. We need to represent each observation as a separate entity which has an observed value along with its attributes and location within the multi-dimensional space. This is the approach taken in SDMX-RDF, which in turn is based on the earlier [SCOVO] model.

1.2 RDF and Linked Data

Linked data is an approach to publication of data on the web. It is a set of best practices to enable systems to use the Web to connect related data that wasn't previously linked, or lower the barriers to linking data currently linked using other methods. The approach [@@ref] recommends use of HTTP URIs to name the entities and concepts so that consumers of the data can lookup those URIs to get more information, including links to other related URIs. RDF [@@ref] provides a standard for the representation of the information that describes those entities and concepts, and is returned by dereferencing the URIs.

When applied to statistical data this linked data approach implies identifying datasets, time series and individual observations by means of HTTP URIs. These then enable both publishers and third parties to annotate and reference statistical data on the web, which helps to build trust with those engaging in conversations about the data. Using the RDF data model enables consumers to query statistical data in standard ways and to enhance statistical data by mixing it with other linked data.

1.3 About SDMX

The Statistical Data and Metadata Exchange (SDMX) Initiative was organised in 2001 by seven international organisations (BIS, ECB, Eurostat, IMF, OECD, World Bank and the UN) to realise greater efficiencies in statistical practice. These organisations all collect significant amounts of data, mostly from the national level, to support policy. They also disseminate data at the supra-national and international levels.

There have been several important results from this work: two versions of a set of technical specifications - ISO:TS 17369 (SDMX) - and the release of several recommendations for structuring and harmonising cross-domain statistics, the SDMX Content-Oriented Guidelines. All of the products are available at www.sdmx.org. The standards are now being widely adopted around the world for the collection, exchange, processing, and dissemination of aggregate statistics by official statistical organisations. The UN Statistical Commission recommended SDMX as the preferred standard for statistics in 2007.

The SDMX specification defines a core information model which is reflected in concrete form in two syntaxes - SDMX-ML (and XML syntax) and SDMX-EDI. SDMX-RDF builds up that same SDMX information model, showing how to expressing the same information in RDF form.

A key component of the SDMX standards package are the Content-Oriented Guidelines (COGs), a set of cross-domain concepts, code lists, and categories that support interoperability and comparability between datasets by providing a shared language between SDMX implementors. RDF versions of these artefacts are available as part of SDMX-RDF, and should be re-used whenever possible. Throughout the sections of this document, resources from the COGs will be mentioned when available.

1.4 Relationship to SCOVO

The Statistical Core Vocabulary (SCOVO) [@@ref] is a lightweight RDF vocabulary for expressing statistical data. Its relative simplicity allows easy adoption by data producers and consumers, and it can be combined with other RDF vocabularies for greater effect. The model is extensible both on the schema and the instance level for more specialized use cases.

While SCOVO addresses the basic use case of expressing statistical data in RDF, its minimalist design is limiting, and it does not support important scenarios that occur in statistical publishing, such as:

The design of SDMX-RDF is informed by SCOVO, and every SCOVO dataset can be re-expressed as an SDMX-RDF dataset.

1.5 Statistical data as a hypercube

A statistical data set comprises a collection of observations made at some points across a logical space. The set can be characterized by a set of dimensions that define what the observation applies to (time, area, population) along with metadata describing what has been measured (e.g. economic activity), how it was measured and how the observations are expressed (e.g. units, multipliers, status). We thus think of the statistical space as a hyper-cube or multi-dimensional space indexed by those dimensions. This concept of a cube of data is a common way to describe and think of such statistical datasets.

1.6 Audience and scope

This document describes the vocabulary for SDMX-RDF and how it relates to the SDMX model. It is aimed at people wishing to publish statistical data in RDF but does not assume that the data is already available in SDMX. Mechanics of cross-format translation from other SDMX formats to SDMX-RDF will be covered elsewhere.

The scope for SDMX-RDF itself is to enable publication of statistics as linked data using RDF. While we can regard it is a third syntax for the SDMX information model it is not aimed at complete round-tripping to other SDMX formats, though it might be extended to support that in the future.

1.7 Document conventions

The names of RDF entities -- classes, predicates, instances, etc -- are URIs. These are usually expressed using a compact notation where the name is written prefix:localname, where the prefix identifies a namespace URI which is to be prepended to the localname to obtain the full URI.

In this document we shall use the conventional prefix names for the well-known namespaces:

We also introduce the prefix sdmx for the SDMX-RDF namespace (yet to be formally allocated). While the new terms required to express SDMX concepts in RDF could have been added to the SCOVO namespace, it seems more appropriate to emphasise their relationship with the standard they are taken from.

- [ ] Usage Notes explain typical practical usage
- [ ] Design Notes explain modelling decisions and explore alternatives

2. An overview of the SDMX model

- [ ] The big picture
    - [ ] Levels of maintenance and re-use
    - [ ] Further reading on SDMX
    - [ ] Scope of this doc -- we don't really talk about DataFlow, ProvisionAgreement etc

Mapping overview. Fig. 4 provides a high-level overview of the RDF model. At the core of SDMX is the data structure definition (DSD), which describes the structure, or metamodel, of one or more statistical datasets. Individual datasets must conform to a DSD, and are represented by instances of the sdmx:DataSet class. The sdmx:structure property connects a dataset and its DSD. The sdmx:DataSet class is defined as a subclass of SCOVO's scovo:Dataset class, and also as a subclass of void:Dataset, so VoiD properties can be used to describe access methods (SPARQL endpoint, RDF dump, etc.) to the data. VoiD covers much of the same ground as SDMX’s web service based registry module, which we therefore do not map to RDF.

Data flows and provision agreements. Two important scenarios in official statistics are the periodical publishing of datasets according to a schedule, and the aggregation of datasets from different data providers (e.g., European Union national statistics offices) into a larger collection for central dissemination (e.g., Eurostat). These scenarios are addressed via sdmx:DataFlow. A data flow represents a "feed" of datasets that all conform to the same DSD. Data flows are associated with provision agreements, which can be understood as commitments from an organisation to publish datasets into a data flow.

2.1 Data structure definitions

- [ ] Data structure definitions
    - [ ] Dimensions, measures and attributes
    - [ ] Concepts

Concepts are about the meaning of the dataset. They are supposed to be widely shared.

Dimensions, attributes and measures are about the structure of a dataset. They are used to define a specific structure that can be re-used for identically-structured datasets. A DSD is essentially created by enumerating the concepts that are used in the dataset, and detailing the role they play in datasets that follow the DSD.

Data structure definition details. A DSD, also known as a key family in SDMX, describes the metamodel of one or more datasets (see Fig. 5). It defines attributes, measures, and dimensions, collectively called components. Measures name the observable phenomenon, such as income per household. Dimensions identify what is measured, such as of a particular country at a particular time. Attributes define metadata about the observations, such as the method of data collection or the unit of measurement. Components are coded if possible values come from a pre-defined code list (such as country), or uncoded otherwise.

Dimensions, attributes and measures in SDMX take their semantics from concepts. Concepts are items in concept schemes. By using standard concepts and code lists, data becomes comparable across datasets, DSDs, and providers.

2.2 Datasets

@@@ About time series, cross sections, and groups

Data set details. SDMX offers two approaches to organising the data inside a dataset. Either the dataset is a collection of time series (a set of observations that share the same dimension values except for the time dimension), or it is a collection of cross-sections (a set of observations that share the same dimension values except for one or more non-time "wildcard dimensions"). In our RDF mapping, we unify both models into a simpler yet more verbose model that can be more easily interrogated with SPARQL queries (see Fig. 6). The observation values are modeled as instances of sdmx:Observation, a subclass of scovo:Item. Each observation instance is directly connected to the sdmx:DataSet via the sdmx:dataset property. An observation must have a value for each dimension property defined in the DSD. The actual observation value is recorded using rdf:value.

The time series and cross-sections found in SDMX data are still translated to RDF, in order to make any metadata attached to them available in the RDF view. The same applies to groups, which are another organisational tool that can be used to apply metadata to sections of a dataset, for example to monthly, quarterly and annual timelines of the same measure.

3. Creating data structure definitions

Data structure defintions are also named key families. Both terms are used synonymously in SDMX.

Figure 5: SDMX Data Structure Definition in RDF

Code lists are mapped to a subclass of skos:ConceptScheme.

We represent all components as instances of rdf:Property. We define subclasses of rdf:Property to indicate the particular kind of component, as well as whether it is coded, and the particular role it plays in the DSD (e.g, TimeDimension, PrimaryMeasure). Compared to SCOVO, the property-based modeling of dimensions allows for a more compact RDF representation of observations.

Concepts could be modeled as properties, and could be associated with components using rdfs:subPropertyOf. Instead, we model them as skos:Concepts, and introduce a new property for associating them with the component. This takes advantage of the easier management, wider reusability, and fine-grained mapping features of SKOS vocabularies compared to RDFS-defined properties.

The main part of a key family are the dimensions, attributes, and measures. We model them as RDF properties. We define subclasses of rdf:Property that are used to map the components to RDF:

The defined properties are attached to the main resource (of type sdmx:DataStructureDefinition) via sdmx:component.

Each property must also have a sdmx:concept property that points to the concept that gives the semantics of the property (from an sdmx:ConceptScheme).

All defined properties have domain sdmx:Attachable. An appropriate domain should also be declared, especially for uncoded properties that use some literal datatype.

3.1 The primary measure property

Every data structure definition must include the component sdmx:obsValue. This is neither an attribute nor a dimension, but a measure. In observations, this property will hold the actual observed (typically numeric) value.

Note: There are rare cases where a data structure definition will not include sdmx:obsValue. When expressing existing SDMX data structure definitions that use a different concept than OBS_VALUE in the primaryMeasure concept role, a corresponding instance of sdmx:PrimaryMeasureProperty has to be created and is used in place of sdmx:obsValue.

4. Expressing datasets

A dataset is a collection of statistical data that corresponds to a given data structure definition. The data in a dataset can be roughly described as belonging to one of the following kinds:

Observations
This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
Organizational structure
To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation. Datasets can have additional organizational structure in the form of time series and groups. Both are slices through the cube along certain dimensions and are used for attaching metadata to areas of the cube.
Internal metadata
Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated? These metadata are provided as attributes and can be attached to individual observations, or to higher levels (time series, groups, entire datasets), which makes them apply to all observations in the region.
External metadata
This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed. External metadata is described in Section 5.

4.1 The dataset instance

A resource representing the entire dataset is created and typed as sdmx:DataSet.

Pitfall: Note the capitalization of sdmx:DataSet, which differs from the capitalization in other vocabularies, such as dct:Dataset, void:Dataset, dcat:Dataset.

The dataset resource is connected to the defining data structure definition via the sdmx:structure property.

Following the example of SCOVO, the RDF mapping does not distinguish between datasets modelled as time series and cross-sectional datasets. TimeSeries and Sections are supported as additional grouping constructs within the cube.

@@@ We still completely ignore group keys.

4.2 Observations

The measured value is provided as the value of the primary measure property (typically sdmx:obsValue).

In the basic representation, an RDF resource is created for each observation and typed as sdmx:Observation. It is connected to the sdmx:DataSet via the sdmx:dataset property. Optionally, instances of sdmx:TimeSeries, sdmx:Section, and sdmx:Group can be created. The dataset resource connects to each of those via sdmx:slice. Each of them connects to the observations contained within the time series/section/group via sdmx:observation.

Values for the attributes, dimensions and measurements are attached directly to the observation. Remember that atts, dims and measurements are all RDF properties, so we use them as the predicate and the respective value as the object of RDF statements. Instead of attaching statements directly to the Observations, they can also be "pulled up" to any of the groupings or even up to the dataset if they are always identical within the group. (@@@ but attachment levels are not visible at all within the @@@)

5. Expressing dataset metadata

DataSets should be marked up with metadata to support discovery, presentation and processing. Metadata such as a display label (rdfs:label), descriptive comment (rdfs:comment) and creation date (dc:date) are common to most resources. We recommend use of Dublin Core Terms for representing the key metadata annotations commonly needed for DataSets.

5.1 Categorizing a dataset

Publishers of statistics often categorize their data sets into different statistical domains, such as Education, Labour, or Transportation. SDMX-RDF supports the annotation of data sets (or data flows) with one or more classification terms using the dct:subject property. The classification terms can include coarse grained classifications, such as the List of Subject-matter Domains from the SDMX Content-oriented Guidelines [SDMX COG SMD], and fine grained classifications to support discovery of data sets.

The classification schemes are represented using the SKOS vocabulary, which is designed for encoding Thesauri and other knowledge organization schemes [SKOS]. For convenience the SMDX Subject-matter Domains have been encoded as a SKOS concept scheme at http://purl.org/linked-data/sdmx/2009/subject#.

Thus a dataset about tourism in Wales might be marked up by:

eg:dataset1 a sdmx:DataSet;
    dct:subject <http://purl.org/linked-data/sdmx/2009/subject#2.4.5>,  eg:Wales;

where eg:Wales is a skos:Concept drawn from an appropriate controlled vocabulary for places.

5.2 Describing publishers and maintenance agencies

The organization that publishes a dataset should be recorded as part of the dataset metadata. SDMX-RDF recommends reuse of the Dublin Core term dc:publisher for this. The organization should be represented as an instance of foaf:Agent. For example:

eg:dataset1 a sdmx:DataSet;
    dc:publisher  ;
    dc:date "30-04-2010"^^xsd:date .
    
 a foaf:Agent;
    rdfs:label "Epimorphics Ltd" .    

@@@ Feel free to switch to another organization for the example

Organizations can also play the role of maintenance agency for various SDMX artifacts, such as DSDs, code lists, and category schemes. This is indicated using the sdmx:maintainer property.

6. Designing code lists

The value for each dimension and attribute within a dataset should be indicated by a code drawn from a code list. In SDMX-RDF then such codes are denoted by URI resources (so they can be dereferenced and further annotated) and are normally of type skos:Concept. The set of codes which make up a code list are represented using skos:ConceptScheme.

For example:

sdmx-code:sex a skos:ConceptScheme, sdmx:CodeList;
    skos:prefLabel "Code list for Sex (SEX) - codelist scheme"@en;
    rdfs:label "Code list for Sex (SEX) - codelist scheme"@en;
    skos:notation "CL_SEX";
    skos:note "This  code list provides the gender."@en;
    skos:definition <http://sdmx.org/wp-content/uploads/2009/01/02_sdmx_cog_annex_2_cl_2009.pdf> ;
    rdfs:seeAlso sdmx-code:Sex ;
    sdmx-code:sex skos:hasTopConcept sdmx-code:sex-F ;
    sdmx-code:sex skos:hasTopConcept sdmx-code:sex-M .

sdmx-code:Sex a rdfs:Class, owl:Class;
    rdfs:subClassOf skos:Concept ;
    rdfs:label "Code list for Sex (SEX) - codelist class"@en;
    rdfs:comment "This  code list provides the gender."@en;
    rdfs:seeAlso sdmx-code:sex .

sdmx-code:sex-F a skos:Concept, sdmx:Concept, sdmx-code:Sex;
    skos:topConceptOf sdmx-code:sex;
    skos:prefLabel "Female"@en ;
    skos:notation "F" ;
    skos:inScheme sdmx-code:sex .

sdmx-code:sex-M a skos:Concept, sdmx:Concept, sdmx-code:Sex;
    skos:topConceptOf sdmx-code:sex;
    skos:prefLabel "Male"@en ;
    skos:notation "M" ;	
    skos:inScheme sdmx-code:sex .

skos:prefLabel is used to give a name to the code, skos:note gives a description and skos:notation can be used to record a short form code which might appear in other serializations. The SKOS specification [SKOS] recommends the generation of a custom datatype for each use of skos:notation but here the notation is not intended for use within RDF encodings, it merely documents the notation used in other representations (which do not use such a datatype).

The skos:ConceptScheme derived from the ItemScheme is also typed as an sdmx:CodeList.

It is convenient and good practice when developing a code list to also create an owl:Class to denote all the codes within the code list, irrespective of hierarchical structure. This allows the range of an sdmx:componentProperty to be defined by using rdfs:range which then permits standard RDF closed-world checkers to validate use of the code list without requiring custom SDMX-RDF-aware tooling. We do that in the above example by using the common convention that the class name is the same as that of the concept scheme but with leading upper case.

The above example is based on the SDMX Content Oriented Guidelines [SDMX COG CL], though simplified by omitting the other codes T, U and N. For convenience, each of the SDMX COG code lists have been translated to this format at http://purl.org/linked-data/sdmx/2009/code# to facilitate reuse.

This code list can then be associated with a coded property, such as a dimension:

  eg:sex a sdmx:DimensionProperty, sdmx:CodedProperty;
      sdmx:codeList sdmx-code:sex ;
      rdfs:range sdmx-code:Sex .

For those SDMX COG Code Lists which have corresponding SDMX COG dimensions or attributes (including sdmx-dimension:sex) then this binding has already been provided in: http://purl.org/linked-data/sdmx/2009/attribute#, http://purl.org/linked-data/sdmx/2009/dimension#, and http://purl.org/linked-data/sdmx/2009/measure#.

In some cases a controlled set of URI resources might already exist but not as a SKOS concept scheme, for example identifiers exist for things like geographic entities and time periods. It is not necessary to duplicate such resources as skos:Concepts within a skos:ConceptScheme, the resources can be used directly. In that case the OWL (or RDFS) Class which denotes the set of resources can be used in the definition of the corresponding dimension or attribute property.

  eg:refArea a sdmx:DimensionProperty, sdmx:CodedProperty;
      sdmx:codeList eg:GeographicAreaClass ;
      rdfs:range eg:GeographicAreaClass .

In some cases code lists have a hierarchical structure. In particular, this is used in SDMX when the data cube includes aggregations of data values (e.g. aggregating a measure across geographic regions). Hierarchical code lists lists should be represented using the skos:narrower relationship to link from the skos:hasTopConcept codes down through the tree or lattice of child codes. In some publishing tool chains the corresponding transitive closure skos:narrowerTransitive will be automatically inferred. The use of skos:narrower makes it possible to declare new concept schemes which extend an existing scheme by adding additional aggregation layers on top. All items are linked to the scheme via skos:inScheme.

7. Designing concept schemes

The resource derived from a ConceptScheme should be typed as skos:ConceptScheme and sdmx:ConceptScheme. Each concept is an sdmx:Concept. If there is a datatype associated with a Concept in the ConceptScheme, then the corresponding XSD datatype (such as xsd:string, xsd:integer) is attached to the Concept using sdmx:coreType.

8. Annotations

Most annotations of the data should be handled via attributes if possible. Quote from the Implementor's Guide:

It is also possible to associate annotations (Annotation) with both the structures described in key families and the observations contained in the data set. These annotations are a slightly atypical form of documentation, in that they are used to describe both the data itself - like other attributes - but also may be used to describe other metadata. An example of this is methodological information about some particular dimension in a data structure definition structure, attached as an annotation to the description of that dimension. Regular "footnotes" attached to the data as documentation should be declared as attributes in the appropriate places in a data structure definition – annotations are irregular documentation which may need to be attached at many points in the data structure definition or data set.

Annotations in the sense of the text above are handled using SKOS. Any resource in a data structure definition, dataset, or anywhere else can be annotated using this mechanism.

To annotate a resource, a skos:note property is attached to it. The value of the property is a new resource (not a literal). The actual text of the annotation is attached to this resource as a literal via rdfs:label. Other RDF properties from well-known vocabularies can be used on this annotation resource to provide additional information. The following properties are especially noteworthy, because they have counterparts in the SDMX information model:

Property     | Use
-------------+---------------------------------------------------------------
rdfs:label   | name or label for the annotation
rdfs:seeAlso | link to external web document with descriptive text
rdf:type     | extension point for annotations that are to be processed in a
             | particular way

@@@ Example

9. Collections of DataSets

SDMX-RDF provides two methods for group DataSets into aggregate structures - DataFlows (periodic sequences of DataSets with a common structure) and Reports (arbitrary collections of DataSets, DataFlows and nested reports).

9.1 DataFlows

SDMX defines the notion of a DataFlow to represent a regular sequence of DataSet publications. This is used to support publication and notification of DataSets within some series and often there will be a provision agreement between a data provider and data consumers concerning the structure (DataStructureDefinition) and frequency of sets within the flow.

In SDMX-RDF then a DataFlow is represented by an instance of the class sdmx:DataFlow. Like a DataSet a DataFlow can be classified using dct:subject to reference a concept within some concept scheme and is linked to a DataStructureDefintion via sdmx:structure. The individual data sets within a flow are linked to the flow using sdmx:dataFlow. For example:

  eg:unemploymentDataFlow a sdmx:DataFlow ;
      dct:subject sdmx-subject:1.2 ;   # Labour market
      rdfs:label "Unemployment data flow"@en ;
      rdfs:comment "fictitious set of quarterly data unemployment statistics"@en ;
      sdmx-attribute:freq sdmx-code:#freq-Q ;
      sdmx:structure eg:unemploymentDSD ;
      .
      
  eg:unemployment2009Q4 a sdmx:DataSet ;
      rdfs:label "unemployment 2009 Q4"@en ;
      rdfs:comment "unemployment statistics for 2009 quarter 4"@en ;
      # ... other metadata omitted
      sdmx:structure eg:unemploymentDSD ;
      dct:subject sdmx-subject:1.2 ;   # Labour market
      sdmx:dataFlow   eg:unemploymentDataFlow ;
      .

9.2 Reports

DataFlows are one way of relating DataSets together but they are specific to regular publication work flows and only group DataSets with the same logical structure. In some situations an agency publishes a collection of statistics as a bundle which cover different topics, have different DataStructureDefinitions but are related together as some for of coherent report.

SDMX-RDF provides a class sdmx:Report to represent such collections. Reports can be used to group DataSets, DataFlows or other Reports together into arbitrary groupings. The individual components of a report are linked to the sdmx:Report through use of the sdmx:reportComponent property. A Report can be annotated with metadata using the same Dublic Core terms and conventions described above for DataSets and DataFlows.

10. URIs, resolvability and publishing

URIs should be, if possible:

- globally unique
- resolvable
- "dual-use" (for people and machines)
- allow metadata ("Cool URIs" compatible)

Good practices for versioning:

- don't put version numbers into URIs of skos:Concepts etc
- can be ok to put version umbers into URIs of ConceptSchemes, CodeLists

Acknowledgements

@@@

This paper is based on the collaboration that was initiated in a workshop Publishing statistical datasets in SDMX and the semantic web hosted by ONS in Sunningdale, United Kingdom in February 2010. The completion of a draft reference model was one of several recommendations made by the participants, and this ongoing work continues in an open collaborative environment . Taken together with the proposed collaboration to create a recommended style for URI design for use in APIs to find, obtain and query statistical data , we believe this work represents a key step towards bringing the worlds of linked data and official statistics together through the wider adoption of open standards. The authors would like to thank all the participants at that workshop for their input into this work.

The authors would also like to thank John Sheridan for his comments and suggestions on an earlier draft of this paper.

@@@ These are work-in-progress notes on mapping the SDMX standard to RDF. These notes are based on initial work by Wolfgang Halb, Jeni Tennison, Arofan Gregory and me, done at the Workshop on Publishing Statistical Data with SDMX and the Semantic Web in February 2010.

References

@@@ [SCOVO] http://sw.joanneum.at/scovo/schema.html, The Statistical Core Vocabulary
@@@ [SCOVO] http://sw-app.org/pub/eswc09-inuse-scovo.pdf, SCOVO: Using Statistics on the Web of data
@@@ [SDMX] http://www.sdmx.org/docs/2_0/SDMX_2_0%20SECTION_02_InformationModel.pdf, SDMX Information Model
@@@ [RDF] http://www.w3.org/standards/techs/rdf#w3c_all, RDF Current Status
@@@ [SCOVO, SDMX] http://events.linkeddata.org/ldow2010/papers/ldow2010_paper03.pdf, Semantic Statistics: Bringing Together SDMX and SCOVO
@@@ [SDMX COG SMD] http://sdmx.org/wp-content/uploads/2009/01/03_sdmx_cog_annex_3_smd_2009.pdf
@@@ [SDMX COG CL] http://sdmx.org/wp-content/uploads/2009/01/02_sdmx_cog_annex_2_cl_2009.pdf

Appendix 1: From SDMX-IM to SDMX-RDF

This appendix contains a reference of concepts from the SDMX Information Model (SDMX-IM) and their translations to SDMX-RDF. When completed, this will contain an entry for every class that can be found in the UML diagrams of SDMX-IM. This might eventually become a separate document.

The following list enumerates mappings that have to be prepared to achieve a full translation of an SDMX-IM instance to SDMX-RDF. These mappings have to be created manually and are required as input to the translation process.

AnnotableArtefact
Translated to an RDF resource. For each associated Annotation, a skos:note property is attached whose value is the translation of the annotation.
Annotation
Translated to a blank node. The name field, if present, is translated to an rdfs:label literal. If the language of the annotation is known, an appropriate language tag should be used for the literal. The url field, if present, is translated to an rdfs:seeAlso value, with a URI object (not a literal object). The type field is translated to an rdf:type value. The object is a URI that is obtained from the type string through the annotation type mapping. If no mapping is defined for the string value, then no rdf:type triple is generated and the type value is lost. If a text is associated, then the text's LocalisedString members are attached to the annotation via rdfs:comment.
InternationalString
Translated to a set of RDF literals. Each of its LocalisedString members becomes one such literal.
LocalisedString
Translated to a language-tagged RDF literal. The literal's lexical value is the label field. The locale field is translated to a language tag using the locale to language tag mapping. If no mapping is defined for a locale, then the locale is checked for conformance to the RDF language tag syntax; if it matches, then the locale is used directly as a language tag. Otherwise, a plain literal is generated from this LocalisedString.
@@@ Cross-sectional observations

@@@ This currently just discusses how to interpret some of the XS stuff and convert it to time series style.

Each XSObservation has exactly one "number" in it, attached to property "value".

Each XSObservation has a reference to exactly one "XSMeasure".

The XSMeasures are defined in the DSD. They could be "Weight", "Volume" and "Price". Since XSMeasure inherits from Measure, each of these XSMeasures is associated with a concept. In a simple design, this would be everything that XSMeasure does: It would merely form a connection from a concept to the DSD.

To support the transformation of a cross-sectional dataset to a time series dataset, the following trick is used: The DSD contains one or more fake dimensions, called MeasureTypeDimensions. The code list for this dimension could be "w", "v", "p". Each XSMeasure is associated with one MeasureTypeDimension, and with one code from the MeasureTypeDimension's code list. For example, the "Weight" XSMeasure could be associated with the "w" code.

When the cross-sectional dataset is transformed to a time series dataset, each XSObservation is turned into one normal Observation associated to a TimeSeries. These normal observations have no association with an XSMeaure. In order not to lose this association, the Observation's time series will have an additional dimension -- the MeasureTypeDimension. The Observation will be attached to a TimeSeries where the value of the MeasureTypeDimension matches the XSMeasure. For example, an XSObservation attached to the "Weight" XSMeasure would end up on a time series whose MeasureTypeDimension value is "w".

Appendix 2: namespaces used in this document

prefixnamespace URIvocabulary
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#RDF core
rdfshttp://www.w3.org/2000/01/rdf-schema#RDF Schema
skoshttp://www.w3.org/2004/02/skos/core#Simple Knowledge Organization System
foafhttp://xmlns.com/foaf/0.1/Friend Of A Friend
voidhttp://rdfs.org/ns/void#Vocabulary of Interlinked Datasets
scovohttp://purl.org/NET/scovo#Statistical Core Vocabulary
dchttp://purl.org/dc/elements/1.1/Dublin Core

Questions