ENABLING DATA COMPILATION FROM COUNTRIES USING SDMX

ballooncadgeInternet and Web Development

Oct 31, 2013 (3 years and 10 months ago)

83 views

1


ENABLING
DATA COMPILATION
FROM COUNTRIES

USING SDMX


Prepared by

Edgardo GREISING

Database Manager, Department of Statistics

International Labour Office

Route de Morillons 4, CH
-
1211 Geneva 22, Switzerland

E
-
mail: greising@ilo.org


I.

Rationale

1.

Every international organization collecting statistical indicators from countries
aim to have
the

be
s
t response rate
from the countries, to reduce the delay
between the production and publication
of the information received and to
improve the overall
quality of the data published
,
while trying to avoid duplication of efforts and reduce the
burden to

countries.


2.

Among the different data channels provided, electronic data interchange thru the use of SDMX standard seems to
be a good solution to achieve th
ese objectives.


3.


“Doing” SDMX requires good conceptual knowledge of the standard and the implementation (or adaptation to
own environment) of appropriated software tools to generate the SDMX data flows defined and pull the data
from their databases.


4.

N
evertheless,
and beyond the costs and efforts demanded by this implementation,
many developing countries do
not have a
repository of indicators. In tho
se countries the indicators are calculated as part of a publication plan
and then tabulated and the resul
ts are published on the website of the NSI, often in pdf or Excel. However, the
data is not preserved in a central repository, which makes the generation of SDMX files (or even csv) very
difficult, because there is no database from which to take this infor
mation.


5.

Moreover, other institutions that has an indicators database,
but the resources required to disseminate the data in
a particular format (like SDMX) are not available
.

And SDMX is not easy to understand, and harder to
implement
, even for simple
data reporting.


6.

Many people trends to think

SDMX is too complex, and a simpler format like csv should be used
”.
This
argument is fallacious since SDMX is a standard protocol for exchanging statistical data and metadata that
comprises its own information
model and can be implemented on top of XML or UN/EDIFACT syntax, while
“csv” is
an
other file

format syntax

comparable to XML or UN/EDIFACT
. If someb
ody wants to interchange data
using

csv files, a protocol will have to be defined; and if it includes an inf
ormation model to support both data
and metadata
,

and generic enough as to become an ISO standard, it will pro
bably be as complex as SDMX is.


7.

Even though it is possible to define a protocol for data interchange using csv files, it is not easy to create th
ese
files without such a repository, especially if this file has to include descriptive metadata.


8.

This paper elaborates on the idea of implementing an “out
-
of
-
the
-
box” software easy to deploy and maintain by
IT specialists

and statisticians

(not SDMX expe
rts) of any NSI or any reporting agency that could act as a central
repository for statistical indicators administered and
provided

by the agency and able to generate SDMX data
flows to disseminate the data
,

including transmission

to international organiza
tions
.


II.

Requirements

9.

The tool should be very easy to install and maintain, be supported on free software platforms (DBMS, application
server, etc) and count with g
ood documentation, tutorials, and training.


10.

To standardize the structure of the tables (
multi
-
dimensional
cubes) to be collected, the compiler (international
organization) should be able to define it. For this purpose, the tool should be able to take a definition file (
i.e.
in
SDMX a DSD

and a Data Flow definition
), define the table structure

to hold the data and import code lists and
descriptive metadata from the attributes.
The DSD

could be taken from an SDMX registry, or without connecting
to a registry by importing the xml file
.

2



11.

Must have a
n integrated
data discovery and data visualizatio
n tool, and a

full screen editor to enter/modify data
and descriptive metadata (
including foot
notes)
, based on the structural metadata that defined the tables.

An
interactive tool for defining the tables
(structural metadata editor)
would be desirable as w
ell.


12.

S
hould
provide different format fo
r downloading/uploading tables
: Excel, csv format, pdf (reports)
a
nd, of
course, SDMX
.

D
issemination
of data should be supported
in the form of web services
and/
or
downloaded

files.


III.

Possible solutions

13.

At the moment of writing this paper, we
go on

looking

for already existent solutions that could cover most of the
requirements.
Several

alternatives have been identified so far, each one with its strengths and weakness.


14.

FAO’s
CountryStat

project

is a web
-
based information technology system for food and agriculture statistics at
the national and subnational levels. In practice, it acts as a one stop center which centralizes and integrates the
data coming from various sources and allows to harmonize

it according to international standards while ensuring
data quality and reliability.

1


15.

The application can be deployed as a stand
-
alone solution in a server located at the country or can be used in a
Software
-
as
-
a
-
service mode hosted centrally by FAO. In

either case, the database administration is left under
country’s responsibility.


16.

Among the features of CountryStat it
’s

worth highlighting a great flexibility for uploading the data into the local
repository, although there is no editor for entering or modifying the data. The data can be disseminated locally
and is transmitted to the central database using proprietary prot
ocol. (SDMX in the pipeline). The time series are
one
-
dimensional (not cubes) and the

descriptive

metadata management is limited.


17.

.Stat
2

datawarehouse from OECD and the Statistical Information System Collaboration Community
is a very
good statistical data
warehouse with a well
-
organized support and collaboration model behind.


18.

The product is basically oriented towards dissemination,
and includes the possibility of displaying tables as well
as publishing SDMX data. Data
entry

is based

on the Data Provider I
nterface (DPI) that allow
s

to define the
structural metadata and mappings to the production database, from where data is then processed by the EntryGate
thru XML files.
It
should be improved in its
online
data
collection and
editing capabilities.


19.

Fusion t
oolsuite
3

from
Metadata Technology Ltd
., composed by several tools to work around the SDMX
paradigm (some of them free, others with license cost) can achieve most of the functionalities. It should be
complemented with some additional interfaces for data
collection and editing.


20.

Since 2009
Eurostat

has been developing their
SDMX

Reference Infrastructure (SDMX
-
RI
)
4

to help any
system not “SDMX enabled” to make use of the standard for data collection and dissemination. It allows for
defining the mapping betw
een the SDMX concepts and the
production

database and provides the interfaces
thru
web services. Nevertheless, SDMX
-
RI does not include a data repository. It is assumed as to be present in the
agency that wants to

collect/
disseminate the data.


21.

Datal
oader
5

is a
standalone
tool created by
Istat

(Italy’s

NSI)
to enable de
-
centralized data collection at the
national level, which probably could be extrapolated to the international level.


22.

The idea was
to extend

SDMX
-
RI

by

creat
ing

and populat
ing

a dissemination database interacting with the
Mapping
S
tore database from
which it retrieves the structural

metadata information necessary to
up
load the data.

Dissemination is then assured by the SDMX
-
RI components.


23.

For data entry it is able to gather
csv or fixed length record (flr) files by defining the mapping among the
different concepts in a DSD and the columns of the file. It should be improved in its online data collection and
editing capabilities.





1

http://www.fao.org/economic/ess/countrystat/en/

2

See Appendix I
-

What is

.
Stat

3

See Appendix II

-

Fusion Toolsuite


4

https://webgate.ec.europa.eu/fpfis/
mwikis/sdmx/index.php/SDMX_Reference_Infrastructure_SDMX
-
RI

5

See Appendix III

-

Data Loader

3


24.

Another alternative to analyse is the possibili
ty of adding to the common reporting tools used to compute and
disseminate the indicators (
Stata, SPSS, R, SAS
, etc.) the ability of writing SDMX datasets. In that case, the user
would be able of
generate a SDMX
-
compliant file in the same way the reports i
n Excel or pdf used for
dissemination are.


25.

It is important to take into account that for this alternative the software should be able not only of writing
SDMX
-
compliant output but also of importing a DSD containing the definition of concepts and code
lists to be
used.


26.

On the other hand, the handling of descriptive metadata (footnotes, flags) to be attached to the observation values
and table’s components may be complex when we are thinking of producing the SDMX data file directly from
the statistical
processor.


4


IV.

At a glance

Requirement

CountryStat

.Stat

Fusion Toolsuite

SDMX
-
RI

SDMX
-
RI +
Dataloader

Developer

FAO

OECD


SIS
-
CC

Metadata Technology

Eurostat

Istat

Governance/Deployment experience

Yes

Yes

No

Yes

No

Documentation, tutorials, training

Yes

?

?

Yes

?

Multidimensional cubes

No

Yes

Yes

Yes

Yes

Interactive cube structure definition

?

Yes

Yes

Yes

Yes

Define cube
structure by DSD

No

No

Yes

Yes

Yes

Data discovery & visualization

Yes

Yes

Yes

No

No

Online data
& metadata
editor

No

No

No

No

No

D
escriptive metadata
management

Limited

Yes

?

Yes

?

SDMX exports/web services

No

Yes

Yes

Yes

Yes

Other formats exports

Yes

Yes

Yes

Converter

Converter

Commercial status

Free

Collab. Community

$

Free

?



5


APPENDIX I

What is

.
Stat

.Stat is the central repository ("warehouse") where validated statistics and related metadata are stored. It provides the

sole and coherent source of statistical data and related metadata for an organisation’s statistical data sharing,
publication and electronic dissemination processes.

.Stat enables users to easily locate required data from a single online source, rather th
an having to navigate multiple
databases and data formats, each with its own query/manipulation systems. And the access to systematic metadata in
.Stat helps ensure appropriate selection and use of statistical information.

Main features



a rich data visuali
sation and exploration interface, allowing dimension selection, query saving and data
extraction in multiple formats



data providers are autonomous for most data loads and management features with overall reduction in data
administration and a centralised s
ystem for access rights management by data providers



separation of the production and dissemination environments for improved performance and data integrity



a single repository that can integrate with analytical tools and publication processes through vari
ous
interfaces



webservices in support of Open data accesses, built on internationally recognised standards through the use
of
Statistical Data and Metadata Exchange Web Service (SDMX)


.Stat current evolutions

Search API

Connect .Stat to any third party se
arch engine using a dedicated web service

Open data APIs (BETA)

-

SDMX
-
JSON WS

-

OData WS

6


.Stat Web Services and interaction with Web browser (WBOS)







Data Extractor

(Hi de empty rows)

DatasetBrowser

Bul k Extract

SDMX

Search

Authenti cati on

Tabl e Generator

Query Manager

+

7


.
Stat Architecture


Data Flow
8


Data Entry

Data Provider Interface (DPI) a web based application and the interface between data production and
data
publishing that enables data providers to manage their data in the Data Warehouse. The DPI creates work
packages for the entry gate to process and load data into the Data Warehouse.



Data provider interface (DPI)


web based application and the interface

between data production and data publishing,


enables data providers to manage their data in the data warehouse, including:



dataset and dimension structure,



data content,



data presentation,



metadata,



core data view,




and related files.


The DPI creates

work packages for the entry gate to process and load data into the data
warehouse.



EntryGate



a data transfer “protocol” that uses a well
-
defined XML
-
stream, and hot folders to manage
loading of data to .Stat.


Permanently running Entry
-
Gate
-
Service watche
s and processes all “incoming” XML files one at
the time, all other requests are queued.


The data provider is informed (as specified through the DPI) by email about the outcome
(success, failure, errors, warnings). If any error occurred and the processing
failed, .Stat
performs a complete rollback to the situation before the submission of the XML stream.


The data provider can automate the update of data by creating batch jobs that regularly extract the
relevant data from the production database, create the
necessary XML file and put a copy of
this file at the .Stat Entry
-
Gate.


XML format was chosen for its:



flexibility,



platform
-

and software independence,



readability,



and well
-
structured format.

Data Storage

Based on a standard star sch
ema data warehouse
technology, t
he Data Warehouse is able to process and handle
multiple updates per hour in a synchronised way.



Data warehouse


Microsoft SQL server 2005 or above,


standard star schema data warehouse technology,


OECD has over 800 datasets (tables)


database is

approximately 330GB in size but built to be scalable,


able to process and handle multiple updates per hour and all updates are synchronised,



maximum of 16 dimensions with a current maximum of 7000 items per dimension for optimal
performance,



snapshots

and CSV exports provide archive functionality.



9


Data Exit

A single exit point serves all outputs from the Data Warehouse. This consists of a set of Web Services, which
are exploited by the various dissemination tools such as the .Stat Web Browser.



Services



A number of services complete the .Stat solution. These include:



international recognised standards through the
Statistical Data and Metadata
Exchange Web Service (SDMX),



Entry Gate
enables the processing of xml command files to load data from
pro
duction data systems through the DPI,



Dataset Browser
serves up data to the OECD.Stat browser interface,



and
Authentication

allows for single sign on (SSO) management of groups and
dataset access permissions.



Dissemination



.Stat browser allows for:



tabula
r views of datasets,



pivoting functionality,



export to CSV, excel, SDMX,



save queries,



and merge queries to compare datasets.




eXplorer component viewer is graphical component that provides mapping and charting of
data.




10


APPENDIX
II

Fusion Toolsuite

All the following solutions are products developed by Metadata Technology which have been built with
SdmxSource at the core. Each product plays a specific role, to support data, structural metadata, and reference
metadata. Each can operate on its own, or
they can work together to provide an integrated solution.

Fusion Matrix

Functional Role


Figure
1



Showing the static mock up for Fusion Matrix User Interface


The
Fusion Matrix

is a dissemination database. The database tables a
re generated automatically based on the
Data Structure Definition. The tables are indexed and optimised for data retrieval. On data import, the data are
validated against the DSD and any Constraints defined (i.e. sub set of valid codes or keys). The re
sulting table
structure is very simple, with only two tables for each Data Structure.

The Fusion Matrix offers SDMX compliant web services for external clients to directly query the database. The
Fusion Matrix provides a user interface to view the data, a
nd related metadata. The
Fusion Browser

Excel
plugin tool integrates with the Fusion Matrix, allowing users to directly query the database from within Excel.

The Fusion Matrix currently does not have a back office user Interface. All data is imported/del
eted though the
command line tool. Data can be imported in any SDMX format, and CSV format.

The Fusion Matrix can make use of the
Fusion Registry
to obtain the Structural Metadata (codelists, concepts,
data structures).

The Fusion Matrix can make use
of
Fusion Scribe
to obtain the Reference Metadata.

The development plan for Fusion Matrix includes the addition a back
-
office GUI which allows users to
insert/amend/delete data directly in the database. The plan also includes allowing the user to identify
11


observation(s) or keys that require metadata using the Fusion Matrix UI. In addition the generation of pdf output
is in the plan schedule.

Technical Requirements

There is no installation required. Fusion Matrix consists of a command line tool to perform i
mports, and a web
application for dissemination. Fusion Matrix requires the following software to be available (all of which are
freely available):



MySql Database



Apache Tomcat web server

Fusion Registry

Functional Role

The Fusion Registry is a structural

metadata maintenance and dissemination application. It provides web
services for external applications to obtain structures. Fusion Matrix does not require a Fusion Registry to be
present, it just requires structural metadata. Structural metadata can b
e provided via any SDMX conformant
web service, including the Fusion Registry, or a SDMX file.

Technical Requirements

There is no installation required. The Fusion Registry is a single web application, which is deployed to an
Apache Tomcat. Fusion Regist
ry requires the following software to be available (all of which are freely
available):



MySql/Oracle/SqlServer database



Apache Tomcat web server

Fusion Scribe

Functional Role

The
Fusion Scribe

is a reference metadata repository, which allows the authoring
and dissemination of
reference metadata. Fusion Matrix does not require Fusion Scribe, however if it has access to the Fusion Scribe,
it will integrate it in. This means the Fusion Matrix can enrich the data responses with additional metadata from
Fusion

Scribe.

Technical Requirements

There is no installation required. The Fusion Scribe consists of a web application for metadata maintenance,
and one for dissemination, both are deployed to an Apache Tomcat. Fusion Scribe requires the following
software t
o be available (all of which are freely available):



Mongo Db database



Apache Tomcat web server

Fusion Browser

Functional Role


Figure
2



Fusion Browser


navigate the Dataflows (called topics)


12



Figure
3

On selecting a Dataflow (topic), selecting values per dimension. Available choices are updated on click to
prevent the user from making an invalid combination of selections



Figure
4



Showing the resu
lts of a data query. The user is presented with additional tools to chart, pivot, and
perform arithmetic on the data

Fusion Browser

is an Excel plugin for Excel 2007 and upwards. It adds a new option to the Excel Ribbon,
which allows users to browse data

from configurable datasources. Fusion Matrix has been enabled to support
the Fusion Browser. The Fusion Browser is an open source product from Metadata Technology, and the Beta
release is imminent.

Technical Requirements

The Fusion Browser does not re
quire installation and, other than Excel, has no other software requirements.

Benefits of the Solution to the Reporting Agency

The use of the Fusion Matrix integrated solution provides the reporting agency with:



An easy to use and performant database for p
ermanent storage that can be used for data reporting and
data visualisation supporting multiple statistical domains



The ability to report data in a standardised format



The ability query and visualise data using an Excel add
-
in



An easy to use metadata autho
ring and reporting tool that integrates metadata automatically to the data
that is queried

The benefits to the reporting agency of using the Fusion Matrix integrated solution are:

1.

The need for technical resources is minimal by providing an integrated tool
set with minimal
installation requirements.

2.

The solution can be used for statistical data storage, reporting, visualisation of any type of data as the
toolset is not specific to any particular statistical domain.

3.

Interfacing to the toolset is via software
already installed on the user desktop


web browser, Excel


with which the user is well acquainted.

13


4.

Enables the organisation to better manage its statistical resources by combining the major aspects of
statistical data and metadata storage, visualisation
and dissemination.

Additional Tools

Introduction

There will be reporting organisations that wish to take advantage of plugging into the SDMX world (and
therefore able to take advantage of other SDMX tools) using an existing database but do not have the re
sources
to learn the SDMX standards. The way this can be achieved is covered in this section.

A possible way of for a data collection organisation to collect data is to implement a web service which allows
users to upload datasets. It is probable that such

a service will need to have sufficient security to ensure the data
it collects is from validated data reporters and compliant in the terms of the data itself. This topic is also covered
in brief in this section.

Organisations with an Existing Database

And

which wishes to use an easy way to retrieve and format data for reporting or dissemination

Solution


The SdmxSource has a variety of components that can be packaged into a “data access” utility that comprises
two major components:

Query reader



this mak
es it easy for an organisation to process an SDMX query without knowing the syntax
of SDMX: it just needs to understand that there are dimensions each of which has one or more query values and
the time range queried. The organisation then needs to write th
e SQL code that will make the query on its
database.

Data writer



this makes it easy to write the data returned as a result of the SQL query without knowing the
syntax of SDMX: it just needs to understand that there are series keys, observations, time per
iods, and attributes.
There are many implementations of the data writer in SdmxSource and so a variety of output formats are
14


available and it is very easy to write an output for a different format: it does not have to be an SDMX format
(indeed a CSV format

is already available and formats for packages such as R and RDF (using the Data Cube
Vocabulary) are under development).

Benefits to the Reporting Agency



Ability to report data in a standardised format with minimal effort and no need to learn SDMX



When co
upled with the Fusion Registry the Excel plug can be used to query and visualise the data

Data Upload to a Central Location

Data collection can be streamlined by providing a data upload facility to reporting organisations. Such facilities
are easy to buil
d using the SdmxSource and it is simple to use. Note that such files can be zipped automatically
to minimise size in the transfer. Validation and visualisation components can be integrated easily into such a
collection system.



15


APPENDIX III

Data Loader






Scenario

The

Data

Loader
is an application for loading the data

into a repository
that interacts with the SDMX
-
RI

for
dissemination thru SDMX web services.



16


The data flow in the Data Loader application


As it is shown in the picture above, the Loader

application creates and populate
s

a dissemination database
interacting

with the Mapping S
tore database from which it retrieves the structure metadata information
necessary to
up
load the data.

Through this interaction the user load
s

the DSDs and dataflows
that needs into the Loader database.

This choice, although create a duplication of structural metadata (Structural Tables in Fig), is due to two
different reasons:



Better performance during the loading of data



A better completeness of the information store
d into the Loader database


After the loading
,

the user can create the data table (Data Tables in Fig
). The
s
e

data table
s are depending on the
DSD to which they refer
.

The next step for the user is to load
a
data

file
. A

recommendation is that data in the

data

file must be described
using the structural metadata previously loaded. Data file
to

be loaded can have the following formats: SDMX
(compact, generic and cross sectional), fixed length record

(flr)
, gesmes TS and character separated value

(csv)
.

The loading of data file in fixed length record, gesmes TS and character separated value requires internal
operation that provides to the application the exact way in which data can be read

(mapping)
. This mapping is
not necessary for the SDMX data file. A
fter the loading of data it is still necessary (in this version) to perform a
mapping of data using the “Mapping assistant” to made data available by the SDMX
-
RI web service.

By the same application is possible also to create static data file in SDMX forma
t and related RSS file for the
organizations that does not want to use a web service for the dissemination of data.

The application provide
s

also the possibility to define different users with different level
s

of security.