Prepare data for OpenSpending
OpenSpending loads and stores data in common tabular data formats such as CSV and Excel. While the system can work with a range of data structures in these files, due to the flexible modeling scheme of Fiscal Data Package, a minimum set of quality requirements must be met.
Minimum quality requirements
Essentially, the minimum quality requirements are as follows:
- The file must have headers on the first row.
- There must not be any blank rows.
- There must not be any mismatch between the length of a row, and the length of the headers.
- Each column must have a consistent "data type" (date columns should contain dates, amount columns should contain numbers without currency signs or names).
Ensuring quality
Files added to OpenSpending need to meet a certain quality level in the structure of the file, and the schema.
If you use the OpenSpending Packager to upload files via a UI, or, the OpenSpending CLI to upload files via the command line, the data sources will be checked using the GoodTables data validator.
Using these tools, you'll not only be told that the data sources are valid (or not), you'll also get hints on how to address issues in the case of invalid files.
If you have a custom data processing pipeline, or in general, would like to validate your files without using the Packager and CLI, that is entirely possible by using GoodTables directly in your own setup.
Why is this important?
OpenSpending uses a flat file datastore to store the raw data provided by users, with additional information on how to understand that raw data via the Fiscal Data Package descriptor. From the datastore, other databases are derived, to provide the OpenSpending APIs and other related data services. The data quality checks ensure that the ecosystem that reads data out of the datastore can expect the data to be of a reliable quality.
Walkthroughs
Checking data quality with the Packager
-
Access OS Packager: https://openspending.org/packager/provide-data.
-
Authenticate to OpenSpending by clicking “Login/Register” in the upper-right corner.
-
Select a file/data source from your computer or insert a URL.
-
If your data source passed the “validation step,” you will get a message saying “Resource is valid. Now you can continue.”
-
If the data source contain errors, you will get the following message: “There are some errors. Click here to view details.”
-
Click to see the “Data Validation Results.”
-
Correct errors in your data source and try again.
Checking data quality with the CLI
-
Download the goodtables library, which is a Python package, and can be used as a command line tool. It runs on Python 2 or 3:
pip install goodtables
-
Ensure goodtables is installed correctly by typing
goodtables
in your shell. You should see something like the following: -
We want to check the structure of our CSV file. This is done with the following command:
goodtables structure {PATH_TO_FILE}
. See two screenshots below, one with a check that returned structural errors, and one with a check that found the file valid.