Data Quality Initiative:About

From Data Quality Initiative
Jump to: navigation, search

The Data Quality Initiative (DQI) was started at the Botanic Garden and Botanical Museum Berlin-Dahlem (BGBM) by the projects BioVeL, OpenUP!, BiNHum and reBiND. All of these projects have to deal with data quality, so the goal was to join the efforts and avoid duplicate work by coordinated the data quality approaches of those projects.

The Data Quality Initiative was presented at TDWG 2013 in Florence in the symposium "Biodiversity Data Quality – issues, methods and tools". The slides of this talk are available on the TDWG website.

Goals of the Data Quality Initiative

  • to share knowledge about data quality issues
  • to develop better tools to detect and fix data quality issues
  • to make the knowledge and tools publicly available, e.g. by publishing them in this wiki under an open source license


Data Quality Tools

A key focus of this wiki is to document newly developed and existing data quality tools.

In this context a Data Quality Tool is any kind of software or data set that helps improve data quality by detecting and or fixing issues within research data.

The target audience of this wiki are software developers who deal with data quality within the software products they are developing. It therefore is important that all of the tools documented here are suitable for automated processing. Tools that can only by used by their own graphical user interface (GUI) or web based tools that only have HTML forms as input and output the data in an result HTML page are therefor not suited for this wiki.

For a complete list of documented data quality tools in this wiki, take a look at the list of tools.

Within this wiki tools are categories as Data Set, Library or Web Service

Data Sets

In this context Data Sets does not refer to the scientific data set that is supposed to get corrected, but to data sets which can be used for data correction. An example would be a data set of country names in various languages to detect which country a given string might refer to, even if the language is unknown/unsupported (see Country Name Data Set).

Data sets are usually in a specific format (e.g. XML, JSON, Tab separated values (TSV), comma separated values (CSV), ...) and are independent of the programming language of the software in which they are used.

Library

A library is a piece of software that can be used by other software, to access certain functionality.

An example of a library would be a bundled jar file for Java programs.

A library is usually dependent on the programming language in which it will be used.

Web Service

A web service offers certain functionality via defined protocols (e.g. REST or SOAP).

Web services are independent from the programming language.


Data Quality Efforts by other institutions

There are many other institutions who deal with data quality aspects.