Date Formats

From Data Quality Initiative
Jump to: navigation, search

Description

There exist many valid ways of referring to a certain date.

Datestring Format ISO 8601
21.05.1905 DD.MM.YYYY 1905-05-21
24.VII.1890 1890-7-24
25.Aug.1794 DD.mmm.YYYY 1794-08-25
25 Aug 1808 DD mmm YYYY 1808-08-25
2. April 1993 1993-04-02
Oct. 31, 1967 mmm. DD, YYYY 1967-10-31

For database use a standardized, non-ambiguous, machine-readable date format is preferable. One established format is the ISO 8601-standard, which formats the date as YYYY-MM-DD, example: 2013-12-19. N.b.: ISO 8601 specifies different ways of writing days. 2013-W51 (51th week) and 2013-353 (353th day) also conform to ISO 8601. Date parsing is the process of extracting the date, i.e. Year, Month and Day from the strings mentioned above.

Potential problems

All examples mentioned here are dates that were actually encountered.

Month names in different languages

Month names are often spelled out or abbreviated, in different languages. In that case it is possible, albeit difficult, to extract an unamiguous date. Some (common) examples: 3. Oktober 1848, 20. Oct. 1902, 24 Jun 1799, Jan. 21, 1966 In some cases it is not feasible to extract a date: 3. Jänner 1904. Spelling mistakes may also occur.

Incomplete dates

Some parts do not contain a complete date, but only the year and month, or the year only.

  • Mar. 1874
  • 00.10.1926

Vague dates

In some cases the exact date is not know, but the date range is narrowed in a way that makes sense to humans (and humans only). Some examples:

  • Ende Sept. 1956
  • m. Aug 1854
  • Herbst 1855
  • Aestate 1836
  • 2. Mayhälfte 1934

It is not feasible to extract this information. It is possible to extract some parts of the data, like the year in the cases mentioned above.

Date periods

  • May 7-8/1933
  • March-July, 1933
  • zwischen 1814-1830

Missing century

A date formated as 15.06.74 only makes sense if you have additional information about the source. It is hard to automatically extract this information and it should not have to be guessed. In other cases, the century has already be manually added by a human source and put in brackets like 15.06.[19]74. That information should be used.

OCR mistakes

Sometimes OCR mistakes handritten latters for other characters. Common mistakes: 3o.01.1907 (o the letter instead of 0 the number), 4. !947 (! instead of 1).

Non-date information

Information not part of the date has been entered into the date field.

  • unleserlich
  • ex BHU
  • Utrecht 23
  • s.l.
  • Emerich-Rambo
  • rec. 1864
  • eingelegt am:8.8.196
  • 18.631.865

Web services

Possibility of own tool

Parsing dates written in the common formats is not a difficult task. But this task can be made arbitrarily complex by including as many of the cases mentioned above as possible. The canadensys date parses is open source, instead of using the web service the code should be run locally to save time.

Links