There exist many valid ways of referring to a certain date.
|25 Aug 1808||DD mmm YYYY||1808-08-25|
|2. April 1993||1993-04-02|
|Oct. 31, 1967||mmm. DD, YYYY||1967-10-31|
For database use a standardized, non-ambiguous, machine-readable date format is preferable. One established format is the ISO 8601-standard, which formats the date as YYYY-MM-DD, example: 2013-12-19. N.b.: ISO 8601 specifies different ways of writing days. 2013-W51 (51th week) and 2013-353 (353th day) also conform to ISO 8601. Date parsing is the process of extracting the date, i.e. Year, Month and Day from the strings mentioned above.
All examples mentioned here are dates that were actually encountered.
Month names in different languages
Month names are often spelled out or abbreviated, in different languages. In that case it is possible, albeit difficult, to extract an unamiguous date. Some (common) examples: 3. Oktober 1848, 20. Oct. 1902, 24 Jun 1799, Jan. 21, 1966 In some cases it is not feasible to extract a date: 3. Jänner 1904. Spelling mistakes may also occur.
Some parts do not contain a complete date, but only the year and month, or the year only.
- Mar. 1874
In some cases the exact date is not know, but the date range is narrowed in a way that makes sense to humans (and humans only). Some examples:
- Ende Sept. 1956
- m. Aug 1854
- Herbst 1855
- Aestate 1836
- 2. Mayhälfte 1934
It is not feasible to extract this information. It is possible to extract some parts of the data, like the year in the cases mentioned above.
- May 7-8/1933
- March-July, 1933
- zwischen 1814-1830
A date formated as 15.06.74 only makes sense if you have additional information about the source. It is hard to automatically extract this information and it should not have to be guessed. In other cases, the century has already be manually added by a human source and put in brackets like 15.06.74. That information should be used.
Sometimes OCR mistakes handritten latters for other characters. Common mistakes: 3o.01.1907 (o the letter instead of 0 the number), 4. !947 (! instead of 1).
Information not part of the date has been entered into the date field.
- ex BHU
- Utrecht 23
- rec. 1864
- eingelegt am:8.8.196
Possibility of own tool
Parsing dates written in the common formats is not a difficult task. But this task can be made arbitrarily complex by including as many of the cases mentioned above as possible. The canadensys date parses is open source, instead of using the web service the code should be run locally to save time.