Country Name Parser

From Data Quality Wiki
Jump to: navigation, search
Country Name Parser
Type:
Library
Version:
0.8.2 (2013-10-27)
Author:
David Fichtmueller (BGBM)
Status:
Beta
Programming Language:
Java
License:
Mozilla Public License 2.0 (MPL-2)
Download Library
Download Source
Related Tools:

Description

The Country Name Parser is a library which can parse country names in various languages and returns the corresponding ISO 3166-1 alpha 2 country code (e.g. DE for Germany).

The data set of the different country names is generated based on the Unicode Common Locale Data Repository (CLDR) data set. Only the relevant data of this data set was condensed into the Country Name Data Set. The CLDR and the Country Name Data Set are both copyrighted by Unicode Inc. (Copyright © 1991-2013 Unicode, Inc. All rights reserved. See Copyright Notice for details.)

The Country Name Parser has an index of over 25000 different country name strings from over 700 different languages and their corresponding ISO 3266-1 alpha2 country codes.

Usage: Simple Matching

The simplest way to use the CountryNameParser is to use the function getISO3166Code. The input parameter is the string that is supposed to be matched. This function will only check for identical matched (case sensitive) with the country names in the index. For more complex checks with case insensitivity, partial matching and fuzzy search (Levenshtein distance) take a look at the section Usage:Complex Matching.

The first step would be to create an instance of the Country Name Parser: CountryNameParser parser = new CountryNameParser(); This step may take a few seconds since, the entire dataset of country names is loaded.

Now you can match a name using the function Map<String,List<String> CountryNameParser.getISO3166Code(String countryName). The function returns a Map with the String of the Country Code as the key and as the value a List of the different languages (as abbreviations) that match the provided name. Ideally only one county code is returned. There are a few occasions where the same country name string refers to the different countries in different languages, this is so rare that you probably don't have to worry about it (unless you are parsing references to North or South Korea in Central African languages).

Code Example

CountryNameParser parser = new CountryNameParser();
String[] names = new String[]{"Schweden","United States","Germany"};
for(String name:names){
	System.out.println(name);
	Map<String,List<String>> results = parser.getISO3166Code(name);
	for(String code:results.keySet()){
		System.out.println("\t"+code);
		List<String> languages = results.get(code);
		for(String language:languages){
			System.out.println("\t\t"+language);
		}
	}
}


Output

Schweden
	SE
		de
United States
	US
		en
		om
Germany
	DE
		en
		fil
		luo
		nd
		om
		sn

If you don't even need to know what country code a certain name refers to, but only want to know if it exists within the dataset you can use the function boolean CountryNameParser.containsName(String countryName)

CountryNameParser parser = new CountryNameParser();
String name1 = "Italy";
String name2 = "Shire";
boolean result1 = parser.containsName(name1);
boolean result2 = parser.containsName(name2);
System.out.println(result1);
System.out.println(result2);

The above code will return:

true
false

Usage: Complex Matching

  • TODO: matchISO3166Code(String)

Usage: Convert Country Codes

Though the County Name Checker only returns the ISO 3166-1 alpha2 codes, you can use it to convert the code to the alpha-3 or numeric code, as well as to the common name of the country.

  • TODO: convertCode(String, ISO3166Type)

Usage: Customization

Loading custom country name mappings

  • TODO: loadNames(File)
  • TODO: loadNames(URL)
  • TODO: restNames()

Adjust Levenshtein threashold

  • TODO: getLevenshteinThreashold()
  • TODO: setLevenshteinThreashold(double)

Disable Levenshtein matching

  • TODO: isCheckLevenShtein()
  • TODO: setCheckLevenShtein(boolean)

Display Version Info

  • TODO: getVersion()


Known Issues

  • results from Levenshtein distance are not ordered according to the Levenshtein score
  • Error handling and signaling needs to be improved (currently mostly done via exception.printStackTrace();