[OSM-dev] Querying for non-native characters in name field

Tue Jan 31 16:39:43 UTC 2017

> I want to be able to do an overpass query for Iceland where name= field
> contains non-Icelandic characters. These could be for example Chinese,
> Cyrillic or even other European characters (such as âà for example). I'm
> guessing it could be difficult for the latin characters but hopeful it
> would be easier for non-latin alphabets.
>
> Is there a magic formula for achieving this?

I suggest, as a refinement of Ilya's query, this one:
http://overpass-turbo.eu/s/lCk

As it may help for other languages, I explain how I got to this:

1. Start with

area["name:en"="Iceland"];
node(area)[name];
out count;

This is basically an all-nodes-in-Iceland-with a name. The important 
part is the "out count". This assures that you are not flooded with 
results. For the same reason it is enough to start with nodes: We do not 
want a final result now. But we want to create a senstive search term. 
For this reason, we will even get down to just a subset of all nodes in 
a second.

2. Clamp down to

area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z]"];
out count;

These are all nodes that contain at least one character different from a 
latin letter. These are still many. Therefore:

3. Get examples with

area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z]"];
out 100;

This prints some random 100 results (in fact: the 100 matches with 
lowest node id). Now we can look at the name fields and get an idea what 
we would like to exclude in addition.

4. Start to narrow down with

area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z0-9 ]"];
out 100;

Spaces and digits are OK even before we start to accept all the special 
characters from Icelandic.

This process is now repeated until the sample contains no more false 
positives. Finally, we expand this to all three types of OSM elements, 
in the expectation that not much false positives appear.

Cheers,

Roland