[OSM-dev] Querying for non-native characters in name field
Roland Olbricht
roland.olbricht at gmx.de
Tue Jan 31 16:39:43 UTC 2017
> I want to be able to do an overpass query for Iceland where name= field
> contains non-Icelandic characters. These could be for example Chinese,
> Cyrillic or even other European characters (such as âà for example). I'm
> guessing it could be difficult for the latin characters but hopeful it
> would be easier for non-latin alphabets.
>
> Is there a magic formula for achieving this?
I suggest, as a refinement of Ilya's query, this one:
http://overpass-turbo.eu/s/lCk
As it may help for other languages, I explain how I got to this:
1. Start with
area["name:en"="Iceland"];
node(area)[name];
out count;
This is basically an all-nodes-in-Iceland-with a name. The important
part is the "out count". This assures that you are not flooded with
results. For the same reason it is enough to start with nodes: We do not
want a final result now. But we want to create a senstive search term.
For this reason, we will even get down to just a subset of all nodes in
a second.
2. Clamp down to
area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z]"];
out count;
These are all nodes that contain at least one character different from a
latin letter. These are still many. Therefore:
3. Get examples with
area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z]"];
out 100;
This prints some random 100 results (in fact: the 100 matches with
lowest node id). Now we can look at the name fields and get an idea what
we would like to exclude in addition.
4. Start to narrow down with
area["name:en"="Iceland"];
node(area)[name~"[^a-zA-Z0-9 ]"];
out 100;
Spaces and digits are OK even before we start to accept all the special
characters from Icelandic.
This process is now repeated until the sample contains no more false
positives. Finally, we expand this to all three types of OSM elements,
in the expectation that not much false positives appear.
Cheers,
Roland
More information about the dev
mailing list