[OSM-talk] Looking for "primary language" map
moltonel at gmail.com
moltonel at gmail.com
Fri Apr 14 13:26:44 UTC 2017
On 11 April 2017 08:26:14 IST, Rory McCann <rory at technomancy.org> wrote:
>You could try to run the "name" tag though a language detection
>algorithm and see what comes out. I think Google released one a few
>years ago: cf. https://github.com/Mimino666/langdetect
>
>Ethnologue has some. But I think it would cost a lot to licence.
>https://www.ethnologue.com/ and is probably much more precise than you
>need.
KDE's Sonnet is another library that springs to mind.
Another approach that might be interesting is to look at nearby objects in osm. Look for objects with a clearly-identifiable language (ie if name tag has same value as exatly one of the name:xx tags of the object). If 90% of those identify as 'English' for example, then other unidentified languages in the same area are probably English too.
To get decent performance, split the world in tiles and figure out the dominant clearly-tagged language for each tile. Use that preprocessed data as your language-guessing "shapefile".
--
Vdp
Sent from a phone.
More information about the talk
mailing list