[Talk-in] Automating OSM translation into Indic languages

Srikanth Lakshmanan srik.lak at gmail.com
Mon Apr 6 09:10:59 UTC 2015


Great work, I have been thinking this for sometime. I am of the opinion
that place names(towns / villages etc) should be translated and not
transliterated. Arun has a point about locality address as people might be
so used to English, that they find translations in their own language

For place names, would it be a good idea to run a script which can look up
wikidata, extract names in multiple language and update OSM? Below is a
sample query for 'Bangalore' in multiple languages.


On Sat, Apr 4, 2015 at 11:54 PM, Aruna S <safincrazy at gmail.com> wrote:

> Hello!
> Long email warning.
> I've been thinking a little bit about automating the translation of maps
> into multiple Indic languages ever since I saw the Kannada map at geoBLR in
> March.
> I started some work on it today, and I have lots of interesting things to
> report. Right now I am mostly transliterating as opposed to translating but
> if a dictionary of common words/tags can be compiled, upgrading the script
> to translate instead of transliterating should be doable.
> Here's the algorithm I followed:
>    1. Get the nodes within a bounding box from OSM using the python
>    wrapper for Overpass - overpy
>    <http://python-overpy.readthedocs.org/en/latest/example.html> - This
>    returns a collection of nodes and associated ID, tags, lat, lon and other
>    attributes. This can also be repeated for ways by using the corresponding
>    overpy query.
>    2. Filter nodes that have tags
>    3. From the result of the filter, identify nodes with Indic language
>    tags - eg:["name:kn"]
>    4. Transliterate the string value for tag["name:kn"] to another
>    language - I used Tamil - and store it within tag["name:ta"] - I used the Indic
>    transliterator <http://silpa.org.in/Transliteration> APIs from SILPA
>    <http://transliteration.readthedocs.org/en/latest/> for this
>    5. Create a new changeset and upload the result(node with
>    tag["name:ta"]) to OSM using osmapi
>    <http://osmapi.divshot.io/#OsmApi.OsmApi.NodeUpdate>
> I did it only for one node:
> https://www.openstreetmap.org/edit?node=1118255762#map=19/12.99451/77.55430
> *Advantages*
>    - *Indic to Indic transliterations - ✓*The Indic transliterator APIs
>    seem to convert quite effortlessly from one Indic language to another.
>    Right now, support is available for Hindi, Tamil, Punjabi, Gujarati,
>    Malayalam, Oriya, Bengaliand Kannada. So, if a Kannada tag exists in OSM,
>    the same text can be transliterated into multiple Indic languages using the
>    naive algorithm I described above.
> *Limitations*
>    - *English to Indic transliterations - X*: Though the Indic
>    Transliterator works for English To Indic transliterations as well, it is
>    not very useful. This is because only English words that are in the CMU
>    dictionary are capable of being transliterated - which means that we can't
>    transliterate "Raajaajeenagar", even if we had a custom tag for
>    transliteration on OSM. On emailing the developer
>    <http://thottingal.in/blog/about/> of the transliterator about
>    extending the capabilities of English transliteration, I was told that
>    extending the dictionary by adding additional words is one option. I am not
>    sure of how feasible this is, or how much more optimal it is as compared to
>    translating to one Indic language and transliterating+translating to the
>    rest.
>    - *Translations of English Words - X* - Right now, I am only able to
>    transliterate words, but if a list of common words(I am guessing all the
>    OSM tags, and other common words) could be compiled, and translated into
>    all the Indic languages, the translation process can be automated quite
>    easily. This would require the algorithm to have 2 additional steps
>    1. From an Indic tag(i.e., an already translated tag, we would have to
>       identify portions that are in the translations list, and leave them out of
>       the transliteration process.
>       2. For the word(s) identified in step 1, we must find a translation
>       in the translations list for the language we are translating into. This
>       must then be suffixed or prefixed with the transliterated portion. I am
>       guessing suffix will be the norm, while prefixes might occasionally be
>       necessary.
>    - *Tracking node version numbers - X *- Right now, I am unable to
>    track the version attribute of a node tag using the overpy API. I entered
>    the version number manually. Not sure if I am missing something. This is
>    just a "need-to-figure-out" issue more than anything. This is very
>    important for automatically updating a node to the server because if
>    there's a mismatch between the version number being passed to the API and
>    the version number on the server, the API won't work.
>    - *Which Indic Language to begin transliterating in* - Issues might
>    arise if a language like Tamil - where the letter for ka, kha, ga, gha etc
>    is the same - is say used to transliterate to Hindi. But, if we use a
>    language like Kannada or Hindi for the first time, this issue can probably
>    be resolved easily.
> The script is on Github
> <https://github.com/anura28/Automate-Translations-OSM/blob/master/automateIndicTranslation.py>.
> Feel free to fork it, use it, work on it, edit it and suggest changes,
> different language, other possibilities, alternatives etc. Pull Requests
> very welcome. :)
> This is my first time writing code in Python, so advice on improving code
> would be very welcome. Also, let me know if I'm missing something else,
> obvious or subtle.
> Thanks!
> Warmly,
> Aruna
> _______________________________________________
> Talk-in mailing list
> Talk-in at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/talk-in

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-in/attachments/20150406/b07aec18/attachment.html>

More information about the Talk-in mailing list