[Talk-in] Automating OSM translation into Indic languages

Sat Apr 4 18:24:16 UTC 2015

Hello!

Long email warning.

I've been thinking a little bit about automating the translation of maps
into multiple Indic languages ever since I saw the Kannada map at geoBLR in
March.

I started some work on it today, and I have lots of interesting things to
report. Right now I am mostly transliterating as opposed to translating but
if a dictionary of common words/tags can be compiled, upgrading the script
to translate instead of transliterating should be doable.

Here's the algorithm I followed:

   1. Get the nodes within a bounding box from OSM using the python wrapper
   for Overpass - overpy
   <http://python-overpy.readthedocs.org/en/latest/example.html> - This
   returns a collection of nodes and associated ID, tags, lat, lon and other
   attributes. This can also be repeated for ways by using the corresponding
   overpy query.
   2. Filter nodes that have tags
   3. From the result of the filter, identify nodes with Indic language
   tags - eg:["name:kn"]
   4. Transliterate the string value for tag["name:kn"] to another language
   - I used Tamil - and store it within tag["name:ta"] - I used the Indic
   transliterator <http://silpa.org.in/Transliteration> APIs from SILPA
   <http://transliteration.readthedocs.org/en/latest/> for this
   5. Create a new changeset and upload the result(node with
   tag["name:ta"]) to OSM using osmapi
   <http://osmapi.divshot.io/#OsmApi.OsmApi.NodeUpdate>

I did it only for one node:
https://www.openstreetmap.org/edit?node=1118255762#map=19/12.99451/77.55430

*Advantages*

   - *Indic to Indic transliterations - ✓*The Indic transliterator APIs
   seem to convert quite effortlessly from one Indic language to another.
   Right now, support is available for Hindi, Tamil, Punjabi, Gujarati,
   Malayalam, Oriya, Bengaliand Kannada. So, if a Kannada tag exists in OSM,
   the same text can be transliterated into multiple Indic languages using the
   naive algorithm I described above.

*Limitations*

   - *English to Indic transliterations - X*: Though the Indic
   Transliterator works for English To Indic transliterations as well, it is
   not very useful. This is because only English words that are in the CMU
   dictionary are capable of being transliterated - which means that we can't
   transliterate "Raajaajeenagar", even if we had a custom tag for
   transliteration on OSM. On emailing the developer
   <http://thottingal.in/blog/about/> of the transliterator about extending
   the capabilities of English transliteration, I was told that extending the
   dictionary by adding additional words is one option. I am not sure of how
   feasible this is, or how much more optimal it is as compared to translating
   to one Indic language and transliterating+translating to the rest.
   - *Translations of English Words - X* - Right now, I am only able to
   transliterate words, but if a list of common words(I am guessing all the
   OSM tags, and other common words) could be compiled, and translated into
   all the Indic languages, the translation process can be automated quite
   easily. This would require the algorithm to have 2 additional steps

   1. From an Indic tag(i.e., an already translated tag, we would have to
      identify portions that are in the translations list, and leave
them out of
      the transliteration process.
      2. For the word(s) identified in step 1, we must find a translation
      in the translations list for the language we are translating into. This
      must then be suffixed or prefixed with the transliterated portion. I am
      guessing suffix will be the norm, while prefixes might occasionally be
      necessary.

   - *Tracking node version numbers - X *- Right now, I am unable to track
   the version attribute of a node tag using the overpy API. I entered the
   version number manually. Not sure if I am missing something. This is just a
   "need-to-figure-out" issue more than anything. This is very important for
   automatically updating a node to the server because if there's a mismatch
   between the version number being passed to the API and the version number
   on the server, the API won't work.
   - *Which Indic Language to begin transliterating in* - Issues might
   arise if a language like Tamil - where the letter for ka, kha, ga, gha etc
   is the same - is say used to transliterate to Hindi. But, if we use a
   language like Kannada or Hindi for the first time, this issue can probably
   be resolved easily.

The script is on Github
<https://github.com/anura28/Automate-Translations-OSM/blob/master/automateIndicTranslation.py>.
Feel free to fork it, use it, work on it, edit it and suggest changes,
different language, other possibilities, alternatives etc. Pull Requests
very welcome. :)

This is my first time writing code in Python, so advice on improving code
would be very welcome. Also, let me know if I'm missing something else,
obvious or subtle.

Thanks!

Warmly,

Aruna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-in/attachments/20150404/117c071c/attachment.html>