[OSM-dev] Need advise for localization of Mapnik/Osmarender/Data search

Milo van der Linden mlinden at zeelandnet.nl
Thu Jul 10 12:42:48 BST 2008


Arne Goetje wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi list,
> 
> I need some advice what would be the best way to localize our own OSM
> server. Goal: to have one layer for each language for Mapnik and
> Osmarender and an improved search engine for the planet.osm data.
> 
> The facts:
> We have already
>  * a dedicated server to host a planet.osm excerpt of Taiwan with Mapnik
> set up.
>  * localized name tags in the planet.osm data. Almost all data for
> Taiwan uses English in the 'name' and 'ref' tags and Chinese in
> 'name:zh' and 'ref:zh' as well as some other local languages with the
> appropriate language tags.
Nice! I assume that the main data is stored on the central OSM server,
that you get a reoccuring update and use that for your database? Or is
localization done entirely on your server?
> 
> Now for Mapnik and Osmarender: we want to add a layer for each language
> code on top of the default rendering layer (which uses English and the
> normal 'name' and 'ref' tags), which uses the localized 'name:$lang' and
> 'ref:$lang' tags and the 'name' and 'ref' tags as fallback.
> 
> How do we do that?
> 
> Then the search engine: currently it only works well for addresses which
> use spaces, but not for Han and Hangul scripts. The reason is, that in
> Chinese, Japanese and Korean, addresses don't have spaces. Instead the
> levels (Street, Number, Village, City, County, etc.) are distinguished
> by a specific character. For example (Chinese as used in Taiwan): the
> address Zhongzheng Rd. in Taipei would be written 台北市中正路 in one
> string without spaces, where 台北市 stands for "Taipei City" (市 being
> the character for City) and 中正路 stands for Zhongzheng Rd. (路 being
> the character for Road).
> In the planet.osm data we have 'is_in' and 'is_in:zh' tags, where the
> Chinese version uses the same way to write the address:
> The road Zhongzheng Rd. in Taipei has the 'is_in:zh' value 台灣台北市
> (means Taiwan, Taipei City).
> Another more complex (but not the most complex) example:
> A search for 桃園縣八德市介壽路二段325巷1弄1衖 should find the alley '介
> 壽路二段325巷1弄1衖 (Alley 1-1, Ln. 325, Jieshou Rd. Sec. 2), which has
> the 'is_in:zh' tag value 台灣桃園縣八德市 (Bade City, Taoyuan County,
> Taiwan).
> 
> So, we need to enhance the search engine code to
>  a) not rely on spaces as delimiters
>  b) for Han and Hangul scripts know the correct and possible alternate
> address schemes
>  c) every possible English transliteration (for example 'Jieshou Rd.
> Sec. 2' could also be written 'Sec. 2 Jieshou Rd.' and in multiple other
> ways (Sec. 2, Jie-Shou Rd., etc.)
>  d) spelling variants in English transliteration (for example the road
> name Zhongzheng Rd. (中正路) can also be written ZhongZheng Rd.,
> Zhong-Zheng Rd., Jhongjheng Rd., Jungjeng Rd. Chung-cheng Rd, and many
> more). Many municipalities in Taiwan use alternate spelling systems, as
> there exists no standard but many different ways to transliterate
> Chinese characters into English. And people's name cards can also
> contain spelling systems which you won't find on street signs anymore
> (in my old company, my name card had the street name written as Tzu-You
> Rd., although the City administration changed the spellings on the road
> signs to be Ziyou Rd.).

Is your local database postgres/postGIS?
In that case I would strongly advice to upgrade to the latest
postgresql: 8.3. It comes with full text search out of the box.

<http://www.postgresql.org/docs/current/static/textsearch.html>

Full text search makes transliteration a piece of cake and makes it case
insensitive. You would need to write your own "sounds-like" script on
the database for cases like: Zhong-Zheng Rd., Jhongjheng Rd., Jungjeng
Rd. Chung-cheng Rd, where I would advice you to maintain a seperate data
table.

regarding the lack of spaces in addresses; A minumum amount of spaces is
needed for full text search to work. So I would sugest writing a script
that will strip house-numbers,citynames, and streetnames and store them
in separate tables. Or you can put spaces in front of and behind every
single character...

read the documentation on textsearch with postgresql, it will probably
be a good help and starting point!



> 
> It would be great if we could feed those improvements back into the main
> OSM project, so that the search on the main OSM website also delivers
> the correct results. So the question is: where is the code we need to
> enhance and how to coordinate it? I could provide a list of aliases for
> road names and Chinese characters to classify address patterns (County,
> City, Village, etc.) and explain the possible address patterns.
> 
> Cheers
> Arne
> - --
> Arne Götje (高盛華) <arne at linux.org.tw>
> PGP/GnuPG key: 1024D/685D1E8C
> Fingerprint: 2056 F6B7 DEA8 B478 311F  1C34 6E9F D06E 685D 1E8C
> Key available at wwwkeys.pgp.net.   Encrypted e-mail preferred.
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFIdeqkbp/QbmhdHowRAqDBAKDpktXk1L9axzdpUWF5BEZMatcfswCgtgE6
> t0WnMiJjZH8N74IHdlt6w/g=
> =j6+S
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev





More information about the dev mailing list