[OSM-talk] Japan KSJ2 Import

Frederik Ramm frederik at remote.org
Mon Jun 20 22:05:37 BST 2011


Hi,

    is someone on this list involved in OSM in Japan? I'll go to talk-jp 
with the issue if not, but maybe the right people are reading this here 
also.

I noticed that a lot of data has been imported from a "KSJ2" data set, 
and this data has many tags that I consider unnecessary.

The whole import seems to comprise about 3.5 million nodes, 680k ways, 
and 9000 relations.

3.3 million nodes are tagged with something like

     <tag k="KSJ2:coordinate" v="32.787857 130.687672"/>
     <tag k="KSJ2:lat" v="32.787857"/>
     <tag k="KSJ2:long" v="130.687672"/>

which means that the node coordinates are stored three times - once in 
the node itself and twice in the tags.

About 3.3 million objects are tagged with something like

     <tag k="note" v="National-Land Numerical Information (Railway) 
2007, MLIT Japan"/>
     <tag k="note:ja" v="??????(?????)??19??????"/>
     <tag k="source" v="KSJ2"/>
     <tag k="source_ref" 
v="http://nlftp.mlit.go.jp/ksj/jpgis/datalist/KsjTmplt-N02-v1_1.html"/>

which is a lot of text where in my opinion a simple source tag on the 
changeset would have been sufficient. (The overwhelming majority of 
source_ref tags, 2.9 million, point to "KsjTmplt-N03.html", but another 
17 are in use; the distribution for note:ja is similar, with two 
messages being used 1.8 and 1.0 million times respectively, and a 
handful of others in use.)

3.1 million nodes used by ways are tagged with something like

     <tag k="KSJ2:curve_id" v="c00100298"/>
     <tag k="KSJ2:filename" v="N03-090320_40_new.xml"/>

which strikes me as a bit unnecessary as well; if really required, then 
that could go on the way using the nodes and not on every single node!

In addition to that, we have 1.1 million objects tagged with

     <tag k="created_by" 
v="National-Land-Numerical-Information_MLIT_Japan"/>

- also something that we usually but on changesets, and that seems to 
duplicate information already in the source and note tags.

There are also about 360k occurrences, on nodes used by ways, of the 
tags KSJ2:INT, KSJ2:INT_label, KSJ2:LIN, KSJ2:OPC, KSJ2:RAC; I have no 
idea what these are for but do they have to go on the nodes really?

I would like to see this (in my opinion) superfluous information 
removed. We would get rid of about 30 million tags. The size of the 
Japan dataset (in XML form) would shrink by 13% from 13.1 to 11.5 GB, 
the .osm.pbf would shrink by 14% from 585 to 501 MB. About 1 GB of 
database storage would be saved on the central OSM database server.

Needless to say, any software that processes the Japan dataset would 
also run faster and consume less resources.

Can anybody comment on this? Are any of the tags that I mentioned above 
actually used by anyone for anything?

In addition, there are 22 multipolygons from the same import, with more 
than 1000 members each (the top three being #1337942 with 10865 members, 
#1060553 with 5637, and #1069424 with 4518). While it is not wrong for a 
multipolygon to have so many members, this makes the affected areas very 
difficult to render and edit, and has the potential to bring 
unsuspecting relation processing software to a halt. Most of these 
multipolygons cannot even be downloaded via the API becuase it takes so 
long. I would like these multipolygons (all natural=wood I believe) 
split up into smaller entities.

It would be great if someone involved with the Japan community could 
deal with these issues; but I would also be willing to do it myself if 
that's ok with the community in Japan.

Finally, I am unsure if the KSJ2 import is even complete; if it is not, 
and still ongoing, then the numbers reported above might not even be the 
last word. In that case I would like to ask whoever is masterminding the 
import to maybe modify their scripts to include less superfluous tags. 
(Objects in question seem to be uploaded by a variety of users so I 
cannot detect from the object history alone who runs the import.)

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"



More information about the talk mailing list