[OSM-ja] KSJ2インポートのsourceタグについて(Was: [OSM-talk] Japan KSJ2 Import)

Shu Higashi s_higash @ mua.biglobe.ne.jp
2011年 6月 21日 (火) 05:04:51 BST


東です。

Frederik Rammより日本のコミュニティ宛に質問がありました。
私が適当に要約した内容は以下の通りです。
(正確には原文を参照ください)

要はKSJ2インポートのsourceタグ関連で、
冗長なデータ量があまりに多いように見えるが本当に必要な内容なのか、
もう少し記述量を減らせないか
整理すれば日本のデータ量を減らせるのでみんなハッピーだと思うのだが
といった内容です。

現在のタグ付けの経緯などご存知の方がおられましたら教えてください。

<以下要約>
1.330万のノードに座標が3回書かれている。
(ひとつで良いのでは?)
<tag k="KSJ2:coordinate" v="32.787857 130.687672"/>
<tag k="KSJ2:lat" v="32.787857"/>
<tag k="KSJ2:long" v="130.687672"/>

2.330万のオブジェクトが以下のようにタグ付けされているが
シンプルなsourceタグひとつで良いのでは?
<tag k="note" v="National-Land Numerical Information (Railway) 2007,
MLIT Japan"/>
<tag k="note:ja" v="??????(?????)??19??????"/>
<tag k="source" v="KSJ2"/>
<tag k="source_ref"
v="http://nlftp.mlit.go.jp/ksj/jpgis/datalist/KsjTmplt-N02-v1_1.html"/>

3.ウェイ上にある310万のノードに書かれている以下のような記述について、
必要な理由があるのか?
必要な場合でも各ノードではなく、ウェイに書けば良いのではないか?
<tag k="KSJ2:curve_id" v="c00100298"/>
<tag k="KSJ2:filename" v="N03-090320_40_new.xml"/>

4.110万のオブジェクトに以下のようなタグが付いているが
sourceやnoteのタグと重複しているのでは?
<tag k="created_by" v="National-Land-Numerical-Information_MLIT_Japan"/>

5.ウェイ上にある36万件のノードに以下のような記述があるが
その意味、及びノードひとつずつにある必要性が不明。
KSJ2:INT, KSJ2:INT_label, KSJ2:LIN, KSJ2:OPC, KSJ2:RAC;

これらの重複を整理(削除)できれば日本のデータを13%
(xmlファイルで13.1GBから11.5GBに)削減できる。


---------- Forwarded message ----------
From: Frederik Ramm <frederik @ remote.org>
Date: Mon, 20 Jun 2011 23:05:37 +0200
Subject: [OSM-talk] Japan KSJ2 Import
To: Talk Openstreetmap <talk @ openstreetmap.org>

Hi,

    is someone on this list involved in OSM in Japan? I'll go to talk-jp
with the issue if not, but maybe the right people are reading this here
also.

I noticed that a lot of data has been imported from a "KSJ2" data set,
and this data has many tags that I consider unnecessary.

The whole import seems to comprise about 3.5 million nodes, 680k ways,
and 9000 relations.

3.3 million nodes are tagged with something like

     <tag k="KSJ2:coordinate" v="32.787857 130.687672"/>
     <tag k="KSJ2:lat" v="32.787857"/>
     <tag k="KSJ2:long" v="130.687672"/>

which means that the node coordinates are stored three times - once in
the node itself and twice in the tags.

About 3.3 million objects are tagged with something like

     <tag k="note" v="National-Land Numerical Information (Railway)
2007, MLIT Japan"/>
     <tag k="note:ja" v="??????(?????)??19??????"/>
     <tag k="source" v="KSJ2"/>
     <tag k="source_ref"
v="http://nlftp.mlit.go.jp/ksj/jpgis/datalist/KsjTmplt-N02-v1_1.html"/>

which is a lot of text where in my opinion a simple source tag on the
changeset would have been sufficient. (The overwhelming majority of
source_ref tags, 2.9 million, point to "KsjTmplt-N03.html", but another
17 are in use; the distribution for note:ja is similar, with two
messages being used 1.8 and 1.0 million times respectively, and a
handful of others in use.)

3.1 million nodes used by ways are tagged with something like

     <tag k="KSJ2:curve_id" v="c00100298"/>
     <tag k="KSJ2:filename" v="N03-090320_40_new.xml"/>

which strikes me as a bit unnecessary as well; if really required, then
that could go on the way using the nodes and not on every single node!

In addition to that, we have 1.1 million objects tagged with

     <tag k="created_by"
v="National-Land-Numerical-Information_MLIT_Japan"/>

- also something that we usually but on changesets, and that seems to
duplicate information already in the source and note tags.

There are also about 360k occurrences, on nodes used by ways, of the
tags KSJ2:INT, KSJ2:INT_label, KSJ2:LIN, KSJ2:OPC, KSJ2:RAC; I have no
idea what these are for but do they have to go on the nodes really?

I would like to see this (in my opinion) superfluous information
removed. We would get rid of about 30 million tags. The size of the
Japan dataset (in XML form) would shrink by 13% from 13.1 to 11.5 GB,
the .osm.pbf would shrink by 14% from 585 to 501 MB. About 1 GB of
database storage would be saved on the central OSM database server.

Needless to say, any software that processes the Japan dataset would
also run faster and consume less resources.

Can anybody comment on this? Are any of the tags that I mentioned above
actually used by anyone for anything?

In addition, there are 22 multipolygons from the same import, with more
than 1000 members each (the top three being #1337942 with 10865 members,
#1060553 with 5637, and #1069424 with 4518). While it is not wrong for a
multipolygon to have so many members, this makes the affected areas very
difficult to render and edit, and has the potential to bring
unsuspecting relation processing software to a halt. Most of these
multipolygons cannot even be downloaded via the API becuase it takes so
long. I would like these multipolygons (all natural=wood I believe)
split up into smaller entities.

It would be great if someone involved with the Japan community could
deal with these issues; but I would also be willing to do it myself if
that's ok with the community in Japan.

Finally, I am unsure if the KSJ2 import is even complete; if it is not,
and still ongoing, then the numbers reported above might not even be the
last word. In that case I would like to ask whoever is masterminding the
import to maybe modify their scripts to include less superfluous tags.
(Objects in question seem to be uploaded by a variety of users so I
cannot detect from the object history alone who runs the import.)

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik @ remote.org  ##  N49°00'09" E008°23'33"

_______________________________________________
talk mailing list
talk @ openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk



Talk-ja メーリングリストの案内