[talk-ph] bulk editing address info in POIs

Mike Collinson mike at ayeltd.biz
Sun Aug 16 17:02:08 BST 2009


I think Eugene and I have reasonably covered both sides of the argument.  I think what I am trying to say is that de-normalization (putting some redundancy back) is good for speeding  things up both for getting data out AND for putting it in. I may not have a copyright-free boundary (or just be too lazy to get it) but I do know that such and such a POI is within such and such a place, perhaps down to barangay level perhaps not. If everyone does the same thing, we rapidly establish de-facto boundaries.

So, where I think we come together is:

- Normalization is a good goal for the long term as our data is more complete and our software tools more sophisticated.

- FULLY redundant data is not necessary.  I've experimented by simply putting the next tier up in an is_in:* tag. For example, if it is a barangay, just put the municipality. It is an extra step, but not too difficult a step for software to then look for the municipality and see what is_in:* tags it has and then repeat the process - effectively the normalization as effected in a relational database.  I am slowly writing search/gazeteer software to do that and put the results in a separate database which can be regenerated on each new planet download.  

In our crowd sourcing environment there is always a danger that the muncipality tag is missing, deleted, or spelt differently.  So I think there is still a value on what I call "seed points", i.e. randomly putting more than necessary information on some tags.  So, for example if the municipality tag is missing but a barangay mentions that it is in Sorsogon Province, then I've found it is possible to generate a tag for the municipality and even locate it by creating a simple rectangle around the barangay tags.

Sorry if I am waffling on a bit but this is an interesting subject for me and I value the chance for discussion!

Mike

PS I am not sure exactly how this fits into this discussion but we also have to remember the psycho-geographic. Many cities like Sydney and, formerly, London don't actually officially exist. When people search for something "in Manila" they often will not mean Mayor Lim's kingdom but the built up place that vaguely corresponds to Metro Manila or the CAR.


At 04:26 PM 16/08/2009, Eugene Alvin Villar wrote:
>Well, after thinking about it, maybe using only addr:city (for both cities and municipalities) is a good compromise.
>
>Some Q&As on my point of view:
>
>Q. Why is duplication bad?
>A. Well, I come from a software engineering background and in designing database systems, redundancies are not good as Mike has stated (there's actually a whole academic subject on the topic of database normalization just to remove every redundancy in the data). The trade-off, however, is that look-up performance goes down as a result (e.g., finding all the POIs in Makati is not as fast to do unless you did pre-processing). So sometimes, if you know what you are doing, de-normalization (putting some redundancy back) can speed things up.
>
>Q. So can we add addr:city, etc.?
>A. While adding these makes me cringe due to redundancy, I see the merit for a compromise. My proposal is to only add addr:city and not addr:village, addr:state, addr:country.
>
>Q. Why not add also addr:state (for provinces) and addr:country?
>A. Because I don't think making the data FULLY redundant is not considering the trade-offs (see the pros and cons of my previous e-mail on this topic). If a POI is tagged as addr:city=Makati, then it already implies that addr:country=Philippines. It's possible that there is another Makati city elsewhere in the world such that addr:country is needed for disambiguation of a POI, but the POI's lat-long already does the disambiguation.
>
>Q. Why not add also addr:village (for barangays)?
>A. My thinking is that addr:city is enough to reduce the look-up performance. It is certainly computationally intensive to determine the barangay, city/municipality, province of a POI by determining whether the POI lies within a barangay/city/municipality/province's boundary polygon (though there are plenty of ways to optimize this). But by specifying the addr:city, the search space is now reduced by two orders of magnitude. (Besides, at least for Metro Manila, barangays are really not used for addressing information.)
>
>Q. Why not tag POIs within municipalities using addr:town or addr:municipality; the Karlsruhe schema allows for arbitrary addr:* tags.
>A. I suggest using addr:city for both cities and municipalities only as a convention. That way, when a municipality later becomes a city, there is no need to change addr:municipality keys to addr:city.
>
>
>Now here's a question: the is_in:* tags and addr:* tags both overlap each other in function. We should stick to one. The Karlsruhe schema (<http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema>http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema ) is silent on this but the Key:addr page (<http://wiki.openstreetmap.org/wiki/Key:addr>http://wiki.openstreetmap.org/wiki/Key:addr) actually suggests to use is_in:*. I favor using is_in
>
>
>Eugene / seav
>
>
>On Sun, Aug 16, 2009 at 8:55 PM, Mike Collinson <<mailto:mike at ayeltd.biz>mike at ayeltd.biz> wrote:
>At 03:55 PM 13/08/2009, Eugene Alvin Villar wrote:
>>Here's my two cents regarding this:
>>
>>I don't favor using addr:city, addr:village, is_in to specify where a POI is. Here are the cons:
>>
>>1. Duplication of info with admin borders (and potential mismatch issues)
>>2. Increased data size with respect to tags (which makes planet dumps larger)
>>
>>On the other hand, here are the pros:
>>
>>1. POIs are easier to filter by place than the alternative which is to do bounding polygon calculation, which is more computationally intensive. This calculation can be mitigated somewhat by doing pre-processing of the data just before the data will be used (e.g., as an additional step to making Garmin maps.)
>>2. Identifies where a POI is in the (hopefully temporary) lack of boundary data.
>>
>>Regardless, addr:street is essential since this is very hard to infer from the data without it.
>>
>>
>>Anybody else have other thoughts?
>
>In my own mapping and having an interest in preparing OSM data for first generation gazeteer and search software, I generally go for "the more the better" broadly for the reasons Eugene outlines.  Redundancy is heresy in database programming courses but I think there is an assumption that data is put in under strict rules and in a  controlled environment. For us, I think redundancy (partial duplication but from different sources and methodologies) is actually a good thing ... latter pruning is not impossible.  Perhaps in two or three years time, boundary data and the software to easily process it will be highly available but for now, I say leave 'em in!
>
>Size of planet dumps. Yes, a concern, especially when you are trying to do a dial-up download, something the Europeans forget.  But POIs may number thousands in an area but the ways in the same area may have hundreds of thousands of nodes, especially if over-digitised. Taking into account all the XML tagging wrapping a node, the size of a POI is not that much bigger than a  raw lat,lon node.  The size of planet dumps is going to get too big anyway, I kind of see value in forcing the issue sooner not later.
>
>I have, by the way, now switched to using explicitly identified is_in:* tags using the place= values where possible and user defined value where it gives some local benefit.
>
>is_in:country, is_in:state, is_in:city,  is_in:town ...
>is_in:island, is_in:sea
>is_in:valley, is_in:barangay, ...
>
>I am interested to see whether we can collect enough points to generate reasonable boundaries from points rather than the other way around.
>
>Just my thoughts!
>
>Mike
>
>
>
>_______________________________________________
>talk-ph mailing list
><mailto:talk-ph at openstreetmap.org>talk-ph at openstreetmap.org
>http://lists.openstreetmap.org/listinfo/talk-ph
>
>
>
>
>-- 
><http://vaes9.codedgraphic.com>http://vaes9.codedgraphic.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-ph/attachments/20090816/e3364a9c/attachment.html>


More information about the talk-ph mailing list