[talk-ph] bulk editing address info in POIs

maning sambale emmanuel.sambale at gmail.com
Mon Aug 17 00:13:38 BST 2009


To put it simply for addressing:

addr:housenumber
addr:street
addr:city

and don't delete any existing is_in tags (just leave em there)

Did I get it right?

> (there's actually a whole academic subject on the topic of database normalization
> just to remove every redundancy in the data)
You are correct, I also deal with these problems everyday.  BUT, I
must say most of what openstreetmap is doing defies what we always
call as normal or standard.  For one, if OSM followed some ISO/OGC/GIS
standard thingie, we may still be discussing specifications these days
:)

Happy monday! A lot of work this week!



> I think Eugene and I have reasonably covered both sides of the argument.  I
> think what I am trying to say is that de-normalization (putting some
> redundancy back) is good for speeding  things up both for getting data out
> AND for putting it in. I may not have a copyright-free boundary (or just be
> too lazy to get it) but I do know that such and such a POI is within such
> and such a place, perhaps down to barangay level perhaps not. If everyone
> does the same thing, we rapidly establish de-facto boundaries.
>
> So, where I think we come together is:
>
> - Normalization is a good goal for the long term as our data is more
> complete and our software tools more sophisticated.
>
> - FULLY redundant data is not necessary.  I've experimented by simply
> putting the next tier up in an is_in:* tag. For example, if it is a
> barangay, just put the municipality. It is an extra step, but not too
> difficult a step for software to then look for the municipality and see what
> is_in:* tags it has and then repeat the process - effectively the
> normalization as effected in a relational database.  I am slowly writing
> search/gazeteer software to do that and put the results in a separate
> database which can be regenerated on each new planet download.
>
> In our crowd sourcing environment there is always a danger that the
> muncipality tag is missing, deleted, or spelt differently.  So I think there
> is still a value on what I call "seed points", i.e. randomly putting more
> than necessary information on some tags.  So, for example if the
> municipality tag is missing but a barangay mentions that it is in Sorsogon
> Province, then I've found it is possible to generate a tag for the
> municipality and even locate it by creating a simple rectangle around the
> barangay tags.
>
> Sorry if I am waffling on a bit but this is an interesting subject for me
> and I value the chance for discussion!
>
> Mike
>
> PS I am not sure exactly how this fits into this discussion but we also have
> to remember the psycho-geographic. Many cities like Sydney and, formerly,
> London don't actually officially exist. When people search for something "in
> Manila" they often will not mean Mayor Lim's kingdom but the built up place
> that vaguely corresponds to Metro Manila or the CAR.
>
>
> At 04:26 PM 16/08/2009, Eugene Alvin Villar wrote:
>
> Well, after thinking about it, maybe using only addr:city (for both cities
> and municipalities) is a good compromise.
>
> Some Q&As on my point of view:
>
> Q. Why is duplication bad?
> A. Well, I come from a software engineering background and in designing
> database systems, redundancies are not good as Mike has stated (there's
> actually a whole academic subject on the topic of database normalization
> just to remove every redundancy in the data). The trade-off, however, is
> that look-up performance goes down as a result (e.g., finding all the POIs
> in Makati is not as fast to do unless you did pre-processing). So sometimes,
> if you know what you are doing, de-normalization (putting some redundancy
> back) can speed things up.
>
> Q. So can we add addr:city, etc.?
> A. While adding these makes me cringe due to redundancy, I see the merit for
> a compromise. My proposal is to only add addr:city and not addr:village,
> addr:state, addr:country.
>
> Q. Why not add also addr:state (for provinces) and addr:country?
> A. Because I don't think making the data FULLY redundant is not considering
> the trade-offs (see the pros and cons of my previous e-mail on this topic).
> If a POI is tagged as addr:city=Makati, then it already implies that
> addr:country=Philippines. It's possible that there is another Makati city
> elsewhere in the world such that addr:country is needed for disambiguation
> of a POI, but the POI's lat-long already does the disambiguation.
>
> Q. Why not add also addr:village (for barangays)?
> A. My thinking is that addr:city is enough to reduce the look-up
> performance. It is certainly computationally intensive to determine the
> barangay, city/municipality, province of a POI by determining whether the
> POI lies within a barangay/city/municipality/province's boundary polygon
> (though there are plenty of ways to optimize this). But by specifying the
> addr:city, the search space is now reduced by two orders of magnitude.
> (Besides, at least for Metro Manila, barangays are really not used for
> addressing information.)
>
> Q. Why not tag POIs within municipalities using addr:town or
> addr:municipality; the Karlsruhe schema allows for arbitrary addr:* tags.
> A. I suggest using addr:city for both cities and municipalities only as a
> convention. That way, when a municipality later becomes a city, there is no
> need to change addr:municipality keys to addr:city.
>
>
> Now here's a question: the is_in:* tags and addr:* tags both overlap each
> other in function. We should stick to one. The Karlsruhe schema (
> http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema
> ) is silent on this but the Key:addr page (
> http://wiki.openstreetmap.org/wiki/Key:addr) actually suggests to use
> is_in:*. I favor using is_in
>
>
> Eugene / seav
>
>
> On Sun, Aug 16, 2009 at 8:55 PM, Mike Collinson <mike at ayeltd.biz> wrote:
> At 03:55 PM 13/08/2009, Eugene Alvin Villar wrote:
>>Here's my two cents regarding this:
>>
>>I don't favor using addr:city, addr:village, is_in to specify where a POI
>> is. Here are the cons:
>>
>>1. Duplication of info with admin borders (and potential mismatch issues)
>>2. Increased data size with respect to tags (which makes planet dumps
>> larger)
>>
>>On the other hand, here are the pros:
>>
>>1. POIs are easier to filter by place than the alternative which is to do
>> bounding polygon calculation, which is more computationally intensive. This
>> calculation can be mitigated somewhat by doing pre-processing of the data
>> just before the data will be used (e.g., as an additional step to making
>> Garmin maps.)
>>2. Identifies where a POI is in the (hopefully temporary) lack of boundary
>> data.
>>
>>Regardless, addr:street is essential since this is very hard to infer from
>> the data without it.
>>
>>
>>Anybody else have other thoughts?
>
> In my own mapping and having an interest in preparing OSM data for first
> generation gazeteer and search software, I generally go for "the more the
> better" broadly for the reasons Eugene outlines.  Redundancy is heresy in
> database programming courses but I think there is an assumption that data is
> put in under strict rules and in a  controlled environment. For us, I think
> redundancy (partial duplication but from different sources and
> methodologies) is actually a good thing ... latter pruning is not
> impossible.  Perhaps in two or three years time, boundary data and the
> software to easily process it will be highly available but for now, I say
> leave 'em in!
>
> Size of planet dumps. Yes, a concern, especially when you are trying to do a
> dial-up download, something the Europeans forget.  But POIs may number
> thousands in an area but the ways in the same area may have hundreds of
> thousands of nodes, especially if over-digitised. Taking into account all
> the XML tagging wrapping a node, the size of a POI is not that much bigger
> than a  raw lat,lon node.  The size of planet dumps is going to get too big
> anyway, I kind of see value in forcing the issue sooner not later.
>
> I have, by the way, now switched to using explicitly identified is_in:* tags
> using the place= values where possible and user defined value where it gives
> some local benefit.
>
> is_in:country, is_in:state, is_in:city,  is_in:town ...
> is_in:island, is_in:sea
> is_in:valley, is_in:barangay, ...
>
> I am interested to see whether we can collect enough points to generate
> reasonable boundaries from points rather than the other way around.
>
> Just my thoughts!
>
> Mike
>
>
>
> _______________________________________________
> talk-ph mailing list
> talk-ph at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/talk-ph
>
>
>
>
> --
> http://vaes9.codedgraphic.com
>
> _______________________________________________
> talk-ph mailing list
> talk-ph at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/talk-ph
>
>



-- 
cheers,
maning
------------------------------------------------------
"Freedom is still the most radical idea of all" -N.Branden
wiki: http://esambale.wikispaces.com/
blog: http://epsg4253.wordpress.com/
------------------------------------------------------




More information about the talk-ph mailing list