[talk-ph] bulk editing address info in POIs

Eugene Alvin Villar seav80 at gmail.com
Sun Aug 16 15:26:31 BST 2009


Well, after thinking about it, maybe using only addr:city (for both cities
and municipalities) is a good compromise.

Some Q&As on my point of view:

Q. Why is duplication bad?
A. Well, I come from a software engineering background and in designing
database systems, redundancies are not good as Mike has stated (there's
actually a whole academic subject on the topic of database normalization
just to remove every redundancy in the data). The trade-off, however, is
that look-up performance goes down as a result (e.g., finding all the POIs
in Makati is not as fast to do unless you did pre-processing). So sometimes,
if you know what you are doing, de-normalization (putting some redundancy
back) can speed things up.

Q. So can we add addr:city, etc.?
A. While adding these makes me cringe due to redundancy, I see the merit for
a compromise. My proposal is to only add addr:city and not addr:village,
addr:state, addr:country.

Q. Why not add also addr:state (for provinces) and addr:country?
A. Because I don't think making the data FULLY redundant is not considering
the trade-offs (see the pros and cons of my previous e-mail on this topic).
If a POI is tagged as addr:city=Makati, then it already implies that
addr:country=Philippines. It's possible that there is another Makati city
elsewhere in the world such that addr:country is needed for disambiguation
of a POI, but the POI's lat-long already does the disambiguation.

Q. Why not add also addr:village (for barangays)?
A. My thinking is that addr:city is enough to reduce the look-up
performance. It is certainly computationally intensive to determine the
barangay, city/municipality, province of a POI by determining whether the
POI lies within a barangay/city/municipality/province's boundary polygon
(though there are plenty of ways to optimize this). But by specifying the
addr:city, the search space is now reduced by two orders of magnitude.
(Besides, at least for Metro Manila, barangays are really not used for
addressing information.)

Q. Why not tag POIs within municipalities using addr:town or
addr:municipality; the Karlsruhe schema allows for arbitrary addr:* tags.
A. I suggest using addr:city for both cities and municipalities only as a
convention. That way, when a municipality later becomes a city, there is no
need to change addr:municipality keys to addr:city.


Now here's a question: the is_in:* tags and addr:* tags both overlap each
other in function. We should stick to one. The Karlsruhe schema (
http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema)
is silent on this but the Key:addr page (
http://wiki.openstreetmap.org/wiki/Key:addr) actually suggests to use
is_in:*. I favor using is_in


Eugene / seav


On Sun, Aug 16, 2009 at 8:55 PM, Mike Collinson <mike at ayeltd.biz> wrote:

> At 03:55 PM 13/08/2009, Eugene Alvin Villar wrote:
> >Here's my two cents regarding this:
> >
> >I don't favor using addr:city, addr:village, is_in to specify where a POI
> is. Here are the cons:
> >
> >1. Duplication of info with admin borders (and potential mismatch issues)
> >2. Increased data size with respect to tags (which makes planet dumps
> larger)
> >
> >On the other hand, here are the pros:
> >
> >1. POIs are easier to filter by place than the alternative which is to do
> bounding polygon calculation, which is more computationally intensive. This
> calculation can be mitigated somewhat by doing pre-processing of the data
> just before the data will be used (e.g., as an additional step to making
> Garmin maps.)
> >2. Identifies where a POI is in the (hopefully temporary) lack of boundary
> data.
> >
> >Regardless, addr:street is essential since this is very hard to infer from
> the data without it.
> >
> >
> >Anybody else have other thoughts?
>
> In my own mapping and having an interest in preparing OSM data for first
> generation gazeteer and search software, I generally go for "the more the
> better" broadly for the reasons Eugene outlines.  Redundancy is heresy in
> database programming courses but I think there is an assumption that data is
> put in under strict rules and in a  controlled environment. For us, I think
> redundancy (partial duplication but from different sources and
> methodologies) is actually a good thing ... latter pruning is not
> impossible.  Perhaps in two or three years time, boundary data and the
> software to easily process it will be highly available but for now, I say
> leave 'em in!
>
> Size of planet dumps. Yes, a concern, especially when you are trying to do
> a dial-up download, something the Europeans forget.  But POIs may number
> thousands in an area but the ways in the same area may have hundreds of
> thousands of nodes, especially if over-digitised. Taking into account all
> the XML tagging wrapping a node, the size of a POI is not that much bigger
> than a  raw lat,lon node.  The size of planet dumps is going to get too big
> anyway, I kind of see value in forcing the issue sooner not later.
>
> I have, by the way, now switched to using explicitly identified is_in:*
> tags using the place= values where possible and user defined value where it
> gives some local benefit.
>
> is_in:country, is_in:state, is_in:city,  is_in:town ...
> is_in:island, is_in:sea
> is_in:valley, is_in:barangay, ...
>
> I am interested to see whether we can collect enough points to generate
> reasonable boundaries from points rather than the other way around.
>
> Just my thoughts!
>
> Mike
>
>
>
> _______________________________________________
> talk-ph mailing list
> talk-ph at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/talk-ph
>



-- 
http://vaes9.codedgraphic.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-ph/attachments/20090816/4cce8628/attachment.html>


More information about the talk-ph mailing list