<html>

<body>

I think Eugene and I have reasonably covered both sides of the

argument.  I think what I am trying to say is that de-normalization

(putting some redundancy back) is good for speeding  things up both

for getting data out AND for putting it in. I may not have a

copyright-free boundary (or just be too lazy to get it) but I do know

that such and such a POI is within such and such a place, perhaps down to

barangay level perhaps not. If everyone does the same thing, we rapidly

establish de-facto boundaries.<br><br>

So, where I think we come together is:<br><br>

- Normalization is a good goal for the long term as our data is more

complete and our software tools more sophisticated.<br><br>

- FULLY redundant data is not necessary.  I've experimented by

simply putting the next tier up in an is_in:* tag. For example, if it is

a barangay, just put the municipality. It is an extra step, but not too

difficult a step for software to then look for the municipality and see

what is_in:* tags it has and then repeat the process - effectively the

normalization as effected in a relational database.  I am slowly

writing search/gazeteer software to do that and put the results in a

separate database which can be regenerated on each new planet

download.  <br><br>

In our crowd sourcing environment there is always a danger that the

muncipality tag is missing, deleted, or spelt differently.  So I

think there is still a value on what I call "seed points", i.e.

randomly putting more than necessary information on some tags.  So,

for example if the municipality tag is missing but a barangay mentions

that it is in Sorsogon Province, then I've found it is possible to

generate a tag for the municipality and even locate it by creating a

simple rectangle around the barangay tags.<br><br>

Sorry if I am waffling on a bit but this is an interesting subject for me

and I value the chance for discussion!<br><br>

Mike<br><br>

PS I am not sure exactly how this fits into this discussion but we also

have to remember the psycho-geographic. Many cities like Sydney and,

formerly, London don't actually officially exist. When people search for

something "in Manila" they often will not mean Mayor Lim's

kingdom but the built up place that vaguely corresponds to Metro Manila

or the CAR.<br><br>

<br>

At 04:26 PM 16/08/2009, Eugene Alvin Villar wrote:<br>

<blockquote type=cite class=cite cite="">Well, after thinking about it,

maybe using only addr:city (for both cities and municipalities) is a good

compromise.<br><br>

Some Q&As on my point of view:<br><br>

Q. Why is duplication bad?<br>

A. Well, I come from a software engineering background and in designing

database systems, redundancies are not good as Mike has stated (there's

actually a whole academic subject on the topic of database normalization

just to remove every redundancy in the data). The trade-off, however, is

that look-up performance goes down as a result (e.g., finding all the

POIs in Makati is not as fast to do unless you did pre-processing). So

sometimes, if you know what you are doing, de-normalization (putting some

redundancy back) can speed things up.<br>

<br>

Q. So can we add addr:city, etc.?<br>

A. While adding these makes me cringe due to redundancy, I see the merit

for a compromise. My proposal is to only add addr:city and not

addr:village, addr:state, addr:country.<br><br>

Q. Why not add also addr:state (for provinces) and addr:country?<br>

A. Because I don't think making the data FULLY redundant is not

considering the trade-offs (see the pros and cons of my previous e-mail

on this topic). If a POI is tagged as addr:city=Makati, then it already

implies that addr:country=Philippines. It's possible that there is

another Makati city elsewhere in the world such that addr:country is

needed for disambiguation of a POI, but the POI's lat-long already does

the disambiguation.<br><br>

Q. Why not add also addr:village (for barangays)?<br>

A. My thinking is that addr:city is enough to reduce the look-up

performance. It is certainly computationally intensive to determine the

barangay, city/municipality, province of a POI by determining whether the

POI lies within a barangay/city/municipality/province's boundary polygon

(though there are plenty of ways to optimize this). But by specifying the

addr:city, the search space is now reduced by two orders of magnitude.

(Besides, at least for Metro Manila, barangays are really not used for

addressing information.)<br><br>

Q. Why not tag POIs within municipalities using addr:town or

addr:municipality; the Karlsruhe schema allows for arbitrary addr:*

tags.<br>

A. I suggest using addr:city for both cities and municipalities only as a

convention. That way, when a municipality later becomes a city, there is

no need to change addr:municipality keys to addr:city.<br><br>

<br>

Now here's a question: the is_in:* tags and addr:* tags both overlap each

other in function. We should stick to one. The Karlsruhe schema

(<a href="http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema">

http://wiki.openstreetmap.org/wiki/Proposed_features/House_numbers/Karlsruhe_Schema</a>

 ) is silent on this but the Key:addr page

(<a href="http://wiki.openstreetmap.org/wiki/Key:addr">

http://wiki.openstreetmap.org/wiki/Key:addr</a>) actually suggests to use

is_in:*. I favor using is_in<br><br>

<br>

Eugene / seav<br><br>

<br>

On Sun, Aug 16, 2009 at 8:55 PM, Mike Collinson

<<a href="mailto:mike@ayeltd.biz">mike@ayeltd.biz</a>> wrote:<br>

<dl>

<dd>At 03:55 PM 13/08/2009, Eugene Alvin Villar wrote:<br>

<dd>>Here's my two cents regarding this:<br>

<dd>><br>

<dd>>I don't favor using addr:city, addr:village, is_in to specify

where a POI is. Here are the cons:<br>

<dd>><br>

<dd>>1. Duplication of info with admin borders (and potential mismatch

issues)<br>

<dd>>2. Increased data size with respect to tags (which makes planet

dumps larger)<br>

<dd>><br>

<dd>>On the other hand, here are the pros:<br>

<dd>><br>

<dd>>1. POIs are easier to filter by place than the alternative which

is to do bounding polygon calculation, which is more computationally

intensive. This calculation can be mitigated somewhat by doing

pre-processing of the data just before the data will be used (e.g., as an

additional step to making Garmin maps.)<br>

<dd>>2. Identifies where a POI is in the (hopefully temporary) lack of

boundary data.<br>

<dd>><br>

<dd>>Regardless, addr:street is essential since this is very hard to

infer from the data without it.<br>

<dd>><br>

<dd>><br>

<dd>>Anybody else have other thoughts?<br><br>

<dd>In my own mapping and having an interest in preparing OSM data for

first generation gazeteer and search software, I generally go for

"the more the better" broadly for the reasons Eugene

outlines.  Redundancy is heresy in database programming courses but

I think there is an assumption that data is put in under strict rules and

in a  controlled environment. For us, I think redundancy (partial

duplication but from different sources and methodologies) is actually a

good thing ... latter pruning is not impossible.  Perhaps in two or

three years time, boundary data and the software to easily process it

will be highly available but for now, I say leave 'em in!<br><br>

<dd>Size of planet dumps. Yes, a concern, especially when you are trying

to do a dial-up download, something the Europeans forget.  But POIs

may number thousands in an area but the ways in the same area may have

hundreds of thousands of nodes, especially if over-digitised. Taking into

account all the XML tagging wrapping a node, the size of a POI is not

that much bigger than a  raw lat,lon node.  The size of planet

dumps is going to get too big anyway, I kind of see value in forcing the

issue sooner not later.<br><br>

<dd>I have, by the way, now switched to using explicitly identified

is_in:* tags using the place= values where possible and user defined

value where it gives some local benefit.<br><br>

<dd>is_in:country, is_in:state, is_in:city,  is_in:town ...<br>

<dd>is_in:island, is_in:sea<br>

<dd>is_in:valley, is_in:barangay, ...<br><br>

<dd>I am interested to see whether we can collect enough points to

generate reasonable boundaries from points rather than the other way

around.<br><br>

<dd>Just my thoughts!<br>

<font color="#888888"><br>

<dd>Mike<br>

</font><br><br>

<br>

<dd>_______________________________________________<br>

<dd>talk-ph mailing list<br>

<dd><a href="mailto:talk-ph@openstreetmap.org">

talk-ph@openstreetmap.org</a><br>

<dd>

<a href="http://lists.openstreetmap.org/listinfo/talk-ph" eudora="autourl">

http://lists.openstreetmap.org/listinfo/talk-ph</a><br><br>

</dl><br><br>

<br>

-- <br>

<a href="http://vaes9.codedgraphic.com">http://vaes9.codedgraphic.com</a>

</blockquote></body>

</html>