[Imports] Proposal for proper OSM import solution (OpenMetaMap)

Fri Aug 19 09:51:39 UTC 2011

On 18.08.2011, at 19:22, Serge Wroclawski wrote:

> On Thu, Aug 18, 2011 at 11:31 AM, Jaak Laineste <jaak at nutiteq.com> wrote:
> 
>>> 2. This approach implies that external data sets are correct.
>>> 
>>> Underlying this approach is an assumption that we can rely on other
>>> datasets accuracy. Sadly this is not the case. As I work with more
>>> datasets and compare them to on the ground surveying, I find that many
>>> government datasets are either wrong, or out of date.
> 
>> I do not really agree with this implication, it does not assume that external dataset is correct. The process of linking (resolving conflations) would be actually same as with normal import: somebody has to review all data overlappings/conflicts/duplicates and solve them.
> 
>> It assumes that using external data is better than having nothing - just the same assumption you have with any external data usage and import.
> 
> For those of us who've been in the project for more than a year or two
> the jury is still out on this.
> 
> It's easy to say "Well isn't some data better than no data?" but then
> we see places where imports have taken place also have low community
> uptake. These are of course correlations and we cannot automatically
> assume causation, but we can certainly raise these as concerns.

 I have the list of things why imports are generally bad already in the OMM wiki page, some but not all are improved with OMM. I know the general issues of import, and generally agree with you. Whether to use specific external dataset is topic for another discussion. My starting point here is after decision that this dataset should be used in OSM map, so the question is how to use it.

My friends have told me that they don't dare update my home town in OSM as there is already detailed import there (done by me). I see here three solutions:
a) ban external datasets
b) use them in smart way like MetaMapping
c) just import in the old way

I believe that total banning is not realistic and in some cases is unreasonable and even technically impossible (say admin borders). Therefore I propose to do it smarter way than just importing. Currently there is no technical solution for the smart way. Let's create it.

I can see some advantages of dumb import also, compared to linking:
- if import is small then you can fool the community that the data was crowd-sourced. Maybe there is a bit less motivation decrease, as community embraces it more.
- Updating is a bit simpler. It is just like ignoring relations makes editing easier, but can also harm database. OMM links are really Relations in essence, just external relations.

> 
>> So after first linking round you would have correction of external data, at least as much as it is possible then.
> 
> If you have the corrections of the external data in OSM, you might as
> well have the data in OSM in the first place.

There would be several advantages in linking, even if you have to correct the data.

> 
>> But then in later days it could happen that the data what was ok in initial linking will be changed to something else (worse). Here you are right - external data provider can do harm to our data. I would say we assume that the external data provider works in the direction of making data better, not worse. In other words: it is ok to have bad data in the beginning, but it is not ok if the data modifications are in wrong direction.
> 
> The problem isn't that external datasets get worse, the problem is
> that external datasets make it hard to see what's missing, or worse,
> what's wrong.

A little bit, but not a lot harder. From editor point of view you have now two choices in JOSM: (1) download data and (2) GPS tracks for reference. You will have one more option (by default selected perhaps): (3) external data. With selection of (1) and (3) you have just as good overview of all the data as after dumb import. You see for example that the external dataset has 3 nodes (POIs) in wrong place and you just drag them to correct location. With saving JOSM asks where to save changes (now you have (1) file and (2) OSM, added will be two: (3) OMM and (4) specific external dataset options). Maybe this OMM option should be automatically turned on when you save to OSM to make things simpler. If someone does not want to contribute to OMM, then he/she can save it to OSM only which creates unlinked copy of data. Well, the link will be eventually created later by other user (or OMM validator bot) who discovers duplicate (JSOM Validator will warn it already), but this will be different type than with normal linked data.

The data validator will need to handle Link break cases: e.g. linked object is removed in external database: re-defined, ID changed, just fixed. Then validator discovers dead links and asks user to do something about it (remove also from OSM, remove link, re-link to other object), also validator finds out editing/merge conflicts. Nothing radically new: you have now also similar cases with Relations and when two persons edit same node in OSM database.

>> Also there will be always problem of added new data - maintainer of database links has to do occasional reviews and correct this also. So with usual import you have to fix the data once. There is no bulk update possible so you do not need to worry about later updates. Now when we have later updates, maintainer has to start taking care about it also. More gain, more pain.
> 
> I'm sorry, while your English is far better than my Estonian, I do not
> understand this paragraph. Can you rephrase?

I mean that with dumb import you import and fix data once and you are done. With data linking you do basically the same. Just as you will have continuous stream of new data (later changes) then you are not done with one data fixing round, you have to take care of your data all the time. Which is something you are expected to do anyway, as in OSM you always have the community-created data stream.

> 
>> Actually I'm afraid that most external datasources will be rather static (just OSM files). This way there is no risk that external dataset will be suddenly damaged. There would be no benefit of later updates, but even then there is advantage of MetaMap database - you keep the datasets clean and separated.
> 
> The key value proposition of external datasets is that they could be
> updated by external entities (think distributed version control). If
> you think this isn't the case, or is not the case you're designing
> around, then I see no benefit of using this technique vs improving our
> conflation tools inside OSM, which is something we need today!

The key benefit is to have clear border between different data sources, especially community-generated original data, and all the external data sources. Why this is good:
1. When for any reason should you remove the external data. With dumb imports it is very complicated.
2. You can then have alternative sources of same objects. With current OSM model you can in principle use special Relations (if there were this kind of relation). World is complicated place and our map database is one subjective model of that, other people may a bit different alternative models. Example would be different scale/abstraction level: in middle-scale map (like OSM) you have lines for road centerlines, but really detailed city plans you use lines for street borders.

>>> 
> 
>> I assume here that often usage of external datasets is good and reasonable, and in many cases unavoidable (admin borders, shoreline and other samples).
> 
> This sentence has two statements:
> 
> 1. You assume that imports are often unavoidable.
> 
> 2. You assume that often the imports are good and reasonable.
> 
> 1 isn't true. We see lots of imports of data that could be collected
> manually. TIGER could have been done manually given time. GNIS could
> have likely been done manually, and even Corine could have been done
> manually. OSM took shortcuts. That doesn't mean they were bad, but
> they weren't unavoidable. And if you look at the datasets users plop
> in most often, without discussing with the community, that could have
> been collected manually. Again, doesn't mean it's bad, but it's
> certainly avoidable.
> 
> 2 isn't true at all. In fact, we have tons of problems due to imports.
> Imports are hard to get right (I'll address more technical issues
> later on in this mail). We have had to revert changesets, we've had to
> fix problems. I spent a lot of time fixing TIGER data, as do many US
> mappers. That's time we could be spending mapping, we spend fixing.

You are right in the arguments, the bad term here is "often", very subjective. Please read it "sometimes". I assume that using external data is sometimes good and reasonable, even unavoidable. I do not know details of TIGER and GNIS, cannot comment. I have spent several long weekend days and nights for Corine data fixing and matching it with shoreline (and we have quite a lot of it here in Estonia, per capita one of the top countries in the world). Then later I have tried to maintain it and match with other data, and I've discovered it is really impossible. When others have asked me how to do Corine import properly, then I have tried to persuade them just not to import it, just handpick some polygons for specific area if they fit. I have also done import of our official administrative areas, which is now outdated and I have no good idea how to find out what has changed and how to fix it properly. Corine key issue at least here was that it is too low scale, so it is really hard to match it with other data of higher accuracy, perhaps proper approach for it would have been cherry picking, and MetaMapping does not help you there. But for admin areas OMM would be perfect tool - the data cannot be crowd-sourced and  it is well maintained by external sources. Also it has good official external object IDs and it is not too much conflated with other objects in the database. 

> 
>> And there is always risk that a mapper finds from Internet site called Google Maps and discovers that "the map" is already there and complete :)
> 
> Is it? If that were true, Google wouldn't have accidentally used OSM
> on at least one (but I think I remember two) occasions. There are
> places where OSM is of higher quality than Google. We just aren't as
> good consistently across the globe.

I'm not sure if key point of OSM is really to have more detailed map than the "competitors", but this is some other discussion which is not really relevant here.

> 
>>> 4. It assumes OSM object IDs remain constant.
>>> 
>>> OSM object IDs change. They don't change a lot, but they do change,
>>> and you can't force users to jump through hoops to preserve them (as
>>> we've seen people propose).
>> 
>> Yes, it assumes that IDs do not change. This is most important. Can you explain more why and how OSM object IDs change? I've heard it too, but to analyze cases in more details I'd need to know the details.
> 
> They change because people delete things, and add things, and move
> things around.
> 
> A simple example is that often I'll see a POI node, and I'll go ahead
> and draw the building outline and put the data on the building. I draw
> the building and delete the node.

Good case. Do you just delete the node without checking whether it has useful tags and relations which need to be copied to the polygon? If so then it would be a bit stupid and right - you would create orphaned Link in OMM. I would check the tags of the node if I found something there then use JOSM "copy tags" feature, which copies also Relations including OMM Link to the new object. So changing ID this way is solvable. If someone really just deletes object without carrying on Link then with JSOM save the Validator will give you warning "you try to save broken data: orphaned Link", and you can fix it. Options would be relinking to the new object or by changing Link type to special "deleted in OSM" value (default). If you just save, then there will be broken link which will be handled by other mappers (or bot if possible). Even if you just delete node (e.g. the POI amenity is closed) then you should not delete Link, otherwise renderer would assume that the object is only in external database (and not yet in OSM) and would show the object.

> 
> Another example would be that I might delete a road segment and redraw
> it, if it's easier to do that than to move every single node around.

It would be handled same way as in previous case. 

I see many harder cases here like splitting a way from one to several segments. Probably we cannot avoid having complexity of one-to-many Link types here. With JOSM splitting you anyway replicate tags to all segments (which is a bit stupid as Relation should be used instead, but there is no established Relation for it), so you would do same with Links. Change 1-1 Link to 1-N link. 

> And by the way, since we're on the topic of object IDs, your proposal
> only addresses one end product: rendering.
> 
> How do you propose to handle routing?

Rendering and routing would be similar cases actually: you do specific extract from all the linked databases and based on Links combine it to single coherent dataset. From this on you can have nice renderings or generate route graphs, or generate .IMG files for Garmin or shape files for GIS, or index for textual searches.

> And what about layers?

What do you mean by layers here?

> And what about about objects which contain other objects. Even if you
> ignore ways, you still have relations.

Where possible you would need to link to same object, which can be node, way or relation in OSM. It assumes that data models are more or less compatible, object B can be transformed to A with number of simple steps (changeset). In many cases it is not easy - in some cities I can see buildings have both nodes (with address data tags) and polygon (without much tags), without proper link between them. Proper way to solve it would be first fix OSM so there is relation between node and polygon, and then create Link to the relation. 

I'm not sure that there is general one way to solve all the different cases. Well, the general solution is to use your brain and analyze every case carefully: what are really the objects, what is the real data model behind it (often implied and not documented). Sometimes the data model is kind of tricky for linking due to some historical reasons. I'm afraid missing "Street Relation" in OSM is one potential major issue for linking.  You can link apples with apples and not with oranges, which in some cases is not so easy to see. Especially with natural objects like forests and rivers: too many good ways to model them on map, all they call it "river" but actually they mean something too different.

>>  I would put to API specification that object ID must not change by definition.
> 
> And how do you propose to enforce that for every object in every
> dataset for every organization? Our import page mentions at least 30
> datasets, but with the floodgates open, how many more would you have
> to deal with, and then enforce these rules on?

 If you link dataset, then the dataset should comply with specific API and that's it. If they don't, like change IDs all the time then their Links will be often broken. Sooner or later community sees it and the dataset is marked as "broken external API" and will be ignored by users (tools). There are also many other ways how their API can be broken, but changing ID is kind of harder case as unlike most other issues it cannot be detected automatically. But community will discover it quite fast.

 The hardest part here would be to get external organizations to provide OMM/OSM live API. Today they typically send their Shapefiles and say "deal with it on, don't bother us more". But I'm optimist - if they will see additional benefit from it, then some of them should be able to do it.

 Here in EU state mapping agencies are wrestling right now with Inspire requirements, which is in essence also rules how government should give public API for the geodata with central metadata repository. Well, also in similar direction as OpenMetaMap, just without object-level data linking, they enable only dataset-level overlays. Their technical solution is from the last century so it means basically WMS and maybe some WFS (plus metadata). They are wasting millions and several years to get this Inspire done. OMM should reuse it somehow, perhaps by supporting WFS for external data sources. Or even Shapefiles. In Inspire case you have less need for object-level linking, as different state agencies are often dealing with completely different layers: one with cadastre, another with buildings, third with addresses, another with roads, another with pipes and power. And many of them do not care much about topology.

 OMM would also have only on-line metadatabase which enables dataset-level overlays in the first iteration. Object-level micro-linking would be already second level.

> 
>>> 6. License nightmare
>>> 
>>> This is a powder-keg ready to explode, but I'll just say this:
>>> Incompatible licenses will not allow this.
>> 
>> Yes, by using OMM, OSM and DBX data then you would create derivate of all of them and they must be compatible. But here again - this is general issue what I do not solve nor create there. I'm comparing OMM solution with usual import, and license issues are there basically the same. Maybe the problem happens just later - with imports the importer has to check it over once, with OMM-linking the user has to be sure that he merges appropriate databases.
>> 
>> Actually it would reduce nightmare a lot in some cases - if someone has imported data what was OK in 2010, but is not OK  in 2012 anymore.
> 
> That problem is solved with the CT.
> 
> And we solve the general issue by /generally/ discouraging imports,
> especially those where a strict process hasn't been followed.
> 
> I'm under the assumption that in your system, any user will be able to
> add a dataset.

This is valid point - the OMM would be pretty powerful tool, and theoretically can be misused (overused). Complex import process itself keeps some bad imports away. 

This it would be solvable by own CT, and if this is not enough then some review/approval process (community voting, something else - no idea now what might work). Actually linking via OMM would not be so much easier or even different:
a) With dumb import you technically 1.prepare OSM file 2. load it to JSOM and merge data 3. click upload. Damage done
b) With OMM you 1. prepare OSM file or prepare live API (big work) 2. load it to JSOM and merge data 3. register/publish your dataset in OMM registry 4. upload Links to OMM

Actually OMM route would be even a bit harder and therefore safer?

Discussion with community is also suggested and needed, this is same with both cases. 

>>  In principle I do not see significantly more work as you need to do with imports now. Extra work comes only from extra data updates - instead of data bursts you will have continuous stream to take care of - with all the gains and pains. You can use very similar tools (scripts and JOSM) as now. I hope that if external data providers can quite easily get back also community edits, then actually they should be much more motivated to look after their OSM/OMM derivate than now.
> 
> I think that you've touched on an important bit here. What you propose
> is not OpenStreetMap, and you couldn't call it OpenStreetMap.

 Right, and I'm calling it OpenMetaMap as the work title. It is not something like OSM, really it is not a map or map database at all. I would like to see it as a very thin layer in top of OSM which enables new kind of "external link" relations for it, and enables clean OSM database while keeping current nice visual map and applications which need also other data sources. This would assume that OSM community will accept it. Frankly, I do not see or want to create also any other community who would replace it.

 I think we agree here in several points:
1. What is essence of OSM - community-created map
2. Are imports to OSM bad - yes. Why - basically, they are not community-created.

Now how to deal with imports. I understand that you propose to ban them, or leave them in but as little as possible. I propose kind of more radical approach: keep imports completely away from OSM. To make this possible OSM (or maybe someone else if OSM community does not pick up the idea) should provide alternative tool to use useful external datasets completely away, but still use them. So in ideal world OSM could be pure community-created map, not the kind of mixture it has become by now.

> When the OSM wanted to split the project, I stayed here. Many of them
> will be encouraging of your work. Some of these people who are
> supporting you are folks who are banned from editing in OpenStreetMap.
> That, I think is why they're so encouraging of your idea, because it
> may be something they feel could give them the advantages of OSM
> without being OSM.

I've discovered it too, which was actually surprising for me.

> I think that's sad. But no, despite its faults, I like OpenStreetMap
> and will stay with the project for the foreseeable future.

No, I do not want you or anyone else to come over. As said, it is not alternative to OSM by any means. It is just alternative to the technical way how external data usage is currently done to OSM (I don't blame anyone - there just has not been any better way). All I would like to ask is some of your time and brains, only in this sense it is "a competitor".

Saying that, I realize that like any advanced tool it can be used for different intentions. Just like Potlatch and RailsPort and other originally OSM-only tools are now used by both OSM and OSM forks. I generally do not like idea of forking OSM or splitting communities, but probably the ones doing it have their good reasons (if not their projects will just die soon). Can OMM be used to link OSM fork with OSM - technically maybe it can, but by using OSM data you create derivate and I do not see how OMM approach would make things legally any different here. 

Jaak