[Imports] Proposal for proper OSM import solution (OpenMetaMap)

Thu Aug 18 15:31:48 UTC 2011

Hello,

> On Thu, Aug 18, 2011 at 6:23 AM, Jaak Laineste <jaak at nutiteq.com> wrote:
>> Hello,
>> 
>> Based on my own long-time thinking and small talk in WhereCamp Berlin
>> I created request for comments on kind of different approach to
>> imports called meta-mapping.
> 
> Since this proposal is nearly (exactly) identical to a thought I had
> about a year ago, I feel pretty qualified to speak about it.

Great!

> The objective of a tool like this would be to allow someone to run a
> database of geographic data and isolate it from other datasets- that
> is by keeping the databases separate, one may allow for more
> flexibility in changing the data in one of the non-OSM datasets.
> 
> An example would be if a city government's dataset were to add/remove
> listings of libraries, the conflation process in OSM would be harder
> than it would if there were simply a database where the information
> existed in isolation and then were linked to OSM. Simple, right?

Exactly. But I've never thought it is something simple.

> 1. By moving objects out of the OSM database, you move the complexity
> out of the OSM database and into the conflation database
> 
> Moving the problem doesn't solve it. It just hides it (and you'll see
> why in a the next few points).

Well, I'm not sure if it is OSM problem. In fact my initial implementation idea was to create special relations in OSM database which would be external links. Then others were suggesting that it would be easier and cleaner to do it completely outside, only vaguely dependent on specific geodatabase. OSM would be best reference database, as it sooner or later has every object on earth (I hope.

> 2. This approach implies that external data sets are correct.
> 
> Underlying this approach is an assumption that we can rely on other
> datasets accuracy. Sadly this is not the case. As I work with more
> datasets and compare them to on the ground surveying, I find that many
> government datasets are either wrong, or out of date.
> 
> Take TIGER as an example. I'm going through TIGER 2010 as we speak.
> Most of what i've found indicates that when OSM is active in an area,
> our maps are more accurate than TIGER, even TIGER 2010, which is more
> accurate than TIGER 2005 (what was imported in the US).
> 
> We need therefore to encourage more mappers to map and not to rely on
> these external datasets. This project would do the opposite.

I do not really agree with this implication, it does not assume that external dataset is correct. The process of linking (resolving conflations) would be actually same as with normal import: somebody has to review all data overlappings/conflicts/duplicates and solve them. It assumes that using external data is better than having nothing - just the same assumption you have with any external data usage and import. So after first linking round you would have correction of external data, at least as much as it is possible then.

But then in later days it could happen that the data what was ok in initial linking will be changed to something else (worse). Here you are right - external data provider can do harm to our data. I would say we assume that the external data provider works in the direction of making data better, not worse. In other words: it is ok to have bad data in the beginning, but it is not ok if the data modifications are in wrong direction.

Also there will be always problem of added new data - maintainer of database links has to do occasional reviews and correct this also. So with usual import you have to fix the data once. There is no bulk update possible so you do not need to worry about later updates. Now when we have later updates, maintainer has to start taking care about it also. More gain, more pain. 

Actually I'm afraid that most external datasources will be rather static (just OSM files). This way there is no risk that external dataset will be suddenly damaged. There would be no benefit of later updates, but even then there is advantage of MetaMap database - you keep the datasets clean and separated.

> 3. Data in the aggregated map won't be collected by on the ground mappers.
> 
> Some data, like the road data, will appear in both OSM and external
> datasets, but there's other data which may just never get collected by
> the community, if the map appears to already be complete.
> 
> And then since there's less on the ground mapping, the problems I
> mentioned earlier regarding flawed external datasets don't get noticed
> and corrected.

This is valid point. This is very general problem: data what "is already there (from imports, even just by the other mappers before you) is quite likely to be left behind, not reviewed and trusted. This is separate issue what I do not solve here. 

I assume here that often usage of external datasets is good and reasonable, and in many cases unavoidable (admin borders, shoreline and other samples). I propose here that MetaMapping is better way of using other datasets than importing.  There are several cases (possibly roads) where other datasources should be avoided.

In fact with OpenMetaMap you would always have two views and maps - one is pure OSM - all made by our mappers, and another would be complete map (OMM map) with all external sources. This is something what you with current imports approach cannot get. So if you wish you can ignore complete map and work on on OSM only.

And there is always risk that a mapper finds from Internet site called Google Maps and discovers that "the map" is already there and complete :)

> 4. It assumes OSM object IDs remain constant.
> 
> OSM object IDs change. They don't change a lot, but they do change,
> and you can't force users to jump through hoops to preserve them (as
> we've seen people propose).

Yes, it assumes that IDs do not change. This is most important. Can you explain more why and how OSM object IDs change? I've heard it too, but to analyze cases in more details I'd need to know the details.

> 5. It assumes external data sets IDs remain constant
> 
> One of the whole points of this project seems to be to keep up to date
> with external datasets, such as those put out by local governments
> every quarter.
> Since most of these external datasets will be given in Shapefile
> format, there will need to be a conversion process.
> 
> You can't be assured that the ID numbers on objects will remain
> constant from Q1 and Q2. Heck, I bet you'd find that even their own
> internal IDs won't remain constant, at least not for every single ID
> on every single object on every single external database, of which
> there may be dozens or more.
> 
> So you're constantly in a race to conflate changing object IDs.

 I would put to API specification that object ID must not change by definition. Of course we could not use some ad hoc ID-s there (row numbers etc), but some official ID-s. Here in Estonia we have statewide registry of topological objects, where every object has own key, and without good reason the IDs must not change. I assume that every admin (NUTS) area has unique official ID what is assigned to it and which must be used. In OMM it must go one level deeper, even node IDs must be persistent (perhaps - so detailed analysis is yet to be done) , so this is a challenge.

> 6. License nightmare
> 
> This is a powder-keg ready to explode, but I'll just say this:
> Incompatible licenses will not allow this.

Yes, by using OMM, OSM and DBX data then you would create derivate of all of them and they must be compatible. But here again - this is general issue what I do not solve nor create there. I'm comparing OMM solution with usual import, and license issues are there basically the same. Maybe the problem happens just later - with imports the importer has to check it over once, with OMM-linking the user has to be sure that he merges appropriate databases. 

Actually it would reduce nightmare a lot in some cases - if someone has imported data what was OK in 2010, but is not OK  in 2012 anymore. With imports you have to pick out the data (and all the derivations) from OSM somehow, which is nightmare. With OMM approach you just disable dataset (change its license field in the data directory, or just remove it) and it is done.

> 
> 7. Tremendous work.
> 
> The conflation process would be very hard to do, and frankly, not a
> lot of fun. You'll end up writing programs to do most of it I'm sure,
> but no programs will be perfect.
> 
> So people have to do it, and, frankly, it's not fun work.

 In principle I do not see significantly more work as you need to do with imports now. Extra work comes only from extra data updates - instead of data bursts you will have continuous stream to take care of - with all the gains and pains. You can use very similar tools (scripts and JOSM) as now. I hope that if external data providers can quite easily get back also community edits, then actually they should be much more motivated to look after their OSM/OMM derivate than now.

> These are the reasons I never went forward with this project.

I really hope you are open to reconsider :)

Jaak