[Strategic] Fwd: Subject: Forks and such

Jim Brown jim at cloudmade.com
Mon Aug 30 21:51:29 BST 2010


Hi Tim,

First off... I am surprised that you choose to assert "Cloudmade wants a global and comprehensive database, fair enough but there are other users in the world."  When I interact in the community, I am thinking of, and representing the community...  You seem to imply that I am not considering other users and that all I care about is a global and comprehensive database which is patently untrue.  I came very close to withdrawing from the conversation when I read that.

By the way: I am assuming that you are thinking of and representing what you think is best for the community and not a personal agenda, and I would be hesitant to imply otherwise.

Now, back to the discussion...

I may be wrong, but you still seem to misunderstand the point I am making about useable data.  Let me try a final time to state my basic assumptions, observations, your Australian use case and then from these and what I think the key question is ;) .

ASSUMPTIONS:
If you had multiple forks, any particular use case could only use the data from one.  they have forked, and once forked, cannot be recombined (unless they are under compatible licenses in which case, why fork?).   This inability to recombine is essentially the difference in forking from offering the data under dual licenses (mentioned here as well).  

So when I say usable data, I am not making a statement that other (theoretical) forks are not in some way useful to someone, I am just asserting that they cannot be combined in any use case and so for any use case, the amount of usable data is that data in a single set.

It is also important to keep in mind that the data is neither static nor independent in a data set.  Any data item can be edited over time, and data items can be combined into new objects over time.  These edits and combinations can only occur in the domain of the data set that the objects come from...  so to make a way out of a set of nodes, means that the nodes all must be combinable and so must come from the same data set (or compatible data sets).  And if you edit something, your edit can only go back into the data set the object came from (so only new contributions can be applied across data sets, edits cannot).  It is tempting to consider tools that could apply edits to multiple data sets, but the complexities of checking surrounding data, looking for duplicates of new data that already exist in one data set, but not the other, looking at the licensing of data object that are made out of the items being edited, looking at the user preferences for licensing makes this hard at best.

OBSERVATIONS:
Finally, it was these observations that prompted my two main comments:

1. that the amount of useable data is the amount in the data set you choose to use 
   (it is never the sum of all data, because the sum of all data cannot be used for 
   any single purpose).  

2. that objects that exist in different forks will continue to diverge over time
   because they cannot be practically edited/used across data sets.  

AUS USE CASE
With regard to your question about as Australia use case, let's consider the case where there is a forked DB with mostly Australian data in it under CC BY SA, and the general OSM database.

With a finite number of mappers, and the inability to edit data across data sets, you would soon have some objects that are:
 a. only in one set or the other, 
 b. in both sets and are the same in both
 c. in both sets and have been separately edited in one or both version and are different.  

The class of objects that are the same (a) would degrade over time as edits are applied (given that edits cannot be applied to multiple forks).  

As time goes on, the two would be increasingly different from each other, each one likely to be better in some places and worse in others than the other dataset. 

In this case, I would say that we end up with a lower quality map in terms of accuracy and completeness in either data set than we would have had with a single data set.  This seems inevitable if mappers are made to choose to put edits into one or the other.   

Finally, you seem to say that there are other important attributes of a geo data set in addition to its accuracy and completeness.  and you cite "license, format, availability, richness, etc"  

'License' is of course what we are discussing, 
'format' seems irrelevant, as it is a function of extract tools and not related to license, 
'availability' (I am not sure what you mean by this?) 
'richness' is part of what I mean by completeness and accuracy 
 
THE KEY QUESTION:
While I agree these are important attributes, we are really discussing the question:

Is the loss of accuracy and completeness associated with a fork (along with the other costs) offset by the benefit of having an additional license?

I would say no, but that is the question we are discussing...  unless you disagree with the assertion that we would have a loss of accuracy and completeness (in which case we are back at square one and I'll ask that someone else try and argue this point as I have certainly failed...)



j



From: TimSC [mailto:mappinglists at sheerman-chase.org.uk] 
Sent: 30 August 2010 19:27
To: strategic at openstreetmap.org
Cc: Jim Brown
Subject: Re: [Strategic] Fwd: Subject: Forks and such


On 30/08/10 14:54, Mikel Maron wrote: 
Forking has been well explored on the lists, and here. If someone could give a neutral account in the wiki, that would be a good contribution.

I think this is a good idea but based on the discussion, we have some fundamental issues to cover. Once the discussion begins to slow, I suggest we can document some of the ideas on the wiki.

On 30/08/10 13:44, Frederik Ramm wrote: 

If it is OSM administered then maybe we're not talking about a fork, but about dual licensing? 
Yes, that could be worth considering. I did not intend to include this possibility in my original idea. I imagined several independent databases under different licenses (and yes, there would generally be divergence). I am occasionally pro-PD-like licensing but there are several to choose from and a multiple PD-like license would seem to be a through solution. I can't see much call for dual licensing CT/ODbL with a 2nd license at the moment - unless it is CC-BY-SA (but the LWG can worry about that).

On 30/08/10 10:22, Oliver wrote: 

I think it is clear that with the effort put in the license change the idea to handle a fork under the umbrella of the OSMF is not capable of winning a majority within the OSMF. Otherwise it would make more sense to establish an ODbL fork rather than changing the license of the primary database.
Interesting point, but I don't see the need for a vote of OSMF membership (yet or possibly at all). We are getting ahead of ourselves... (Side note to my thread on "consensus": since when does OSMF membership votes determine the direction of OSM?)

On 30/08/10 17:28, Jim Brown wrote:

At any given time, the total data you can use for anything is limited to a single database.  Having multiple data sets is a binary condition where choosing one excludes using the others.  Let's say we have two data sets A and B, A has 1m POI, B has 750k POI and between them they have 1.25m distinct POI.  The 1.25m number is irrelevant as no one can use it, they can use 1m or 750k. And the management of the 500k overlap data is totally wasted effort detracting from mapping and editing.
I feel like we are both repeating ourselves, but this won't go into an infinite loop... I hope. In what follows, my tone attempts to be more concise than my previous email. It might come off as rather argumentative, but this is not the intent - sorry in advance basically! For other readers, basically I attempt to pick apart Jim's points but I don't advance anything new.

So, for a single user, yes. But for different individual users, the quantity of data is not the only consideration to make a database the preferable or usable one. For example, the license is different. Users want or need different licenses. Therefore both databases are utilized. From my previous example [1], are you saying only one database fork of Australia is "useful"? Specifically answering this point might provide me with some insight into your thinking.

Your argument seems to have the conclusion that only one GIS database is ever need in the whole world for any purpose ("USABLE data is the amount in a single database"), which is clearly absurd. (I am taking a literal reading, as you suggested). If you admit other databases have their uses, for what ever reason, then forks could in principle be useful.

Also, I am an existentialist. This means I think something is valued if (and only if) we think it so. Some people think forks are valuable. Therefore forks are valuable (at least to those people). You are of course entitled to your opinion that they are not, but don't assume everyone is like you. Cloudmade wants a global and comprehensive database, fair enough but there are other users in the world.


As soon as someone edits only one, then that is game over for that entry.
That is far from certain to occur.


The difference is in the scope of impact, not the quality of the impact.
[snip]

the decision to use a tool, write a tool, fork a tool, change the source code license of a tool or discontinue the use of a tool has no lasting impact outside the authors/users of the tool.   

And the decision to change or fork a database has no lasting impact outside its authors/users! I don't see a difference between diversity in tools and databases, apart from the number of users. Ok, so the OSM database has more users than a single tool. So what? (I am not calling for reckless action, I am just pointing out I don't agree with your/Jim's point.)


Having a forked database impacts EVERY mappers who wants to map an area by constraining the set of data they see as what is already there (they must choose a data set to map against).  It also impacts every user of the map data who has to choose between datasets to work with.    
Ok, so a choice of databases would exist. So what?


Consequently, I think that any forks are permanent divisions of the project and do not add any value to the project.
That doesn't follow from the fact that the database has many users (large scope), or a choice of databases exists.


  If the goal is to "create and provide free geographic data" as you say then we do that less well in a forked world as the data is both less complete and less accurate in any single fork that it would be in a unified database.  
I don't agree with your premise. You need to establish that forking results in less completeness (which is far from certain) and less accuracy (ditto). And even then, your conclusion doesn't follow from that premise, either! (Accuracy and completeness are not the only attributes of databases. What about license, format, availability, richness, etc?)


The costs of a fork are pretty extreme.  
I don't agree that "costs of a fork are pretty extreme". Perhaps you can back that up with a concrete example? (You probably think you did, but I don't see it.)


A fork completely divides the project.  This is because the data is no longer common across the forks and I would question the capacity of the community (mappers, coders, admins and everyone) to support multiple distinct projects.
This is an exaggeration. A PD-like fork would not be completely independent, as data would flow from PD to the other datasets. There would be some areas of commonality (and some areas that diverge). The tools are the same, too. And many individual mappers are shared. Therefore it does not "completely divide the project".


  It would be like the Wikipedia foundation deciding to host a second Wikipedia site under a different license.
And if there was some advantages in doing so, they might consider it. Your point?

As you probably can guess from above, I don't think there are any viable arguments to be made, using these abstract principles, against forking (such as I feel Jim has attempted). I have attempted to addressed every one of Jim's points. However, I can think of many practical problems that are far more worrying. Perhaps we should move on to those?

I don't know if anyone else cares to wade into this discussion and say if forks could, in principle or in practice, provide value or are they always a waste of time? We (or I) probably could do with some perspective...

TimSC

[1] http://lists.openstreetmap.org/pipermail/strategic/2010-August/000138.html



More information about the Strategic mailing list