[Tagging] Data redundancy with "ref" tag on ways vs relations

Mon Jul 30 19:37:04 BST 2012

On 30.07.2012 20:08, Paweł Paprota wrote:
>> the more redundancy the more
>> automated checks can be done to find errors.
> 
> Sorry if I am being too harsh, I am not trying to be mean or anything
> but... I don't understand how this sentence would be true in any
> context. More redundancy, especially redundancy in data entered by
> humans, simply invites more opportunity for errors. So of course QA
> tools will find more errors - simply because there is more data to
> maintain!

The reasoning is as follows:

If only one instance of the data is being created, then there is a
certain probability that this data is wrong.

If two instances are created at least somewhat independently*, then
there is a very small probability that both end up wrong, and a much
larger probability that one of them ends up wrong. The probability that
everything is correct is now smaller than before.

However, at this point we can begin to use automated error checking. The
idea is that errors that can be found automatically are much more
acceptable than those that cannot.

With only one instance of the data, none of the errors can found
automatically.

With two instances, most errors can be found automatically, only the
(very rare) case that both instances are wrong cannot.

Therefore, according to this line of reasoning, redundancy will increase
the number of errors initially, but reduce the number of "bad" errors
that cannot be spotted by automated checks.

* Of course this reasoning depends on the assumption that these two
instances are created independently. If a small number of mappers trace
an entire route network mostly from scratch (e.g. during initial
remapping) using aerial imagery or other non-local sources, this
assumption is probably not justified. However, it  might be justified
for a scenario where many contributors each add only small sections of a
route each over a longer period of time.

Tobias