[OSM-dev] Osmosis, Changesets, Diffs (replicate) and general questions

Sat Oct 31 01:49:13 GMT 2009

Lars Francke wrote:
>>> I'd like to include full changeset information in diffs but it's not
>>> trivial.  I'm not sure if I'll ever get to this personally.  I'd love to
>>> see somebody take it on though.
>>>       
>> I'll have a look at it but I don't want to get your hopes up :)
>>     
>
> I had a look and my initial enthusiasm has been dampened a little.....a lot :)
>   
Hehe, welcome to my world :-)
> I think I understand the PostgreSQL/Skytools/xmin/transaction stuff
> now but I have to admit that I have problems with all the
> indirections/layers/redirections in EntityDao and some other classes.
> But it seems to me that it wouldn't be the best idea to make Changeset
> an Entity subclass as there are just too many differences.
>   
This is one of the things I'm struggling with as well.  The Bound data 
type has similar issues because it doesn't even have an id, but it's 
fudged in order to pass it through the same pipeline.  I'm torn between 
passing changesets through as another Entity type and having some unused 
fields, and introducing more specific object hierarchies and making the 
pipeline more generic/complex.

Honestly, I don't have any strong opinions on this one.  It's hard to 
work up enough enthusiasm to tackle it.

Where it starts to get tricky is in detecting which changesets need to 
be sent through the pipeline, and in particular how to transfer the 
bounding box information.  From memory you need to send it through when 
it's created (to avoid foreign key problems), when it's used by entities 
(because the bounding box might have been updated), and when it is 
closed.  Not sure how that relates to the new replication code though, 
the new algorithm would need to be figured out for that one.
> ReplicationDestination on the other hand needs ChangeContainers which
> require EntityContainers, which require Entity objects. Some of the
> functions in EntityDao don't apply either (changeset has no version or
> "timestamp", ...). In addition the OSMWriter would have to be extended
> and a ChangesetWriter needs to be written. I could certainly try to
> hack something together but I'm afraid that it'd fall short of your
> code standards :)
>   
If the main issue is that timestamp and version are not used in a 
Changeset then I'd lean towards just sub-classing Entity.  It's kinda 
messy but not atrocious.  The alternatives add a lot of complexity.

My main issue with accepting patches is that I'm usually the one who has 
to maintain them so I do tend to be a little fussy ;-)  But so long as 
it has reasonable test coverage, and all tasks are updated (ie. 
including the existing timestamp based replication tasks, and all 
downstream tasks such as xml writers), and the various code checks (eg. 
checkstyle) pass then I shouldn't have a problem.  If you're keen to 
have a go we can create a branch and do some experiments to see what works.
> * I _only_ looked at the replication task, so it is _very_ possible
> that I overlooked something and my changes would break compatibility.
> I'll have a second look at this
> * The xmin-index for the changeset-table would have to be created but
> I suppose that wouldn't be a big problem
>   
Yep, should be fine.  The changeset table is relatively small compared 
to the node table for example so we shouldn't have an issue getting the 
index created if it is necessary.
> But I'd be glad if you could give me any pointers. I still won't
> promise anything but I'm still reading the code...so who knows.
>   
You seem to have a reasonable handle on it.  To be honest I'm not too 
sure where to begin :-)  There's a lot in there and I struggle to 
remember how it all works.  About all you can do is focus on one task at 
a time.  One thing that is quite confusing is that there are several 
different database access methods in use.  There's the original code 
used by the old --read-apidb-change type tasks, then there's improved 
code in the pgsql tasks, then there's the new replication tasks which 
are Spring Framework based.  As a result there's some redundant classes 
in there that could be eliminated with a good refactor and rewrite of 
the apidb tasks.  The new Spring Framework stuff is the direction I'm 
heading in as it requires far less code, is cleaner, and is less error 
prone.

Don't be scared to have a go at it, and feel free to ask me to take a 
look at some code before you spend too much time on it.  So long as it's 
done in a branch to keep the trunk relatively stable.
> On another note: Has anyone ever had a look at alternative database
> systems for OSM? No, I don't propose a change! I'd just be interested
> if anyone had a look at systems like HBase, MonetDB (Stefan de Konink
> does a lot of stuff with MonetDB if I remember correctly), MongoDB,
> CouchDB, Cassandra, ... and their possible use cases for OSM.
>   
The only ones I've played with a MySQL (obsolete), PostgreSQL/PostGIS 
(basis of apidb and pgsql tasks), and Berkeley DB Java Edition (deleted 
because I couldn't get it to scale).

Cheers,
Brett