[Tilesathome] Two new ROMA...

Thu Dec 4 19:13:33 GMT 2008

> -----Original Message-----
> From: Kai Krueger [mailto:kakrueger at gmail.com]
> Sent: Thursday, December 04, 2008 1:17 PM
> To: Mathieu Arnold
> Cc: 'TilesAtHome'; milenko at king-nerd.com
> Subject: Re: [Tilesathome] Two new ROMA...
> 
> On 04/12/08 17:45, Mathieu Arnold wrote:
> > +-le 04.12.2008 12:42:09 -0500, Milenko a dit :
> > |>  -----Original Message-----
> > |>  From: tilesathome-bounces at openstreetmap.org [mailto:tilesathome-
> > |>  bounces at openstreetmap.org] On Behalf Of Mathieu Arnold
> > |>  Sent: Thursday, December 04, 2008 10:24 AM
> > |>  To: 'TilesAtHome'
> > |>  Subject: Re: [Tilesathome] Two new ROMA...
> > |>
> > |>  +-le 04.12.2008 10:12:09 -0500, Milenko a dit :
> > |>  | OK - the map.fcgi that I have I just downloaded from florians
> server,
> > |>  so
> > |>  | that version does need to be updated.
> > |>
> > |>  Yes, that should end up somewhere in the svn :-)
> > |>
> > |>  | Yes I see that.  It's returning results in under a second or
> two at
> > |>  the
> > |>  | moment.  Most of the current tiles look empty or pretty sparse
> > |>  though.
> > |>  | We'll see what happens when missingtiles runs next.
> > |>
> > |>  The thing is that if possible, the clients that were hitting your
> > |>  server
> > |>  directly would be really better served by the load balancer, so
> that
> > |>  the load
> > |>  balancer does not set your server as being down because the
> concurrency
> > |>  limit
> > |>  is reached on your end. (You could also do as I did, set the
> > |>  concurrency
> > |>  around 20 and let the LB do the job.)
> > |>
> > |>  --
> > |>  Mathieu Arnold
> > |
> > | Could you drop the LB to something more like 8 instead of 10?
> Those extra
> > | two requests really slow things down.
> >
> > It was at 9, it's at 8 now.
> >
> > | Do you have any idea what causes the large groups on requests all
> at one
> > | time?  I'm seeing a pattern of no new requests for a minute or so
> and then
> > | 4-7 new requests all within a second or two.  Is this by design on
> the LB?
> > | If so, spreading these requests out would probably make all of the
> ROMA
> > | servers more efficient.
> >
> > Well, it's a bit hard to debug things with not much informations :-)
> > It may be because your server is marked down, thus no hit, and comes
> up, thus
> > you getting assigned what's left in the queue you can get :-)
> >
> 
> I think that is exactly what is happening.  As soon as the server comes
> back up, the LB will assign it the full 8 requests upto maxconn if
> there
> is still requests left in the queue. I think there is an option called
> slowstart / warmup or something like that, that allows you to slowly
> ramp up the requests once the server comes back online.  The main
> problem is however that your server keeps on getting marked as down. In
> 20 minutes it has been down 19 times. I.e. basically whenever it was
> assigned 9 requests at least one of them failed with 503, triggering
> the
> server to be marked down. Can you tell how many requests hit your
> server
> that do not come through the load balancer? Another possibility is that
> there are a few stale pid files left in the directory that map.fcgi
> uses
> to determine the concurrency.  It would then think there are more
> requests currently ongoing than there actually are. Could you check to
> see if there are any stale pids left in your $stampdir? One reason
> there
> are stale pid files is that fcgi scripts can get killed if they run to
> long not leaving the script time to clean up after it. On my ubuntu
> system this timeout is only 40 seconds, which is far to short for large
> requests.
> 
> Also your check for the stale db does not seem ideally placed. As the
> check is rather late, the load balancer health check requests come back
> claiming everything is fine, but all requests fail with 503. Would it
> be
> possible to move the db check before the healthcheck bbox check?
> 
> 
> Kai
> 

The issue was probably that there were two of my clients still directly
hitting my server and that I had lowered maxinstances to 8 before I emailed
Mathieu to ask him to lower it on the LB.  This would create the exact
scenario that you described.  Both of these issues should be resolved now,
so we'll see if the problem continues.

I've moved the call to onstaledb() to right after the check for
maxinstances.

Server was throwing 500s for a couple minutes there when I got interrupted
in the middle of moving the db check, but should be good now.

-Jeremy