diff mbox

bug in /etc/init.d/ceph debian

Message ID alpine.DEB.2.00.1308070822180.3006@cobra.newdream.net (mailing list archive)
State New, archived
Headers show

Commit Message

Sage Weil Aug. 7, 2013, 3:23 p.m. UTC
Hi James,

Here is a somewhat simpler patch; does this work for you?  Note that if 
you something like /etc/init.d/ceph status osd.123 where osd.123 isn't in 
ceph.conf then you get a status 1 instead of 3.  But for the 
/etc/init.d/ceph status mds (or osd or mon) case where there are no 
daemons of a particular type it works.

Perhaps the "does not exist" check should be also modified to return 3?

sage




On Wed, 7 Aug 2013, James Harper wrote:

> > 
> > I'm running ceph 0.61.7-1~bpo70+1 and I think there is a bug in
> > /etc/init.d/ceph
> > 
> > The heartbeat RA expects that the init.d script will return 3 for "not running",
> > but if there is no agent (eg mds) defined for that host it will return 0 instead,
> > so pacemaker thinks the agent is running on a node where it isn't even
> > defined and presumably would then start doing stonith when it finds it
> > remains running after a stop command.
> > 
> > Or maybe that is the correct behaviour of the init.d script and the RA needs
> > to be modified?
> > 
> 
> Nobody interested in this?
> 
> My proposed fix follows this email. Return status is:
> 0 - everything tested is running
> 1 - something wrong
> 3 - something tested is stopped
> 
> Without this patch, the resource agents report that the service is running if the service is not defined on the host.
> 
> I'm not sure though if this is the right approach. Maybe the /etc/init.d/ceph should return 0 when checking the status of (say) mon, when there are no mons defined on this host?
> 
> James
> 
> --- ceph.orig   2013-08-07 13:28:25.000000000 +1000
> +++ ceph        2013-08-07 13:32:37.000000000 +1000
> @@ -170,6 +170,9 @@
>  get_local_name_list
>  get_name_list "$@"
> 
> +running=0
> +dead=0
> +stopped=0
>  for name in $what; do
>      type=`echo $name | cut -c 1-3`   # e.g. 'mon', if $item is 'mon1'
>      id=`echo $name | cut -c 4- | sed 's/^\\.//'`
> @@ -375,14 +378,15 @@
>             if daemon_is_running $name ceph-$type $id $pid_file; then
>                 echo -n "$name: running "
>                 do_cmd "$BINDIR/ceph --admin-daemon $asok version 2>/dev/null" || echo unknown
> +                running=1
>              elif [ -e "$pid_file" ]; then
>                  # daemon is dead, but pid file still exists
>                  echo "$name: dead."
> -                EXIT_STATUS=1
> +                dead=1
>              else
>                  # daemon is dead, and pid file is gone
>                  echo "$name: not running."
> -                EXIT_STATUS=3
> +                stopped=1
>              fi
>             ;;
> 
> @@ -430,6 +434,16 @@
>      esac
>  done
> 
> +if [ "$command" = "status" ]; then
> +    if [ "$dead" = "1" ]; then
> +      EXIT_STATUS=1
> +    elif [ "$running" = "1" ]; then
> +      EXIT_STATUS=0
> +    else
> +      EXIT_STATUS=3
> +    fi
> +fi
> +
>  # activate latent osds?
>  if [ "$command" = "start" ]; then
>      if [ "$*" = "" ] || echo $* | grep -q ^osd\$ ; then
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

James Harper Aug. 7, 2013, 11:20 p.m. UTC | #1
> Hi James,
> 
> Here is a somewhat simpler patch; does this work for you?  Note that if
> you something like /etc/init.d/ceph status osd.123 where osd.123 isn't in
> ceph.conf then you get a status 1 instead of 3.  But for the
> /etc/init.d/ceph status mds (or osd or mon) case where there are no
> daemons of a particular type it works.
> 
> Perhaps the "does not exist" check should be also modified to return 3?
> 

Pacemaker will call the RA on every node to see what is running. On a node in an asymmetric cluster where ceph isn't configured, the RA just wants to know that it isn't running - it won't like an error being returned. For a node without even the RA script installed it would return not-installed, but I think that's okay too.

Do you think maybe the 'ceph status' check and the RA check have conflicting requirements here? Maybe it would be better to leave the init.d script as-is and build the smarts into the RA script instead.

Do idle mds's add any load to the system? Would it be useful to be able to have pacemaker bring up mds's on any two nodes so you always have exactly two running, without actually tying them to specific nodes?

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 8, 2013, 3:46 p.m. UTC | #2
On Wed, 7 Aug 2013, James Harper wrote:
> > Hi James,
> > 
> > Here is a somewhat simpler patch; does this work for you?  Note that if
> > you something like /etc/init.d/ceph status osd.123 where osd.123 isn't in
> > ceph.conf then you get a status 1 instead of 3.  But for the
> > /etc/init.d/ceph status mds (or osd or mon) case where there are no
> > daemons of a particular type it works.
> > 
> > Perhaps the "does not exist" check should be also modified to return 3?
> > 
> 
> Pacemaker will call the RA on every node to see what is running. On a node in an asymmetric cluster where ceph isn't configured, the RA just wants to know that it isn't running - it won't like an error being returned. For a node without even the RA script installed it would return not-installed, but I think that's okay too.
> 
> Do you think maybe the 'ceph status' check and the RA check have conflicting requirements here? Maybe it would be better to leave the init.d script as-is and build the smarts into the RA script instead.

Maybe, but I'm not completely following what the RA's requirements are 
here.  If it's just a matter of the init script returning a different 
error code, though (as we've done so far), I don't see any problem.

> Do idle mds's add any load to the system? Would it be useful to be able to have pacemaker bring up mds's on any two nodes so you always have exactly two running, without actually tying them to specific nodes?

They don't add much load when they are standby, but they will if they end 
up taking over.  There is also no reason to say 'exactly two' IMO.  If you 
have a symmetric cluster I would be more inclined to run one on every node 
for simplicity, recognizing that the active one will use some resources.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Harper Aug. 9, 2013, 12:04 a.m. UTC | #3
> On Wed, 7 Aug 2013, James Harper wrote:
> > > Hi James,
> > >
> > > Here is a somewhat simpler patch; does this work for you?  Note that if
> > > you something like /etc/init.d/ceph status osd.123 where osd.123 isn't in
> > > ceph.conf then you get a status 1 instead of 3.  But for the
> > > /etc/init.d/ceph status mds (or osd or mon) case where there are no
> > > daemons of a particular type it works.
> > >
> > > Perhaps the "does not exist" check should be also modified to return 3?
> > >
> >
> > Pacemaker will call the RA on every node to see what is running. On a node
> in an asymmetric cluster where ceph isn't configured, the RA just wants to
> know that it isn't running - it won't like an error being returned. For a node
> without even the RA script installed it would return not-installed, but I think
> that's okay too.
> >
> > Do you think maybe the 'ceph status' check and the RA check have
> conflicting requirements here? Maybe it would be better to leave the init.d
> script as-is and build the smarts into the RA script instead.
> 
> Maybe, but I'm not completely following what the RA's requirements are
> here.  If it's just a matter of the init script returning a different
> error code, though (as we've done so far), I don't see any problem.
> 

I haven't tried your patch yet, but can it ever return 0? It seems to set it to 3 initially, and then change it to 1 if it finds an error. I can't see that it ever sets it to 0 indicating that daemons are running. Easy enough to fix by setting the EXIT_STATUS=0 after the check of daemon_is_running, I think, but it still doesn't allow for the case where there are three OSD's, one is running, one is stopped, and one is failed. The EXIT_STATUS in that case appears to be based on the last daemon checked, eg basically random.

> > Do idle mds's add any load to the system? Would it be useful to be able to
> > have pacemaker bring up mds's on any two nodes so you always have exactly
> > two running, without actually tying them to specific nodes?
> 
> They don't add much load when they are standby, but they will if they end
> up taking over.  There is also no reason to say 'exactly two' IMO.  If you
> have a symmetric cluster I would be more inclined to run one on every node
> for simplicity, recognizing that the active one will use some resources.
> 

Thanks for that clarification

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 9, 2013, 1:10 a.m. UTC | #4
On Fri, 9 Aug 2013, James Harper wrote:
> > On Wed, 7 Aug 2013, James Harper wrote:
> > > > Hi James,
> > > >
> > > > Here is a somewhat simpler patch; does this work for you?  Note that if
> > > > you something like /etc/init.d/ceph status osd.123 where osd.123 isn't in
> > > > ceph.conf then you get a status 1 instead of 3.  But for the
> > > > /etc/init.d/ceph status mds (or osd or mon) case where there are no
> > > > daemons of a particular type it works.
> > > >
> > > > Perhaps the "does not exist" check should be also modified to return 3?
> > > >
> > >
> > > Pacemaker will call the RA on every node to see what is running. On a node
> > in an asymmetric cluster where ceph isn't configured, the RA just wants to
> > know that it isn't running - it won't like an error being returned. For a node
> > without even the RA script installed it would return not-installed, but I think
> > that's okay too.
> > >
> > > Do you think maybe the 'ceph status' check and the RA check have
> > conflicting requirements here? Maybe it would be better to leave the init.d
> > script as-is and build the smarts into the RA script instead.
> > 
> > Maybe, but I'm not completely following what the RA's requirements are
> > here.  If it's just a matter of the init script returning a different
> > error code, though (as we've done so far), I don't see any problem.
> > 
> 
> I haven't tried your patch yet, but can it ever return 0? It seems to 
> set it to 3 initially, and then change it to 1 if it finds an error. I 
> can't see that it ever sets it to 0 indicating that daemons are running. 
> Easy enough to fix by setting the EXIT_STATUS=0 after the check of 
> daemon_is_running, I think, but it still doesn't allow for the case 
> where there are three OSD's, one is running, one is stopped, and one is 
> failed. The EXIT_STATUS in that case appears to be based on the last 
> daemon checked, eg basically random.

What should it return in that case?

sage

> 
> > > Do idle mds's add any load to the system? Would it be useful to be 
> > > able to have pacemaker bring up mds's on any two nodes so you always 
> > > have exactly two running, without actually tying them to specific 
> > > nodes?
> > 
> > They don't add much load when they are standby, but they will if they end
> > up taking over.  There is also no reason to say 'exactly two' IMO.  If you
> > have a symmetric cluster I would be more inclined to run one on every node
> > for simplicity, recognizing that the active one will use some resources.
> > 
> 
> Thanks for that clarification
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Harper Aug. 9, 2013, 7:09 a.m. UTC | #5
> > I haven't tried your patch yet, but can it ever return 0? It seems to
> > set it to 3 initially, and then change it to 1 if it finds an error. I
> > can't see that it ever sets it to 0 indicating that daemons are running.
> > Easy enough to fix by setting the EXIT_STATUS=0 after the check of
> > daemon_is_running, I think, but it still doesn't allow for the case
> > where there are three OSD's, one is running, one is stopped, and one is
> > failed. The EXIT_STATUS in that case appears to be based on the last
> > daemon checked, eg basically random.
> 
> What should it return in that case?
> 

I've been thinking about this some more and I'm still not sure. I think my patch says:
if _any_ are in error then return 1
else if any are running return 0
else if all are stopped return 3

But I think this still won't have the desired outcome if you have 2 OSD's. The possible situations if the resource is supposed to be running are:
. Both running => all good, pacemaker will do nothing
. Both stopped => all good, pacemaker will start the services
. One stopped one running => not good, pacemaker won't make any effort to start services
. One in error, one running => not good. I'm not sure exactly what will happen but it won't be what you expect.

The only solution I can see is to manage the services individually, in which case the init.d script with your patch + setting to 0 if running does the right thing anyway.

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 9, 2013, 3:19 p.m. UTC | #6
On Fri, 9 Aug 2013, James Harper wrote:
> > > I haven't tried your patch yet, but can it ever return 0? It seems to
> > > set it to 3 initially, and then change it to 1 if it finds an error. I
> > > can't see that it ever sets it to 0 indicating that daemons are running.
> > > Easy enough to fix by setting the EXIT_STATUS=0 after the check of
> > > daemon_is_running, I think, but it still doesn't allow for the case
> > > where there are three OSD's, one is running, one is stopped, and one is
> > > failed. The EXIT_STATUS in that case appears to be based on the last
> > > daemon checked, eg basically random.
> > 
> > What should it return in that case?
> > 
> 
> I've been thinking about this some more and I'm still not sure. I think my patch says:
> if _any_ are in error then return 1
> else if any are running return 0
> else if all are stopped return 3
> 
> But I think this still won't have the desired outcome if you have 2 OSD's. The possible situations if the resource is supposed to be running are:
> . Both running => all good, pacemaker will do nothing
> . Both stopped => all good, pacemaker will start the services
> . One stopped one running => not good, pacemaker won't make any effort to start services

If one daemon si stopped and one is running, returning 'not running' seems 
ok to me, since 'start' at that point will do the right thing.

> . One in error, one running => not good. I'm not sure exactly what will happen but it won't be what you expect.

I think it's fine for this to be an error condition.

> 
> The only solution I can see is to manage the services individually, in 
> which case the init.d script with your patch + setting to 0 if running 
> does the right thing anyway.

Yeah, managing individually is probably the most robust, but if it works 
well enough in the generic configuration with no customization that is 
good.

Anyway, I'm fine with whatever variation of your original or my patch you 
think addresses this.  A comment block in the init-ceph script documenting 
what the return codes mean (similar to the above) would be nice so that 
it is clear to the next person who comes along.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Harper Aug. 9, 2013, 11:52 p.m. UTC | #7
> > But I think this still won't have the desired outcome if you have 2 OSD's.
> > The possible situations if the resource is supposed to be running are:
> > . Both running => all good, pacemaker will do nothing
> > . Both stopped => all good, pacemaker will start the services
> > . One stopped one running => not good, pacemaker won't make any effort
> > to start services
> 
> If one daemon si stopped and one is running, returning 'not running' seems
> ok to me, since 'start' at that point will do the right thing.

Maybe. If the stopped daemon is stopped because it fails to start then pacemaker might get unhappy when subsequent starts also fail, and might even get STONITHy.

> > . One in error, one running => not good. I'm not sure exactly what will
> > happen but it won't be what you expect.
> 
> I think it's fine for this to be an error condition.

Again. If pacemaker see's the error it might start doing things you don't want.

Technically, for actual clustered resources, returning "not running" when something is running is about the worst thing you can do because pacemaker might then start up the resource on another node (eg start a VM on two nodes at once, corrupting the fs). The way you'd set this up for ceph though is just a cloned resource on each node so it wouldn't matter anyway.

> >
> > The only solution I can see is to manage the services individually, in
> > which case the init.d script with your patch + setting to 0 if running
> > does the right thing anyway.
> 
> Yeah, managing individually is probably the most robust, but if it works
> well enough in the generic configuration with no customization that is
> good.

Actually it subsequently occurred to me that if I set them up individually then my dependencies will break (eg start ceph before mounting ceph-fs) because there are now different ceph instances per node.

> 
> Anyway, I'm fine with whatever variation of your original or my patch you
> think addresses this.  A comment block in the init-ceph script documenting
> what the return codes mean (similar to the above) would be nice so that
> it is clear to the next person who comes along.
> 

I might post on the pacemaker list and see what the thoughts are there.

Maybe it would be better for me to just re-order the init.d scripts so ceph starts in init.d and leave it at that...

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 10, 2013, 12:35 a.m. UTC | #8
On Fri, 9 Aug 2013, James Harper wrote:
> > > But I think this still won't have the desired outcome if you have 2 OSD's.
> > > The possible situations if the resource is supposed to be running are:
> > > . Both running => all good, pacemaker will do nothing
> > > . Both stopped => all good, pacemaker will start the services
> > > . One stopped one running => not good, pacemaker won't make any effort
> > > to start services
> > 
> > If one daemon si stopped and one is running, returning 'not running' seems
> > ok to me, since 'start' at that point will do the right thing.
> 
> Maybe. If the stopped daemon is stopped because it fails to start then pacemaker might get unhappy when subsequent starts also fail, and might even get STONITHy.

This is sounding more like we're trying to fit a square peg in a round 
hole.  Generally speaking there is *never* any need for anything that 
resembles STONITH with Ceph; all of that is handled internally by Ceph 
itself.

I think the only real reason why you would want to use pacemaker here is 
if you just like it better than the normal startup scripts, or perhaps 
because you are using it to control where the standby mdss run.  So maybe 
we are barking up the wrong tree...

sage


> 
> > > . One in error, one running => not good. I'm not sure exactly what will
> > > happen but it won't be what you expect.
> > 
> > I think it's fine for this to be an error condition.
> 
> Again. If pacemaker see's the error it might start doing things you don't want.
> 
> Technically, for actual clustered resources, returning "not running" when something is running is about the worst thing you can do because pacemaker might then start up the resource on another node (eg start a VM on two nodes at once, corrupting the fs). The way you'd set this up for ceph though is just a cloned resource on each node so it wouldn't matter anyway.
> 
> > >
> > > The only solution I can see is to manage the services individually, in
> > > which case the init.d script with your patch + setting to 0 if running
> > > does the right thing anyway.
> > 
> > Yeah, managing individually is probably the most robust, but if it works
> > well enough in the generic configuration with no customization that is
> > good.
> 
> Actually it subsequently occurred to me that if I set them up individually then my dependencies will break (eg start ceph before mounting ceph-fs) because there are now different ceph instances per node.
> 
> > 
> > Anyway, I'm fine with whatever variation of your original or my patch you
> > think addresses this.  A comment block in the init-ceph script documenting
> > what the return codes mean (similar to the above) would be nice so that
> > it is clear to the next person who comes along.
> > 
> 
> I might post on the pacemaker list and see what the thoughts are there.
> 
> Maybe it would be better for me to just re-order the init.d scripts so ceph starts in init.d and leave it at that...
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/init-ceph.in b/src/init-ceph.in
index 8eb02f8..be5565c 100644
--- a/src/init-ceph.in
+++ b/src/init-ceph.in
@@ -165,6 +165,12 @@  verify_conf
 command=$1
 [ -n "$*" ] && shift
 
+if [ "$command" = "status" ]; then
+    # nothing defined for this host => not running; we'll use this if we
+    # don't check anything below.
+    EXIT_STATUS=3
+fi
+
 get_local_name_list
 get_name_list "$@"