diff mbox series

[2/3] ceph: add method that forces client to reconnect using new entity addr

Message ID 20190531122802.12814-2-zyan@redhat.com (mailing list archive)
State New, archived
Headers show
Series [1/3] libceph: add function that reset client's entity addr | expand

Commit Message

Yan, Zheng May 31, 2019, 12:28 p.m. UTC
echo force_reconnect > /sys/kernel/debug/ceph/xxx/control

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/debugfs.c    | 34 +++++++++++++++++++++++++++++++++-
 fs/ceph/mds_client.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 fs/ceph/mds_client.h |  1 +
 fs/ceph/super.h      |  1 +
 4 files changed, 75 insertions(+), 2 deletions(-)

Comments

Ilya Dryomov May 31, 2019, 2:18 p.m. UTC | #1
On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote:
>
> echo force_reconnect > /sys/kernel/debug/ceph/xxx/control
>
> Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

Hi Zheng,

There should be an explanation in the commit message of what this is
and why it is needed.

I'm assuming the use case is recovering a blacklisted mount, but what
is the intended semantics?  What happens to in-flight OSD requests,
MDS requests, open files, etc?  These are things that should really be
written down.

Looking at the previous patch, it appears that in-flight OSD requests
are simply retried, as they would be on a regular connection fault.  Is
that safe?

Thanks,

                Ilya
Yan, Zheng June 3, 2019, 1:49 p.m. UTC | #2
On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote:
> >
> > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control
> >
> > Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>
> Hi Zheng,
>
> There should be an explanation in the commit message of what this is
> and why it is needed.
>
> I'm assuming the use case is recovering a blacklisted mount, but what
> is the intended semantics?  What happens to in-flight OSD requests,
> MDS requests, open files, etc?  These are things that should really be
> written down.
>
got it

> Looking at the previous patch, it appears that in-flight OSD requests
> are simply retried, as they would be on a regular connection fault.  Is
> that safe?
>

It's not safe. I still thinking about how to handle dirty data and
in-flight osd requests in the this case.

Regards
Yan, Zheng

> Thanks,
>
>                 Ilya
Gregory Farnum June 3, 2019, 5:54 p.m. UTC | #3
On Mon, Jun 3, 2019 at 6:51 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote:
> > >
> > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control
> > >
> > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> >
> > Hi Zheng,
> >
> > There should be an explanation in the commit message of what this is
> > and why it is needed.
> >
> > I'm assuming the use case is recovering a blacklisted mount, but what
> > is the intended semantics?  What happens to in-flight OSD requests,
> > MDS requests, open files, etc?  These are things that should really be
> > written down.
> >
> got it
>
> > Looking at the previous patch, it appears that in-flight OSD requests
> > are simply retried, as they would be on a regular connection fault.  Is
> > that safe?
> >
>
> It's not safe. I still thinking about how to handle dirty data and
> in-flight osd requests in the this case.

Can we figure out the consistency-handling story before we start
adding interfaces for people to mis-use then please?

It's not pleasant but if the client gets disconnected I'd assume we
have to just return EIO or something on all outstanding writes and
toss away our dirty data. There's not really another option that makes
any sense, is there?
-Greg

>
> Regards
> Yan, Zheng
>
> > Thanks,
> >
> >                 Ilya
Ilya Dryomov June 3, 2019, 8:07 p.m. UTC | #4
On Mon, Jun 3, 2019 at 7:54 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Mon, Jun 3, 2019 at 6:51 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >
> > On Fri, May 31, 2019 at 10:20 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > >
> > > On Fri, May 31, 2019 at 2:30 PM Yan, Zheng <zyan@redhat.com> wrote:
> > > >
> > > > echo force_reconnect > /sys/kernel/debug/ceph/xxx/control
> > > >
> > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > >
> > > Hi Zheng,
> > >
> > > There should be an explanation in the commit message of what this is
> > > and why it is needed.
> > >
> > > I'm assuming the use case is recovering a blacklisted mount, but what
> > > is the intended semantics?  What happens to in-flight OSD requests,
> > > MDS requests, open files, etc?  These are things that should really be
> > > written down.
> > >
> > got it
> >
> > > Looking at the previous patch, it appears that in-flight OSD requests
> > > are simply retried, as they would be on a regular connection fault.  Is
> > > that safe?
> > >
> >
> > It's not safe. I still thinking about how to handle dirty data and
> > in-flight osd requests in the this case.
>
> Can we figure out the consistency-handling story before we start
> adding interfaces for people to mis-use then please?
>
> It's not pleasant but if the client gets disconnected I'd assume we
> have to just return EIO or something on all outstanding writes and
> toss away our dirty data. There's not really another option that makes
> any sense, is there?

Can we also discuss how useful is allowing to recover a mount after it
has been blacklisted?  After we fail everything with EIO and throw out
all dirty state, how many applications would continue working without
some kind of restart?  And if you are restarting your application, why
not get a new mount?

IOW what is the use case for introducing a new debugfs knob that isn't
that much different from umount+mount?

Thanks,

                Ilya
Gregory Farnum June 3, 2019, 8:22 p.m. UTC | #5
On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> Can we also discuss how useful is allowing to recover a mount after it
> has been blacklisted?  After we fail everything with EIO and throw out
> all dirty state, how many applications would continue working without
> some kind of restart?  And if you are restarting your application, why
> not get a new mount?
>
> IOW what is the use case for introducing a new debugfs knob that isn't
> that much different from umount+mount?

People don't like it when their filesystem refuses to umount, which is
what happens when the kernel client can't reconnect to the MDS right
now. I'm not sure there's a practical way to deal with that besides
some kind of computer admin intervention. (Even if you umount -l, that
by design doesn't reply to syscalls and let the applications exit.)
So it's not that we expect most applications to work, but we need to
give them *something* that isn't a successful return, and we don't
currently do that automatically on a disconnect. (And probably don't
want to?)
Patrick Donnelly June 3, 2019, 9:05 p.m. UTC | #6
On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > Can we also discuss how useful is allowing to recover a mount after it
> > has been blacklisted?  After we fail everything with EIO and throw out
> > all dirty state, how many applications would continue working without
> > some kind of restart?  And if you are restarting your application, why
> > not get a new mount?
> >
> > IOW what is the use case for introducing a new debugfs knob that isn't
> > that much different from umount+mount?
>
> People don't like it when their filesystem refuses to umount, which is
> what happens when the kernel client can't reconnect to the MDS right
> now. I'm not sure there's a practical way to deal with that besides
> some kind of computer admin intervention.

Furthermore, there are often many applications using the mount (even
with containers) and it's not a sustainable position that any
network/client/cephfs hiccup requires a remount. Also, an application
that fails because of EIO is easy to deal with a layer above but a
remount usually requires grump admin intervention.
Ilya Dryomov June 3, 2019, 9:18 p.m. UTC | #7
On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > Can we also discuss how useful is allowing to recover a mount after it
> > has been blacklisted?  After we fail everything with EIO and throw out
> > all dirty state, how many applications would continue working without
> > some kind of restart?  And if you are restarting your application, why
> > not get a new mount?
> >
> > IOW what is the use case for introducing a new debugfs knob that isn't
> > that much different from umount+mount?
>
> People don't like it when their filesystem refuses to umount, which is
> what happens when the kernel client can't reconnect to the MDS right
> now. I'm not sure there's a practical way to deal with that besides
> some kind of computer admin intervention. (Even if you umount -l, that
> by design doesn't reply to syscalls and let the applications exit.)

Well, that is what I'm saying: if an admin intervention is required
anyway, then why not make it be umount+mount?  That is certainly more
intuitive than an obscure write-only file in debugfs...

We have umount -f, which is there for tearing down a mount that is
unresponsive.  It should be able to deal with a blacklisted mount, if
it can't it's probably a bug.

Thanks,

                Ilya
Yan, Zheng June 4, 2019, 2:10 a.m. UTC | #8
On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> >
> > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > Can we also discuss how useful is allowing to recover a mount after it
> > > has been blacklisted?  After we fail everything with EIO and throw out
> > > all dirty state, how many applications would continue working without
> > > some kind of restart?  And if you are restarting your application, why
> > > not get a new mount?
> > >
> > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > that much different from umount+mount?
> >
> > People don't like it when their filesystem refuses to umount, which is
> > what happens when the kernel client can't reconnect to the MDS right
> > now. I'm not sure there's a practical way to deal with that besides
> > some kind of computer admin intervention. (Even if you umount -l, that
> > by design doesn't reply to syscalls and let the applications exit.)
>
> Well, that is what I'm saying: if an admin intervention is required
> anyway, then why not make it be umount+mount?  That is certainly more
> intuitive than an obscure write-only file in debugfs...
>

I think  'umount -f' + 'mount -o remount' is better than the debugfs file


> We have umount -f, which is there for tearing down a mount that is
> unresponsive.  It should be able to deal with a blacklisted mount, if
> it can't it's probably a bug.
>
> Thanks,
>
>                 Ilya
Ilya Dryomov June 4, 2019, 9:25 a.m. UTC | #9
On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > >
> > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > all dirty state, how many applications would continue working without
> > > > some kind of restart?  And if you are restarting your application, why
> > > > not get a new mount?
> > > >
> > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > that much different from umount+mount?
> > >
> > > People don't like it when their filesystem refuses to umount, which is
> > > what happens when the kernel client can't reconnect to the MDS right
> > > now. I'm not sure there's a practical way to deal with that besides
> > > some kind of computer admin intervention. (Even if you umount -l, that
> > > by design doesn't reply to syscalls and let the applications exit.)
> >
> > Well, that is what I'm saying: if an admin intervention is required
> > anyway, then why not make it be umount+mount?  That is certainly more
> > intuitive than an obscure write-only file in debugfs...
> >
>
> I think  'umount -f' + 'mount -o remount' is better than the debugfs file

Why '-o remount'?  I wouldn't expect 'umount -f' to leave behind any
actionable state, it should tear down all data structures, mount point,
etc.  What would '-o remount' act on?

Thanks,

                Ilya
Ilya Dryomov June 4, 2019, 9:37 a.m. UTC | #10
On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
>
> On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> >
> > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > Can we also discuss how useful is allowing to recover a mount after it
> > > has been blacklisted?  After we fail everything with EIO and throw out
> > > all dirty state, how many applications would continue working without
> > > some kind of restart?  And if you are restarting your application, why
> > > not get a new mount?
> > >
> > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > that much different from umount+mount?
> >
> > People don't like it when their filesystem refuses to umount, which is
> > what happens when the kernel client can't reconnect to the MDS right
> > now. I'm not sure there's a practical way to deal with that besides
> > some kind of computer admin intervention.
>
> Furthermore, there are often many applications using the mount (even
> with containers) and it's not a sustainable position that any
> network/client/cephfs hiccup requires a remount. Also, an application

Well, it's not just any hiccup.  It's one that lead to blacklisting...

> that fails because of EIO is easy to deal with a layer above but a
> remount usually requires grump admin intervention.

I feel like I'm missing something here.  Would figuring out $ID,
obtaining root and echoing to /sys/kernel/debug/$ID/control make the
admin less grumpy, especially when containers are involved?

Doing the force_reconnect thing would retain the mount point, but how
much use would it be?  Would using existing (i.e. pre-blacklist) file
descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
something of that sort), so maybe that is the piece I'm missing...

Thanks,

                Ilya
Jeff Layton June 4, 2019, 10:50 a.m. UTC | #11
On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > all dirty state, how many applications would continue working without
> > > > some kind of restart?  And if you are restarting your application, why
> > > > not get a new mount?
> > > > 
> > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > that much different from umount+mount?
> > > 
> > > People don't like it when their filesystem refuses to umount, which is
> > > what happens when the kernel client can't reconnect to the MDS right
> > > now. I'm not sure there's a practical way to deal with that besides
> > > some kind of computer admin intervention.
> > 
> > Furthermore, there are often many applications using the mount (even
> > with containers) and it's not a sustainable position that any
> > network/client/cephfs hiccup requires a remount. Also, an application
> 
> Well, it's not just any hiccup.  It's one that lead to blacklisting...
> 
> > that fails because of EIO is easy to deal with a layer above but a
> > remount usually requires grump admin intervention.
> 
> I feel like I'm missing something here.  Would figuring out $ID,
> obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> admin less grumpy, especially when containers are involved?
> 
> Doing the force_reconnect thing would retain the mount point, but how
> much use would it be?  Would using existing (i.e. pre-blacklist) file
> descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> something of that sort), so maybe that is the piece I'm missing...
> 

I agree with Ilya here. I don't see how applications can just pick up
where they left off after being blacklisted. Remounting in some fashion
is really the only recourse here.

To be clear, what happens to stateful objects (open files, byte-range
locks, etc.) in this scenario? Were you planning to just re-open files
and re-request locks that you held before being blacklisted? If so, that
sounds like a great way to cause some silent data corruption...
Yan, Zheng June 4, 2019, 12:01 p.m. UTC | #12
On Tue, Jun 4, 2019 at 5:25 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >
> > On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > >
> > > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > all dirty state, how many applications would continue working without
> > > > > some kind of restart?  And if you are restarting your application, why
> > > > > not get a new mount?
> > > > >
> > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > that much different from umount+mount?
> > > >
> > > > People don't like it when their filesystem refuses to umount, which is
> > > > what happens when the kernel client can't reconnect to the MDS right
> > > > now. I'm not sure there's a practical way to deal with that besides
> > > > some kind of computer admin intervention. (Even if you umount -l, that
> > > > by design doesn't reply to syscalls and let the applications exit.)
> > >
> > > Well, that is what I'm saying: if an admin intervention is required
> > > anyway, then why not make it be umount+mount?  That is certainly more
> > > intuitive than an obscure write-only file in debugfs...
> > >
> >
> > I think  'umount -f' + 'mount -o remount' is better than the debugfs file
>
> Why '-o remount'?  I wouldn't expect 'umount -f' to leave behind any
> actionable state, it should tear down all data structures, mount point,
> etc.  What would '-o remount' act on?
>

If mount point is in use, 'umount -f ' only closes mds sessions and
aborts osd requests. Mount point is still there, any operation on it
will return -EIO. The remount change the mount point back to normal
state.

> Thanks,
>
>                 Ilya
Patrick Donnelly June 5, 2019, 9:57 p.m. UTC | #13
On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > all dirty state, how many applications would continue working without
> > > > > some kind of restart?  And if you are restarting your application, why
> > > > > not get a new mount?
> > > > >
> > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > that much different from umount+mount?
> > > >
> > > > People don't like it when their filesystem refuses to umount, which is
> > > > what happens when the kernel client can't reconnect to the MDS right
> > > > now. I'm not sure there's a practical way to deal with that besides
> > > > some kind of computer admin intervention.
> > >
> > > Furthermore, there are often many applications using the mount (even
> > > with containers) and it's not a sustainable position that any
> > > network/client/cephfs hiccup requires a remount. Also, an application
> >
> > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> >
> > > that fails because of EIO is easy to deal with a layer above but a
> > > remount usually requires grump admin intervention.
> >
> > I feel like I'm missing something here.  Would figuring out $ID,
> > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > admin less grumpy, especially when containers are involved?
> >
> > Doing the force_reconnect thing would retain the mount point, but how
> > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > something of that sort), so maybe that is the piece I'm missing...
> >
>
> I agree with Ilya here. I don't see how applications can just pick up
> where they left off after being blacklisted. Remounting in some fashion
> is really the only recourse here.
>
> To be clear, what happens to stateful objects (open files, byte-range
> locks, etc.) in this scenario? Were you planning to just re-open files
> and re-request locks that you held before being blacklisted? If so, that
> sounds like a great way to cause some silent data corruption...

The plan is:

- files open for reading re-obtain caps and may continue to be used
- files open for writing discard all dirty file blocks and return -EIO
on further use (this could be configurable via a mount_option like
with the ceph-fuse client)

Not sure how best to handle locks and I'm open to suggestions. We
could raise SIGLOST on those processes?
Jeff Layton June 5, 2019, 10:26 p.m. UTC | #14
On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote:
> On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote:
> > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > > all dirty state, how many applications would continue working without
> > > > > > some kind of restart?  And if you are restarting your application, why
> > > > > > not get a new mount?
> > > > > > 
> > > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > > that much different from umount+mount?
> > > > > 
> > > > > People don't like it when their filesystem refuses to umount, which is
> > > > > what happens when the kernel client can't reconnect to the MDS right
> > > > > now. I'm not sure there's a practical way to deal with that besides
> > > > > some kind of computer admin intervention.
> > > > 
> > > > Furthermore, there are often many applications using the mount (even
> > > > with containers) and it's not a sustainable position that any
> > > > network/client/cephfs hiccup requires a remount. Also, an application
> > > 
> > > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> > > 
> > > > that fails because of EIO is easy to deal with a layer above but a
> > > > remount usually requires grump admin intervention.
> > > 
> > > I feel like I'm missing something here.  Would figuring out $ID,
> > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > > admin less grumpy, especially when containers are involved?
> > > 
> > > Doing the force_reconnect thing would retain the mount point, but how
> > > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > > something of that sort), so maybe that is the piece I'm missing...
> > > 
> > 
> > I agree with Ilya here. I don't see how applications can just pick up
> > where they left off after being blacklisted. Remounting in some fashion
> > is really the only recourse here.
> > 
> > To be clear, what happens to stateful objects (open files, byte-range
> > locks, etc.) in this scenario? Were you planning to just re-open files
> > and re-request locks that you held before being blacklisted? If so, that
> > sounds like a great way to cause some silent data corruption...
> 
> The plan is:
> 
> - files open for reading re-obtain caps and may continue to be used
> - files open for writing discard all dirty file blocks and return -EIO
> on further use (this could be configurable via a mount_option like
> with the ceph-fuse client)
> 

That sounds fairly reasonable.

> Not sure how best to handle locks and I'm open to suggestions. We
> could raise SIGLOST on those processes?
> 

Unfortunately, SIGLOST has never really been a thing on Linux. There was
an attempt by Anna Schumaker a few years ago to implement it for use
with NFS, but it never went in.

We ended up with this patch, IIRC:

    https://patchwork.kernel.org/patch/10108419/

"The current practice is to set NFS_LOCK_LOST so that read/write returns
 EIO when a lock is lost. So, change these comments to code when sets
NFS_LOCK_LOST."

Maybe we should aim for similar behavior in this situation. It's a
little tricker here since we don't really have an analogue to a lock
stateid in ceph, so we'd need to implement this in some other way.
Patrick Donnelly June 5, 2019, 11:18 p.m. UTC | #15
Apologies for having this discussion in two threads...

On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote:
> > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > > > all dirty state, how many applications would continue working without
> > > > > > > some kind of restart?  And if you are restarting your application, why
> > > > > > > not get a new mount?
> > > > > > >
> > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > > > that much different from umount+mount?
> > > > > >
> > > > > > People don't like it when their filesystem refuses to umount, which is
> > > > > > what happens when the kernel client can't reconnect to the MDS right
> > > > > > now. I'm not sure there's a practical way to deal with that besides
> > > > > > some kind of computer admin intervention.
> > > > >
> > > > > Furthermore, there are often many applications using the mount (even
> > > > > with containers) and it's not a sustainable position that any
> > > > > network/client/cephfs hiccup requires a remount. Also, an application
> > > >
> > > > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> > > >
> > > > > that fails because of EIO is easy to deal with a layer above but a
> > > > > remount usually requires grump admin intervention.
> > > >
> > > > I feel like I'm missing something here.  Would figuring out $ID,
> > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > > > admin less grumpy, especially when containers are involved?
> > > >
> > > > Doing the force_reconnect thing would retain the mount point, but how
> > > > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > > > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > > > something of that sort), so maybe that is the piece I'm missing...
> > > >
> > >
> > > I agree with Ilya here. I don't see how applications can just pick up
> > > where they left off after being blacklisted. Remounting in some fashion
> > > is really the only recourse here.
> > >
> > > To be clear, what happens to stateful objects (open files, byte-range
> > > locks, etc.) in this scenario? Were you planning to just re-open files
> > > and re-request locks that you held before being blacklisted? If so, that
> > > sounds like a great way to cause some silent data corruption...
> >
> > The plan is:
> >
> > - files open for reading re-obtain caps and may continue to be used
> > - files open for writing discard all dirty file blocks and return -EIO
> > on further use (this could be configurable via a mount_option like
> > with the ceph-fuse client)
> >
>
> That sounds fairly reasonable.
>
> > Not sure how best to handle locks and I'm open to suggestions. We
> > could raise SIGLOST on those processes?
> >
>
> Unfortunately, SIGLOST has never really been a thing on Linux. There was
> an attempt by Anna Schumaker a few years ago to implement it for use
> with NFS, but it never went in.

Is there another signal we could reasonably use?

> We ended up with this patch, IIRC:
>
>     https://patchwork.kernel.org/patch/10108419/
>
> "The current practice is to set NFS_LOCK_LOST so that read/write returns
>  EIO when a lock is lost. So, change these comments to code when sets
> NFS_LOCK_LOST."
>
> Maybe we should aim for similar behavior in this situation. It's a
> little tricker here since we don't really have an analogue to a lock
> stateid in ceph, so we'd need to implement this in some other way.

So effectively blacklist the process so all I/O is blocked on the
mount? Do I understand correctly?
Jeff Layton June 5, 2019, 11:43 p.m. UTC | #16
On Wed, 2019-06-05 at 16:18 -0700, Patrick Donnelly wrote:
> Apologies for having this discussion in two threads...
> 
> On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@redhat.com> wrote:
> > On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote:
> > > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > > > > all dirty state, how many applications would continue working without
> > > > > > > > some kind of restart?  And if you are restarting your application, why
> > > > > > > > not get a new mount?
> > > > > > > > 
> > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > > > > that much different from umount+mount?
> > > > > > > 
> > > > > > > People don't like it when their filesystem refuses to umount, which is
> > > > > > > what happens when the kernel client can't reconnect to the MDS right
> > > > > > > now. I'm not sure there's a practical way to deal with that besides
> > > > > > > some kind of computer admin intervention.
> > > > > > 
> > > > > > Furthermore, there are often many applications using the mount (even
> > > > > > with containers) and it's not a sustainable position that any
> > > > > > network/client/cephfs hiccup requires a remount. Also, an application
> > > > > 
> > > > > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> > > > > 
> > > > > > that fails because of EIO is easy to deal with a layer above but a
> > > > > > remount usually requires grump admin intervention.
> > > > > 
> > > > > I feel like I'm missing something here.  Would figuring out $ID,
> > > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > > > > admin less grumpy, especially when containers are involved?
> > > > > 
> > > > > Doing the force_reconnect thing would retain the mount point, but how
> > > > > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > > > > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > > > > something of that sort), so maybe that is the piece I'm missing...
> > > > > 
> > > > 
> > > > I agree with Ilya here. I don't see how applications can just pick up
> > > > where they left off after being blacklisted. Remounting in some fashion
> > > > is really the only recourse here.
> > > > 
> > > > To be clear, what happens to stateful objects (open files, byte-range
> > > > locks, etc.) in this scenario? Were you planning to just re-open files
> > > > and re-request locks that you held before being blacklisted? If so, that
> > > > sounds like a great way to cause some silent data corruption...
> > > 
> > > The plan is:
> > > 
> > > - files open for reading re-obtain caps and may continue to be used
> > > - files open for writing discard all dirty file blocks and return -EIO
> > > on further use (this could be configurable via a mount_option like
> > > with the ceph-fuse client)
> > > 
> > 
> > That sounds fairly reasonable.
> > 
> > > Not sure how best to handle locks and I'm open to suggestions. We
> > > could raise SIGLOST on those processes?
> > > 
> > 
> > Unfortunately, SIGLOST has never really been a thing on Linux. There was
> > an attempt by Anna Schumaker a few years ago to implement it for use
> > with NFS, but it never went in.
> 
> Is there another signal we could reasonably use?
> 

Not really. The problem is really that SIGLOST is not even defined. In
fact, if you look at the asm-generic/signal.h header:

#define SIGIO           29                                                      
#define SIGPOLL         SIGIO                                                   
/*                                                                              
#define SIGLOST         29                                                      
*/                                                                              

So, there it is, commented out, and it shares a value with SIGIO. We
could pick another value for it, of course, but then you'd have to get
it into userland headers too. All of that sounds like a giant PITA.

> > We ended up with this patch, IIRC:
> > 
> >     https://patchwork.kernel.org/patch/10108419/
> > 
> > "The current practice is to set NFS_LOCK_LOST so that read/write returns
> >  EIO when a lock is lost. So, change these comments to code when sets
> > NFS_LOCK_LOST."
> > 
> > Maybe we should aim for similar behavior in this situation. It's a
> > little tricker here since we don't really have an analogue to a lock
> > stateid in ceph, so we'd need to implement this in some other way.
> 
> So effectively blacklist the process so all I/O is blocked on the
> mount? Do I understand correctly?
> 

No. I think in practice what we'd want to do is "invalidate" any file
descriptions that were open before the blacklisting where locks were
lost. Attempts to do reads or writes against those fd's would get back
an error (EIO, most likely).

File descriptions that didn't have any lost locks could carry on working
as normal after reacquiring caps. We could also consider a module
parameter or something to allow reclaim of lost locks too (in violation
of continuity rules), like the recover_lost_locks parameter in nfs.ko.
Dan van der Ster June 6, 2019, 9:30 a.m. UTC | #17
On Tue, Jun 4, 2019 at 4:10 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> On Tue, Jun 4, 2019 at 5:18 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Mon, Jun 3, 2019 at 10:23 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > >
> > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > all dirty state, how many applications would continue working without
> > > > some kind of restart?  And if you are restarting your application, why
> > > > not get a new mount?
> > > >
> > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > that much different from umount+mount?
> > >
> > > People don't like it when their filesystem refuses to umount, which is
> > > what happens when the kernel client can't reconnect to the MDS right
> > > now. I'm not sure there's a practical way to deal with that besides
> > > some kind of computer admin intervention. (Even if you umount -l, that
> > > by design doesn't reply to syscalls and let the applications exit.)
> >
> > Well, that is what I'm saying: if an admin intervention is required
> > anyway, then why not make it be umount+mount?  That is certainly more
> > intuitive than an obscure write-only file in debugfs...
> >
>
> I think  'umount -f' + 'mount -o remount' is better than the debugfs file

A small bit of user input: for some of the places we'd like to use
CephFS we value availability over consistency.
For example, in a large batch processing farm, it is really
inconvenient (and expensive in lost CPU-hours) if an operator needs to
repair thousands of mounts when cephfs breaks (e.g. an mds crash or
whatever). It is preferential to let the apps crash, drop caches,
fh's, whatever else is necessary, and create a new session to the
cluster with the same mount. In this use-case, it doesn't matter if
the files were inconsistent, because a higher-level job scheduler will
retry the job from scratch somewhere else with new output files.
It would be nice if there was a mount option to allow users to choose
this mode (-o soft, for example). Without a mount option, we're forced
to run ugly cron jobs which look for hung mounts and do the necessary.

My 2c,

dan


>
>
> > We have umount -f, which is there for tearing down a mount that is
> > unresponsive.  It should be able to deal with a blacklisted mount, if
> > it can't it's probably a bug.
> >
> > Thanks,
> >
> >                 Ilya
diff mbox series

Patch

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index a14d64664878..d65da57406bd 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -210,6 +210,31 @@  CEPH_DEFINE_SHOW_FUNC(mdsc_show)
 CEPH_DEFINE_SHOW_FUNC(caps_show)
 CEPH_DEFINE_SHOW_FUNC(mds_sessions_show)
 
+static ssize_t control_file_write(struct file *file,
+				  const char __user *ubuf,
+				  size_t count, loff_t *ppos)
+{
+	struct ceph_fs_client *fsc = file_inode(file)->i_private;
+	char buf[16];
+	ssize_t len;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, ubuf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (!strcmp(buf, "force_reconnect")) {
+		ceph_mdsc_force_reconnect(fsc->mdsc);
+	} else {
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static const struct file_operations control_file_fops = {
+	.write = control_file_write,
+};
 
 /*
  * debugfs
@@ -233,7 +258,6 @@  static int congestion_kb_get(void *data, u64 *val)
 DEFINE_SIMPLE_ATTRIBUTE(congestion_kb_fops, congestion_kb_get,
 			congestion_kb_set, "%llu\n");
 
-
 void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 {
 	dout("ceph_fs_debugfs_cleanup\n");
@@ -243,6 +267,7 @@  void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 	debugfs_remove(fsc->debugfs_mds_sessions);
 	debugfs_remove(fsc->debugfs_caps);
 	debugfs_remove(fsc->debugfs_mdsc);
+	debugfs_remove(fsc->debugfs_control);
 }
 
 int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
@@ -302,6 +327,13 @@  int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 	if (!fsc->debugfs_caps)
 		goto out;
 
+	fsc->debugfs_control = debugfs_create_file("control",
+						   0200,
+						   fsc->client->debugfs_dir,
+						   fsc,
+						   &control_file_fops);
+	if (!fsc->debugfs_control)
+		goto out;
 	return 0;
 
 out:
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f5c3499fdec6..95ee893205c5 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2631,7 +2631,7 @@  static void kick_requests(struct ceph_mds_client *mdsc, int mds)
 		if (req->r_attempts > 0)
 			continue; /* only new requests */
 		if (req->r_session &&
-		    req->r_session->s_mds == mds) {
+		    (mds == -1 || req->r_session->s_mds == mds)) {
 			dout(" kicking tid %llu\n", req->r_tid);
 			list_del_init(&req->r_wait);
 			__do_request(mdsc, req);
@@ -4371,6 +4371,45 @@  void ceph_mdsc_force_umount(struct ceph_mds_client *mdsc)
 	mutex_unlock(&mdsc->mutex);
 }
 
+void ceph_mdsc_force_reconnect(struct ceph_mds_client *mdsc)
+{
+	struct ceph_mds_session *session;
+	int mds;
+	LIST_HEAD(to_wake);
+
+	pr_info("force reconnect\n");
+
+	/* this also reset add mon/osd conntions */
+	ceph_reset_client_addr(mdsc->fsc->client);
+
+	mutex_lock(&mdsc->mutex);
+
+	/* reset mds connections */
+	for (mds = 0; mds < mdsc->max_sessions; mds++) {
+		session = __ceph_lookup_mds_session(mdsc, mds);
+		if (!session)
+			continue;
+
+		__unregister_session(mdsc, session);
+		list_splice_init(&session->s_waiting, &to_wake);
+		mutex_unlock(&mdsc->mutex);
+
+		mutex_lock(&session->s_mutex);
+		cleanup_session_requests(mdsc, session);
+		remove_session_caps(session);
+		mutex_unlock(&session->s_mutex);
+
+		ceph_put_mds_session(session);
+		mutex_lock(&mdsc->mutex);
+	}
+
+	list_splice_init(&mdsc->waiting_for_map, &to_wake);
+	__wake_requests(mdsc, &to_wake);
+	kick_requests(mdsc, -1);
+
+	mutex_unlock(&mdsc->mutex);
+}
+
 static void ceph_mdsc_stop(struct ceph_mds_client *mdsc)
 {
 	dout("stop\n");
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 330769ecb601..125e26895f14 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -457,6 +457,7 @@  extern int ceph_send_msg_mds(struct ceph_mds_client *mdsc,
 extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
 extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
 extern void ceph_mdsc_force_umount(struct ceph_mds_client *mdsc);
+extern void ceph_mdsc_force_reconnect(struct ceph_mds_client *mdsc);
 extern void ceph_mdsc_destroy(struct ceph_fs_client *fsc);
 
 extern void ceph_mdsc_sync(struct ceph_mds_client *mdsc);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9c82d213a5ab..9ccb6e031988 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -118,6 +118,7 @@  struct ceph_fs_client {
 	struct dentry *debugfs_bdi;
 	struct dentry *debugfs_mdsc, *debugfs_mdsmap;
 	struct dentry *debugfs_mds_sessions;
+	struct dentry *debugfs_control;
 #endif
 
 #ifdef CONFIG_CEPH_FSCACHE