diff mbox series

[8/8] ceph: return -EIO if read/write against filp that lost file locks

Message ID 20190617125529.6230-9-zyan@redhat.com (mailing list archive)
State New, archived
Headers show
Series ceph: remount aborted mount | expand

Commit Message

Yan, Zheng June 17, 2019, 12:55 p.m. UTC
After mds evicts session, file locks get lost sliently. It's not safe to
let programs continue to do read/write.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/caps.c  | 28 ++++++++++++++++++++++------
 fs/ceph/locks.c |  8 ++++++--
 fs/ceph/super.h |  1 +
 3 files changed, 29 insertions(+), 8 deletions(-)

Comments

Patrick Donnelly June 17, 2019, 8:32 p.m. UTC | #1
On Mon, Jun 17, 2019 at 5:56 AM Yan, Zheng <zyan@redhat.com> wrote:
>
> After mds evicts session, file locks get lost sliently. It's not safe to
> let programs continue to do read/write.

I think one issue with returning EIO on a file handle that did hold a
lock is that the application may be programmed to write to other files
assuming that lock is never lost, yes? In that case, it may not ever
write to the locked file in any case.

Again, I'd like to see SIGLOST sent to the application here. Are there
any objections to reviving whatever patchset was in flight to add
that? Or just writeup a new one?
Jeff Layton June 17, 2019, 8:45 p.m. UTC | #2
On Mon, 2019-06-17 at 13:32 -0700, Patrick Donnelly wrote:
> On Mon, Jun 17, 2019 at 5:56 AM Yan, Zheng <zyan@redhat.com> wrote:
> > After mds evicts session, file locks get lost sliently. It's not safe to
> > let programs continue to do read/write.
> 
> I think one issue with returning EIO on a file handle that did hold a
> lock is that the application may be programmed to write to other files
> assuming that lock is never lost, yes? In that case, it may not ever
> write to the locked file in any case.
> 

Sure, applications do all sorts of crazy things. We can only cater to so
much craziness. I'll assert that most applications don't have these
sorts of weirdo usage patterns, and an error on read/write is more
reasonable.

Note that this behavior is already documented in fcntl(2) as well, as is
SIGLOST not being implemented.

> Again, I'd like to see SIGLOST sent to the application here. Are there
> any objections to reviving whatever patchset was in flight to add
> that? Or just writeup a new one?
> 

I think SIGLOST's utility is somewhat questionable. Applications will
need to be custom-written to handle it. If you're going to do that, then
it may be better to consider other async notification mechanisms.
inotify or fanotify, perhaps? Those may be simpler to implement and get
merged.
Gregory Farnum June 20, 2019, 4:59 p.m. UTC | #3
On Mon, Jun 17, 2019 at 1:46 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Mon, 2019-06-17 at 13:32 -0700, Patrick Donnelly wrote:
> > On Mon, Jun 17, 2019 at 5:56 AM Yan, Zheng <zyan@redhat.com> wrote:
> > > After mds evicts session, file locks get lost sliently. It's not safe to
> > > let programs continue to do read/write.
> >
> > I think one issue with returning EIO on a file handle that did hold a
> > lock is that the application may be programmed to write to other files
> > assuming that lock is never lost, yes? In that case, it may not ever
> > write to the locked file in any case.
> >
>
> Sure, applications do all sorts of crazy things. We can only cater to so
> much craziness. I'll assert that most applications don't have these
> sorts of weirdo usage patterns, and an error on read/write is more
> reasonable.

It wouldn't surprise me if it's a niche use case, but I hear about a
*lot* of applications which know they're running on a distributed fs
using file locks as a primitive leader election to select amongst
multiple processes (this happens a lot not only in HPC, but also in
distributed database and/or compute projects). That tends to involve
precisely what Patrick is describing.

Given that as you've said SIGLOST doesn't actually exist in Linux I
don't have a good alternative, but if there was some more proactive
way we could tell the application (or let applications enable a more
proactive way, like getting EIO on any file access once a lock is
lost?) it would probably be good.

>
> Note that this behavior is already documented in fcntl(2) as well, as is
> SIGLOST not being implemented.
>
> > Again, I'd like to see SIGLOST sent to the application here. Are there
> > any objections to reviving whatever patchset was in flight to add
> > that? Or just writeup a new one?
> >
>
> I think SIGLOST's utility is somewhat questionable. Applications will
> need to be custom-written to handle it. If you're going to do that, then
> it may be better to consider other async notification mechanisms.
> inotify or fanotify, perhaps? Those may be simpler to implement and get
> merged.
> --
> Jeff Layton <jlayton@redhat.com>
>
Patrick Donnelly June 20, 2019, 5:19 p.m. UTC | #4
On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > Again, I'd like to see SIGLOST sent to the application here. Are there
> > any objections to reviving whatever patchset was in flight to add
> > that? Or just writeup a new one?
> >
>
> I think SIGLOST's utility is somewhat questionable. Applications will
> need to be custom-written to handle it. If you're going to do that, then
> it may be better to consider other async notification mechanisms.
> inotify or fanotify, perhaps? Those may be simpler to implement and get
> merged.

The utility of SIGLOST is not well understood from the viewpoint of a
local file system. The problem uniquely applies to distributed file
systems. There simply is no way to recover from a lost lock for an
application through POSIX mechanisms. We really need a new signal to
just kill the application (by default) because recovery cannot be
automatically performed even through system call errors. I don't see
how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
the application will even use those system calls to monitor for lost
locks when POSIX has no provision for that to happen.

It's worth noting as well that the current behavior of the mount
freezing on blacklist is not an acceptable status quo. The application
will just silently stall the next time it tries to access the mount,
if it ever does.
Jeff Layton June 20, 2019, 6:18 p.m. UTC | #5
On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > any objections to reviving whatever patchset was in flight to add
> > > that? Or just writeup a new one?
> > > 
> > 
> > I think SIGLOST's utility is somewhat questionable. Applications will
> > need to be custom-written to handle it. If you're going to do that, then
> > it may be better to consider other async notification mechanisms.
> > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > merged.
> 
> The utility of SIGLOST is not well understood from the viewpoint of a
> local file system. The problem uniquely applies to distributed file
> systems. There simply is no way to recover from a lost lock for an
> application through POSIX mechanisms. We really need a new signal to
> just kill the application (by default) because recovery cannot be
> automatically performed even through system call errors. I don't see
> how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> the application will even use those system calls to monitor for lost
> locks when POSIX has no provision for that to happen.
> 

(cc'ing Anna in case she has an opinion)

SIGLOST isn't defined in POSIX either, so I'm not sure that argument
carries much weight. The _only_ way you'd be able to add SIGLOST is if
it defaults to SIG_IGN, such that only applications that are watching
for it will react to it. Given that, you're already looking at code
modifications.

So, the real question is: what's the best method to watch for lost
locks? I don't have a terribly strong opinion about any of these
notification methods, tbh. I only suggested inotify/fanotify because
they'd likely be much simpler to implement.

Adding signals is a non-trivial affair as we're running out of bits in
that space. The lower 32 bits are all taken and the upper ones are
reserved for realtime signals. My signal.h has a commented out SIGLOST:

#define SIGIO           29
#define SIGPOLL         SIGIO
/*
#define SIGLOST         29
*/

Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
think it'd probably need its own value. Maybe there is some way to have
the application ask that one of the realtime signals be set up for this?

> It's worth noting as well that the current behavior of the mount
> freezing on blacklist is not an acceptable status quo. The application
> will just silently stall the next time it tries to access the mount,
> if it ever does.

Fair enough.
Patrick Donnelly June 20, 2019, 6:51 p.m. UTC | #6
On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > any objections to reviving whatever patchset was in flight to add
> > > > that? Or just writeup a new one?
> > > >
> > >
> > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > need to be custom-written to handle it. If you're going to do that, then
> > > it may be better to consider other async notification mechanisms.
> > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > merged.
> >
> > The utility of SIGLOST is not well understood from the viewpoint of a
> > local file system. The problem uniquely applies to distributed file
> > systems. There simply is no way to recover from a lost lock for an
> > application through POSIX mechanisms. We really need a new signal to
> > just kill the application (by default) because recovery cannot be
> > automatically performed even through system call errors. I don't see
> > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > the application will even use those system calls to monitor for lost
> > locks when POSIX has no provision for that to happen.
> >
>
> (cc'ing Anna in case she has an opinion)
>
> SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> carries much weight. The _only_ way you'd be able to add SIGLOST is if
> it defaults to SIG_IGN, such that only applications that are watching
> for it will react to it. Given that, you're already looking at code
> modifications.

Why does the default need to be SIG_IGN? Is that existing convention
for new signals in Linux?

> So, the real question is: what's the best method to watch for lost
> locks? I don't have a terribly strong opinion about any of these
> notification methods, tbh. I only suggested inotify/fanotify because
> they'd likely be much simpler to implement.
>
> Adding signals is a non-trivial affair as we're running out of bits in
> that space. The lower 32 bits are all taken and the upper ones are
> reserved for realtime signals. My signal.h has a commented out SIGLOST:
>
> #define SIGIO           29
> #define SIGPOLL         SIGIO
> /*
> #define SIGLOST         29
> */
>
> Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> think it'd probably need its own value. Maybe there is some way to have
> the application ask that one of the realtime signals be set up for this?

Well, SIGPOLL is on its way out according to the standards. So SIGIO
looks like what Linux uses instead. Looking at the man page for
signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
interface could then be used to process the event?

The one nit here is that we would be generating SIGIO for regular
files (and directories?) which would be new. It makes sense with what
we want to do though. Also, SIGIO default behavior is to terminate the
process.
Jeff Layton June 20, 2019, 7 p.m. UTC | #7
On Thu, 2019-06-20 at 11:51 -0700, Patrick Donnelly wrote:
> On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > > any objections to reviving whatever patchset was in flight to add
> > > > > that? Or just writeup a new one?
> > > > > 
> > > > 
> > > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > > need to be custom-written to handle it. If you're going to do that, then
> > > > it may be better to consider other async notification mechanisms.
> > > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > > merged.
> > > 
> > > The utility of SIGLOST is not well understood from the viewpoint of a
> > > local file system. The problem uniquely applies to distributed file
> > > systems. There simply is no way to recover from a lost lock for an
> > > application through POSIX mechanisms. We really need a new signal to
> > > just kill the application (by default) because recovery cannot be
> > > automatically performed even through system call errors. I don't see
> > > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > > the application will even use those system calls to monitor for lost
> > > locks when POSIX has no provision for that to happen.
> > > 
> > 
> > (cc'ing Anna in case she has an opinion)
> > 
> > SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> > carries much weight. The _only_ way you'd be able to add SIGLOST is if
> > it defaults to SIG_IGN, such that only applications that are watching
> > for it will react to it. Given that, you're already looking at code
> > modifications.
> 
> Why does the default need to be SIG_IGN? Is that existing convention
> for new signals in Linux?
> 

No, it's just that if you don't do that, and locks are lost, then you'll
have a bunch of applications suddenly crash. That sounds scary.

> > So, the real question is: what's the best method to watch for lost
> > locks? I don't have a terribly strong opinion about any of these
> > notification methods, tbh. I only suggested inotify/fanotify because
> > they'd likely be much simpler to implement.
> > 
> > Adding signals is a non-trivial affair as we're running out of bits in
> > that space. The lower 32 bits are all taken and the upper ones are
> > reserved for realtime signals. My signal.h has a commented out SIGLOST:
> > 
> > #define SIGIO           29
> > #define SIGPOLL         SIGIO
> > /*
> > #define SIGLOST         29
> > */
> > 
> > Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> > think it'd probably need its own value. Maybe there is some way to have
> > the application ask that one of the realtime signals be set up for this?
> 
> Well, SIGPOLL is on its way out according to the standards. So SIGIO
> looks like what Linux uses instead. Looking at the man page for
> signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
> new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
> interface could then be used to process the event?
> 
> The one nit here is that we would be generating SIGIO for regular
> files (and directories?) which would be new. It makes sense with what
> we want to do though. Also, SIGIO default behavior is to terminate the
> process.
> 

That sounds like it could have unintended side-effects. A systemwide
event that causes a ton of userland processes to suddenly catch a fatal
signal seems rather nasty.

It's also not clear to me how you'll identify recipients for this
signal. What tasks will get a SIGLOST when this occurs? Going from file
descriptors or inodes to tasks that are associated with them is not
straightforward.
Patrick Donnelly June 20, 2019, 7:50 p.m. UTC | #8
On Thu, Jun 20, 2019 at 12:00 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Thu, 2019-06-20 at 11:51 -0700, Patrick Donnelly wrote:
> > On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > > > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > > > any objections to reviving whatever patchset was in flight to add
> > > > > > that? Or just writeup a new one?
> > > > > >
> > > > >
> > > > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > > > need to be custom-written to handle it. If you're going to do that, then
> > > > > it may be better to consider other async notification mechanisms.
> > > > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > > > merged.
> > > >
> > > > The utility of SIGLOST is not well understood from the viewpoint of a
> > > > local file system. The problem uniquely applies to distributed file
> > > > systems. There simply is no way to recover from a lost lock for an
> > > > application through POSIX mechanisms. We really need a new signal to
> > > > just kill the application (by default) because recovery cannot be
> > > > automatically performed even through system call errors. I don't see
> > > > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > > > the application will even use those system calls to monitor for lost
> > > > locks when POSIX has no provision for that to happen.
> > > >
> > >
> > > (cc'ing Anna in case she has an opinion)
> > >
> > > SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> > > carries much weight. The _only_ way you'd be able to add SIGLOST is if
> > > it defaults to SIG_IGN, such that only applications that are watching
> > > for it will react to it. Given that, you're already looking at code
> > > modifications.
> >
> > Why does the default need to be SIG_IGN? Is that existing convention
> > for new signals in Linux?
> >
>
> No, it's just that if you don't do that, and locks are lost, then you'll
> have a bunch of applications suddenly crash. That sounds scary.
>
> > > So, the real question is: what's the best method to watch for lost
> > > locks? I don't have a terribly strong opinion about any of these
> > > notification methods, tbh. I only suggested inotify/fanotify because
> > > they'd likely be much simpler to implement.
> > >
> > > Adding signals is a non-trivial affair as we're running out of bits in
> > > that space. The lower 32 bits are all taken and the upper ones are
> > > reserved for realtime signals. My signal.h has a commented out SIGLOST:
> > >
> > > #define SIGIO           29
> > > #define SIGPOLL         SIGIO
> > > /*
> > > #define SIGLOST         29
> > > */
> > >
> > > Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> > > think it'd probably need its own value. Maybe there is some way to have
> > > the application ask that one of the realtime signals be set up for this?
> >
> > Well, SIGPOLL is on its way out according to the standards. So SIGIO
> > looks like what Linux uses instead. Looking at the man page for
> > signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
> > new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
> > interface could then be used to process the event?
> >
> > The one nit here is that we would be generating SIGIO for regular
> > files (and directories?) which would be new. It makes sense with what
> > we want to do though. Also, SIGIO default behavior is to terminate the
> > process.
> >
>
> That sounds like it could have unintended side-effects. A systemwide
> event that causes a ton of userland processes to suddenly catch a fatal
> signal seems rather nasty.

To be clear: that's only if the mount is configured in the most
conservative way. Killing only userland processes with file locks
would be an intermediate option (and maybe a default).

> It's also not clear to me how you'll identify recipients for this
> signal. What tasks will get a SIGLOST when this occurs? Going from file
> descriptors or inodes to tasks that are associated with them is not
> straightforward.

We could start with file locks (which do have owners?) and table the
idea of killing all tasks that have any kind of file descriptor open.
Jeff Layton June 20, 2019, 8:06 p.m. UTC | #9
On Thu, 2019-06-20 at 12:50 -0700, Patrick Donnelly wrote:
> On Thu, Jun 20, 2019 at 12:00 PM Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 2019-06-20 at 11:51 -0700, Patrick Donnelly wrote:
> > > On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > > > > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > > > > any objections to reviving whatever patchset was in flight to add
> > > > > > > that? Or just writeup a new one?
> > > > > > > 
> > > > > > 
> > > > > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > > > > need to be custom-written to handle it. If you're going to do that, then
> > > > > > it may be better to consider other async notification mechanisms.
> > > > > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > > > > merged.
> > > > > 
> > > > > The utility of SIGLOST is not well understood from the viewpoint of a
> > > > > local file system. The problem uniquely applies to distributed file
> > > > > systems. There simply is no way to recover from a lost lock for an
> > > > > application through POSIX mechanisms. We really need a new signal to
> > > > > just kill the application (by default) because recovery cannot be
> > > > > automatically performed even through system call errors. I don't see
> > > > > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > > > > the application will even use those system calls to monitor for lost
> > > > > locks when POSIX has no provision for that to happen.
> > > > > 
> > > > 
> > > > (cc'ing Anna in case she has an opinion)
> > > > 
> > > > SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> > > > carries much weight. The _only_ way you'd be able to add SIGLOST is if
> > > > it defaults to SIG_IGN, such that only applications that are watching
> > > > for it will react to it. Given that, you're already looking at code
> > > > modifications.
> > > 
> > > Why does the default need to be SIG_IGN? Is that existing convention
> > > for new signals in Linux?
> > > 
> > 
> > No, it's just that if you don't do that, and locks are lost, then you'll
> > have a bunch of applications suddenly crash. That sounds scary.
> > 
> > > > So, the real question is: what's the best method to watch for lost
> > > > locks? I don't have a terribly strong opinion about any of these
> > > > notification methods, tbh. I only suggested inotify/fanotify because
> > > > they'd likely be much simpler to implement.
> > > > 
> > > > Adding signals is a non-trivial affair as we're running out of bits in
> > > > that space. The lower 32 bits are all taken and the upper ones are
> > > > reserved for realtime signals. My signal.h has a commented out SIGLOST:
> > > > 
> > > > #define SIGIO           29
> > > > #define SIGPOLL         SIGIO
> > > > /*
> > > > #define SIGLOST         29
> > > > */
> > > > 
> > > > Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> > > > think it'd probably need its own value. Maybe there is some way to have
> > > > the application ask that one of the realtime signals be set up for this?
> > > 
> > > Well, SIGPOLL is on its way out according to the standards. So SIGIO
> > > looks like what Linux uses instead. Looking at the man page for
> > > signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
> > > new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
> > > interface could then be used to process the event?
> > > 
> > > The one nit here is that we would be generating SIGIO for regular
> > > files (and directories?) which would be new. It makes sense with what
> > > we want to do though. Also, SIGIO default behavior is to terminate the
> > > process.
> > > 
> > 
> > That sounds like it could have unintended side-effects. A systemwide
> > event that causes a ton of userland processes to suddenly catch a fatal
> > signal seems rather nasty.
> 
> To be clear: that's only if the mount is configured in the most
> conservative way. Killing only userland processes with file locks
> would be an intermediate option (and maybe a default).
> 

A disable switch for this behavior would be a minimum requirement, I
think.

> > It's also not clear to me how you'll identify recipients for this
> > signal. What tasks will get a SIGLOST when this occurs? Going from file
> > descriptors or inodes to tasks that are associated with them is not
> > straightforward.
> 
> We could start with file locks (which do have owners?) and table the
> idea of killing all tasks that have any kind of file descriptor open.

Well...we do track current->tgid for l_pid, so you could probably go by
that for traditional POSIX locks.

For flock() and OFD locks though, the tasks are owned by the file
description and those can be shared between processes. So, even if you
kill the tgid that acquired the lock, there's no guarantee other
processes that might be using that lock will get the signal. Not that
that's a real argument against doing this, but this approach could have
significant gaps. 

OTOH, reporting a lost lock via fanotify would be quite straightforward
(and not even that difficult). Then, any process that really cares could
watch for these events.

Again, I really think I'm missing the big picture on the problem you're
attempting to solve with this.

For instance, I was operating under the assumption that you wanted this
to be an opt-in thing, but it sounds like you're aiming for something
that is more draconian. I'm not convinced that that's a good idea --
especially not initially. Enabling this by default could be a very
unwelcome surprise in some environments.
Patrick Donnelly June 20, 2019, 10:25 p.m. UTC | #10
On Thu, Jun 20, 2019 at 1:06 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Thu, 2019-06-20 at 12:50 -0700, Patrick Donnelly wrote:
> > On Thu, Jun 20, 2019 at 12:00 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > On Thu, 2019-06-20 at 11:51 -0700, Patrick Donnelly wrote:
> > > > On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > > > > > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > > > > > any objections to reviving whatever patchset was in flight to add
> > > > > > > > that? Or just writeup a new one?
> > > > > > > >
> > > > > > >
> > > > > > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > > > > > need to be custom-written to handle it. If you're going to do that, then
> > > > > > > it may be better to consider other async notification mechanisms.
> > > > > > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > > > > > merged.
> > > > > >
> > > > > > The utility of SIGLOST is not well understood from the viewpoint of a
> > > > > > local file system. The problem uniquely applies to distributed file
> > > > > > systems. There simply is no way to recover from a lost lock for an
> > > > > > application through POSIX mechanisms. We really need a new signal to
> > > > > > just kill the application (by default) because recovery cannot be
> > > > > > automatically performed even through system call errors. I don't see
> > > > > > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > > > > > the application will even use those system calls to monitor for lost
> > > > > > locks when POSIX has no provision for that to happen.
> > > > > >
> > > > >
> > > > > (cc'ing Anna in case she has an opinion)
> > > > >
> > > > > SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> > > > > carries much weight. The _only_ way you'd be able to add SIGLOST is if
> > > > > it defaults to SIG_IGN, such that only applications that are watching
> > > > > for it will react to it. Given that, you're already looking at code
> > > > > modifications.
> > > >
> > > > Why does the default need to be SIG_IGN? Is that existing convention
> > > > for new signals in Linux?
> > > >
> > >
> > > No, it's just that if you don't do that, and locks are lost, then you'll
> > > have a bunch of applications suddenly crash. That sounds scary.
> > >
> > > > > So, the real question is: what's the best method to watch for lost
> > > > > locks? I don't have a terribly strong opinion about any of these
> > > > > notification methods, tbh. I only suggested inotify/fanotify because
> > > > > they'd likely be much simpler to implement.
> > > > >
> > > > > Adding signals is a non-trivial affair as we're running out of bits in
> > > > > that space. The lower 32 bits are all taken and the upper ones are
> > > > > reserved for realtime signals. My signal.h has a commented out SIGLOST:
> > > > >
> > > > > #define SIGIO           29
> > > > > #define SIGPOLL         SIGIO
> > > > > /*
> > > > > #define SIGLOST         29
> > > > > */
> > > > >
> > > > > Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> > > > > think it'd probably need its own value. Maybe there is some way to have
> > > > > the application ask that one of the realtime signals be set up for this?
> > > >
> > > > Well, SIGPOLL is on its way out according to the standards. So SIGIO
> > > > looks like what Linux uses instead. Looking at the man page for
> > > > signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
> > > > new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
> > > > interface could then be used to process the event?
> > > >
> > > > The one nit here is that we would be generating SIGIO for regular
> > > > files (and directories?) which would be new. It makes sense with what
> > > > we want to do though. Also, SIGIO default behavior is to terminate the
> > > > process.
> > > >
> > >
> > > That sounds like it could have unintended side-effects. A systemwide
> > > event that causes a ton of userland processes to suddenly catch a fatal
> > > signal seems rather nasty.
> >
> > To be clear: that's only if the mount is configured in the most
> > conservative way. Killing only userland processes with file locks
> > would be an intermediate option (and maybe a default).
> >
>
> A disable switch for this behavior would be a minimum requirement, I
> think.
>
> > > It's also not clear to me how you'll identify recipients for this
> > > signal. What tasks will get a SIGLOST when this occurs? Going from file
> > > descriptors or inodes to tasks that are associated with them is not
> > > straightforward.
> >
> > We could start with file locks (which do have owners?) and table the
> > idea of killing all tasks that have any kind of file descriptor open.
>
> Well...we do track current->tgid for l_pid, so you could probably go by
> that for traditional POSIX locks.
>
> For flock() and OFD locks though, the tasks are owned by the file
> description and those can be shared between processes. So, even if you
> kill the tgid that acquired the lock, there's no guarantee other
> processes that might be using that lock will get the signal. Not that
> that's a real argument against doing this, but this approach could have
> significant gaps.

I wonder if it's actually common for a process to share locked file
descriptors? I'm not even sure what that should mean.

> OTOH, reporting a lost lock via fanotify would be quite straightforward
> (and not even that difficult). Then, any process that really cares could
> watch for these events.
>
> Again, I really think I'm missing the big picture on the problem you're
> attempting to solve with this.

I may be zooming farther than you want, but here's "The Big Picture":
a kernel cephfs mount should recover after blacklist and continue to
be usable at least for new processes/applications. Recovery should be
automatic without admin intervention.

> For instance, I was operating under the assumption that you wanted this
> to be an opt-in thing, but it sounds like you're aiming for something
> that is more draconian. I'm not convinced that that's a good idea --
> especially not initially. Enabling this by default could be a very
> unwelcome surprise in some environments.

Losing file locks is a serious issue that is grounds for terminating
applications. Reminder once-again: status quo is the application is
freezing without _any way to recover_. Almost any change is an
improvement over that behavior, including termination because then
monitor processes/init may recover.

I'm not looking to build "opt-in" mechanisms (because the alternative
is what? hang forever?) but I am happy to make the behavior
configurable via mount options.
Jeff Layton June 24, 2019, 6:59 p.m. UTC | #11
On Thu, 2019-06-20 at 15:25 -0700, Patrick Donnelly wrote:
> On Thu, Jun 20, 2019 at 1:06 PM Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 2019-06-20 at 12:50 -0700, Patrick Donnelly wrote:
> > > On Thu, Jun 20, 2019 at 12:00 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Thu, 2019-06-20 at 11:51 -0700, Patrick Donnelly wrote:
> > > > > On Thu, Jun 20, 2019 at 11:18 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > On Thu, 2019-06-20 at 10:19 -0700, Patrick Donnelly wrote:
> > > > > > > On Mon, Jun 17, 2019 at 1:45 PM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > > > > Again, I'd like to see SIGLOST sent to the application here. Are there
> > > > > > > > > any objections to reviving whatever patchset was in flight to add
> > > > > > > > > that? Or just writeup a new one?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I think SIGLOST's utility is somewhat questionable. Applications will
> > > > > > > > need to be custom-written to handle it. If you're going to do that, then
> > > > > > > > it may be better to consider other async notification mechanisms.
> > > > > > > > inotify or fanotify, perhaps? Those may be simpler to implement and get
> > > > > > > > merged.
> > > > > > > 
> > > > > > > The utility of SIGLOST is not well understood from the viewpoint of a
> > > > > > > local file system. The problem uniquely applies to distributed file
> > > > > > > systems. There simply is no way to recover from a lost lock for an
> > > > > > > application through POSIX mechanisms. We really need a new signal to
> > > > > > > just kill the application (by default) because recovery cannot be
> > > > > > > automatically performed even through system call errors. I don't see
> > > > > > > how inotify/fanotify (not POSIX interfaces!) helps here as it assumes
> > > > > > > the application will even use those system calls to monitor for lost
> > > > > > > locks when POSIX has no provision for that to happen.
> > > > > > > 
> > > > > > 
> > > > > > (cc'ing Anna in case she has an opinion)
> > > > > > 
> > > > > > SIGLOST isn't defined in POSIX either, so I'm not sure that argument
> > > > > > carries much weight. The _only_ way you'd be able to add SIGLOST is if
> > > > > > it defaults to SIG_IGN, such that only applications that are watching
> > > > > > for it will react to it. Given that, you're already looking at code
> > > > > > modifications.
> > > > > 
> > > > > Why does the default need to be SIG_IGN? Is that existing convention
> > > > > for new signals in Linux?
> > > > > 
> > > > 
> > > > No, it's just that if you don't do that, and locks are lost, then you'll
> > > > have a bunch of applications suddenly crash. That sounds scary.
> > > > 
> > > > > > So, the real question is: what's the best method to watch for lost
> > > > > > locks? I don't have a terribly strong opinion about any of these
> > > > > > notification methods, tbh. I only suggested inotify/fanotify because
> > > > > > they'd likely be much simpler to implement.
> > > > > > 
> > > > > > Adding signals is a non-trivial affair as we're running out of bits in
> > > > > > that space. The lower 32 bits are all taken and the upper ones are
> > > > > > reserved for realtime signals. My signal.h has a commented out SIGLOST:
> > > > > > 
> > > > > > #define SIGIO           29
> > > > > > #define SIGPOLL         SIGIO
> > > > > > /*
> > > > > > #define SIGLOST         29
> > > > > > */
> > > > > > 
> > > > > > Sharing the value with SIGIO/SIGPOLL makes it distinctly less useful. I
> > > > > > think it'd probably need its own value. Maybe there is some way to have
> > > > > > the application ask that one of the realtime signals be set up for this?
> > > > > 
> > > > > Well, SIGPOLL is on its way out according to the standards. So SIGIO
> > > > > looks like what Linux uses instead. Looking at the man page for
> > > > > signal.h, I wonder if we could use SIGIO with si_code==POLL_LOST (a
> > > > > new code); si_band==POLL_MSG; and si_fd==<locked fd>. Then the inotify
> > > > > interface could then be used to process the event?
> > > > > 
> > > > > The one nit here is that we would be generating SIGIO for regular
> > > > > files (and directories?) which would be new. It makes sense with what
> > > > > we want to do though. Also, SIGIO default behavior is to terminate the
> > > > > process.
> > > > > 
> > > > 
> > > > That sounds like it could have unintended side-effects. A systemwide
> > > > event that causes a ton of userland processes to suddenly catch a fatal
> > > > signal seems rather nasty.
> > > 
> > > To be clear: that's only if the mount is configured in the most
> > > conservative way. Killing only userland processes with file locks
> > > would be an intermediate option (and maybe a default).
> > > 
> > 
> > A disable switch for this behavior would be a minimum requirement, I
> > think.
> > 
> > > > It's also not clear to me how you'll identify recipients for this
> > > > signal. What tasks will get a SIGLOST when this occurs? Going from file
> > > > descriptors or inodes to tasks that are associated with them is not
> > > > straightforward.
> > > 
> > > We could start with file locks (which do have owners?) and table the
> > > idea of killing all tasks that have any kind of file descriptor open.
> > 
> > Well...we do track current->tgid for l_pid, so you could probably go by
> > that for traditional POSIX locks.
> > 
> > For flock() and OFD locks though, the tasks are owned by the file
> > description and those can be shared between processes. So, even if you
> > kill the tgid that acquired the lock, there's no guarantee other
> > processes that might be using that lock will get the signal. Not that
> > that's a real argument against doing this, but this approach could have
> > significant gaps.
> 
> I wonder if it's actually common for a process to share locked file
> descriptors? I'm not even sure what that should mean.
> 
> > OTOH, reporting a lost lock via fanotify would be quite straightforward
> > (and not even that difficult). Then, any process that really cares could
> > watch for these events.
> > 
> > Again, I really think I'm missing the big picture on the problem you're
> > attempting to solve with this.
> 
> I may be zooming farther than you want, but here's "The Big Picture":
> a kernel cephfs mount should recover after blacklist and continue to
> be usable at least for new processes/applications. Recovery should be
> automatic without admin intervention.
> 
> > For instance, I was operating under the assumption that you wanted this
> > to be an opt-in thing, but it sounds like you're aiming for something
> > that is more draconian. I'm not convinced that that's a good idea --
> > especially not initially. Enabling this by default could be a very
> > unwelcome surprise in some environments.
> 
> Losing file locks is a serious issue that is grounds for terminating
> applications. Reminder once-again: status quo is the application is
> freezing without _any way to recover_. Almost any change is an
> improvement over that behavior, including termination because then
> monitor processes/init may recover.
> 
> I'm not looking to build "opt-in" mechanisms (because the alternative
> is what? hang forever?) but I am happy to make the behavior
> configurable via mount options.
> 

FWIW, here are some links to old linux-nfs threads on SIGLOST/SIGIO
handling (from 2011 and 2013):

    https://marc.info/?l=linux-nfs&m=137114937610166&w=4

    https://marc.info/?l=linux-nfs&m=131101162420308&w=4

The original patcheset used SIGIO (since that's what SIGLOST is aliased
to on most arches), with plans to use siginfo to distinguish a SIGLOST
signal vs SIGIO.
diff mbox series

Patch

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index f07767d3864c..d8efacd874b7 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2543,8 +2543,13 @@  static void __take_cap_refs(struct ceph_inode_info *ci, int got,
  *
  * FIXME: how does a 0 return differ from -EAGAIN?
  */
+enum {
+	NON_BLOCKING	= 1,
+	CHECK_FILELOCK	= 2,
+};
+
 static int try_get_cap_refs(struct inode *inode, int need, int want,
-			    loff_t endoff, bool nonblock, int *got)
+			    loff_t endoff, int flags, int *got)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
@@ -2559,6 +2564,13 @@  static int try_get_cap_refs(struct inode *inode, int need, int want,
 again:
 	spin_lock(&ci->i_ceph_lock);
 
+	if ((flags & CHECK_FILELOCK) &&
+	    (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) {
+		dout("try_get_cap_refs %p error filelock\n", inode);
+		ret = -EIO;
+		goto out_unlock;
+	}
+
 	/* make sure file is actually open */
 	file_wanted = __ceph_caps_file_wanted(ci);
 	if ((file_wanted & need) != need) {
@@ -2620,7 +2632,7 @@  static int try_get_cap_refs(struct inode *inode, int need, int want,
 					 * we can not call down_read() when
 					 * task isn't in TASK_RUNNING state
 					 */
-					if (nonblock) {
+					if (flags & NON_BLOCKING) {
 						ret = -EAGAIN;
 						goto out_unlock;
 					}
@@ -2725,7 +2737,8 @@  int ceph_try_get_caps(struct inode *inode, int need, int want,
 	if (ret < 0)
 		return ret;
 
-	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
+	ret = try_get_cap_refs(inode, need, want, 0,
+			       (nonblock ? NON_BLOCKING : 0), got);
 	return ret == -EAGAIN ? 0 : ret;
 }
 
@@ -2737,9 +2750,10 @@  int ceph_try_get_caps(struct inode *inode, int need, int want,
 int ceph_get_caps(struct file *filp, int need, int want,
 		  loff_t endoff, int *got, struct page **pinned_page)
 {
+	struct ceph_file_info *fi = filp->private_data;
 	struct inode *inode = file_inode(filp);
 	struct ceph_inode_info *ci = ceph_inode(inode);
-	int _got, ret;
+	int ret, _got, flags;
 
 	ret = ceph_pool_perm_check(inode, need);
 	if (ret < 0)
@@ -2749,17 +2763,19 @@  int ceph_get_caps(struct file *filp, int need, int want,
 		if (endoff > 0)
 			check_max_size(inode, endoff);
 
+		flags = atomic_read(&fi->num_locks) ? CHECK_FILELOCK : 0;
 		_got = 0;
 		ret = try_get_cap_refs(inode, need, want, endoff,
-				       false, &_got);
+				       flags, &_got);
 		if (ret == -EAGAIN)
 			continue;
 		if (!ret) {
 			DEFINE_WAIT_FUNC(wait, woken_wake_function);
 			add_wait_queue(&ci->i_cap_wq, &wait);
 
+			flags |= NON_BLOCKING;
 			while (!(ret = try_get_cap_refs(inode, need, want,
-							endoff, true, &_got))) {
+							endoff, flags, &_got))) {
 				if (signal_pending(current)) {
 					ret = -ERESTARTSYS;
 					break;
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ac9b53b89365..cb216501c959 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -32,14 +32,18 @@  void __init ceph_flock_init(void)
 
 static void ceph_fl_copy_lock(struct file_lock *dst, struct file_lock *src)
 {
-	struct inode *inode = file_inode(src->fl_file);
+	struct ceph_file_info *fi = dst->fl_file->private_data;
+	struct inode *inode = file_inode(dst->fl_file);
 	atomic_inc(&ceph_inode(inode)->i_filelock_ref);
+	atomic_inc(&fi->num_locks);
 }
 
 static void ceph_fl_release_lock(struct file_lock *fl)
 {
+	struct ceph_file_info *fi = fl->fl_file->private_data;
 	struct inode *inode = file_inode(fl->fl_file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	atomic_dec(&fi->num_locks);
 	if (atomic_dec_and_test(&ci->i_filelock_ref)) {
 		/* clear error when all locks are released */
 		spin_lock(&ci->i_ceph_lock);
@@ -73,7 +77,7 @@  static int ceph_lock_message(u8 lock_type, u16 operation, struct inode *inode,
 		 * window. Caller function will decrease the counter.
 		 */
 		fl->fl_ops = &ceph_fl_lock_ops;
-		atomic_inc(&ceph_inode(inode)->i_filelock_ref);
+		fl->fl_ops->fl_copy_lock(fl, NULL);
 	}
 
 	if (operation != CEPH_MDS_OP_SETFILELOCK || cmd == CEPH_LOCK_UNLOCK)
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index f45a06475f4f..999fd3244907 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -705,6 +705,7 @@  struct ceph_file_info {
 	struct list_head rw_contexts;
 
 	errseq_t meta_err;
+	atomic_t num_locks;
 };
 
 struct ceph_dir_file_info {