mbox series

[RFC,v2,00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

Message ID 20190809225833.6657-1-ira.weiny@intel.com (mailing list archive)
Headers show
Series RDMA/FS DAX truncate proposal V1,000,002 ;-) | expand

Message

Ira Weiny Aug. 9, 2019, 10:58 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

Pre-requisites
==============
	Based on mmotm tree.

Based on the feedback from LSFmm, the LWN article, the RFC series since
then, and a ton of scenarios I've worked in my mind and/or tested...[1]

Solution summary
================

The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated.  So really any solution we present which:

A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong

Should be an acceptable solution.

Because this is slightly new behavior.  And because this is going to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.

The following patches implement the following solution.

0) Registrations to Device DAX char devs are not affected

1) The user has to opt in to allowing page pins on a file with an exclusive
   layout lease.  Both exclusive and layout lease flags are user visible now.

2) page pins will fail if the lease is not active when the file back page is
   encountered.

3) Any truncate or hole punch operation on a pinned DAX page will fail.

4) The user has the option of holding the lease or releasing it.  If they
   release it no other pin calls will work on the file.

5) Closing the file is ok.

6) Unmapping the file is ok

7) Pins against the files are tracked back to an owning file or an owning mm
   depending on the internal subsystem needs.  With RDMA there is an owning
   file which is related to the pined file.

8) Only RDMA is currently supported

9) Truncation of pages which are not actively pinned nor covered by a lease
   will succeed.


Reporting of pinned files in procfs
===================================

A number of alternatives were explored for how to report the file pins within
procfs.  The following incorporates ideas from Jan Kara, Jason Gunthorpe, Dave
Chinner, Dan Williams and myself.

A new entry is added to procfs

/proc/<pid>/file_pins

For processes which have pinned DAX file memory file_pins reference come in 2
flavors.  Those which are attached to another open file descriptor (For example
what is done in the RDMA subsytem) and those which are attached to a process
mm.

For those which are attached to another open file descriptor (such as RDMA)
the file pin references go through the 'struct file' associated with that pin.
In RDMA this is the RDMA context struct file.

The resulting output from proc fs is something like.

$ cat /proc/<pid>/file_pins
3: /dev/infiniband/uverbs0
	/mnt/pmem/foo

Where '3' is the file descriptor (and file path) of the rdma context within the
process.  The paths of the files pinned using that context are then listed.

RDMA contexts may have multiple MR each of which may have multiple files pinned
within them.  So an output like the following is possible.

$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
	/mnt/pmem/foo
	/mnt/pmem/bar
	/mnt/pmem/another
	/mnt/pmem/one

The actual memory regions associated with the file pins are not reported.

For processes which are pinning memory which is not associated with a specific
file descriptor memory pins are reported directly as paths to the file.

$ cat /proc/<pid>/file_pins
/mnt/pmem/foo

Putting the above together if a process was using RDMA and another subsystem
the output could be something like:


$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
	/mnt/pmem/foo
	/mnt/pmem/bar
	/mnt/pmem/another
	/mnt/pmem/one
/mnt/pmem/foo
/mnt/pmem/another
/mnt/pmem/mm_mapped_file


[1] https://lkml.org/lkml/2019/6/5/1046


Background
==========

It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP).  There are 2 big reasons this may not work.

	1) The hardware being used for RDMA may not support ODP
	2) ODP may be detrimental to the over all network (cluster or cloud)
	   performance

Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.

1) "longterm" GUP users need to inform other subsystems that they have taken a
   pin on a page which may remain pinned for a very "long time".  The
   definition of long time is debatable but it has been established that RDMAs
   use of pages for, minutes, hours, or even days after the pin is the extreme
   case which makes this problem most severe.

2) Any page which is "controlled" by a file system needs to have special
   handling.  The details of the handling depends on if the page is page cache
   fronted or not.

   2a) A page cache fronted page which has been pinned by GUP long term can use a
   bounce buffer to allow the file system to write back snap shots of the page.
   This is handled by the FS recognizing the GUP long term pin and making a copy
   of the page to be written back.
	NOTE: this patch set does not address this path.

   2b) A FS "controlled" page which is not page cache fronted is either easier
   to deal with or harder depending on the operation the filesystem is trying
   to do.

	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
	FS can no longer use the pages in question until the pin has been
	removed.  This patch set presents a solution to this by introducing
	some reasonable restrictions on user space applications.

	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
	then there is nothing which need be done.  Data is Read or Written
	directly to the page.  This is an easy case which would currently work
	if not for GUP long term pins being disabled.  Therefore this patch set
	need not change access to the file data but does allow for GUP pins
	after 2ba above is dealt with.


This patch series and presents a solution for problem 2ba)

Ira Weiny (19):
  fs/locks: Export F_LAYOUT lease to user space
  fs/locks: Add Exclusive flag to user Layout lease
  mm/gup: Pass flags down to __gup_device_huge* calls
  mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
  fs/ext4: Teach ext4 to break layout leases
  fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  fs/xfs: Teach xfs to use new dax_layout_busy_page()
  fs/xfs: Fail truncate if page lease can't be broken
  mm/gup: Introduce vaddr_pin structure
  mm/gup: Pass a NULL vaddr_pin through GUP fast
  mm/gup: Pass follow_page_context further down the call stack
  mm/gup: Prep put_user_pages() to take an vaddr_pin struct
  {mm,file}: Add file_pins objects
  fs/locks: Associate file pins while performing GUP
  mm/gup: Introduce vaddr_pin_pages()
  RDMA/uverbs: Add back pointer to system file object
  RDMA/umem: Convert to vaddr_[pin|unpin]* operations.
  {mm,procfs}: Add display file_pins proc
  mm/gup: Remove FOLL_LONGTERM DAX exclusion

 drivers/infiniband/core/umem.c        |  26 +-
 drivers/infiniband/core/umem_odp.c    |  16 +-
 drivers/infiniband/core/uverbs.h      |   1 +
 drivers/infiniband/core/uverbs_main.c |   1 +
 fs/Kconfig                            |   1 +
 fs/dax.c                              |  38 ++-
 fs/ext4/ext4.h                        |   2 +-
 fs/ext4/extents.c                     |   6 +-
 fs/ext4/inode.c                       |  26 +-
 fs/file_table.c                       |   4 +
 fs/locks.c                            | 291 +++++++++++++++++-
 fs/proc/base.c                        | 214 +++++++++++++
 fs/xfs/xfs_file.c                     |  21 +-
 fs/xfs/xfs_inode.h                    |   5 +-
 fs/xfs/xfs_ioctl.c                    |  15 +-
 fs/xfs/xfs_iops.c                     |  14 +-
 include/linux/dax.h                   |  12 +-
 include/linux/file.h                  |  49 +++
 include/linux/fs.h                    |   5 +-
 include/linux/huge_mm.h               |  17 --
 include/linux/mm.h                    |  69 +++--
 include/linux/mm_types.h              |   2 +
 include/rdma/ib_umem.h                |   2 +-
 include/uapi/asm-generic/fcntl.h      |   5 +
 kernel/fork.c                         |   3 +
 mm/gup.c                              | 418 ++++++++++++++++----------
 mm/huge_memory.c                      |  18 +-
 mm/internal.h                         |  28 ++
 28 files changed, 1048 insertions(+), 261 deletions(-)

Comments

Jan Kara Aug. 14, 2019, 10:17 a.m. UTC | #1
Hello!

On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> Pre-requisites
> ==============
> 	Based on mmotm tree.
> 
> Based on the feedback from LSFmm, the LWN article, the RFC series since
> then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> 
> Solution summary
> ================
> 
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated.  So really any solution we present which:
> 
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
> 
> Should be an acceptable solution.
> 
> Because this is slightly new behavior.  And because this is going to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
> 
> The following patches implement the following solution.
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>    layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>    encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.

So I didn't fully grok the patch set yet but by "pinned DAX page" do you
mean a page which has corresponding file_pin covering it? Or do you mean a
page which has pincount increased? If the first then I'd rephrase this to
be less ambiguous, if the second then I think it is wrong. 

> 4) The user has the option of holding the lease or releasing it.  If they
>    release it no other pin calls will work on the file.

Last time we spoke the plan was that the lease is kept while the pages are
pinned (and an attempt to release the lease would block until the pages are
unpinned). That also makes it clear that the *lease* is what is making
truncate and hole punch fail with ETXTBUSY and the file_pin structure is
just an implementation detail how the existence is efficiently tracked (and
what keeps the backing file for the pages open so that the lease does not
get auto-destroyed). Why did you change this?

> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>    depending on the internal subsystem needs.  With RDMA there is an owning
>    file which is related to the pined file.
> 
> 8) Only RDMA is currently supported

If you currently only need "owning file" variant in your patch set, then
I'd just implement that and leave "owning mm" variant for later if it
proves to be necessary. The things are complex enough as is...

> 9) Truncation of pages which are not actively pinned nor covered by a lease
>    will succeed.

Otherwise I like the design.

								Honza
Ira Weiny Aug. 14, 2019, 6:08 p.m. UTC | #2
On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> Hello!
> 
> On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > Pre-requisites
> > ==============
> > 	Based on mmotm tree.
> > 
> > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > 
> > Solution summary
> > ================
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is going to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 0) Registrations to Device DAX char devs are not affected
> > 
> > 1) The user has to opt in to allowing page pins on a file with an exclusive
> >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > 
> > 2) page pins will fail if the lease is not active when the file back page is
> >    encountered.
> > 
> > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> mean a page which has corresponding file_pin covering it? Or do you mean a
> page which has pincount increased? If the first then I'd rephrase this to
> be less ambiguous, if the second then I think it is wrong. 

I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
pincount processing will hang the truncate.  Given the discussion with John H
we can make this a bit better if we use something like FOLL_PIN and the page
count bias to indicate this type of pin.  Then I could fail the truncate
outright.  but that is not done yet.

so... I used the word "fail" to be a bit more vague as the final implementation
may return ETXTBUSY or hang as noted.

> 
> > 4) The user has the option of holding the lease or releasing it.  If they
> >    release it no other pin calls will work on the file.
> 
> Last time we spoke the plan was that the lease is kept while the pages are
> pinned (and an attempt to release the lease would block until the pages are
> unpinned). That also makes it clear that the *lease* is what is making
> truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> just an implementation detail how the existence is efficiently tracked (and
> what keeps the backing file for the pages open so that the lease does not
> get auto-destroyed). Why did you change this?

closing the file _and_ unmaping it will cause the lease to be released
regardless of if we allow this or not.

As we discussed preventing the close seemed intractable.

I thought about failing the munmap but that seemed wrong as well.  But more
importantly AFAIK RDMA can pass its memory pins to other processes via FD
passing...  This means that one could pin this memory, pass it to another
process and exit.  The file lease on the pin'ed file is lost.

The file lease is just a key to get the memory pin.  Once unlocked the procfs
tracking keeps track of where that pin goes and which processes need to be
killed to get rid of it.

> 
> > 5) Closing the file is ok.
> > 
> > 6) Unmapping the file is ok
> > 
> > 7) Pins against the files are tracked back to an owning file or an owning mm
> >    depending on the internal subsystem needs.  With RDMA there is an owning
> >    file which is related to the pined file.
> > 
> > 8) Only RDMA is currently supported
> 
> If you currently only need "owning file" variant in your patch set, then
> I'd just implement that and leave "owning mm" variant for later if it
> proves to be necessary. The things are complex enough as is...

I can do that...  I was trying to get io_uring working as well with the
owning_mm but I should save that for later.

> 
> > 9) Truncation of pages which are not actively pinned nor covered by a lease
> >    will succeed.
> 
> Otherwise I like the design.

Thanks,
Ira

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Aug. 15, 2019, 1:05 p.m. UTC | #3
On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > Pre-requisites
> > > ==============
> > > 	Based on mmotm tree.
> > > 
> > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > 
> > > Solution summary
> > > ================
> > > 
> > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is going to be
> > > specific to DAX (because of the lack of a page cache) we have made the user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 0) Registrations to Device DAX char devs are not affected
> > > 
> > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > 
> > > 2) page pins will fail if the lease is not active when the file back page is
> > >    encountered.
> > > 
> > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > 
> > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > mean a page which has corresponding file_pin covering it? Or do you mean a
> > page which has pincount increased? If the first then I'd rephrase this to
> > be less ambiguous, if the second then I think it is wrong. 
> 
> I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> pincount processing will hang the truncate.  Given the discussion with John H
> we can make this a bit better if we use something like FOLL_PIN and the page
> count bias to indicate this type of pin.  Then I could fail the truncate
> outright.  but that is not done yet.
> 
> so... I used the word "fail" to be a bit more vague as the final implementation
> may return ETXTBUSY or hang as noted.

Ah, OK. Hanging is fine in principle but with longterm pins, your work
makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
e.g. DIO will use page pins as well for its buffers and we must wait there
until the pin is released. So please just clarify your 'fail' here a bit
:).

> > > 4) The user has the option of holding the lease or releasing it.  If they
> > >    release it no other pin calls will work on the file.
> > 
> > Last time we spoke the plan was that the lease is kept while the pages are
> > pinned (and an attempt to release the lease would block until the pages are
> > unpinned). That also makes it clear that the *lease* is what is making
> > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > just an implementation detail how the existence is efficiently tracked (and
> > what keeps the backing file for the pages open so that the lease does not
> > get auto-destroyed). Why did you change this?
> 
> closing the file _and_ unmaping it will cause the lease to be released
> regardless of if we allow this or not.
> 
> As we discussed preventing the close seemed intractable.

Yes, preventing the application from closing the file is difficult. But
from a quick look at your patches it seemed to me that you actually hold a
backing file reference from the file_pin structure thus even though the
application closes its file descriptor, the struct file (and thus the
lease) lives further until the file_pin gets released. And that should last
as long as the pages are pinned. Am I missing something?

> I thought about failing the munmap but that seemed wrong as well.  But more
> importantly AFAIK RDMA can pass its memory pins to other processes via FD
> passing...  This means that one could pin this memory, pass it to another
> process and exit.  The file lease on the pin'ed file is lost.

Not if file_pin grabs struct file reference as I mentioned above...
 
> The file lease is just a key to get the memory pin.  Once unlocked the procfs
> tracking keeps track of where that pin goes and which processes need to be
> killed to get rid of it.

I think having file lease being just a key to get the pin is conceptually
wrong. The lease is what expresses: "I'm accessing these blocks directly,
don't touch them without coordinating with me." So it would be only natural
if we maintained the lease while we are accessing blocks instead of
transferring this protection responsibility to another structure - namely
file_pin - and letting the lease go. But maybe I miss some technical reason
why maintaining file lease is difficult. If that's the case, I'd like to hear
what...
 
> > > 5) Closing the file is ok.
> > > 
> > > 6) Unmapping the file is ok
> > > 
> > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > >    file which is related to the pined file.
> > > 
> > > 8) Only RDMA is currently supported
> > 
> > If you currently only need "owning file" variant in your patch set, then
> > I'd just implement that and leave "owning mm" variant for later if it
> > proves to be necessary. The things are complex enough as is...
> 
> I can do that...  I was trying to get io_uring working as well with the
> owning_mm but I should save that for later.

Ah, OK. Yes, I guess io_uring can be next step.

								Honza
Ira Weiny Aug. 16, 2019, 7:05 p.m. UTC | #4
On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > Pre-requisites
> > > > ==============
> > > > 	Based on mmotm tree.
> > > > 
> > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > 
> > > > Solution summary
> > > > ================
> > > > 
> > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > memory which is then truncated.  So really any solution we present which:
> > > > 
> > > > A) Prevents file system corruption or data leaks
> > > > ...and...
> > > > B) Informs the user that they did something wrong
> > > > 
> > > > Should be an acceptable solution.
> > > > 
> > > > Because this is slightly new behavior.  And because this is going to be
> > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > "opt in" to this behavior.
> > > > 
> > > > The following patches implement the following solution.
> > > > 
> > > > 0) Registrations to Device DAX char devs are not affected
> > > > 
> > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > 
> > > > 2) page pins will fail if the lease is not active when the file back page is
> > > >    encountered.
> > > > 
> > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > 
> > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > page which has pincount increased? If the first then I'd rephrase this to
> > > be less ambiguous, if the second then I think it is wrong. 
> > 
> > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > pincount processing will hang the truncate.  Given the discussion with John H
> > we can make this a bit better if we use something like FOLL_PIN and the page
> > count bias to indicate this type of pin.  Then I could fail the truncate
> > outright.  but that is not done yet.
> > 
> > so... I used the word "fail" to be a bit more vague as the final implementation
> > may return ETXTBUSY or hang as noted.
> 
> Ah, OK. Hanging is fine in principle but with longterm pins, your work
> makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> e.g. DIO will use page pins as well for its buffers and we must wait there
> until the pin is released. So please just clarify your 'fail' here a bit
> :).

It will fail with ETXTBSY.  I've fixed a bug...  See below.

> 
> > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > >    release it no other pin calls will work on the file.
> > > 
> > > Last time we spoke the plan was that the lease is kept while the pages are
> > > pinned (and an attempt to release the lease would block until the pages are
> > > unpinned). That also makes it clear that the *lease* is what is making
> > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > just an implementation detail how the existence is efficiently tracked (and
> > > what keeps the backing file for the pages open so that the lease does not
> > > get auto-destroyed). Why did you change this?
> > 
> > closing the file _and_ unmaping it will cause the lease to be released
> > regardless of if we allow this or not.
> > 
> > As we discussed preventing the close seemed intractable.
> 
> Yes, preventing the application from closing the file is difficult. But
> from a quick look at your patches it seemed to me that you actually hold a
> backing file reference from the file_pin structure thus even though the
> application closes its file descriptor, the struct file (and thus the
> lease) lives further until the file_pin gets released. And that should last
> as long as the pages are pinned. Am I missing something?
> 
> > I thought about failing the munmap but that seemed wrong as well.  But more
> > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > passing...  This means that one could pin this memory, pass it to another
> > process and exit.  The file lease on the pin'ed file is lost.
> 
> Not if file_pin grabs struct file reference as I mentioned above...
>  
> > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > tracking keeps track of where that pin goes and which processes need to be
> > killed to get rid of it.
> 
> I think having file lease being just a key to get the pin is conceptually
> wrong. The lease is what expresses: "I'm accessing these blocks directly,
> don't touch them without coordinating with me." So it would be only natural
> if we maintained the lease while we are accessing blocks instead of
> transferring this protection responsibility to another structure - namely
> file_pin - and letting the lease go.

We do transfer that protection to the file_pin but we don't have to "let the
lease" go.  We just keep the lease with the file_pin as you said.  See below...

> But maybe I miss some technical reason
> why maintaining file lease is difficult. If that's the case, I'd like to hear
> what...

Ok, I've thought a bit about what you said and indeed it should work that way.
The reason I had to think a bit is that I was not sure why I thought we needed
to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
but still not good enough to allow this...

1) I had a bug in the XFS code which should have failed rather than hanging...
   So this was not a good reason...  And I was able to find/fix it...  Thanks!

2) Second reason is that I thought I did not have a good way to tell if the
   lease was actually in use.  What I mean is that letting the lease go should
   be ok IFF we don't have any pins...  I was thinking that without John's code
   we don't have a way to know if there are any pins...  But that is wrong...
   All we have to do is check

	!list_empty(file->file_pins)

So now with this detail I think you are right, we should be able to hold the
lease through the struct file even if the process no longer has any
"references" to it (ie closes and munmaps the file).

I'm going to add a patch to fail releasing the lease and remove this (item 4)
as part of the overall solution.

>  
> > > > 5) Closing the file is ok.
> > > > 
> > > > 6) Unmapping the file is ok
> > > > 
> > > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > > >    file which is related to the pined file.
> > > > 
> > > > 8) Only RDMA is currently supported
> > > 
> > > If you currently only need "owning file" variant in your patch set, then
> > > I'd just implement that and leave "owning mm" variant for later if it
> > > proves to be necessary. The things are complex enough as is...
> > 
> > I can do that...  I was trying to get io_uring working as well with the
> > owning_mm but I should save that for later.
> 
> Ah, OK. Yes, I guess io_uring can be next step.

FWIW I have split the mm_struct stuff out.  I can keep it as a follow on series
for other users later.  At this point I have to solve the issue Jason brought
up WRT the RDMA file reference counting.

Thanks!
Ira
Ira Weiny Aug. 16, 2019, 11:20 p.m. UTC | #5
On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > > Pre-requisites
> > > > > ==============
> > > > > 	Based on mmotm tree.
> > > > > 
> > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > > 
> > > > > Solution summary
> > > > > ================
> > > > > 
> > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > > memory which is then truncated.  So really any solution we present which:
> > > > > 
> > > > > A) Prevents file system corruption or data leaks
> > > > > ...and...
> > > > > B) Informs the user that they did something wrong
> > > > > 
> > > > > Should be an acceptable solution.
> > > > > 
> > > > > Because this is slightly new behavior.  And because this is going to be
> > > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > > "opt in" to this behavior.
> > > > > 
> > > > > The following patches implement the following solution.
> > > > > 
> > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > 
> > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > > 
> > > > > 2) page pins will fail if the lease is not active when the file back page is
> > > > >    encountered.
> > > > > 
> > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > > 
> > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > > page which has pincount increased? If the first then I'd rephrase this to
> > > > be less ambiguous, if the second then I think it is wrong. 
> > > 
> > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > > pincount processing will hang the truncate.  Given the discussion with John H
> > > we can make this a bit better if we use something like FOLL_PIN and the page
> > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > outright.  but that is not done yet.
> > > 
> > > so... I used the word "fail" to be a bit more vague as the final implementation
> > > may return ETXTBUSY or hang as noted.
> > 
> > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > e.g. DIO will use page pins as well for its buffers and we must wait there
> > until the pin is released. So please just clarify your 'fail' here a bit
> > :).
> 
> It will fail with ETXTBSY.  I've fixed a bug...  See below.
> 
> > 
> > > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > > >    release it no other pin calls will work on the file.
> > > > 
> > > > Last time we spoke the plan was that the lease is kept while the pages are
> > > > pinned (and an attempt to release the lease would block until the pages are
> > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > just an implementation detail how the existence is efficiently tracked (and
> > > > what keeps the backing file for the pages open so that the lease does not
> > > > get auto-destroyed). Why did you change this?
> > > 
> > > closing the file _and_ unmaping it will cause the lease to be released
> > > regardless of if we allow this or not.
> > > 
> > > As we discussed preventing the close seemed intractable.
> > 
> > Yes, preventing the application from closing the file is difficult. But
> > from a quick look at your patches it seemed to me that you actually hold a
> > backing file reference from the file_pin structure thus even though the
> > application closes its file descriptor, the struct file (and thus the
> > lease) lives further until the file_pin gets released. And that should last
> > as long as the pages are pinned. Am I missing something?
> > 
> > > I thought about failing the munmap but that seemed wrong as well.  But more
> > > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > > passing...  This means that one could pin this memory, pass it to another
> > > process and exit.  The file lease on the pin'ed file is lost.
> > 
> > Not if file_pin grabs struct file reference as I mentioned above...
> >  
> > > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > > tracking keeps track of where that pin goes and which processes need to be
> > > killed to get rid of it.
> > 
> > I think having file lease being just a key to get the pin is conceptually
> > wrong. The lease is what expresses: "I'm accessing these blocks directly,
> > don't touch them without coordinating with me." So it would be only natural
> > if we maintained the lease while we are accessing blocks instead of
> > transferring this protection responsibility to another structure - namely
> > file_pin - and letting the lease go.
> 
> We do transfer that protection to the file_pin but we don't have to "let the
> lease" go.  We just keep the lease with the file_pin as you said.  See below...
> 
> > But maybe I miss some technical reason
> > why maintaining file lease is difficult. If that's the case, I'd like to hear
> > what...
> 
> Ok, I've thought a bit about what you said and indeed it should work that way.
> The reason I had to think a bit is that I was not sure why I thought we needed
> to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
> but still not good enough to allow this...
> 
> 1) I had a bug in the XFS code which should have failed rather than hanging...
>    So this was not a good reason...  And I was able to find/fix it...  Thanks!
> 
> 2) Second reason is that I thought I did not have a good way to tell if the
>    lease was actually in use.  What I mean is that letting the lease go should
>    be ok IFF we don't have any pins...  I was thinking that without John's code
>    we don't have a way to know if there are any pins...  But that is wrong...
>    All we have to do is check
> 
> 	!list_empty(file->file_pins)

Oops...  I got my "struct files" mixed up...  The RDMA struct file has the
file_pins hanging off it...  This will not work.

I'll have to try something else to prevent this.  However, I don't want to walk
all the pages of the inode.

Also I'm concerned about just failing if they happen to be pinned.  They need
to be LONGTERM pinned...  Otherwise we might have a transient failure of an
unlock based on some internal kernel transient pin...  :-/

Ira

> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).
> 
> I'm going to add a patch to fail releasing the lease and remove this (item 4)
> as part of the overall solution.
> 
> >  
> > > > > 5) Closing the file is ok.
> > > > > 
> > > > > 6) Unmapping the file is ok
> > > > > 
> > > > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > > > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > > > >    file which is related to the pined file.
> > > > > 
> > > > > 8) Only RDMA is currently supported
> > > > 
> > > > If you currently only need "owning file" variant in your patch set, then
> > > > I'd just implement that and leave "owning mm" variant for later if it
> > > > proves to be necessary. The things are complex enough as is...
> > > 
> > > I can do that...  I was trying to get io_uring working as well with the
> > > owning_mm but I should save that for later.
> > 
> > Ah, OK. Yes, I guess io_uring can be next step.
> 
> FWIW I have split the mm_struct stuff out.  I can keep it as a follow on series
> for other users later.  At this point I have to solve the issue Jason brought
> up WRT the RDMA file reference counting.
> 
> Thanks!
> Ira
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Dave Chinner Aug. 17, 2019, 2:26 a.m. UTC | #6
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> 2) Second reason is that I thought I did not have a good way to tell if the
>    lease was actually in use.  What I mean is that letting the lease go should
>    be ok IFF we don't have any pins...  I was thinking that without John's code
>    we don't have a way to know if there are any pins...  But that is wrong...
>    All we have to do is check
> 
> 	!list_empty(file->file_pins)
> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).

I really, really dislike the idea of zombie layout leases. It's a
nasty hack for poor application behaviour. This is a "we allow use
after layout lease release" API, and I think encoding largely
untraceable zombie objects into an API is very poor design.

From the fcntl man page:

LEASES
	Leases are associated with an open file description (see
	open(2)).  This means that duplicate file descriptors
	(created by, for example, fork(2) or dup(2))  re‐ fer  to
	the  same  lease,  and this lease may be modified or
	released using any of these descriptors.  Furthermore, the
	lease is released by either an explicit F_UNLCK operation on
	any of these duplicate file descriptors, or when all such
	file descriptors have been closed.

Leases are associated with *open* file descriptors, not the
lifetime of the struct file in the kernel. If the application closes
the open fds that refer to the lease, then the kernel does not
guarantee, and the application has no right to expect, that the
lease remains active in any way once the application closes all
direct references to the lease.

IOWs, applications using layout leases need to hold the lease fd
open for as long as the want access to the physical file layout. It
is a also a requirement of the layout lease that the holder releases
the resources it holds on the layout before it releases the layout
lease, exclusive lease or not. Closing the fd indicates they do not
need access to the file any more, and so the lease should be
reclaimed at that point.

I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released.

Cheers,

Dave.
Jan Kara Aug. 19, 2019, 6:34 a.m. UTC | #7
On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >    lease was actually in use.  What I mean is that letting the lease go should
> >    be ok IFF we don't have any pins...  I was thinking that without John's code
> >    we don't have a way to know if there are any pins...  But that is wrong...
> >    All we have to do is check
> > 
> > 	!list_empty(file->file_pins)
> > 
> > So now with this detail I think you are right, we should be able to hold the
> > lease through the struct file even if the process no longer has any
> > "references" to it (ie closes and munmaps the file).
> 
> I really, really dislike the idea of zombie layout leases. It's a
> nasty hack for poor application behaviour. This is a "we allow use
> after layout lease release" API, and I think encoding largely
> untraceable zombie objects into an API is very poor design.
> 
> From the fcntl man page:
> 
> LEASES
> 	Leases are associated with an open file description (see
> 	open(2)).  This means that duplicate file descriptors
> 	(created by, for example, fork(2) or dup(2))  re‐ fer  to
> 	the  same  lease,  and this lease may be modified or
> 	released using any of these descriptors.  Furthermore, the
> 	lease is released by either an explicit F_UNLCK operation on
> 	any of these duplicate file descriptors, or when all such
> 	file descriptors have been closed.
> 
> Leases are associated with *open* file descriptors, not the
> lifetime of the struct file in the kernel. If the application closes
> the open fds that refer to the lease, then the kernel does not
> guarantee, and the application has no right to expect, that the
> lease remains active in any way once the application closes all
> direct references to the lease.
> 
> IOWs, applications using layout leases need to hold the lease fd
> open for as long as the want access to the physical file layout. It
> is a also a requirement of the layout lease that the holder releases
> the resources it holds on the layout before it releases the layout
> lease, exclusive lease or not. Closing the fd indicates they do not
> need access to the file any more, and so the lease should be
> reclaimed at that point.
> 
> I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released.

Yeah, so this was my initial though as well [1]. But as the discussion in
that thread revealed, the problem with blocking last close is that kernel
does not really expect close to block. You could easily deadlock e.g. if
the process gets SIGKILL, file with lease has fd 10, and the RDMA context
holding pages pinned has fd 15. Or you could wait for another process to
release page pins and blocking SIGKILL on that is also bad. So in the end
the least bad solution we've come up with were these "zombie" leases as you
call them and tracking them in /proc so that userspace at least has a way
of seeing them. But if you can come up with a different solution, I'm
certainly not attached to the current one...

								Honza

[1] https://lore.kernel.org/lkml/20190606104203.GF7433@quack2.suse.cz
Jan Kara Aug. 19, 2019, 6:36 a.m. UTC | #8
On Fri 16-08-19 16:20:07, Ira Weiny wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > Hello!
> > > > > 
> > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > > > Pre-requisites
> > > > > > ==============
> > > > > > 	Based on mmotm tree.
> > > > > > 
> > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > > > 
> > > > > > Solution summary
> > > > > > ================
> > > > > > 
> > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > > > memory which is then truncated.  So really any solution we present which:
> > > > > > 
> > > > > > A) Prevents file system corruption or data leaks
> > > > > > ...and...
> > > > > > B) Informs the user that they did something wrong
> > > > > > 
> > > > > > Should be an acceptable solution.
> > > > > > 
> > > > > > Because this is slightly new behavior.  And because this is going to be
> > > > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > > > "opt in" to this behavior.
> > > > > > 
> > > > > > The following patches implement the following solution.
> > > > > > 
> > > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > > 
> > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > > > 
> > > > > > 2) page pins will fail if the lease is not active when the file back page is
> > > > > >    encountered.
> > > > > > 
> > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > > > 
> > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > > > page which has pincount increased? If the first then I'd rephrase this to
> > > > > be less ambiguous, if the second then I think it is wrong. 
> > > > 
> > > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > > > pincount processing will hang the truncate.  Given the discussion with John H
> > > > we can make this a bit better if we use something like FOLL_PIN and the page
> > > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > > outright.  but that is not done yet.
> > > > 
> > > > so... I used the word "fail" to be a bit more vague as the final implementation
> > > > may return ETXTBUSY or hang as noted.
> > > 
> > > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > > e.g. DIO will use page pins as well for its buffers and we must wait there
> > > until the pin is released. So please just clarify your 'fail' here a bit
> > > :).
> > 
> > It will fail with ETXTBSY.  I've fixed a bug...  See below.
> > 
> > > 
> > > > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > > > >    release it no other pin calls will work on the file.
> > > > > 
> > > > > Last time we spoke the plan was that the lease is kept while the pages are
> > > > > pinned (and an attempt to release the lease would block until the pages are
> > > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > > just an implementation detail how the existence is efficiently tracked (and
> > > > > what keeps the backing file for the pages open so that the lease does not
> > > > > get auto-destroyed). Why did you change this?
> > > > 
> > > > closing the file _and_ unmaping it will cause the lease to be released
> > > > regardless of if we allow this or not.
> > > > 
> > > > As we discussed preventing the close seemed intractable.
> > > 
> > > Yes, preventing the application from closing the file is difficult. But
> > > from a quick look at your patches it seemed to me that you actually hold a
> > > backing file reference from the file_pin structure thus even though the
> > > application closes its file descriptor, the struct file (and thus the
> > > lease) lives further until the file_pin gets released. And that should last
> > > as long as the pages are pinned. Am I missing something?
> > > 
> > > > I thought about failing the munmap but that seemed wrong as well.  But more
> > > > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > > > passing...  This means that one could pin this memory, pass it to another
> > > > process and exit.  The file lease on the pin'ed file is lost.
> > > 
> > > Not if file_pin grabs struct file reference as I mentioned above...
> > >  
> > > > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > > > tracking keeps track of where that pin goes and which processes need to be
> > > > killed to get rid of it.
> > > 
> > > I think having file lease being just a key to get the pin is conceptually
> > > wrong. The lease is what expresses: "I'm accessing these blocks directly,
> > > don't touch them without coordinating with me." So it would be only natural
> > > if we maintained the lease while we are accessing blocks instead of
> > > transferring this protection responsibility to another structure - namely
> > > file_pin - and letting the lease go.
> > 
> > We do transfer that protection to the file_pin but we don't have to "let the
> > lease" go.  We just keep the lease with the file_pin as you said.  See below...
> > 
> > > But maybe I miss some technical reason
> > > why maintaining file lease is difficult. If that's the case, I'd like to hear
> > > what...
> > 
> > Ok, I've thought a bit about what you said and indeed it should work that way.
> > The reason I had to think a bit is that I was not sure why I thought we needed
> > to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
> > but still not good enough to allow this...
> > 
> > 1) I had a bug in the XFS code which should have failed rather than hanging...
> >    So this was not a good reason...  And I was able to find/fix it...  Thanks!
> > 
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >    lease was actually in use.  What I mean is that letting the lease go should
> >    be ok IFF we don't have any pins...  I was thinking that without John's code
> >    we don't have a way to know if there are any pins...  But that is wrong...
> >    All we have to do is check
> > 
> > 	!list_empty(file->file_pins)
> 
> Oops...  I got my "struct files" mixed up...  The RDMA struct file has the
> file_pins hanging off it...  This will not work.
> 
> I'll have to try something else to prevent this.  However, I don't want to walk
> all the pages of the inode.
> 
> Also I'm concerned about just failing if they happen to be pinned.  They need
> to be LONGTERM pinned...  Otherwise we might have a transient failure of an
> unlock based on some internal kernel transient pin...  :-/

My solution for this was that file_pin would contain counter of pinned
pages which vaddr_pin_pages() would increment and vaddr_unpin_pages() would
decrement. Checking whether there's any outstanding page pinned attached to
the file_pin is then trivial...

								Honza
Dave Chinner Aug. 19, 2019, 9:24 a.m. UTC | #9
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > 2) Second reason is that I thought I did not have a good way to tell if the
> > >    lease was actually in use.  What I mean is that letting the lease go should
> > >    be ok IFF we don't have any pins...  I was thinking that without John's code
> > >    we don't have a way to know if there are any pins...  But that is wrong...
> > >    All we have to do is check
> > > 
> > > 	!list_empty(file->file_pins)
> > > 
> > > So now with this detail I think you are right, we should be able to hold the
> > > lease through the struct file even if the process no longer has any
> > > "references" to it (ie closes and munmaps the file).
> > 
> > I really, really dislike the idea of zombie layout leases. It's a
> > nasty hack for poor application behaviour. This is a "we allow use
> > after layout lease release" API, and I think encoding largely
> > untraceable zombie objects into an API is very poor design.
> > 
> > From the fcntl man page:
> > 
> > LEASES
> > 	Leases are associated with an open file description (see
> > 	open(2)).  This means that duplicate file descriptors
> > 	(created by, for example, fork(2) or dup(2))  re‐ fer  to
> > 	the  same  lease,  and this lease may be modified or
> > 	released using any of these descriptors.  Furthermore, the
> > 	lease is released by either an explicit F_UNLCK operation on
> > 	any of these duplicate file descriptors, or when all such
> > 	file descriptors have been closed.
> > 
> > Leases are associated with *open* file descriptors, not the
> > lifetime of the struct file in the kernel. If the application closes
> > the open fds that refer to the lease, then the kernel does not
> > guarantee, and the application has no right to expect, that the
> > lease remains active in any way once the application closes all
> > direct references to the lease.
> > 
> > IOWs, applications using layout leases need to hold the lease fd
> > open for as long as the want access to the physical file layout. It
> > is a also a requirement of the layout lease that the holder releases
> > the resources it holds on the layout before it releases the layout
> > lease, exclusive lease or not. Closing the fd indicates they do not
> > need access to the file any more, and so the lease should be
> > reclaimed at that point.
> > 
> > I'm of a mind to make the last close() on a file block if there's an
> > active layout lease to prevent processes from zombie-ing layout
> > leases like this. i.e. you can't close the fd until resources that
> > pin the lease have been released.
> 
> Yeah, so this was my initial though as well [1]. But as the discussion in
> that thread revealed, the problem with blocking last close is that kernel
> does not really expect close to block. You could easily deadlock e.g. if
> the process gets SIGKILL, file with lease has fd 10, and the RDMA context
> holding pages pinned has fd 15.

Sure, I did think about this a bit about it before suggesting it :)

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
  tracehook_notify_resume()
    task_work_run()
      __fput()
	locks_remove_file()
	  locks_remove_lease()
	    ....

It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases....

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
  do_signal()
    get_signal()
      task_work_run()
        __fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so we'd only break the block on a fatal signal....

And then, of course, is the call path through do_exit(), which has
the PF_EXITING task flag set:

do_exit()
  exit_task_work()
    task_work_run()
      __fput()

and so it's easy to avoid blocking in this case, too.

So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this
context.  Blocking in an interruptible state will allow fatal signal
delivery to wake us, and then we fall into the
fatal_signal_pending() case if we get a SIGKILL while blocking.

Hence I think blocking in this case would be OK - it indicates an
application bug (releasing a lease before releasing the resources)
but leaves SIGKILL available to administrators to resolve situations
involving buggy applications.

This requires applications to follow the rules: any process
that pins physical resources must have an active reference to a
layout lease, either via a duplicated fd or it's own private lease.
If the app doesn't play by the rules, it hangs in close() until it
is killed.

> Or you could wait for another process to
> release page pins and blocking SIGKILL on that is also bad.

Again, each individual process that pins pages from the layout must
have it's own active layout lease reference.

> So in the end
> the least bad solution we've come up with were these "zombie" leases as you
> call them and tracking them in /proc so that userspace at least has a way
> of seeing them. But if you can come up with a different solution, I'm
> certainly not attached to the current one...

It might be the "least bad" solution, but it's still a pretty bad
one. And one that I don't think is necessary if we simply enforce
the "process must have active references for the entire time the
process uses the resource" rule. That's the way file access has
always worked, I don't see why we should be doing anything different
for access to the physical layout of files...

Cheers,

Dave.
Jason Gunthorpe Aug. 19, 2019, 12:38 p.m. UTC | #10
On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:

> So that leaves just the normal close() syscall exit case, where the
> application has full control of the order in which resources are
> released. We've already established that we can block in this
> context.  Blocking in an interruptible state will allow fatal signal
> delivery to wake us, and then we fall into the
> fatal_signal_pending() case if we get a SIGKILL while blocking.

The major problem with RDMA is that it doesn't always wait on close() for the
MR holding the page pins to be destoyed. This is done to avoid a
deadlock of the form:

   uverbs_destroy_ufile_hw()
      mutex_lock()
       [..]
        mmput()
         exit_mmap()
          remove_vma()
           fput();
            file_operations->release()
             ib_uverbs_close()
              uverbs_destroy_ufile_hw()
               mutex_lock()   <-- Deadlock

But, as I said to Ira earlier, I wonder if this is now impossible on
modern kernels and we can switch to making the whole thing
synchronous. That would resolve RDMA's main problem with this.

Jason
Ira Weiny Aug. 19, 2019, 9:53 p.m. UTC | #11
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>    uverbs_destroy_ufile_hw()
>       mutex_lock()
>        [..]
>         mmput()
>          exit_mmap()
>           remove_vma()
>            fput();
>             file_operations->release()
>              ib_uverbs_close()
>               uverbs_destroy_ufile_hw()
>                mutex_lock()   <-- Deadlock
> 
> But, as I said to Ira earlier, I wonder if this is now impossible on
> modern kernels and we can switch to making the whole thing
> synchronous. That would resolve RDMA's main problem with this.

I'm still looking into this...  but my bigger concern is that the RDMA FD can
be passed to other processes via SCM_RIGHTS.  Which means the process holding
the pin may _not_ be the one with the open file and layout lease...

Ira
John Hubbard Aug. 20, 2019, 12:05 a.m. UTC | #12
On 8/19/19 2:24 AM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
...
> The last close is an interesting case because the __fput() call
> actually runs from task_work() context, not where the last reference
> is actually dropped. So it already has certain specific interactions
> with signals and task exit processing via task_add_work() and
> task_work_run().
> 
> task_add_work() calls set_notify_resume(task), so if nothing else
> triggers when returning to userspace we run this path:
> 
> exit_to_usermode_loop()
>    tracehook_notify_resume()
>      task_work_run()
>        __fput()
> 	locks_remove_file()
> 	  locks_remove_lease()
> 	    ....
> 
> It's worth noting that locks_remove_lease() does a
> percpu_down_read() which means we can already block in this context
> removing leases....
> 
> If there is a signal pending, the task work is run this way (before
> the above notify path):
> 
> exit_to_usermode_loop()
>    do_signal()
>      get_signal()
>        task_work_run()
>          __fput()
> 
> We can detect this case via signal_pending() and even SIGKILL via
> fatal_signal_pending(), and so we can decide not to block based on
> the fact the process is about to be reaped and so the lease largely
> doesn't matter anymore. I'd argue that it is close and we can't
> easily back out, so we'd only break the block on a fatal signal....
> 
> And then, of course, is the call path through do_exit(), which has
> the PF_EXITING task flag set:
> 
> do_exit()
>    exit_task_work()
>      task_work_run()
>        __fput()
> 
> and so it's easy to avoid blocking in this case, too.

Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
memory with FOLL_LONGTERM, and wondering how to make that work here.

These are close to files, in how they're handled, but just different
enough that it's not clear to me how to make work with this system.


> 
> So that leaves just the normal close() syscall exit case, where the
> application has full control of the order in which resources are
> released. We've already established that we can block in this
> context.  Blocking in an interruptible state will allow fatal signal
> delivery to wake us, and then we fall into the
> fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> Hence I think blocking in this case would be OK - it indicates an
> application bug (releasing a lease before releasing the resources)
> but leaves SIGKILL available to administrators to resolve situations
> involving buggy applications.
> 
> This requires applications to follow the rules: any process
> that pins physical resources must have an active reference to a
> layout lease, either via a duplicated fd or it's own private lease.
> If the app doesn't play by the rules, it hangs in close() until it
> is killed.

+1 for these rules, assuming that we can make them work. They are
easy to explain and intuitive.


thanks,
Dave Chinner Aug. 20, 2019, 1:12 a.m. UTC | #13
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>    uverbs_destroy_ufile_hw()
>       mutex_lock()
>        [..]
>         mmput()
>          exit_mmap()
>           remove_vma()
>            fput();
>             file_operations->release()

I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.

		fput()
		  fput_many()
		    task_add_work(f, __fput())

and the call chain ends there.

Before the syscall returns to userspace, it then runs the __fput()
call through the task_work_run() interfaces, and hence the call
chain is just:

	task_work_run
	  __fput
>             file_operations->release()
>              ib_uverbs_close()
>               uverbs_destroy_ufile_hw()
>                mutex_lock()   <-- Deadlock

And there is no deadlock because nothing holds the mutex at this
point.

Cheers,

Dave.
Dave Chinner Aug. 20, 2019, 1:20 a.m. UTC | #14
On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> On 8/19/19 2:24 AM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> ...
> > The last close is an interesting case because the __fput() call
> > actually runs from task_work() context, not where the last reference
> > is actually dropped. So it already has certain specific interactions
> > with signals and task exit processing via task_add_work() and
> > task_work_run().
> > 
> > task_add_work() calls set_notify_resume(task), so if nothing else
> > triggers when returning to userspace we run this path:
> > 
> > exit_to_usermode_loop()
> >    tracehook_notify_resume()
> >      task_work_run()
> >        __fput()
> > 	locks_remove_file()
> > 	  locks_remove_lease()
> > 	    ....
> > 
> > It's worth noting that locks_remove_lease() does a
> > percpu_down_read() which means we can already block in this context
> > removing leases....
> > 
> > If there is a signal pending, the task work is run this way (before
> > the above notify path):
> > 
> > exit_to_usermode_loop()
> >    do_signal()
> >      get_signal()
> >        task_work_run()
> >          __fput()
> > 
> > We can detect this case via signal_pending() and even SIGKILL via
> > fatal_signal_pending(), and so we can decide not to block based on
> > the fact the process is about to be reaped and so the lease largely
> > doesn't matter anymore. I'd argue that it is close and we can't
> > easily back out, so we'd only break the block on a fatal signal....
> > 
> > And then, of course, is the call path through do_exit(), which has
> > the PF_EXITING task flag set:
> > 
> > do_exit()
> >    exit_task_work()
> >      task_work_run()
> >        __fput()
> > 
> > and so it's easy to avoid blocking in this case, too.
> 
> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> memory with FOLL_LONGTERM, and wondering how to make that work here.

I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?

> These are close to files, in how they're handled, but just different
> enough that it's not clear to me how to make work with this system.

I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too....

Cheers,

Dave.
John Hubbard Aug. 20, 2019, 3:09 a.m. UTC | #15
On 8/19/19 6:20 PM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
>> On 8/19/19 2:24 AM, Dave Chinner wrote:
>>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
>> ...
>>
>> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
>> memory with FOLL_LONGTERM, and wondering how to make that work here.
> 
> I'm not sure how this interacts with file mappings? I mean, this
> is just pinning anonymous pages for direct data placement into
> userspace, right?
> 
> Are you asking "what if this pinned memory was a file mapping?",
> or something else?

Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
already there in xdp_umem_pin_pages(), unconditionally. So the
simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
not set) are not going to work here.


> 
>> These are close to files, in how they're handled, but just different
>> enough that it's not clear to me how to make work with this system.
> 
> I'm guessing that if they are pinning a file backed mapping, they
> are trying to dma direct to the file (zero copy into page cache?)
> and so they'll need to either play by ODP rules or take layout
> leases, too....
> 

OK. I was just wondering if there was some simple way to dig up a
struct file associated with a socket (I don't think so), but it sounds
like this is an exercise that's potentially different for each subsystem.

thanks,
Dave Chinner Aug. 20, 2019, 3:36 a.m. UTC | #16
On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> On 8/19/19 6:20 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > ...
> > > 
> > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> > > memory with FOLL_LONGTERM, and wondering how to make that work here.
> > 
> > I'm not sure how this interacts with file mappings? I mean, this
> > is just pinning anonymous pages for direct data placement into
> > userspace, right?
> > 
> > Are you asking "what if this pinned memory was a file mapping?",
> > or something else?
> 
> Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
> already there in xdp_umem_pin_pages(), unconditionally. So the
> simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
> not set) are not going to work here.
> 
> 
> > 
> > > These are close to files, in how they're handled, but just different
> > > enough that it's not clear to me how to make work with this system.
> > 
> > I'm guessing that if they are pinning a file backed mapping, they
> > are trying to dma direct to the file (zero copy into page cache?)
> > and so they'll need to either play by ODP rules or take layout
> > leases, too....
> > 
> 
> OK. I was just wondering if there was some simple way to dig up a
> struct file associated with a socket (I don't think so), but it sounds
> like this is an exercise that's potentially different for each subsystem.

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.

Cheers,

Dave.
Jason Gunthorpe Aug. 20, 2019, 11:55 a.m. UTC | #17
On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > 
> > > So that leaves just the normal close() syscall exit case, where the
> > > application has full control of the order in which resources are
> > > released. We've already established that we can block in this
> > > context.  Blocking in an interruptible state will allow fatal signal
> > > delivery to wake us, and then we fall into the
> > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > 
> > The major problem with RDMA is that it doesn't always wait on close() for the
> > MR holding the page pins to be destoyed. This is done to avoid a
> > deadlock of the form:
> > 
> >    uverbs_destroy_ufile_hw()
> >       mutex_lock()
> >        [..]
> >         mmput()
> >          exit_mmap()
> >           remove_vma()
> >            fput();
> >             file_operations->release()
> 
> I think this is wrong, and I'm pretty sure it's an example of why
> the final __fput() call is moved out of line.

Yes, I think so too, all I can say is this *used* to happen, as we
have special code avoiding it, which is the code that is messing up
Ira's lifetime model.

Ira, you could try unraveling the special locking, that solves your
lifetime issues?

Jason
Ira Weiny Aug. 21, 2019, 6:02 p.m. UTC | #18
On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > 
> > > > So that leaves just the normal close() syscall exit case, where the
> > > > application has full control of the order in which resources are
> > > > released. We've already established that we can block in this
> > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > delivery to wake us, and then we fall into the
> > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > 
> > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > MR holding the page pins to be destoyed. This is done to avoid a
> > > deadlock of the form:
> > > 
> > >    uverbs_destroy_ufile_hw()
> > >       mutex_lock()
> > >        [..]
> > >         mmput()
> > >          exit_mmap()
> > >           remove_vma()
> > >            fput();
> > >             file_operations->release()
> > 
> > I think this is wrong, and I'm pretty sure it's an example of why
> > the final __fput() call is moved out of line.
> 
> Yes, I think so too, all I can say is this *used* to happen, as we
> have special code avoiding it, which is the code that is messing up
> Ira's lifetime model.
> 
> Ira, you could try unraveling the special locking, that solves your
> lifetime issues?

Yes I will try to prove this out...  But I'm still not sure this fully solves
the problem.

This only ensures that the process which has the RDMA context (RDMA FD) is safe
with regard to hanging the close for the "data file FD" (the file which has
pinned pages) in that _same_ process.  But what about the scenario.

Process A has the RDMA context FD and data file FD (with lease) open.

Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Process A attempts to exit (hangs because data file FD is pinned).

Admin kills process A.  kill works because we have allowed for it...

Process B _still_ has the RDMA context FD open _and_ therefore still holds the
file pins.

Truncation still fails.

Admin does not know which process is holding the pin.

What am I missing?

Ira
Jason Gunthorpe Aug. 21, 2019, 6:13 p.m. UTC | #19
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >    uverbs_destroy_ufile_hw()
> > > >       mutex_lock()
> > > >        [..]
> > > >         mmput()
> > > >          exit_mmap()
> > > >           remove_vma()
> > > >            fput();
> > > >             file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.

Oh, I didn't think we were talking about that. Hanging the close of
the datafile fd contingent on some other FD's closure is a recipe for
deadlock..

IMHO the pin refcnt is held by the driver char dev FD, that is the
object you need to make it visible against.

Why not just have a single table someplace of all the layout leases
with the file they are held on and the FD/socket/etc that is holding
the pin? Make it independent of processes and FDs?

Jason
John Hubbard Aug. 21, 2019, 6:22 p.m. UTC | #20
On 8/21/19 11:13 AM, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
>> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
>>> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
>>>>>
>>>>>> So that leaves just the normal close() syscall exit case, where the
>>>>>> application has full control of the order in which resources are
>>>>>> released. We've already established that we can block in this
>>>>>> context.  Blocking in an interruptible state will allow fatal signal
>>>>>> delivery to wake us, and then we fall into the
>>>>>> fatal_signal_pending() case if we get a SIGKILL while blocking.
>>>>>
>>>>> The major problem with RDMA is that it doesn't always wait on close() for the
>>>>> MR holding the page pins to be destoyed. This is done to avoid a
>>>>> deadlock of the form:
>>>>>
>>>>>     uverbs_destroy_ufile_hw()
>>>>>        mutex_lock()
>>>>>         [..]
>>>>>          mmput()
>>>>>           exit_mmap()
>>>>>            remove_vma()
>>>>>             fput();
>>>>>              file_operations->release()
>>>>
>>>> I think this is wrong, and I'm pretty sure it's an example of why
>>>> the final __fput() call is moved out of line.
>>>
>>> Yes, I think so too, all I can say is this *used* to happen, as we
>>> have special code avoiding it, which is the code that is messing up
>>> Ira's lifetime model.
>>>
>>> Ira, you could try unraveling the special locking, that solves your
>>> lifetime issues?
>>
>> Yes I will try to prove this out...  But I'm still not sure this fully solves
>> the problem.
>>
>> This only ensures that the process which has the RDMA context (RDMA FD) is safe
>> with regard to hanging the close for the "data file FD" (the file which has
>> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..
> 
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.


If you do that, it might make it a lot simpler to add lease support
to drivers like XDP, which is otherwise looking pretty difficult to
set up with an fd. (It's socket-based, and not immediately clear where
to connect up the fd.)


thanks,
John Hubbard Aug. 21, 2019, 6:43 p.m. UTC | #21
On 8/19/19 8:36 PM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
>> On 8/19/19 6:20 PM, Dave Chinner wrote:
>>> On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
>>>> On 8/19/19 2:24 AM, Dave Chinner wrote:
>>>>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>>>>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>>>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
>>>> ...
> AFAIA, there is no struct file here - the memory that has been pinned
> is just something mapped into the application's address space.
> 
> It seems to me that the socket here is equivalent of the RDMA handle
> that that owns the hardware that pins the pages. Again, that RDMA
> handle is not aware of waht the mapping represents, hence need to
> hold a layout lease if it's a file mapping.
> 
> SO from the filesystem persepctive, there's no difference between
> XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
> into the filesystem's backing store and that will require use of
> layout leases to perform safely.
> 

OK, got it! Makes perfect sense.

thanks,
Ira Weiny Aug. 21, 2019, 6:57 p.m. UTC | #22
On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >    uverbs_destroy_ufile_hw()
> > > > >       mutex_lock()
> > > > >        [..]
> > > > >         mmput()
> > > > >          exit_mmap()
> > > > >           remove_vma()
> > > > >            fput();
> > > > >             file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> 
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..

The discussion between Jan and Dave was concerning what happens when a user
calls

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)

Dave suggested:

"I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released."

	-- Dave https://lkml.org/lkml/2019/8/16/994

> 
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.

I'm sorry but what do you mean by "driver char dev FD"?

> 
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?

If it is independent of processes how will we know which process is blocking
the truncate?  Using a global table is an interesting idea but I still believe
the users are going to want to track this to specific processes.  It's not
clear to me how that would be done with a global table.

I agree the XDP/socket case is bothersome...  I was thinking that somewhere the
fd of the socket could be hooked up in this case.  But taking a look at it
reveals that is not going to be easy.  And I assume XDP has the same issue WRT
SCM_RIGHTS and the ability to share the xdp context?

Ira

> 
> Jason
Ira Weiny Aug. 21, 2019, 7:06 p.m. UTC | #23
On Wed, Aug 21, 2019 at 11:57:03AM -0700, 'Ira Weiny' wrote:
> On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >    uverbs_destroy_ufile_hw()
> > > > > >       mutex_lock()
> > > > > >        [..]
> > > > > >         mmput()
> > > > > >          exit_mmap()
> > > > > >           remove_vma()
> > > > > >            fput();
> > > > > >             file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > > with regard to hanging the close for the "data file FD" (the file which has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)
> 
> Dave suggested:
> 
> "I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released."
> 
> 	-- Dave https://lkml.org/lkml/2019/8/16/994

I think this may all be easier if there was a way to block a dup() if it comes
from an SCM_RIGHTS.  Does anyone know if that is easy?  I assume it would just
mean passing some flag through the dup() call chain.

Jason, if we did that would it break RDMA use cases?  I personally don't know
of any.  We could pass data back from vaddr_pin indicating that a file pin has
been taken and predicate the blocking of SCM_RIGHTS on that? 

Of course if the user called:

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)
fork() <=== in another thread because close is hanging

Would that dup() "fd" above into the child?  Or maybe that would be part of the
work to make close() hang?  Ensure the fd/file is still in the FD table so it
gets dupped???

Ira


> 
> > 
> > IMHO the pin refcnt is held by the driver char dev FD, that is the
> > object you need to make it visible against.
> 
> I'm sorry but what do you mean by "driver char dev FD"?
> 
> > 
> > Why not just have a single table someplace of all the layout leases
> > with the file they are held on and the FD/socket/etc that is holding
> > the pin? Make it independent of processes and FDs?
> 
> If it is independent of processes how will we know which process is blocking
> the truncate?  Using a global table is an interesting idea but I still believe
> the users are going to want to track this to specific processes.  It's not
> clear to me how that would be done with a global table.
> 
> I agree the XDP/socket case is bothersome...  I was thinking that somewhere the
> fd of the socket could be hooked up in this case.  But taking a look at it
> reveals that is not going to be easy.  And I assume XDP has the same issue WRT
> SCM_RIGHTS and the ability to share the xdp context?
> 
> Ira
> 
> > 
> > Jason
Ira Weiny Aug. 21, 2019, 7:09 p.m. UTC | #24
On Wed, Aug 21, 2019 at 11:43:30AM -0700, John Hubbard wrote:
> On 8/19/19 8:36 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> > > On 8/19/19 6:20 PM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > ...
> > AFAIA, there is no struct file here - the memory that has been pinned
> > is just something mapped into the application's address space.
> > 
> > It seems to me that the socket here is equivalent of the RDMA handle
> > that that owns the hardware that pins the pages. Again, that RDMA
> > handle is not aware of waht the mapping represents, hence need to
> > hold a layout lease if it's a file mapping.
> > 
> > SO from the filesystem persepctive, there's no difference between
> > XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
> > into the filesystem's backing store and that will require use of
> > layout leases to perform safely.
> > 
> 
> OK, got it! Makes perfect sense.

Just to chime in here... Yea from the FS perspective it is the same.

But on the driver side it is more complicated because of how the references to
the pins can be shared among other processes.

See the other branch of this thread

https://lkml.org/lkml/2019/8/21/828

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
Jason Gunthorpe Aug. 21, 2019, 7:48 p.m. UTC | #25
On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:

> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)

I don't see how blocking close(fd) could work. Write it like this:

 fd = open()
 uverbs = open(/dev/uverbs)
 fnctl(...getlease...)
 addr = mmap(fd...)
 ib_reg_mr() <pin>
 munmap(addr...)
  <sigkill>

The order FD's are closed during sigkill is not deterministic, so when
all the fputs happen during a kill'd exit we could end up blocking in
close(fd) as close(uverbs) will come after in the close
list. close(uverbs) is the thing that does the dereg_mr and releases
the pin.

We don't need complexity with dup to create problems.

Jason
Ira Weiny Aug. 21, 2019, 8:44 p.m. UTC | #26
On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> 
> > > Oh, I didn't think we were talking about that. Hanging the close of
> > > the datafile fd contingent on some other FD's closure is a recipe for
> > > deadlock..
> > 
> > The discussion between Jan and Dave was concerning what happens when a user
> > calls
> > 
> > fd = open()
> > fnctl(...getlease...)
> > addr = mmap(fd...)
> > ib_reg_mr() <pin>
> > munmap(addr...)
> > close(fd)
> 
> I don't see how blocking close(fd) could work.

Well Dave was saying this _could_ work.  FWIW I'm not 100% sure it will but I
can't prove it won't..  Maybe we are all just touching a different part of this
elephant[1] but the above scenario or one without munmap is very reasonably
something a user would do.  So we can either allow the close to complete (my
current patches) or try to make it block like Dave is suggesting.

I don't disagree with Dave with the semantics being nice and clean for the
filesystem.  But the fact that RDMA, and potentially others, can "pass the
pins" to other processes is something I spent a lot of time trying to work out.

>
> Write it like this:
> 
>  fd = open()
>  uverbs = open(/dev/uverbs)
>  fnctl(...getlease...)
>  addr = mmap(fd...)
>  ib_reg_mr() <pin>
>  munmap(addr...)
>   <sigkill>
> 
> The order FD's are closed during sigkill is not deterministic, so when
> all the fputs happen during a kill'd exit we could end up blocking in
> close(fd) as close(uverbs) will come after in the close
> list. close(uverbs) is the thing that does the dereg_mr and releases
> the pin.

Of course, that is a different scenario which needs to be fixed in my patch
set.  Now that my servers are back up I can hopefully make progress.  (Power
was down for them yesterday).

> 
> We don't need complexity with dup to create problems.

No but that complexity _will_ come unless we "zombie" layout leases.

Ira

[1] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

> 
> Jason
>
Jason Gunthorpe Aug. 21, 2019, 11:49 p.m. UTC | #27
On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:

> > The order FD's are closed during sigkill is not deterministic, so when
> > all the fputs happen during a kill'd exit we could end up blocking in
> > close(fd) as close(uverbs) will come after in the close
> > list. close(uverbs) is the thing that does the dereg_mr and releases
> > the pin.
> 
> Of course, that is a different scenario which needs to be fixed in my patch
> set.  Now that my servers are back up I can hopefully make progress.  (Power
> was down for them yesterday).

It isn't really a different scenario, the problem is that the
filesystem fd must be closable independenly of fencing the MR to avoid
deadlock cycles. Once you resolve that the issue of the uverbs FD out
living it won't matter one bit if it is in the same process or
another.

Jason
Dave Chinner Aug. 23, 2019, 12:59 a.m. UTC | #28
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >    uverbs_destroy_ufile_hw()
> > > >       mutex_lock()
> > > >        [..]
> > > >         mmput()
> > > >          exit_mmap()
> > > >           remove_vma()
> > > >            fput();
> > > >             file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Process A has the RDMA context FD and data file FD (with lease) open.
> 
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
> 
> Admin kills process A.  kill works because we have allowed for it...
> 
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
> 
> Truncation still fails.
> 
> Admin does not know which process is holding the pin.
> 
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
Dave Chinner Aug. 23, 2019, 3:23 a.m. UTC | #29
On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > 
> > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > deadlock..
> > > 
> > > The discussion between Jan and Dave was concerning what happens when a user
> > > calls
> > > 
> > > fd = open()
> > > fnctl(...getlease...)
> > > addr = mmap(fd...)
> > > ib_reg_mr() <pin>
> > > munmap(addr...)
> > > close(fd)
> > 
> > I don't see how blocking close(fd) could work.
> 
> Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> can't prove it won't..

Right, I proposed it as a possible way of making sure application
developers don't do this. It _could_ be made to work (e.g. recording
longterm page pins on the vma->file), but this is tangential to 
the discussion of requiring active references to all resources
covered by the layout lease.

I think allowing applications to behave like the above is simply
poor system level design, regardless of the interaction with
filesystems and layout leases.

> Maybe we are all just touching a different part of this
> elephant[1] but the above scenario or one without munmap is very reasonably
> something a user would do.  So we can either allow the close to complete (my
> current patches) or try to make it block like Dave is suggesting.
> 
> I don't disagree with Dave with the semantics being nice and clean for the
> filesystem.

I'm not trying to make it "nice and clean for the filesystem".

The problem is not just RDMA/DAX - anything that is directly
accessing the block device under the filesystem has the same set of
issues. That is, the filesystem controls the life cycle of the
blocks in the block device, so direct access to the blocks by any
means needs to be co-ordinated with the filesystem. Pinning direct
access to a file via page pins attached to a hardware context that
the filesystem knows nothing about is not an access model that the
filesystems can support.

IOWs, anyone looking at this problem just from the RDMA POV of page
pins is not seeing all the other direct storage access mechainsms
that we need to support in the filesystems. RDMA on DAX is just one
of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
-> DAX (direct file data placement from the network hardware) is
another. There are /lots/ of different direct storage access
mechanisms that filesystems need to support and we sure as hell do
not want to have to support special case semantics for every single
one of them.

Hence if we don't start with a sane model for arbitrating direct
access to the storage at the filesystem level we'll never get this
stuff to work reliably, let alone work together coherently.  An
application that wants a direct data path to storage should have a
single API that enables then to safely access the storage,
regardless of how they are accessing the storage.

From that perspective, what we are talking about here with RDMA
doing "mmap, page pin, unmap, close" and "pass page pins via
SCM_RIGHTS" are fundamentally unworkable from the filesystem
perspective. They are use-after-free situations from the filesystem
perspective - they do not hold direct references to anything in the
filesystem, and so the filesytem is completely unaware of them.

The filesystem needs to be aware of /all users/ of it's resources if
it's going to manage them sanely.  It needs to be able to corectly
coordinate modifications to ownership of the underlying storage with
all the users directly accessing that physical storage regardless of
the mechanism being used to access the storage.  IOWs, access
control must be independent of the mechanism used to gain access to
the storage hardware.

That's what file layout leases are defining - the filesystem support
model for allowing direct storage access from userspace. It's not
defining "RDMA/FSDAX" access rules, it's defining a generic direct
access model. And one of the rules in this model is "if you don't
have an active reference to the file layout, you are not allowed to
directly access the layout.".

Anything else is unsupportable from the filesystem perspective -
designing an access mechanism that allows userspace to retain access
indefinitely by relying on hidden details of kernel subsystem
implementations is a terrible architecture.  Not only does it bleed
kernel implementation into the API and the behavioural model, it
means we can't ever change that internal kernel behaviour because
userspace may be dependent on it. I shouldn't be having to point out
how bad this is from a system design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on our radar?" and working from
there.  So I'm not just looking at what we are doing right now, I'm
looking at 15 years down the track when we still have to support
layout leases and we've got hardware we haven't dreamed of yet.  If
the model is clean, simple, robust, implementation independent and
has well defined semantics, then it should stand the test of time.
i.e. the "nice and clean" semantics have nothign to do with the
filesystem per se, but everything to do with ensuring the mechanism
is generic and workable for direct storage access for a long time
into the future.

We can't force people to use layout leases - at all, let alone
correctly - but if you want filesystems and enterprise distros to
support direct access to filesystem controlled storage, then direct
access applications need to follow a sane set of rules that are
supportable at the filesystem level.

> But the fact that RDMA, and potentially others, can "pass the
> pins" to other processes is something I spent a lot of time trying to work out.

There's nothing in file layout lease architecture that says you
can't "pass the pins" to another process.  All the file layout lease
requirements say is that if you are going to pass a resource for
which the layout lease guarantees access for to another process,
then the destination process already have a valid, active layout
lease that covers the range of the pins being passed to it via the
RDMA handle.

i.e. as the pins pass from one process to another, they pass from
the protection of the lease process A holds to the protection that
the lease process B holds. This can probably even be done by
duplicating the lease fd and passing it by SCM_RIGHTS first.....

Cheers,

Dave.
Jason Gunthorpe Aug. 23, 2019, 12:04 p.m. UTC | #30
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:

> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work out.
> 
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process.  All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.

How would the kernel detect and enforce this? There are many ways to
pass a FD.

IMHO it is wrong to try and create a model where the file lease exists
independently from the kernel object relying on it. In other words the
IB MR object itself should hold a reference to the lease it relies
upon to function properly.

Then we don't have to wreck the unix FD model to fit this in.

Jason
Ira Weiny Aug. 23, 2019, 5:15 p.m. UTC | #31
On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >    uverbs_destroy_ufile_hw()
> > > > >       mutex_lock()
> > > > >        [..]
> > > > >         mmput()
> > > > >          exit_mmap()
> > > > >           remove_vma()
> > > > >            fput();
> > > > >             file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Process A has the RDMA context FD and data file FD (with lease) open.
> > 
> > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> 
> Passing the RDMA context dependent on a file layout lease to another
> process that doesn't have a file layout lease or a reference to the
> original lease should be considered a violation of the layout lease.
> Process B does not have an active layout lease, and so by the rules
> of layout leases, it is not allowed to pin the layout of the file.
> 

I don't disagree with the semantics of this.  I just don't know how to enforce
it.

> > Process A attempts to exit (hangs because data file FD is pinned).
> > 
> > Admin kills process A.  kill works because we have allowed for it...
> > 
> > Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> > file pins.
> > 
> > Truncation still fails.
> > 
> > Admin does not know which process is holding the pin.
> > 
> > What am I missing?
> 
> Application does not hold the correct file layout lease references.
> Passing the fd via SCM_RIGHTS to a process without a layout lease
> is equivalent to not using layout leases in the first place.

Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
in this case?  I'm ok with that but not sure how to implement it right now.

To that end, I would like to simplify this slightly because I'm not convinced
that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
user who wants to do this.

Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
definition leases) exist underneath the "RDMA FD" (or other direct access FD,
like XDP etc) being duplicated.  Later, if this becomes a use case we will need
to code up the proper checks, potentially within each of the subsystems.  This
is because, with RDMA at least, there are potentially large numbers of MR's and
file leases which may have to be checked.

Ira
Dave Chinner Aug. 24, 2019, 12:11 a.m. UTC | #32
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> 
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to work out.
> > 
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process.  All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
> 
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications.  i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd....

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range....

Cheers,

Dave.
Dave Chinner Aug. 24, 2019, 12:18 a.m. UTC | #33
On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote:
> On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >    uverbs_destroy_ufile_hw()
> > > > > >       mutex_lock()
> > > > > >        [..]
> > > > > >         mmput()
> > > > > >          exit_mmap()
> > > > > >           remove_vma()
> > > > > >            fput();
> > > > > >             file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > > with regard to hanging the close for the "data file FD" (the file which has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > > 
> > > Process A has the RDMA context FD and data file FD (with lease) open.
> > > 
> > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> > 
> > Passing the RDMA context dependent on a file layout lease to another
> > process that doesn't have a file layout lease or a reference to the
> > original lease should be considered a violation of the layout lease.
> > Process B does not have an active layout lease, and so by the rules
> > of layout leases, it is not allowed to pin the layout of the file.
> > 
> 
> I don't disagree with the semantics of this.  I just don't know how to enforce
> it.
> 
> > > Process A attempts to exit (hangs because data file FD is pinned).
> > > 
> > > Admin kills process A.  kill works because we have allowed for it...
> > > 
> > > Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> > > file pins.
> > > 
> > > Truncation still fails.
> > > 
> > > Admin does not know which process is holding the pin.
> > > 
> > > What am I missing?
> > 
> > Application does not hold the correct file layout lease references.
> > Passing the fd via SCM_RIGHTS to a process without a layout lease
> > is equivalent to not using layout leases in the first place.
> 
> Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
> in this case?  I'm ok with that but not sure how to implement it right now.
> 
> To that end, I would like to simplify this slightly because I'm not convinced
> that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
> user who wants to do this.

I don't think we can support it, let alone want to. SCM_RIGHTS was a
mistake made years ago that has been causing bugs and complexity to
try and avoid those bugs ever since.  I'm only taking about it
because someone else raised it and I asummed they raised it because
they want it to "work".

> Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
> definition leases) exist underneath the "RDMA FD" (or other direct access FD,
> like XDP etc) being duplicated.

Sounds like a fine idea to me.

Cheers,

Dave.
Ira Weiny Aug. 24, 2019, 4:49 a.m. UTC | #34
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > > 
> > > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > > deadlock..
> > > > 
> > > > The discussion between Jan and Dave was concerning what happens when a user
> > > > calls
> > > > 
> > > > fd = open()
> > > > fnctl(...getlease...)
> > > > addr = mmap(fd...)
> > > > ib_reg_mr() <pin>
> > > > munmap(addr...)
> > > > close(fd)
> > > 
> > > I don't see how blocking close(fd) could work.
> > 
> > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> > can't prove it won't..
> 
> Right, I proposed it as a possible way of making sure application
> developers don't do this. It _could_ be made to work (e.g. recording
> longterm page pins on the vma->file), but this is tangential to 
> the discussion of requiring active references to all resources
> covered by the layout lease.
> 
> I think allowing applications to behave like the above is simply
> poor system level design, regardless of the interaction with
> filesystems and layout leases.
> 
> > Maybe we are all just touching a different part of this
> > elephant[1] but the above scenario or one without munmap is very reasonably
> > something a user would do.  So we can either allow the close to complete (my
> > current patches) or try to make it block like Dave is suggesting.

My belief when writing the current series was that hanging the close would
cause deadlock.  But it seems I was wrong because of the delayed __fput().

So far, I have not been able to get RDMA to have an issue like Jason suggested
would happen (or used to happen).  So from that perspective it may be ok to
hang the close.

> > 
> > I don't disagree with Dave with the semantics being nice and clean for the
> > filesystem.
> 
> I'm not trying to make it "nice and clean for the filesystem".
> 
> The problem is not just RDMA/DAX - anything that is directly
> accessing the block device under the filesystem has the same set of
> issues. That is, the filesystem controls the life cycle of the
> blocks in the block device, so direct access to the blocks by any
> means needs to be co-ordinated with the filesystem. Pinning direct
> access to a file via page pins attached to a hardware context that
> the filesystem knows nothing about is not an access model that the
> filesystems can support.
> 
> IOWs, anyone looking at this problem just from the RDMA POV of page
> pins is not seeing all the other direct storage access mechainsms
> that we need to support in the filesystems. RDMA on DAX is just one
> of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
> -> DAX (direct file data placement from the network hardware) is
> another. There are /lots/ of different direct storage access
> mechanisms that filesystems need to support and we sure as hell do
> not want to have to support special case semantics for every single
> one of them.

My use of struct file was based on the fact that FDs are a primary interface
for linux and my thought was that they would be more universal than having file
pin information stored in an RDMA specific structure.

XDP is not as direct; it uses sockets.  But sockets also have a struct file
which I believe could be used in a similar manner.  I'm not 100% sure of the
xdp_umem lifetime yet but it seems that my choice of using struct file was a
good one in this respect.

> 
> Hence if we don't start with a sane model for arbitrating direct
> access to the storage at the filesystem level we'll never get this
> stuff to work reliably, let alone work together coherently.  An
> application that wants a direct data path to storage should have a
> single API that enables then to safely access the storage,
> regardless of how they are accessing the storage.
> 
> From that perspective, what we are talking about here with RDMA
> doing "mmap, page pin, unmap, close" and "pass page pins via
> SCM_RIGHTS" are fundamentally unworkable from the filesystem
> perspective. They are use-after-free situations from the filesystem
> perspective - they do not hold direct references to anything in the
> filesystem, and so the filesytem is completely unaware of them.

I see your point of view but looking at it from a different point of view I
don't see this as a "use after free".

The user has explicitly registered this memory (and layout) with another direct
access subsystem (RDMA for example) so why do they need to keep the FD around?

> 
> The filesystem needs to be aware of /all users/ of it's resources if
> it's going to manage them sanely.

From the way I look at it the underlying filesystem _is_ aware of the leases
with my patch set.  And so to is the user.  It is just not through the original
"data file fd".

And the owner of the lease becomes the subsystem object ("RDMA FD" in this
case) which is holding the pins.  Furthermore, the lease is maintained and
transferred automatically through the normal FD processing.

(Furthermore, tracking of these pins is available for whatever subsystem by
tracking them with struct file; _not_ just RDMA).  When those subsystem objects
are released the "data file lease" will be released as well.  That was the
design.

> 
> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work out.
> 
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process.  All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.
> 
> i.e. as the pins pass from one process to another, they pass from
> the protection of the lease process A holds to the protection that
> the lease process B holds. This can probably even be done by
> duplicating the lease fd and passing it by SCM_RIGHTS first.....

My worry with this is how to enforce it.  As I said in the other thread I think
we could potentially block SCM_RIGHTS use in the short term.  But I'm not sure
about blocking every call which may "dup()" an FD to random processes.

Ira
Ira Weiny Aug. 24, 2019, 5:08 a.m. UTC | #35
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.
> 
> > IMHO it is wrong to try and create a model where the file lease exists
> > independently from the kernel object relying on it. In other words the
> > IB MR object itself should hold a reference to the lease it relies
> > upon to function properly.
> 
> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.
> 
> That is, a lease specifically tracks the pid and the _open fd_ it
> was obtained for, so it is essentially owned by a specific process
> context.  Hence a lease is not able to be passed to a separate
> process context and have it still work correctly for lease break
> notifications.  i.e. the layout break signal gets delivered to
> original process that created the struct file, if it still exists
> and has the original fd still open. It does not get sent to the
> process that currently holds a reference to the IB context.
>

The fcntl man page says:

"Leases are associated with an open file description (see open(2)).  This means
that duplicate file descriptors (created by, for example, fork(2) or dup(2))
refer to the same lease, and this lease may be modified or released using any
of these descriptors.  Furthermore,  the lease is released by either an
explicit F_UNLCK operation on any of these duplicate file descriptors, or when
all such file descriptors have been closed."

From this I took it that the child process FD would have the lease as well
_and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
seem to work the same way as dup() so I'm not so sure.

Ira

> 
> So while a struct file passed to another process might still have
> an active lease, and you can change the owner of the struct file
> via fcntl(F_SETOWN), you can't associate the existing lease with a
> the new fd in the new process and so layout break signals can't be
> directed at the lease fd....
> 
> This really means that a lease can only be owned by a single process
> context - it can't be shared across multiple processes (so I was
> wrong about dup/pass as being a possible way of passing them)
> because there's only one process that can "own" a struct file, and
> that where signals are sent when the lease needs to be broken.
> 
> So, fundamentally, if you want to pass a resource that pins a file
> layout between processes, both processes need to hold a layout lease
> on that file range. And that means exclusive leases and passing
> layouts between processes are fundamentally incompatible because you
> can't hold two exclusive leases on the same file range....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Jason Gunthorpe Aug. 25, 2019, 7:39 p.m. UTC | #36
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.

It is a kernel problem, the MR exists and is doing DMA. That relies on
the lease to prevent data corruption.

The sanest outcome I could suggest is that when the kernel detects the
MR has outlived the lease it needs then we forcibly abort the entire
RDMA state. Ie the application has malfunctioned and gets wacked with
a very big hammer.

> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.

This is the problem. How to link something that is not refcounted to
the refcounted world of file descriptors does not seem very obvious.

There are too many places where struct file relies on its refcounting
to try to and plug them.

Jason
Jason Gunthorpe Aug. 25, 2019, 7:40 p.m. UTC | #37
On Fri, Aug 23, 2019 at 09:49:12PM -0700, Ira Weiny wrote:

> So far, I have not been able to get RDMA to have an issue like Jason suggested
> would happen (or used to happen).  So from that perspective it may be ok to
> hang the close.

No, it is not OK to hang the close. You will deadlock on process
destruction when the 'lease fd' hangs waiting for the 'uverbs fd'
which is later in the single threaded destruction sequence.

This is different from the uverbs deadlock I outlined

Jason
Dave Chinner Aug. 26, 2019, 5:55 a.m. UTC | #38
On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > > 
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > > 
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process.  All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > > 
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> > 
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> > 
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> > 
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> > 
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context.  Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications.  i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
> 
> The fcntl man page says:
> 
> "Leases are associated with an open file description (see open(2)).  This means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors.  Furthermore,  the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

	send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and you can't change the fd in
the fasync list. You can add new fd to an existing lease by calling
F_SETLEASE on the new fd, but you still only have a single process
owner context for signal delivery.

As such, leases that require callbacks to userspace are currently
only valid within the process context the lease was taken in.
Indeed, even closing the fd the lease was taken on without
F_UNLCKing it first doesn't mean the lease has been torn down if
there is some other reference to the struct file. That means the
original lease owner will still get SIGIO delivered to that fd on a
lease break regardless of whether it is open or not. ANd if we
implement "layout lease not released within SIGIO response timeout"
then that process will get killed, despite the fact it may not even
have a reference to that file anymore.

So, AFAICT, leases that require userspace callbacks only work within
their original process context while they original fd is still open.

Cheers,

Dave.
Ira Weiny Aug. 29, 2019, 2:02 a.m. UTC | #39
On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > >
> > > > IMHO it is wrong to try and create a model where the file lease exists
> > > > independently from the kernel object relying on it. In other words the
> > > > IB MR object itself should hold a reference to the lease it relies
> > > > upon to function properly.
> > > 
> > > That still doesn't work. Leases are not individually trackable or
> > > reference counted objects objects - they are attached to a struct
> > > file bUt, in reality, they are far more restricted than a struct
> > > file.
> > > 
> > > That is, a lease specifically tracks the pid and the _open fd_ it
> > > was obtained for, so it is essentially owned by a specific process
> > > context.  Hence a lease is not able to be passed to a separate
> > > process context and have it still work correctly for lease break
> > > notifications.  i.e. the layout break signal gets delivered to
> > > original process that created the struct file, if it still exists
> > > and has the original fd still open. It does not get sent to the
> > > process that currently holds a reference to the IB context.

But this is an exclusive layout lease which does not send a signal.  There is
no way to break it.

> > >
> > 
> > The fcntl man page says:
> > 
> > "Leases are associated with an open file description (see open(2)).  This means
> > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > refer to the same lease, and this lease may be modified or released using any
> > of these descriptors.  Furthermore,  the lease is released by either an
> > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > all such file descriptors have been closed."
> 
> Right, the lease is attached to the struct file, so it follows
> where-ever the struct file goes. That doesn't mean it's actually
> useful when the struct file is duplicated and/or passed to another
> process. :/
> 
> AFAICT, the problem is that when we take another reference to the
> struct file, or when the struct file is passed to a different
> process, nothing updates the lease or lease state attached to that
> struct file.

Ok, I probably should have made this more clear in the cover letter but _only_
the process which took the lease can actually pin memory.

That pinned memory _can_ be passed to another process but those sub-process' can
_not_ use the original lease to pin _more_ of the file.  They would need to
take their own lease to do that.

Sorry for not being clear on that.

> 
> > From this I took it that the child process FD would have the lease as well
> > _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
> > seem to work the same way as dup() so I'm not so sure.
> 
> Sure, that part works because the struct file is passed. It doesn't
> end up with the same fd number in the other process, though.
> 
> The issue is that layout leases need to notify userspace when they
> are broken by the kernel, so a lease stores the owner pid/tid in the
> file->f_owner field via __f_setown(). It also keeps a struct fasync
> attached to the file_lock that records the fd that the lease was
> created on.  When a signal needs to be sent to userspace for that
> lease, we call kill_fasync() and that walks the list of fasync
> structures on the lease and calls:
> 
> 	send_sigio(fown, fa->fa_fd, band);
> 
> And it does for every fasync struct attached to a lease. Yes, a
> lease can track multiple fds, but it can only track them in a single
> process context. The moment the struct file is shared with another
> process, the lease is no longer capable of sending notifications to
> all the lease holders.
> 
> Yes, you can change the owning process via F_SETOWNER, but that's
> still only a single process context, and you can't change the fd in
> the fasync list. You can add new fd to an existing lease by calling
> F_SETLEASE on the new fd, but you still only have a single process
> owner context for signal delivery.
> 
> As such, leases that require callbacks to userspace are currently
> only valid within the process context the lease was taken in.

But for long term pins we are not requiring callbacks.

> Indeed, even closing the fd the lease was taken on without
> F_UNLCKing it first doesn't mean the lease has been torn down if
> there is some other reference to the struct file. That means the
> original lease owner will still get SIGIO delivered to that fd on a
> lease break regardless of whether it is open or not. ANd if we
> implement "layout lease not released within SIGIO response timeout"
> then that process will get killed, despite the fact it may not even
> have a reference to that file anymore.

I'm not seeing that as a problem.  This is all a result of the application
failing to do the right thing.  The code here is simply keeping the kernel
consistent and safe so that an admin or the user themselves can unwind the
badness without damage to the file system.

> 
> So, AFAICT, leases that require userspace callbacks only work within
> their original process context while they original fd is still open.

But they _work_ IFF the application actually expects to do something with the
SIGIO.  The application could just as well chose to ignore the SIGIO without
closing the FD which would do the same thing.

If the application expected to do something with the SIGIO but closed the FD
then it's really just the applications fault.

So after thinking on this for a day I don't think we have a serious issue.

Even the "zombie" lease is just an application error and it is already possible
to get something like this.  If the application passes the FD to another
process and closes their FD then SIGIO's don't get delivered but there is a
lease hanging off the struct file until it is destroyed.  No harm, no foul.

In the case of close it is _not_ true that users don't have a way to release
the lease.  It is just that they can't call F_UNLCK to do so.  Once they have
"zombie'ed" the lease (again an application error) the only recourse is to
unpin the file through the subsystem which pinned the page.  Probably through
killing the process.

Ira
John Hubbard Aug. 29, 2019, 3:27 a.m. UTC | #40
On 8/28/19 7:02 PM, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
>> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
>>> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
>>>> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
...
>>
>> Sure, that part works because the struct file is passed. It doesn't
>> end up with the same fd number in the other process, though.
>>
>> The issue is that layout leases need to notify userspace when they
>> are broken by the kernel, so a lease stores the owner pid/tid in the
>> file->f_owner field via __f_setown(). It also keeps a struct fasync
>> attached to the file_lock that records the fd that the lease was
>> created on.  When a signal needs to be sent to userspace for that
>> lease, we call kill_fasync() and that walks the list of fasync
>> structures on the lease and calls:
>>
>> 	send_sigio(fown, fa->fa_fd, band);
>>
>> And it does for every fasync struct attached to a lease. Yes, a
>> lease can track multiple fds, but it can only track them in a single
>> process context. The moment the struct file is shared with another
>> process, the lease is no longer capable of sending notifications to
>> all the lease holders.
>>
>> Yes, you can change the owning process via F_SETOWNER, but that's
>> still only a single process context, and you can't change the fd in
>> the fasync list. You can add new fd to an existing lease by calling
>> F_SETLEASE on the new fd, but you still only have a single process
>> owner context for signal delivery.
>>
>> As such, leases that require callbacks to userspace are currently
>> only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.
> 

Hi Ira,

If "require callbacks to userspace" means sending SIGIO, then actually
FOLL_LONGTERM *does* require those callbacks. Because we've been, so
far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.

What am I missing here?

thanks,
Ira Weiny Aug. 29, 2019, 4:16 p.m. UTC | #41
On Wed, Aug 28, 2019 at 08:27:23PM -0700, John Hubbard wrote:
> On 8/28/19 7:02 PM, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> ...
> > > 
> > > Sure, that part works because the struct file is passed. It doesn't
> > > end up with the same fd number in the other process, though.
> > > 
> > > The issue is that layout leases need to notify userspace when they
> > > are broken by the kernel, so a lease stores the owner pid/tid in the
> > > file->f_owner field via __f_setown(). It also keeps a struct fasync
> > > attached to the file_lock that records the fd that the lease was
> > > created on.  When a signal needs to be sent to userspace for that
> > > lease, we call kill_fasync() and that walks the list of fasync
> > > structures on the lease and calls:
> > > 
> > > 	send_sigio(fown, fa->fa_fd, band);
> > > 
> > > And it does for every fasync struct attached to a lease. Yes, a
> > > lease can track multiple fds, but it can only track them in a single
> > > process context. The moment the struct file is shared with another
> > > process, the lease is no longer capable of sending notifications to
> > > all the lease holders.
> > > 
> > > Yes, you can change the owning process via F_SETOWNER, but that's
> > > still only a single process context, and you can't change the fd in
> > > the fasync list. You can add new fd to an existing lease by calling
> > > F_SETLEASE on the new fd, but you still only have a single process
> > > owner context for signal delivery.
> > > 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> > 
> 
> Hi Ira,
> 
> If "require callbacks to userspace" means sending SIGIO, then actually
> FOLL_LONGTERM *does* require those callbacks. Because we've been, so
> far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.
> 
> What am I missing here?

We agreed back in June that the layout lease would have 2 "levels".  The
"normal" layout lease would cause SIGIO and could be broken and another
"exclusive" level which could _not_ be broken.

Because we _can't_ _trust_ user space to react to the SIGIO properly the
"exclusive" lease is required to take the longterm pins.  Also this is the
lease which causes the truncate to fail (return ETXTBSY) because the kernel
can't break the lease.

The vaddr_pin struct in the current RFC is there for a couple of reasons.

1) To ensure that we have a way to correlate the long term pin user with the
   file if the data file FD's are closed.  (ie the application has zombie'd the
   lease).

2) And more importantly as a token the vaddr_pin*() callers use to be able to
   properly ref count the file itself while in use.

Ira
Dave Chinner Sept. 2, 2019, 10:26 p.m. UTC | #42
On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > "Leases are associated with an open file description (see open(2)).  This means
> > > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > > refer to the same lease, and this lease may be modified or released using any
> > > of these descriptors.  Furthermore,  the lease is released by either an
> > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > > all such file descriptors have been closed."
> > 
> > Right, the lease is attached to the struct file, so it follows
> > where-ever the struct file goes. That doesn't mean it's actually
> > useful when the struct file is duplicated and/or passed to another
> > process. :/
> > 
> > AFAICT, the problem is that when we take another reference to the
> > struct file, or when the struct file is passed to a different
> > process, nothing updates the lease or lease state attached to that
> > struct file.
> 
> Ok, I probably should have made this more clear in the cover letter but _only_
> the process which took the lease can actually pin memory.

Sure, no question about that.

> That pinned memory _can_ be passed to another process but those sub-process' can
> _not_ use the original lease to pin _more_ of the file.  They would need to
> take their own lease to do that.

Yes, they would need a new lease to extend it. But that ignores the
fact they don't have a lease on the existing pins they are using and
have no control over the lease those pins originated under.  e.g.
the originating process dies (for whatever reason) and now we have
pins without a valid lease holder.

If something else now takes an exclusive lease on the file (because
the original exclusive lease no longer exists), it's not going to
work correctly because of the zombied page pins caused by closing
the exclusive lease they were gained under. IOWs, pages pinned under
an exclusive lease are no longer "exclusive" the moment the original
exclusive lease is dropped, and pins passed to another process are
no longer covered by the original lease they were created under.

> Sorry for not being clear on that.

I know exactly what you are saying. What I'm failing to get across
is that file layout leases don't actually allow the behaviour you
want to have.

> > As such, leases that require callbacks to userspace are currently
> > only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.

Regardless, we still require an active lease for long term pins so
that other lease holders fail operations appropriately. And that
exclusive lease must follow the process that pins the pages so that
the life cycle is the same...

> > Indeed, even closing the fd the lease was taken on without
> > F_UNLCKing it first doesn't mean the lease has been torn down if
> > there is some other reference to the struct file. That means the
> > original lease owner will still get SIGIO delivered to that fd on a
> > lease break regardless of whether it is open or not. ANd if we
> > implement "layout lease not released within SIGIO response timeout"
> > then that process will get killed, despite the fact it may not even
> > have a reference to that file anymore.
> 
> I'm not seeing that as a problem.  This is all a result of the application
> failing to do the right thing.

How is that not a problem?

Cheers,

Dave.
Ira Weiny Sept. 4, 2019, 4:54 p.m. UTC | #43
On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > > "Leases are associated with an open file description (see open(2)).  This means
> > > > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > > > refer to the same lease, and this lease may be modified or released using any
> > > > of these descriptors.  Furthermore,  the lease is released by either an
> > > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > > > all such file descriptors have been closed."
> > > 
> > > Right, the lease is attached to the struct file, so it follows
> > > where-ever the struct file goes. That doesn't mean it's actually
> > > useful when the struct file is duplicated and/or passed to another
> > > process. :/
> > > 
> > > AFAICT, the problem is that when we take another reference to the
> > > struct file, or when the struct file is passed to a different
> > > process, nothing updates the lease or lease state attached to that
> > > struct file.
> > 
> > Ok, I probably should have made this more clear in the cover letter but _only_
> > the process which took the lease can actually pin memory.
> 
> Sure, no question about that.
> 
> > That pinned memory _can_ be passed to another process but those sub-process' can
> > _not_ use the original lease to pin _more_ of the file.  They would need to
> > take their own lease to do that.
> 
> Yes, they would need a new lease to extend it. But that ignores the
> fact they don't have a lease on the existing pins they are using and
> have no control over the lease those pins originated under.  e.g.
> the originating process dies (for whatever reason) and now we have
> pins without a valid lease holder.

Define "valid lease holder"?

> 
> If something else now takes an exclusive lease on the file (because
> the original exclusive lease no longer exists), it's not going to
> work correctly because of the zombied page pins caused by closing
> the exclusive lease they were gained under. IOWs, pages pinned under
> an exclusive lease are no longer "exclusive" the moment the original
> exclusive lease is dropped, and pins passed to another process are
> no longer covered by the original lease they were created under.

The page pins are not zombied the lease is.  The lease still exists, it can't
be dropped while the pins are in place.  I need to double check the
implementation but that was the intent.

Yep just did a quick check, I have a test for that.  If the page pins exist
then the lease can _not_ be released.  Closing the FD will "zombie" the lease
but it and the struct file will still exist until the pins go away.

Furthermore, a "zombie" lease is _not_ sufficient to pin more pages.  (I have a
test for this too.)  I apologize that I don't have something to submit to
xfstests.  I'm new to that code base.

I'm happy to share the code I have which I've been using to test...  But it is
pretty rough as it has undergone a number of changes.  I think it would be
better to convert my test series to xfstests.

However, I don't know if it is ok to require RDMA within those tests.  Right
now that is the only sub-system I have allowed to create these page pins.  So
I'm not sure what to do at this time.  I'm open to suggestions.

> 
> > Sorry for not being clear on that.
> 
> I know exactly what you are saying. What I'm failing to get across
> is that file layout leases don't actually allow the behaviour you
> want to have.

Not currently, no.  But we are discussing the semantics to allow them _to_ have
the behavior needed.

> 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> 
> Regardless, we still require an active lease for long term pins so
> that other lease holders fail operations appropriately. And that
> exclusive lease must follow the process that pins the pages so that
> the life cycle is the same...

I disagree.  See below.

> 
> > > Indeed, even closing the fd the lease was taken on without
> > > F_UNLCKing it first doesn't mean the lease has been torn down if
> > > there is some other reference to the struct file. That means the
> > > original lease owner will still get SIGIO delivered to that fd on a
> > > lease break regardless of whether it is open or not. ANd if we
> > > implement "layout lease not released within SIGIO response timeout"
> > > then that process will get killed, despite the fact it may not even
> > > have a reference to that file anymore.
> > 
> > I'm not seeing that as a problem.  This is all a result of the application
> > failing to do the right thing.
> 
> How is that not a problem?

The application has taken an exclusive lease and they don't have to let it go.

IOW, there is little difference between the application closing the FD and
creating a zombie lease vs keeping the FD open with a real lease.  Because no
SIGIO is sent and there is no need to react to it anyway as the intention is to
keep the lease active and the layout pinned "indefinitely".

Furthermore, in both cases the admin must kill the application to change the
layout forcibly.  Basically applications don't _have_ to do the right thing but
the kernel and the filesystem is still protected while the admin has a way to
correct the situation given a bad application.

Therefore, from the POV of the kernel and file system I don't see a problem.

Ira