Message ID | 20190606014544.8339-1-ira.weiny@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | RDMA/FS DAX truncate proposal | expand |
On 6/5/19 6:45 PM, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > ... V1,000,000 ;-) > > Pre-requisites: > John Hubbard's put_user_pages() patch series.[1] > Jan Kara's ext4_break_layouts() fixes[2] > > Based on the feedback from LSFmm and the LWN article which resulted. I've > decided to take a slightly different tack on this problem. > > The real issue is that there is no use case for a user to have RDMA pinn'ed > memory which is then truncated. So really any solution we present which: > > A) Prevents file system corruption or data leaks > ...and... > B) Informs the user that they did something wrong > > Should be an acceptable solution. > > Because this is slightly new behavior. And because this is gonig to be > specific to DAX (because of the lack of a page cache) we have made the user > "opt in" to this behavior. > > The following patches implement the following solution. > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > (now made visible). > 2) GUP will fail (EPERM) if a layout lease is not taken > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > 4) The user has the option of holding the layout lease to receive a SIGIO for > notification to the original thread that another thread has tried to delete > their data. Furthermore this indicates that if the user needs to GUP the > file again they will need to retake the Layout lease before doing so. > > > NOTE: If the user releases the layout lease or if it has been broken by another > operation further GUP operations on the file will fail without re-taking the > lease. This means that if a user would like to register pieces of a file and > continue to register other pieces later they would be advised to keep the > layout lease, get a SIGIO notification, and retake the lease. > > NOTE2: Truncation of pages which are not actively pinned will succeed. Similar > to accessing an mmap to this area GUP pins of that memory may fail. > Hi Ira, Wow, great to see this. This looks like basically the right behavior, IMHO. 1. We'll need man page additions, to explain it. In fact, even after a quick first pass through, I'm vague on two points: a) I'm not sure how this actually provides "opt-in to new behavior", because I don't see any CONFIG_* or boot time choices, and it looks like the new behavior just is there. That is, if user space doesn't set F_LAYOUT on a range, GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?) b) Truncate and hole punch behavior, with and without user space having a SIGIO handler. (I'm sure this is obvious after another look through, but it might go nicely in a man page.) 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case, but for general RDMA on them? Or is there more that must be done? 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will conflict violently, as I'm sure you noticed. :) thanks,
On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > ... V1,000,000 ;-) > > Pre-requisites: > John Hubbard's put_user_pages() patch series.[1] > Jan Kara's ext4_break_layouts() fixes[2] > > Based on the feedback from LSFmm and the LWN article which resulted. I've > decided to take a slightly different tack on this problem. > > The real issue is that there is no use case for a user to have RDMA pinn'ed > memory which is then truncated. So really any solution we present which: > > A) Prevents file system corruption or data leaks > ...and... > B) Informs the user that they did something wrong > > Should be an acceptable solution. > > Because this is slightly new behavior. And because this is gonig to be > specific to DAX (because of the lack of a page cache) we have made the user > "opt in" to this behavior. > > The following patches implement the following solution. > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > (now made visible). > 2) GUP will fail (EPERM) if a layout lease is not taken > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > 4) The user has the option of holding the layout lease to receive a SIGIO for > notification to the original thread that another thread has tried to delete > their data. Furthermore this indicates that if the user needs to GUP the > file again they will need to retake the Layout lease before doing so. > > > NOTE: If the user releases the layout lease or if it has been broken by > another operation further GUP operations on the file will fail without > re-taking the lease. This means that if a user would like to register > pieces of a file and continue to register other pieces later they would > be advised to keep the layout lease, get a SIGIO notification, and retake > the lease. > > NOTE2: Truncation of pages which are not actively pinned will succeed. > Similar to accessing an mmap to this area GUP pins of that memory may > fail. So after some through I'm willing accept the fact that pinned DAX pages will just make truncate / hole punch fail and shove it into a same bucket of situations like "user can open a file and unlink won't delete it" or "ETXTBUSY when user is executing a file being truncated". The problem I have with this proposal is a lack of visibility from sysadmin POV. For ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the problematic process and kill it. There's nothing like that with your proposal since currently once you hold page reference, you can unmap the file, drop layout lease, close the file, and there's no trace that you're responsible for the pinned page anymore. So I'd like to actually mandate that you *must* hold the file lease until you unpin all pages in the given range (not just that you have an option to hold a lease). And I believe the kernel should actually enforce this. That way we maintain a sane state that if someone uses a physical location of logical file offset on disk, he has a layout lease. Also once this is done, sysadmin has a reasonably easy way to discover run-away RDMA application and kill it if he wishes so. The question is on how to exactly enforce that lease is taken until all pages are unpinned. I belive it could be done by tracking number of long-term pinned pages within a lease. Gup_longterm could easily increment the count when verifying the lease exists, gup_longterm users will somehow need to propagate corresponding 'filp' (struct file pointer) to put_user_pages_longterm() callsites so that they can look up appropriate lease to drop reference - probably I'd just transition all gup_longterm() users to a saner API similar to the one we have in mm/frame_vector.c where we don't hand out page pointers but an encapsulating structure that does all the necessary tracking. Removing a lease would need to block until all pins are released - this is probably the most hairy part since we need to handle a case if application just closes the file descriptor which would release the lease but OTOH we need to make sure task exit does not deadlock. Maybe we could block only on explicit lease unlock and just drop the layout lease on file close and if there are still pinned pages, send SIGKILL to an application as a reminder it did something stupid... What do people think about this? Honza > > > A general overview follows for background. > > It should be noted that one solution for this problem is to use RDMA's On > Demand Paging (ODP). There are 2 big reasons this may not work. > > 1) The hardware being used for RDMA may not support ODP > 2) ODP may be detrimental to the over all network (cluster or cloud) > performance > > Therefore, in order to support RDMA to File system pages without On Demand > Paging (ODP) a number of things need to be done. > > 1) GUP "longterm" users need to inform the other subsystems that they have > taken a pin on a page which may remain pinned for a very "long time".[3] > > 2) Any page which is "controlled" by a file system needs to have special > handling. The details of the handling depends on if the page is page cache > fronted or not. > > 2a) A page cache fronted page which has been pinned by GUP long term can use a > bounce buffer to allow the file system to write back snap shots of the page. > This is handled by the FS recognizing the GUP long term pin and making a copy > of the page to be written back. > NOTE: this patch set does not address this path. > > 2b) A FS "controlled" page which is not page cache fronted is either easier > to deal with or harder depending on the operation the filesystem is trying > to do. > > 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the > FS can no longer use the pages in question until the pin has been > removed. This patch set presents a solution to this by introducing > some reasonable restrictions on user space applications. > > 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch > then there is nothing which need be done. Data is Read or Written > directly to the page. This is an easy case which would currently work > if not for GUP long term pins being disabled. Therefore this patch set > need not change access to the file data but does allow for GUP pins > after 2ba above is dealt with. > > > This patch series and presents a solution for problem 2ba) > > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core > > [2] ext4/dev branch: > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev > > Specific patches: > > [2a] ext4: wait for outstanding dio during truncate in nojournal mode > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8 > > [2b] ext4: do not delete unlinked inode from orphan list on failed truncate > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14 > > [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c > > [3] The definition of long time is debatable but it has been established > that RDMAs use of pages, minutes or hours after the pin is the extreme case > which makes this problem most severe. > > > Ira Weiny (10): > fs/locks: Add trace_leases_conflict > fs/locks: Export F_LAYOUT lease to user space > mm/gup: Pass flags down to __gup_device_huge* calls > mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages > fs/ext4: Teach ext4 to break layout leases > fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range > fs/ext4: Fail truncate if pages are GUP pinned > fs/xfs: Teach xfs to use new dax_layout_busy_page() > fs/xfs: Fail truncate if pages are GUP pinned > mm/gup: Remove FOLL_LONGTERM DAX exclusion > > fs/Kconfig | 1 + > fs/dax.c | 38 ++++++--- > fs/ext4/ext4.h | 2 +- > fs/ext4/extents.c | 6 +- > fs/ext4/inode.c | 26 +++++-- > fs/locks.c | 97 ++++++++++++++++++++--- > fs/xfs/xfs_file.c | 24 ++++-- > fs/xfs/xfs_inode.h | 5 +- > fs/xfs/xfs_ioctl.c | 15 +++- > fs/xfs/xfs_iops.c | 14 +++- > fs/xfs/xfs_pnfs.c | 14 ++-- > include/linux/dax.h | 9 ++- > include/linux/fs.h | 2 +- > include/linux/mm.h | 2 + > include/trace/events/filelock.h | 35 +++++++++ > include/uapi/asm-generic/fcntl.h | 3 + > mm/gup.c | 129 ++++++++++++------------------- > mm/huge_memory.c | 12 +++ > 18 files changed, 299 insertions(+), 135 deletions(-) > > -- > 2.20.1 >
On Thu, Jun 6, 2019 at 3:42 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > ... V1,000,000 ;-) > > > > Pre-requisites: > > John Hubbard's put_user_pages() patch series.[1] > > Jan Kara's ext4_break_layouts() fixes[2] > > > > Based on the feedback from LSFmm and the LWN article which resulted. I've > > decided to take a slightly different tack on this problem. > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > memory which is then truncated. So really any solution we present which: > > > > A) Prevents file system corruption or data leaks > > ...and... > > B) Informs the user that they did something wrong > > > > Should be an acceptable solution. > > > > Because this is slightly new behavior. And because this is gonig to be > > specific to DAX (because of the lack of a page cache) we have made the user > > "opt in" to this behavior. > > > > The following patches implement the following solution. > > > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > > (now made visible). > > 2) GUP will fail (EPERM) if a layout lease is not taken > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > > 4) The user has the option of holding the layout lease to receive a SIGIO for > > notification to the original thread that another thread has tried to delete > > their data. Furthermore this indicates that if the user needs to GUP the > > file again they will need to retake the Layout lease before doing so. > > > > > > NOTE: If the user releases the layout lease or if it has been broken by > > another operation further GUP operations on the file will fail without > > re-taking the lease. This means that if a user would like to register > > pieces of a file and continue to register other pieces later they would > > be advised to keep the layout lease, get a SIGIO notification, and retake > > the lease. > > > > NOTE2: Truncation of pages which are not actively pinned will succeed. > > Similar to accessing an mmap to this area GUP pins of that memory may > > fail. > > So after some through I'm willing accept the fact that pinned DAX pages > will just make truncate / hole punch fail and shove it into a same bucket > of situations like "user can open a file and unlink won't delete it" or > "ETXTBUSY when user is executing a file being truncated". The problem I > have with this proposal is a lack of visibility from sysadmin POV. For > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the > problematic process and kill it. There's nothing like that with your > proposal since currently once you hold page reference, you can unmap the > file, drop layout lease, close the file, and there's no trace that you're > responsible for the pinned page anymore. > > So I'd like to actually mandate that you *must* hold the file lease until > you unpin all pages in the given range (not just that you have an option to > hold a lease). And I believe the kernel should actually enforce this. That > way we maintain a sane state that if someone uses a physical location of > logical file offset on disk, he has a layout lease. Also once this is done, > sysadmin has a reasonably easy way to discover run-away RDMA application > and kill it if he wishes so. Yes, this satisfies the primary concern that made me oppose failing truncate. If the administrator determines that reclaiming capacity is more important than maintaining active RDMA mappings "lsof + kill" is a reasonable way to recover. I'd go so far as to say that anything less is an abdication of the kernel's responsibility as an arbiter of platform resources. > The question is on how to exactly enforce that lease is taken until all > pages are unpinned. I belive it could be done by tracking number of > long-term pinned pages within a lease. Gup_longterm could easily increment > the count when verifying the lease exists, gup_longterm users will somehow > need to propagate corresponding 'filp' (struct file pointer) to > put_user_pages_longterm() callsites so that they can look up appropriate > lease to drop reference - probably I'd just transition all gup_longterm() > users to a saner API similar to the one we have in mm/frame_vector.c where > we don't hand out page pointers but an encapsulating structure that does > all the necessary tracking. Removing a lease would need to block until all > pins are released - this is probably the most hairy part since we need to > handle a case if application just closes the file descriptor which would > release the lease but OTOH we need to make sure task exit does not deadlock. > Maybe we could block only on explicit lease unlock and just drop the layout > lease on file close and if there are still pinned pages, send SIGKILL to an > application as a reminder it did something stupid... > > What do people think about this? SIGKILL on close() without explicit unlock and wait-on-last-pin with explicit unlock sounds reasonable to me.
On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote: > On 6/5/19 6:45 PM, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > ... V1,000,000 ;-) > > > > Pre-requisites: > > John Hubbard's put_user_pages() patch series.[1] > > Jan Kara's ext4_break_layouts() fixes[2] > > > > Based on the feedback from LSFmm and the LWN article which resulted. I've > > decided to take a slightly different tack on this problem. > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > memory which is then truncated. So really any solution we present which: > > > > A) Prevents file system corruption or data leaks > > ...and... > > B) Informs the user that they did something wrong > > > > Should be an acceptable solution. > > > > Because this is slightly new behavior. And because this is gonig to be > > specific to DAX (because of the lack of a page cache) we have made the user > > "opt in" to this behavior. > > > > The following patches implement the following solution. > > > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > > (now made visible). > > 2) GUP will fail (EPERM) if a layout lease is not taken > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > > 4) The user has the option of holding the layout lease to receive a SIGIO for > > notification to the original thread that another thread has tried to delete > > their data. Furthermore this indicates that if the user needs to GUP the > > file again they will need to retake the Layout lease before doing so. > > > > > > NOTE: If the user releases the layout lease or if it has been broken by another > > operation further GUP operations on the file will fail without re-taking the > > lease. This means that if a user would like to register pieces of a file and > > continue to register other pieces later they would be advised to keep the > > layout lease, get a SIGIO notification, and retake the lease. > > > > NOTE2: Truncation of pages which are not actively pinned will succeed. Similar > > to accessing an mmap to this area GUP pins of that memory may fail. > > > > Hi Ira, > > Wow, great to see this. This looks like basically the right behavior, IMHO. > > 1. We'll need man page additions, to explain it. In fact, even after a quick first > pass through, I'm vague on two points: Of course. But I was not going to go through and attempt to write man pages and other docs without some agreement on the final mechanisms. This works which was the basic requirement I had to send an RFC. :-D But yes man pages and updates to headers etc all have to be done. > > a) I'm not sure how this actually provides "opt-in to new behavior", because I > don't see any CONFIG_* or boot time choices, and it looks like the new behavior > just is there. That is, if user space doesn't set F_LAYOUT on a range, > GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?) The opt in is at run time. Currently GUP FOLL_LONGTERM is _not_ _allowed_ on the FS DAX pages at all. So the default behavior is the same, GUP fails. (Or specifically ibv_reg_mr() fails. This fails as before, not change there. The Opt in is that if a user knows what is involved they can take the lease and the GUP will not fail. This comes with the price of knowing that other processes can't truncate those pages in use. > > b) Truncate and hole punch behavior, with and without user space having a SIGIO > handler. (I'm sure this is obvious after another look through, but it might go > nicely in a man page.) Sorry this was not clear. There are 2 points for this patch set which requires the use of catching SIGIO. 1) If an application _actually_ does (somehow, somewhere, in some unforseen use case) want to allow a truncate to happen. They can catch the SIGIO, finish their use of the pages, and release them. As long as they can do this within the <sysfs>/lease-time-break time they are ok and the truncate can proceed. 2) This is a bit more subtle and something I almost delayed sending these out for. Currently the implementation of a lease break actually removes the lease from the file. I did not want this to happen and I was thinking of delaying this patch set to implement something which keeps the lease around but I figured I should get something out for comments. Jan has proposed something along these lines and I agree with him so I'm going to ask you to read my response to him about the details. Anyway so the key here is that currently an app needs the SIGIO to retake the lease if they want to map the file again or in parts based on usage. For example, they may only want to map some of the file for when they are using it and then map another part later. Without the SIGIO they would lose their lease or would have to just take the lease for each GUP pin (which adds overhead). Like I said I did not like this but I left it to get something which works out. > > 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case, > but for general RDMA on them? Or is there more that must be done? This is limited to DAX. All the functionality is limited to *_devmap or "is DAX" cases. I'm still thinking that page cache backed files can have a better solution for the user. > > 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will > conflict violently, as I'm sure you noticed. :) Yep... But I needed to get the conversation started on this idea. Thanks for the feedback! Ira > > > thanks, > -- > John Hubbard > NVIDIA >
On Thu, Jun 06, 2019 at 10:11:58AM -0700, Ira Weiny wrote: > 2) This is a bit more subtle and something I almost delayed sending these out > for. Currently the implementation of a lease break actually removes the > lease from the file. I did not want this to happen and I was thinking of > delaying this patch set to implement something which keeps the lease around > but I figured I should get something out for comments. Jan has proposed > something along these lines and I agree with him so I'm going to ask you to > read my response to him about the details. > > > Anyway so the key here is that currently an app needs the SIGIO to retake > the lease if they want to map the file again or in parts based on usage. > For example, they may only want to map some of the file for when they are > using it and then map another part later. Without the SIGIO they would lose > their lease or would have to just take the lease for each GUP pin (which > adds overhead). Like I said I did not like this but I left it to get > something which works out. So to be clear.. Even though the lease is broken the GUP remains, the pages remain pined, and truncate/etc continues to fail? I like Jan's take on this actually.. see other email. Jason
On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > So I'd like to actually mandate that you *must* hold the file lease until > you unpin all pages in the given range (not just that you have an option to > hold a lease). And I believe the kernel should actually enforce this. That > way we maintain a sane state that if someone uses a physical location of > logical file offset on disk, he has a layout lease. Also once this is done, > sysadmin has a reasonably easy way to discover run-away RDMA application > and kill it if he wishes so. > > The question is on how to exactly enforce that lease is taken until all > pages are unpinned. I belive it could be done by tracking number of > long-term pinned pages within a lease. Gup_longterm could easily increment > the count when verifying the lease exists, gup_longterm users will somehow > need to propagate corresponding 'filp' (struct file pointer) to > put_user_pages_longterm() callsites so that they can look up appropriate > lease to drop reference - probably I'd just transition all gup_longterm() > users to a saner API similar to the one we have in mm/frame_vector.c where > we don't hand out page pointers but an encapsulating structure that does > all the necessary tracking. Removing a lease would need to block until all > pins are released - this is probably the most hairy part since we need to > handle a case if application just closes the file descriptor which > would I think if you are going to do this then the 'struct filp' that represents the lease should be held in the kernel (ie inside the RDMA umem) until the kernel is done with it. Actually does someone have a pointer to this userspace lease API, I'm not at all familiar with it, thanks And yes, a better output format from GUP would be great.. > Maybe we could block only on explicit lease unlock and just drop the layout > lease on file close and if there are still pinned pages, send SIGKILL to an > application as a reminder it did something stupid... Which process would you SIGKILL? At least for the rdma case a FD is holding the GUP, so to do the put_user_pages() the kernel needs to close the FD. I guess it would have to kill every process that has the FD open? Seems complicated... Regards, Jason
On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > ... V1,000,000 ;-) > > > > Pre-requisites: > > John Hubbard's put_user_pages() patch series.[1] > > Jan Kara's ext4_break_layouts() fixes[2] > > > > Based on the feedback from LSFmm and the LWN article which resulted. I've > > decided to take a slightly different tack on this problem. > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > memory which is then truncated. So really any solution we present which: > > > > A) Prevents file system corruption or data leaks > > ...and... > > B) Informs the user that they did something wrong > > > > Should be an acceptable solution. > > > > Because this is slightly new behavior. And because this is gonig to be > > specific to DAX (because of the lack of a page cache) we have made the user > > "opt in" to this behavior. > > > > The following patches implement the following solution. > > > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > > (now made visible). > > 2) GUP will fail (EPERM) if a layout lease is not taken > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > > 4) The user has the option of holding the layout lease to receive a SIGIO for > > notification to the original thread that another thread has tried to delete > > their data. Furthermore this indicates that if the user needs to GUP the > > file again they will need to retake the Layout lease before doing so. > > > > > > NOTE: If the user releases the layout lease or if it has been broken by > > another operation further GUP operations on the file will fail without > > re-taking the lease. This means that if a user would like to register > > pieces of a file and continue to register other pieces later they would > > be advised to keep the layout lease, get a SIGIO notification, and retake > > the lease. > > > > NOTE2: Truncation of pages which are not actively pinned will succeed. > > Similar to accessing an mmap to this area GUP pins of that memory may > > fail. > > So after some through I'm willing accept the fact that pinned DAX pages > will just make truncate / hole punch fail and shove it into a same bucket > of situations like "user can open a file and unlink won't delete it" or > "ETXTBUSY when user is executing a file being truncated". The problem I > have with this proposal is a lack of visibility from sysadmin POV. For > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the > problematic process and kill it. There's nothing like that with your > proposal since currently once you hold page reference, you can unmap the > file, drop layout lease, close the file, and there's no trace that you're > responsible for the pinned page anymore. Agreed. For some "GUP interfaces" one may be able to figure this out but I'm not familiar with any. For RDMA there has been some additions for tracking resources but I don't think any of that is useful here. Regardless from a FS POV this is awkward to have to understand all the independent interfaces, so I agree. > > So I'd like to actually mandate that you *must* hold the file lease until > you unpin all pages in the given range (not just that you have an option to > hold a lease). And I believe the kernel should actually enforce this. That > way we maintain a sane state that if someone uses a physical location of > logical file offset on disk, he has a layout lease. Also once this is done, > sysadmin has a reasonably easy way to discover run-away RDMA application > and kill it if he wishes so. Fair enough. I was kind of heading that direction but had not thought this far forward. I was exploring how to have a lease remain on the file even after a "lease break". But that is incompatible with the current semantics of a "layout" lease (as currently defined in the kernel). [In the end I wanted to get an RFC out to see what people think of this idea so I did not look at keeping the lease.] Also hitch is that currently a lease is forcefully broken after <sysfs>/lease-break-time. To do what you suggest I think we would need a new lease type with the semantics you describe. Previously I had thought this would be a good idea (for other reasons). But what does everyone think about using a "longterm lease" similar to [1] which has the semantics you proppose? In [1] I was not sure "longterm" was a good name but with your proposal I think it makes more sense. > > The question is on how to exactly enforce that lease is taken until all > pages are unpinned. I belive it could be done by tracking number of > long-term pinned pages within a lease. Gup_longterm could easily increment > the count when verifying the lease exists, gup_longterm users will somehow > need to propagate corresponding 'filp' (struct file pointer) to > put_user_pages_longterm() callsites so that they can look up appropriate > lease to drop reference I actually think that might be pretty easy. I actually added a ref count to the longterm lease before.[2] This was done to be able to take the lease within the GUP code. We don't need that functionality exactly but that patch implements some of what you propose. With a ref count on the lease we can refuse to release it until all GUP users have released it. > > - probably I'd just transition all gup_longterm() > users to a saner API similar to the one we have in mm/frame_vector.c where > we don't hand out page pointers but an encapsulating structure that does > all the necessary tracking. I'll take a look at that code. But that seems like a pretty big change. > > Removing a lease would need to block until all > pins are released - this is probably the most hairy part since we need to > handle a case if application just closes the file descriptor which would > release the lease but OTOH we need to make sure task exit does not deadlock. > Maybe we could block only on explicit lease unlock and just drop the layout > lease on file close and if there are still pinned pages, send SIGKILL to an > application as a reminder it did something stupid... As presented at LSFmm I'm not opposed to killing a process which does not "follow the rules". But I'm concerned about how to handle this across a fork. Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins to a child through the RDMA context. This was the major issue Jason had with the SIGBUS proposal. Always sending a SIGKILL would prevent an RDMA process from doing something like system("ls") (would kill the child unnecessarily). Are we ok with that? > > What do people think about this? But generally I like the idea of the leases being sticky. Not sure about the SIGKILL. Thanks for the review, Ira [1] https://patchwork.kernel.org/patch/10921171/ [2] https://patchwork.kernel.org/patch/10921177/ > > Honza > > > > > > A general overview follows for background. > > > > It should be noted that one solution for this problem is to use RDMA's On > > Demand Paging (ODP). There are 2 big reasons this may not work. > > > > 1) The hardware being used for RDMA may not support ODP > > 2) ODP may be detrimental to the over all network (cluster or cloud) > > performance > > > > Therefore, in order to support RDMA to File system pages without On Demand > > Paging (ODP) a number of things need to be done. > > > > 1) GUP "longterm" users need to inform the other subsystems that they have > > taken a pin on a page which may remain pinned for a very "long time".[3] > > > > 2) Any page which is "controlled" by a file system needs to have special > > handling. The details of the handling depends on if the page is page cache > > fronted or not. > > > > 2a) A page cache fronted page which has been pinned by GUP long term can use a > > bounce buffer to allow the file system to write back snap shots of the page. > > This is handled by the FS recognizing the GUP long term pin and making a copy > > of the page to be written back. > > NOTE: this patch set does not address this path. > > > > 2b) A FS "controlled" page which is not page cache fronted is either easier > > to deal with or harder depending on the operation the filesystem is trying > > to do. > > > > 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the > > FS can no longer use the pages in question until the pin has been > > removed. This patch set presents a solution to this by introducing > > some reasonable restrictions on user space applications. > > > > 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch > > then there is nothing which need be done. Data is Read or Written > > directly to the page. This is an easy case which would currently work > > if not for GUP long term pins being disabled. Therefore this patch set > > need not change access to the file data but does allow for GUP pins > > after 2ba above is dealt with. > > > > > > This patch series and presents a solution for problem 2ba) > > > > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core > > > > [2] ext4/dev branch: > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev > > > > Specific patches: > > > > [2a] ext4: wait for outstanding dio during truncate in nojournal mode > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8 > > > > [2b] ext4: do not delete unlinked inode from orphan list on failed truncate > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14 > > > > [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c > > > > [3] The definition of long time is debatable but it has been established > > that RDMAs use of pages, minutes or hours after the pin is the extreme case > > which makes this problem most severe. > > > > > > Ira Weiny (10): > > fs/locks: Add trace_leases_conflict > > fs/locks: Export F_LAYOUT lease to user space > > mm/gup: Pass flags down to __gup_device_huge* calls > > mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages > > fs/ext4: Teach ext4 to break layout leases > > fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range > > fs/ext4: Fail truncate if pages are GUP pinned > > fs/xfs: Teach xfs to use new dax_layout_busy_page() > > fs/xfs: Fail truncate if pages are GUP pinned > > mm/gup: Remove FOLL_LONGTERM DAX exclusion > > > > fs/Kconfig | 1 + > > fs/dax.c | 38 ++++++--- > > fs/ext4/ext4.h | 2 +- > > fs/ext4/extents.c | 6 +- > > fs/ext4/inode.c | 26 +++++-- > > fs/locks.c | 97 ++++++++++++++++++++--- > > fs/xfs/xfs_file.c | 24 ++++-- > > fs/xfs/xfs_inode.h | 5 +- > > fs/xfs/xfs_ioctl.c | 15 +++- > > fs/xfs/xfs_iops.c | 14 +++- > > fs/xfs/xfs_pnfs.c | 14 ++-- > > include/linux/dax.h | 9 ++- > > include/linux/fs.h | 2 +- > > include/linux/mm.h | 2 + > > include/trace/events/filelock.h | 35 +++++++++ > > include/uapi/asm-generic/fcntl.h | 3 + > > mm/gup.c | 129 ++++++++++++------------------- > > mm/huge_memory.c | 12 +++ > > 18 files changed, 299 insertions(+), 135 deletions(-) > > > > -- > > 2.20.1 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote: > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > > So I'd like to actually mandate that you *must* hold the file lease until > > you unpin all pages in the given range (not just that you have an option to > > hold a lease). And I believe the kernel should actually enforce this. That > > way we maintain a sane state that if someone uses a physical location of > > logical file offset on disk, he has a layout lease. Also once this is done, > > sysadmin has a reasonably easy way to discover run-away RDMA application > > and kill it if he wishes so. > > > > The question is on how to exactly enforce that lease is taken until all > > pages are unpinned. I belive it could be done by tracking number of > > long-term pinned pages within a lease. Gup_longterm could easily increment > > the count when verifying the lease exists, gup_longterm users will somehow > > need to propagate corresponding 'filp' (struct file pointer) to > > put_user_pages_longterm() callsites so that they can look up appropriate > > lease to drop reference - probably I'd just transition all gup_longterm() > > users to a saner API similar to the one we have in mm/frame_vector.c where > > we don't hand out page pointers but an encapsulating structure that does > > all the necessary tracking. Removing a lease would need to block until all > > pins are released - this is probably the most hairy part since we need to > > handle a case if application just closes the file descriptor which > > would > > I think if you are going to do this then the 'struct filp' that > represents the lease should be held in the kernel (ie inside the RDMA > umem) until the kernel is done with it. Yea there seems merit to this. I'm still not resolving how this helps track who has the pin across a fork. > > Actually does someone have a pointer to this userspace lease API, I'm > not at all familiar with it, thanks man fcntl search for SETLEASE But I had to add the F_LAYOUT lease type. (Personally I'm for calling it F_LONGTERM at this point. I don't think LAYOUT is compatible with what we are proposing here.) Anyway, yea would be a libc change at lease for man page etc... But again I want to get some buy in before going through all that. > > And yes, a better output format from GUP would be great.. > > > Maybe we could block only on explicit lease unlock and just drop the layout > > lease on file close and if there are still pinned pages, send SIGKILL to an > > application as a reminder it did something stupid... > > Which process would you SIGKILL? At least for the rdma case a FD is > holding the GUP, so to do the put_user_pages() the kernel needs to > close the FD. I guess it would have to kill every process that has the > FD open? Seems complicated... Tending to agree... But I'm still not opposed to killing bad actors... ;-) NOTE: Jason I think you need to be more clear about the FD you are speaking of. I believe you mean the FD which refers to the RMDA context. That is what I called it in my other email. Ira > > Regards, > Jason
On Thu, Jun 06, 2019 at 03:03:30PM -0700, 'Ira Weiny' wrote: > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > ... V1,000,000 ;-) > > > > > > Pre-requisites: > > > John Hubbard's put_user_pages() patch series.[1] > > > Jan Kara's ext4_break_layouts() fixes[2] > > > > > > Based on the feedback from LSFmm and the LWN article which resulted. I've > > > decided to take a slightly different tack on this problem. > > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > > memory which is then truncated. So really any solution we present which: > > > > > > A) Prevents file system corruption or data leaks > > > ...and... > > > B) Informs the user that they did something wrong > > > > > > Should be an acceptable solution. > > > > > > Because this is slightly new behavior. And because this is gonig to be > > > specific to DAX (because of the lack of a page cache) we have made the user > > > "opt in" to this behavior. > > > > > > The following patches implement the following solution. > > > > > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > > > (now made visible). > > > 2) GUP will fail (EPERM) if a layout lease is not taken > > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > > > 4) The user has the option of holding the layout lease to receive a SIGIO for > > > notification to the original thread that another thread has tried to delete > > > their data. Furthermore this indicates that if the user needs to GUP the > > > file again they will need to retake the Layout lease before doing so. > > > > > > > > > NOTE: If the user releases the layout lease or if it has been broken by > > > another operation further GUP operations on the file will fail without > > > re-taking the lease. This means that if a user would like to register > > > pieces of a file and continue to register other pieces later they would > > > be advised to keep the layout lease, get a SIGIO notification, and retake > > > the lease. > > > > > > NOTE2: Truncation of pages which are not actively pinned will succeed. > > > Similar to accessing an mmap to this area GUP pins of that memory may > > > fail. > > > > So after some through I'm willing accept the fact that pinned DAX pages > > will just make truncate / hole punch fail and shove it into a same bucket > > of situations like "user can open a file and unlink won't delete it" or > > "ETXTBUSY when user is executing a file being truncated". The problem I > > have with this proposal is a lack of visibility from sysadmin POV. For > > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the > > problematic process and kill it. There's nothing like that with your > > proposal since currently once you hold page reference, you can unmap the > > file, drop layout lease, close the file, and there's no trace that you're > > responsible for the pinned page anymore. > > Agreed. For some "GUP interfaces" one may be able to figure this out but I'm > not familiar with any. For RDMA there has been some additions for tracking > resources but I don't think any of that is useful here. Regardless from a FS > POV this is awkward to have to understand all the independent interfaces, so I > agree. > > > > > So I'd like to actually mandate that you *must* hold the file lease until > > you unpin all pages in the given range (not just that you have an option to > > hold a lease). And I believe the kernel should actually enforce this. That > > way we maintain a sane state that if someone uses a physical location of > > logical file offset on disk, he has a layout lease. Also once this is done, > > sysadmin has a reasonably easy way to discover run-away RDMA application > > and kill it if he wishes so. > > Fair enough. > > I was kind of heading that direction but had not thought this far forward. I > was exploring how to have a lease remain on the file even after a "lease > break". But that is incompatible with the current semantics of a "layout" > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > out to see what people think of this idea so I did not look at keeping the > lease.] > > Also hitch is that currently a lease is forcefully broken after > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > lease type with the semantics you describe. > > Previously I had thought this would be a good idea (for other reasons). But > what does everyone think about using a "longterm lease" similar to [1] which > has the semantics you proppose? In [1] I was not sure "longterm" was a good > name but with your proposal I think it makes more sense. > > > > > The question is on how to exactly enforce that lease is taken until all > > pages are unpinned. I belive it could be done by tracking number of > > long-term pinned pages within a lease. Gup_longterm could easily increment > > the count when verifying the lease exists, gup_longterm users will somehow > > need to propagate corresponding 'filp' (struct file pointer) to > > put_user_pages_longterm() callsites so that they can look up appropriate > > lease to drop reference > > I actually think that might be pretty easy. I actually added a ref count to > the longterm lease before.[2] This was done to be able to take the lease > within the GUP code. We don't need that functionality exactly but that patch > implements some of what you propose. With a ref count on the lease we can > refuse to release it until all GUP users have released it. > > > > > - probably I'd just transition all gup_longterm() > > users to a saner API similar to the one we have in mm/frame_vector.c where > > we don't hand out page pointers but an encapsulating structure that does > > all the necessary tracking. > > I'll take a look at that code. But that seems like a pretty big change. > > > > > Removing a lease would need to block until all > > pins are released - this is probably the most hairy part since we need to > > handle a case if application just closes the file descriptor which would > > release the lease but OTOH we need to make sure task exit does not deadlock. > > Maybe we could block only on explicit lease unlock and just drop the layout > > lease on file close and if there are still pinned pages, send SIGKILL to an > > application as a reminder it did something stupid... > > As presented at LSFmm I'm not opposed to killing a process which does not > "follow the rules". But I'm concerned about how to handle this across a fork. > > Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins > to a child through the RDMA context. This was the major issue Jason had with > the SIGBUS proposal. > > Always sending a SIGKILL would prevent an RDMA process from doing something > like system("ls") (would kill the child unnecessarily). Are we ok with that? I might be wrong here. My memory said it closed all fd's but I'm not finding any documentation of that. Perhaps we could say that the child would be required to keep the fd open as well? Ira > > > > > What do people think about this? > > But generally I like the idea of the leases being sticky. Not sure about the > SIGKILL. > > Thanks for the review, > Ira > > [1] https://patchwork.kernel.org/patch/10921171/ > [2] https://patchwork.kernel.org/patch/10921177/ > > > > > Honza > > > > > > > > > A general overview follows for background. > > > > > > It should be noted that one solution for this problem is to use RDMA's On > > > Demand Paging (ODP). There are 2 big reasons this may not work. > > > > > > 1) The hardware being used for RDMA may not support ODP > > > 2) ODP may be detrimental to the over all network (cluster or cloud) > > > performance > > > > > > Therefore, in order to support RDMA to File system pages without On Demand > > > Paging (ODP) a number of things need to be done. > > > > > > 1) GUP "longterm" users need to inform the other subsystems that they have > > > taken a pin on a page which may remain pinned for a very "long time".[3] > > > > > > 2) Any page which is "controlled" by a file system needs to have special > > > handling. The details of the handling depends on if the page is page cache > > > fronted or not. > > > > > > 2a) A page cache fronted page which has been pinned by GUP long term can use a > > > bounce buffer to allow the file system to write back snap shots of the page. > > > This is handled by the FS recognizing the GUP long term pin and making a copy > > > of the page to be written back. > > > NOTE: this patch set does not address this path. > > > > > > 2b) A FS "controlled" page which is not page cache fronted is either easier > > > to deal with or harder depending on the operation the filesystem is trying > > > to do. > > > > > > 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the > > > FS can no longer use the pages in question until the pin has been > > > removed. This patch set presents a solution to this by introducing > > > some reasonable restrictions on user space applications. > > > > > > 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch > > > then there is nothing which need be done. Data is Read or Written > > > directly to the page. This is an easy case which would currently work > > > if not for GUP long term pins being disabled. Therefore this patch set > > > need not change access to the file data but does allow for GUP pins > > > after 2ba above is dealt with. > > > > > > > > > This patch series and presents a solution for problem 2ba) > > > > > > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core > > > > > > [2] ext4/dev branch: > > > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev > > > > > > Specific patches: > > > > > > [2a] ext4: wait for outstanding dio during truncate in nojournal mode > > > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8 > > > > > > [2b] ext4: do not delete unlinked inode from orphan list on failed truncate > > > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14 > > > > > > [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate > > > > > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c > > > > > > [3] The definition of long time is debatable but it has been established > > > that RDMAs use of pages, minutes or hours after the pin is the extreme case > > > which makes this problem most severe. > > > > > > > > > Ira Weiny (10): > > > fs/locks: Add trace_leases_conflict > > > fs/locks: Export F_LAYOUT lease to user space > > > mm/gup: Pass flags down to __gup_device_huge* calls > > > mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages > > > fs/ext4: Teach ext4 to break layout leases > > > fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range > > > fs/ext4: Fail truncate if pages are GUP pinned > > > fs/xfs: Teach xfs to use new dax_layout_busy_page() > > > fs/xfs: Fail truncate if pages are GUP pinned > > > mm/gup: Remove FOLL_LONGTERM DAX exclusion > > > > > > fs/Kconfig | 1 + > > > fs/dax.c | 38 ++++++--- > > > fs/ext4/ext4.h | 2 +- > > > fs/ext4/extents.c | 6 +- > > > fs/ext4/inode.c | 26 +++++-- > > > fs/locks.c | 97 ++++++++++++++++++++--- > > > fs/xfs/xfs_file.c | 24 ++++-- > > > fs/xfs/xfs_inode.h | 5 +- > > > fs/xfs/xfs_ioctl.c | 15 +++- > > > fs/xfs/xfs_iops.c | 14 +++- > > > fs/xfs/xfs_pnfs.c | 14 ++-- > > > include/linux/dax.h | 9 ++- > > > include/linux/fs.h | 2 +- > > > include/linux/mm.h | 2 + > > > include/trace/events/filelock.h | 35 +++++++++ > > > include/uapi/asm-generic/fcntl.h | 3 + > > > mm/gup.c | 129 ++++++++++++------------------- > > > mm/huge_memory.c | 12 +++ > > > 18 files changed, 299 insertions(+), 135 deletions(-) > > > > > > -- > > > 2.20.1 > > > > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR >
On Thu, Jun 06, 2019 at 03:03:30PM -0700, Ira Weiny wrote: > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > So I'd like to actually mandate that you *must* hold the file lease until > > you unpin all pages in the given range (not just that you have an option to > > hold a lease). And I believe the kernel should actually enforce this. That > > way we maintain a sane state that if someone uses a physical location of > > logical file offset on disk, he has a layout lease. Also once this is done, > > sysadmin has a reasonably easy way to discover run-away RDMA application > > and kill it if he wishes so. > > Fair enough. > > I was kind of heading that direction but had not thought this far forward. I > was exploring how to have a lease remain on the file even after a "lease > break". But that is incompatible with the current semantics of a "layout" > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > out to see what people think of this idea so I did not look at keeping the > lease.] > > Also hitch is that currently a lease is forcefully broken after > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > lease type with the semantics you describe. That just requires a flag when gaining the layout lease to say it is an "unbreakable layout lease". That gives the kernel the information needed to determine whether it should attempt to break the lease on truncate or just return ETXTBSY.... i.e. it allows gup-pinning applications that want to behave nicely with other users to drop their gup pins and release the lease when something else wants to truncate/hole punch the file rather than have truncate return an error. e.g. to allow apps to cleanly interop with other breakable layout leases (e.g. pNFS) on the same filesystem. FWIW, I'd also like to see the "truncate fails when unbreakable layout lease is held" behaviour to be common across all filesystem/storage types, not be confined to DAX only. i.e. truncate should return ETXTBSY when an unbreakable layout lease is held by an application, not just when "DAX+gup-pinned" is triggered.... Whatever we decide, the behaviour of truncate et al needs to be predictable, consistent and easily discoverable... Cheers, Dave.
On Thu 06-06-19 15:22:28, Ira Weiny wrote: > On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote: > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > > > > So I'd like to actually mandate that you *must* hold the file lease until > > > you unpin all pages in the given range (not just that you have an option to > > > hold a lease). And I believe the kernel should actually enforce this. That > > > way we maintain a sane state that if someone uses a physical location of > > > logical file offset on disk, he has a layout lease. Also once this is done, > > > sysadmin has a reasonably easy way to discover run-away RDMA application > > > and kill it if he wishes so. > > > > > > The question is on how to exactly enforce that lease is taken until all > > > pages are unpinned. I belive it could be done by tracking number of > > > long-term pinned pages within a lease. Gup_longterm could easily increment > > > the count when verifying the lease exists, gup_longterm users will somehow > > > need to propagate corresponding 'filp' (struct file pointer) to > > > put_user_pages_longterm() callsites so that they can look up appropriate > > > lease to drop reference - probably I'd just transition all gup_longterm() > > > users to a saner API similar to the one we have in mm/frame_vector.c where > > > we don't hand out page pointers but an encapsulating structure that does > > > all the necessary tracking. Removing a lease would need to block until all > > > pins are released - this is probably the most hairy part since we need to > > > handle a case if application just closes the file descriptor which > > > would > > > > I think if you are going to do this then the 'struct filp' that > > represents the lease should be held in the kernel (ie inside the RDMA > > umem) until the kernel is done with it. > > Yea there seems merit to this. I'm still not resolving how this helps track > who has the pin across a fork. Yes, my thought was that gup_longterm() would return a structure that would be tracking filp (or whatever is needed) and that would be embedded inside RDMA umem. > > Actually does someone have a pointer to this userspace lease API, I'm > > not at all familiar with it, thanks > > man fcntl > search for SETLEASE > > But I had to add the F_LAYOUT lease type. (Personally I'm for calling it > F_LONGTERM at this point. I don't think LAYOUT is compatible with what we are > proposing here.) I think F_LAYOUT still expresses it pretty well. The lease is pinning logical->physical file offset mapping, i.e. the file layout. > > > > And yes, a better output format from GUP would be great.. > > > > > Maybe we could block only on explicit lease unlock and just drop the layout > > > lease on file close and if there are still pinned pages, send SIGKILL to an > > > application as a reminder it did something stupid... > > > > Which process would you SIGKILL? At least for the rdma case a FD is > > holding the GUP, so to do the put_user_pages() the kernel needs to > > close the FD. I guess it would have to kill every process that has the > > FD open? Seems complicated... > > Tending to agree... But I'm still not opposed to killing bad actors... ;-) > > NOTE: Jason I think you need to be more clear about the FD you are speaking of. > I believe you mean the FD which refers to the RMDA context. That is what I > called it in my other email. I keep forgetting that the file with RDMA context may be held by multiple processes so thanks for correcting me. My proposal with SIGKILL was jumping to conclusion too quickly :) We have two struct files here: A file with RDMA context that effectively is the owner of the page pins (let's call it "context file") and a file which is mapped and on which we hold the lease and whose blocks (pages) we are pinning (let's call it "buffer file"). Now once buffer file is closed (and this means that all file descriptors pointing to this struct file are closed - so just one child closing the file descriptor won't trigger this) we need to release the lease and I want to have a way of safely releasing remaining pins associated with this lease as well. Because the pins would be invisible to sysadmin from that point on. Now if the context file would be open only by the process closing the buffer file, SIGKILL would work as that would close the buffer file as a side effect. But as you properly pointed out, that's not necessarily the case. Walking processes that have the context file open is technically complex and too ugly to live so we have to come up with something better. The best I can currently come up with is to have a method associated with the lease that would invalidate the RDMA context that holds the pins in the same way that a file close would do it. Honza
On Thu 06-06-19 15:03:30, Ira Weiny wrote: > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > So I'd like to actually mandate that you *must* hold the file lease until > > you unpin all pages in the given range (not just that you have an option to > > hold a lease). And I believe the kernel should actually enforce this. That > > way we maintain a sane state that if someone uses a physical location of > > logical file offset on disk, he has a layout lease. Also once this is done, > > sysadmin has a reasonably easy way to discover run-away RDMA application > > and kill it if he wishes so. > > Fair enough. > > I was kind of heading that direction but had not thought this far forward. I > was exploring how to have a lease remain on the file even after a "lease > break". But that is incompatible with the current semantics of a "layout" > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > out to see what people think of this idea so I did not look at keeping the > lease.] > > Also hitch is that currently a lease is forcefully broken after > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > lease type with the semantics you describe. I'd do what Dave suggested - add flag to mark lease as unbreakable by truncate and teach file locking core to handle that. There actually is support for locks that are not broken after given timeout so there shouldn't be too many changes need. > Previously I had thought this would be a good idea (for other reasons). But > what does everyone think about using a "longterm lease" similar to [1] which > has the semantics you proppose? In [1] I was not sure "longterm" was a good > name but with your proposal I think it makes more sense. As I wrote elsewhere in this thread I think FL_LAYOUT name still makes sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with truncate. > > - probably I'd just transition all gup_longterm() > > users to a saner API similar to the one we have in mm/frame_vector.c where > > we don't hand out page pointers but an encapsulating structure that does > > all the necessary tracking. > > I'll take a look at that code. But that seems like a pretty big change. I was looking into that yesterday before proposing this and there aren't than many gup_longterm() users and most of them anyway just stick pages array into their tracking structure and then release them once done. So it shouldn't be that complex to convert to a new convention (and you have to touch all gup_longterm() users anyway to teach them track leases etc.). > > Removing a lease would need to block until all > > pins are released - this is probably the most hairy part since we need to > > handle a case if application just closes the file descriptor which would > > release the lease but OTOH we need to make sure task exit does not deadlock. > > Maybe we could block only on explicit lease unlock and just drop the layout > > lease on file close and if there are still pinned pages, send SIGKILL to an > > application as a reminder it did something stupid... > > As presented at LSFmm I'm not opposed to killing a process which does not > "follow the rules". But I'm concerned about how to handle this across a fork. > > Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins > to a child through the RDMA context. This was the major issue Jason had with > the SIGBUS proposal. > > Always sending a SIGKILL would prevent an RDMA process from doing something > like system("ls") (would kill the child unnecessarily). Are we ok with that? I answered this in another email but system("ls") won't kill anybody. fork(2) just creates new file descriptor for the same file and possibly then closes it but since there is still another file descriptor for the same struct file, the "close" code won't trigger. Honza
On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > Because the pins would be invisible to sysadmin from that point on. It is not invisible, it just shows up in a rdma specific kernel interface. You have to use rdma netlink to see the kernel object holding this pin. If this visibility is the main sticking point I suggest just enhancing the existing MR reporting to include the file info for current GUP pins and teaching lsof to collect information from there as well so it is easy to use. If the ownership of the lease transfers to the MR, and we report that ownership to userspace in a way lsof can find, then I think all the concerns that have been raised are met, right? > ugly to live so we have to come up with something better. The best I can > currently come up with is to have a method associated with the lease that > would invalidate the RDMA context that holds the pins in the same way that > a file close would do it. This is back to requiring all RDMA HW to have some new behavior they currently don't have.. The main objection to the current ODP & DAX solution is that very little HW can actually implement it, having the alternative still require HW support doesn't seem like progress. I think we will eventually start seein some HW be able to do this invalidation, but it won't be universal, and I'd rather leave it optional, for recovery from truely catastrophic errors (ie my DAX is on fire, I need to unplug it). Jason
On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > Because the pins would be invisible to sysadmin from that point on. > > It is not invisible, it just shows up in a rdma specific kernel > interface. You have to use rdma netlink to see the kernel object > holding this pin. > > If this visibility is the main sticking point I suggest just enhancing > the existing MR reporting to include the file info for current GUP > pins and teaching lsof to collect information from there as well so it > is easy to use. > > If the ownership of the lease transfers to the MR, and we report that > ownership to userspace in a way lsof can find, then I think all the > concerns that have been raised are met, right? I was contemplating some new lsof feature yesterday. But what I don't think we want is sysadmins to have multiple tools for multiple subsystems. Or even have to teach lsof something new for every potential new subsystem user of GUP pins. I was thinking more along the lines of reporting files which have GUP pins on them directly somewhere (dare I say procfs?) and teaching lsof to report that information. That would cover any subsystem which does a longterm pin. > > > ugly to live so we have to come up with something better. The best I can > > currently come up with is to have a method associated with the lease that > > would invalidate the RDMA context that holds the pins in the same way that > > a file close would do it. > > This is back to requiring all RDMA HW to have some new behavior they > currently don't have.. > > The main objection to the current ODP & DAX solution is that very > little HW can actually implement it, having the alternative still > require HW support doesn't seem like progress. > > I think we will eventually start seein some HW be able to do this > invalidation, but it won't be universal, and I'd rather leave it > optional, for recovery from truely catastrophic errors (ie my DAX is > on fire, I need to unplug it). Agreed. I think software wise there is not much some of the devices can do with such an "invalidate". Ira
On Fri, Jun 07, 2019 at 07:52:13AM -0700, Ira Weiny wrote: > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > > > Because the pins would be invisible to sysadmin from that point on. > > > > It is not invisible, it just shows up in a rdma specific kernel > > interface. You have to use rdma netlink to see the kernel object > > holding this pin. > > > > If this visibility is the main sticking point I suggest just enhancing > > the existing MR reporting to include the file info for current GUP > > pins and teaching lsof to collect information from there as well so it > > is easy to use. > > > > If the ownership of the lease transfers to the MR, and we report that > > ownership to userspace in a way lsof can find, then I think all the > > concerns that have been raised are met, right? > > I was contemplating some new lsof feature yesterday. But what I don't think we > want is sysadmins to have multiple tools for multiple subsystems. Or even have > to teach lsof something new for every potential new subsystem user of GUP pins. Well.. it is a bit tricky, but you'd have to arrange for the lease object to have a list of 'struct files' that are holding the lease open. The first would be the file that did the fcntl, the next would be all the files that did longterm GUP - which means longterm GUP has to have a chardev file/etc as well (seems OK) Then lsof could query the list of lease objects for each file it encounters and print them out too. Jason
On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote: > On Thu 06-06-19 15:03:30, Ira Weiny wrote: > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > So I'd like to actually mandate that you *must* hold the file lease until > > > you unpin all pages in the given range (not just that you have an option to > > > hold a lease). And I believe the kernel should actually enforce this. That > > > way we maintain a sane state that if someone uses a physical location of > > > logical file offset on disk, he has a layout lease. Also once this is done, > > > sysadmin has a reasonably easy way to discover run-away RDMA application > > > and kill it if he wishes so. > > > > Fair enough. > > > > I was kind of heading that direction but had not thought this far forward. I > > was exploring how to have a lease remain on the file even after a "lease > > break". But that is incompatible with the current semantics of a "layout" > > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > > out to see what people think of this idea so I did not look at keeping the > > lease.] > > > > Also hitch is that currently a lease is forcefully broken after > > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > > lease type with the semantics you describe. > > I'd do what Dave suggested - add flag to mark lease as unbreakable by > truncate and teach file locking core to handle that. There actually is > support for locks that are not broken after given timeout so there > shouldn't be too many changes need. > > > Previously I had thought this would be a good idea (for other reasons). But > > what does everyone think about using a "longterm lease" similar to [1] which > > has the semantics you proppose? In [1] I was not sure "longterm" was a good > > name but with your proposal I think it makes more sense. > > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with > truncate. Ok I want to make sure I understand what you and Dave are suggesting. Are you suggesting that we have something like this from user space? fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > - probably I'd just transition all gup_longterm() > > > users to a saner API similar to the one we have in mm/frame_vector.c where > > > we don't hand out page pointers but an encapsulating structure that does > > > all the necessary tracking. > > > > I'll take a look at that code. But that seems like a pretty big change. > > I was looking into that yesterday before proposing this and there aren't > than many gup_longterm() users and most of them anyway just stick pages > array into their tracking structure and then release them once done. So it > shouldn't be that complex to convert to a new convention (and you have to > touch all gup_longterm() users anyway to teach them track leases etc.). I think in the direction we are heading this becomes more attractive for sure. For me though it will take some time. Should we convert the frame_vector over to this new mechanism? (Or more accurately perhaps, add to frame_vector and use it?) It seems bad to have "yet another object" returned from the pin pages interface... And I think this is related to what Christoph Hellwig is doing with bio_vec and dma. Really we want drivers out of the page processing business. So for now I'm going to move forward with the idea of handing "some object" to the GUP callers and figure out the lsof stuff, and let bigger questions like this play out a bit more before I try and work with that code. Fair? > > > > Removing a lease would need to block until all > > > pins are released - this is probably the most hairy part since we need to > > > handle a case if application just closes the file descriptor which would > > > release the lease but OTOH we need to make sure task exit does not deadlock. > > > Maybe we could block only on explicit lease unlock and just drop the layout > > > lease on file close and if there are still pinned pages, send SIGKILL to an > > > application as a reminder it did something stupid... > > > > As presented at LSFmm I'm not opposed to killing a process which does not > > "follow the rules". But I'm concerned about how to handle this across a fork. > > > > Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins > > to a child through the RDMA context. This was the major issue Jason had with > > the SIGBUS proposal. > > > > Always sending a SIGKILL would prevent an RDMA process from doing something > > like system("ls") (would kill the child unnecessarily). Are we ok with that? > > I answered this in another email but system("ls") won't kill anybody. > fork(2) just creates new file descriptor for the same file and possibly > then closes it but since there is still another file descriptor for the > same struct file, the "close" code won't trigger. Agreed. I was wrong. Sorry. But if we can keep track of who has the pins in lsof can we agree no process needs to be SIGKILL'ed? Admins can do this on their own "killing" if they really need to stop the use of these files, right? Ira
On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > And I think this is related to what Christoph Hellwig is doing with bio_vec and > dma. Really we want drivers out of the page processing business. At least for RDMA, and a few other places I've noticed, I'd really like to get totally out of the handling struct pages game. We are DMA based and really only want DMA addresses for the target device. I know other places need CPU pages or more complicated things.. But I also know there are other drivers like RDMA.. So I think it would be very helpful to have a driver API something like: int get_user_mem_for_dma(struct device *dma_device, void __user *mem, size_t length, struct gup_handle *res, struct 'bio dma list' *dma_list, const struct dma_params *params); void put_user_mem_for_dma(struct gup_handle *res, struct 'bio dma list' *dma_list); And we could hope to put in there all the specialty logic we want to have for this flow: - The weird HMM stuff in hmm_range_dma_map() - Interaction with DAX - Interaction with DMA BUF - Holding file leases - PCI peer 2 peer features - Optimizations for huge pages - Handling page dirtying from DMA - etc I think Matthew was suggesting something like this at LS/MM, so +1 from here.. When Christoph sends his BIO dma work I was thinking of investigating this avenue, as we already have something quite similiar in RDMA that could perhaps be hoisted out for re-use into mm/ Jason
On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote: > > On Thu 06-06-19 15:03:30, Ira Weiny wrote: > > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > So I'd like to actually mandate that you *must* hold the file lease until > > > > you unpin all pages in the given range (not just that you have an option to > > > > hold a lease). And I believe the kernel should actually enforce this. That > > > > way we maintain a sane state that if someone uses a physical location of > > > > logical file offset on disk, he has a layout lease. Also once this is done, > > > > sysadmin has a reasonably easy way to discover run-away RDMA application > > > > and kill it if he wishes so. > > > > > > Fair enough. > > > > > > I was kind of heading that direction but had not thought this far forward. I > > > was exploring how to have a lease remain on the file even after a "lease > > > break". But that is incompatible with the current semantics of a "layout" > > > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > > > out to see what people think of this idea so I did not look at keeping the > > > lease.] > > > > > > Also hitch is that currently a lease is forcefully broken after > > > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > > > lease type with the semantics you describe. > > > > I'd do what Dave suggested - add flag to mark lease as unbreakable by > > truncate and teach file locking core to handle that. There actually is > > support for locks that are not broken after given timeout so there > > shouldn't be too many changes need. > > > > > Previously I had thought this would be a good idea (for other reasons). But > > > what does everyone think about using a "longterm lease" similar to [1] which > > > has the semantics you proppose? In [1] I was not sure "longterm" was a good > > > name but with your proposal I think it makes more sense. > > > > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes > > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with > > truncate. > > Ok I want to make sure I understand what you and Dave are suggesting. > > Are you suggesting that we have something like this from user space? > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); Rather than "unbreakable", perhaps a clearer description of the policy it entails is "exclusive"? i.e. what we are talking about here is an exclusive lease that prevents other processes from changing the layout. i.e. the mechanism used to guarantee a lease is exclusive is that the layout becomes "unbreakable" at the filesystem level, but the policy we are actually presenting to uses is "exclusive access"... Cheers, Dave.
On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote: > > > On Thu 06-06-19 15:03:30, Ira Weiny wrote: > > > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote: > > > > > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote: > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > > > So I'd like to actually mandate that you *must* hold the file lease until > > > > > you unpin all pages in the given range (not just that you have an option to > > > > > hold a lease). And I believe the kernel should actually enforce this. That > > > > > way we maintain a sane state that if someone uses a physical location of > > > > > logical file offset on disk, he has a layout lease. Also once this is done, > > > > > sysadmin has a reasonably easy way to discover run-away RDMA application > > > > > and kill it if he wishes so. > > > > > > > > Fair enough. > > > > > > > > I was kind of heading that direction but had not thought this far forward. I > > > > was exploring how to have a lease remain on the file even after a "lease > > > > break". But that is incompatible with the current semantics of a "layout" > > > > lease (as currently defined in the kernel). [In the end I wanted to get an RFC > > > > out to see what people think of this idea so I did not look at keeping the > > > > lease.] > > > > > > > > Also hitch is that currently a lease is forcefully broken after > > > > <sysfs>/lease-break-time. To do what you suggest I think we would need a new > > > > lease type with the semantics you describe. > > > > > > I'd do what Dave suggested - add flag to mark lease as unbreakable by > > > truncate and teach file locking core to handle that. There actually is > > > support for locks that are not broken after given timeout so there > > > shouldn't be too many changes need. > > > > > > > Previously I had thought this would be a good idea (for other reasons). But > > > > what does everyone think about using a "longterm lease" similar to [1] which > > > > has the semantics you proppose? In [1] I was not sure "longterm" was a good > > > > name but with your proposal I think it makes more sense. > > > > > > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes > > > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with > > > truncate. > > > > Ok I want to make sure I understand what you and Dave are suggesting. > > > > Are you suggesting that we have something like this from user space? > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > Rather than "unbreakable", perhaps a clearer description of the > policy it entails is "exclusive"? > > i.e. what we are talking about here is an exclusive lease that > prevents other processes from changing the layout. i.e. the > mechanism used to guarantee a lease is exclusive is that the layout > becomes "unbreakable" at the filesystem level, but the policy we are > actually presenting to uses is "exclusive access"... That sounds good. Ira > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri 07-06-19 07:52:13, Ira Weiny wrote: > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > > > Because the pins would be invisible to sysadmin from that point on. > > > > It is not invisible, it just shows up in a rdma specific kernel > > interface. You have to use rdma netlink to see the kernel object > > holding this pin. > > > > If this visibility is the main sticking point I suggest just enhancing > > the existing MR reporting to include the file info for current GUP > > pins and teaching lsof to collect information from there as well so it > > is easy to use. > > > > If the ownership of the lease transfers to the MR, and we report that > > ownership to userspace in a way lsof can find, then I think all the > > concerns that have been raised are met, right? > > I was contemplating some new lsof feature yesterday. But what I don't > think we want is sysadmins to have multiple tools for multiple > subsystems. Or even have to teach lsof something new for every potential > new subsystem user of GUP pins. Agreed. > I was thinking more along the lines of reporting files which have GUP > pins on them directly somewhere (dare I say procfs?) and teaching lsof to > report that information. That would cover any subsystem which does a > longterm pin. So lsof already parses /proc/<pid>/maps to learn about files held open by memory mappings. It could parse some other file as well I guess. The good thing about that would be that then "longterm pin" structure would just hold struct file reference. That would avoid any needs of special behavior on file close (the file reference in the "longterm pin" structure would make sure struct file and thus the lease stays around, we'd just need to make explicit lease unlock block until the "longterm pin" structure is freed). The bad thing is that it requires us to come up with a sane new proc interface for reporting "longterm pins" and associated struct file. Also we need to define what this interface shows if the pinned pages are in DRAM (either page cache or anon) and not on NVDIMM. > > > ugly to live so we have to come up with something better. The best I can > > > currently come up with is to have a method associated with the lease that > > > would invalidate the RDMA context that holds the pins in the same way that > > > a file close would do it. > > > > This is back to requiring all RDMA HW to have some new behavior they > > currently don't have.. > > > > The main objection to the current ODP & DAX solution is that very > > little HW can actually implement it, having the alternative still > > require HW support doesn't seem like progress. > > > > I think we will eventually start seein some HW be able to do this > > invalidation, but it won't be universal, and I'd rather leave it > > optional, for recovery from truely catastrophic errors (ie my DAX is > > on fire, I need to unplug it). > > Agreed. I think software wise there is not much some of the devices can do > with such an "invalidate". So out of curiosity: What does RDMA driver do when userspace just closes the file pointing to RDMA object? It has to handle that somehow by aborting everything that's going on... And I wanted similar behavior here. Honza
On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > The main objection to the current ODP & DAX solution is that very > > > little HW can actually implement it, having the alternative still > > > require HW support doesn't seem like progress. > > > > > > I think we will eventually start seein some HW be able to do this > > > invalidation, but it won't be universal, and I'd rather leave it > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > on fire, I need to unplug it). > > > > Agreed. I think software wise there is not much some of the devices can do > > with such an "invalidate". > > So out of curiosity: What does RDMA driver do when userspace just closes > the file pointing to RDMA object? It has to handle that somehow by aborting > everything that's going on... And I wanted similar behavior here. It aborts *everything* connected to that file descriptor. Destroying everything avoids creating inconsistencies that destroying a subset would create. What has been talked about for lease break is not destroying anything but very selectively saying that one memory region linked to the GUP is no longer functional. Jason
On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > The main objection to the current ODP & DAX solution is that very > > > > little HW can actually implement it, having the alternative still > > > > require HW support doesn't seem like progress. > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > on fire, I need to unplug it). > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > with such an "invalidate". > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > the file pointing to RDMA object? It has to handle that somehow by aborting > > everything that's going on... And I wanted similar behavior here. > > It aborts *everything* connected to that file descriptor. Destroying > everything avoids creating inconsistencies that destroying a subset > would create. > > What has been talked about for lease break is not destroying anything > but very selectively saying that one memory region linked to the GUP > is no longer functional. OK, so what I had in mind was that if RDMA app doesn't play by the rules and closes the file with existing pins (and thus layout lease) we would force it to abort everything. Yes, it is disruptive but then the app didn't obey the rule that it has to maintain file lease while holding pins. Thus such situation should never happen unless the app is malicious / buggy. Honza
On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > Are you suggesting that we have something like this from user space? > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > Rather than "unbreakable", perhaps a clearer description of the > policy it entails is "exclusive"? > > i.e. what we are talking about here is an exclusive lease that > prevents other processes from changing the layout. i.e. the > mechanism used to guarantee a lease is exclusive is that the layout > becomes "unbreakable" at the filesystem level, but the policy we are > actually presenting to uses is "exclusive access"... That's rather different from the normal meaning of 'exclusive' in the context of locks, which is "only one user can have access to this at a time". As I understand it, this is rather more like a 'shared' or 'read' lock. The filesystem would be the one which wants an exclusive lock, so it can modify the mapping of logical to physical blocks. The complication being that by default the filesystem has an exclusive lock on the mapping, and what we're trying to add is the ability for readers to ask the filesystem to give up its exclusive lock.
On Wed, Jun 12, 2019 at 5:09 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > little HW can actually implement it, having the alternative still > > > > > require HW support doesn't seem like progress. > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > on fire, I need to unplug it). > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > with such an "invalidate". > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > everything that's going on... And I wanted similar behavior here. > > > > It aborts *everything* connected to that file descriptor. Destroying > > everything avoids creating inconsistencies that destroying a subset > > would create. > > > > What has been talked about for lease break is not destroying anything > > but very selectively saying that one memory region linked to the GUP > > is no longer functional. > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > and closes the file with existing pins (and thus layout lease) we would > force it to abort everything. Yes, it is disruptive but then the app didn't > obey the rule that it has to maintain file lease while holding pins. Thus > such situation should never happen unless the app is malicious / buggy. When you say 'close' do you mean the final release of the fd? The vma keeps a reference to a 'struct file' live even after the fd is closed.
On Wed, Jun 12, 2019 at 3:29 AM Jan Kara <jack@suse.cz> wrote: > > On Fri 07-06-19 07:52:13, Ira Weiny wrote: > > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > > > > > Because the pins would be invisible to sysadmin from that point on. > > > > > > It is not invisible, it just shows up in a rdma specific kernel > > > interface. You have to use rdma netlink to see the kernel object > > > holding this pin. > > > > > > If this visibility is the main sticking point I suggest just enhancing > > > the existing MR reporting to include the file info for current GUP > > > pins and teaching lsof to collect information from there as well so it > > > is easy to use. > > > > > > If the ownership of the lease transfers to the MR, and we report that > > > ownership to userspace in a way lsof can find, then I think all the > > > concerns that have been raised are met, right? > > > > I was contemplating some new lsof feature yesterday. But what I don't > > think we want is sysadmins to have multiple tools for multiple > > subsystems. Or even have to teach lsof something new for every potential > > new subsystem user of GUP pins. > > Agreed. > > > I was thinking more along the lines of reporting files which have GUP > > pins on them directly somewhere (dare I say procfs?) and teaching lsof to > > report that information. That would cover any subsystem which does a > > longterm pin. > > So lsof already parses /proc/<pid>/maps to learn about files held open by > memory mappings. It could parse some other file as well I guess. The good > thing about that would be that then "longterm pin" structure would just hold > struct file reference. That would avoid any needs of special behavior on > file close (the file reference in the "longterm pin" structure would make > sure struct file and thus the lease stays around, we'd just need to make > explicit lease unlock block until the "longterm pin" structure is freed). > The bad thing is that it requires us to come up with a sane new proc > interface for reporting "longterm pins" and associated struct file. Also we > need to define what this interface shows if the pinned pages are in DRAM > (either page cache or anon) and not on NVDIMM. The anon vs shared detection case is important because a longterm pin might be blocking a memory-hot-unplug operation if it is pinning ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an explicit concern of the interface. For the anon / cached case I expect it might be useful to put that communication under the memory-blocks sysfs interface. I.e. a list of pids that are pinning that memory-block from being hot-unplugged.
On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > little HW can actually implement it, having the alternative still > > > > > require HW support doesn't seem like progress. > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > on fire, I need to unplug it). > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > with such an "invalidate". > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > everything that's going on... And I wanted similar behavior here. > > > > It aborts *everything* connected to that file descriptor. Destroying > > everything avoids creating inconsistencies that destroying a subset > > would create. > > > > What has been talked about for lease break is not destroying anything > > but very selectively saying that one memory region linked to the GUP > > is no longer functional. > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > and closes the file with existing pins (and thus layout lease) we would > force it to abort everything. Yes, it is disruptive but then the app didn't > obey the rule that it has to maintain file lease while holding pins. Thus > such situation should never happen unless the app is malicious / buggy. We do have the infrastructure to completely revoke the entire *content* of a FD (this is called device disassociate). It is basically close without the app doing close. But again it only works with some drivers. However, this is more likely something a driver could support without a HW change though. It is quite destructive as it forcibly kills everything RDMA related the process(es) are doing, but it is less violent than SIGKILL, and there is perhaps a way for the app to recover from this, if it is coded for it. My preference would be to avoid this scenario, but if it is really necessary, we could probably build it with some work. The only case we use it today is forced HW hot unplug, so it is rarely used and only for an 'emergency' like use case. Jason
On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > little HW can actually implement it, having the alternative still > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > on fire, I need to unplug it). > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > with such an "invalidate". > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > everything that's going on... And I wanted similar behavior here. > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > everything avoids creating inconsistencies that destroying a subset > > > would create. > > > > > > What has been talked about for lease break is not destroying anything > > > but very selectively saying that one memory region linked to the GUP > > > is no longer functional. > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > and closes the file with existing pins (and thus layout lease) we would > > force it to abort everything. Yes, it is disruptive but then the app didn't > > obey the rule that it has to maintain file lease while holding pins. Thus > > such situation should never happen unless the app is malicious / buggy. > > We do have the infrastructure to completely revoke the entire > *content* of a FD (this is called device disassociate). It is > basically close without the app doing close. But again it only works > with some drivers. However, this is more likely something a driver > could support without a HW change though. > > It is quite destructive as it forcibly kills everything RDMA related > the process(es) are doing, but it is less violent than SIGKILL, and > there is perhaps a way for the app to recover from this, if it is > coded for it. I don't think many are... I think most would effectively be "killed" if this happened to them. > > My preference would be to avoid this scenario, but if it is really > necessary, we could probably build it with some work. > > The only case we use it today is forced HW hot unplug, so it is rarely > used and only for an 'emergency' like use case. I'd really like to avoid this as well. I think it will be very confusing for RDMA apps to have their context suddenly be invalid. I think if we have a way for admins to ID who is pinning a file the admin can take more appropriate action on those processes. Up to and including killing the process. Ira
On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <ira.weiny@intel.com> wrote: > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > > with such an "invalidate". > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > everything avoids creating inconsistencies that destroying a subset > > > > would create. > > > > > > > > What has been talked about for lease break is not destroying anything > > > > but very selectively saying that one memory region linked to the GUP > > > > is no longer functional. > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > and closes the file with existing pins (and thus layout lease) we would > > > force it to abort everything. Yes, it is disruptive but then the app didn't > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > such situation should never happen unless the app is malicious / buggy. > > > > We do have the infrastructure to completely revoke the entire > > *content* of a FD (this is called device disassociate). It is > > basically close without the app doing close. But again it only works > > with some drivers. However, this is more likely something a driver > > could support without a HW change though. > > > > It is quite destructive as it forcibly kills everything RDMA related > > the process(es) are doing, but it is less violent than SIGKILL, and > > there is perhaps a way for the app to recover from this, if it is > > coded for it. > > I don't think many are... I think most would effectively be "killed" if this > happened to them. > > > > > My preference would be to avoid this scenario, but if it is really > > necessary, we could probably build it with some work. > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > used and only for an 'emergency' like use case. > > I'd really like to avoid this as well. I think it will be very confusing for > RDMA apps to have their context suddenly be invalid. I think if we have a way > for admins to ID who is pinning a file the admin can take more appropriate > action on those processes. Up to and including killing the process. Can RDMA context invalidation, "device disassociate", be inflicted on a process from the outside? Identifying the pid of a pin holder only leaves SIGKILL of the entire process as the remediation for revoking a pin, and I assume admins would use the finer grained invalidation where it was available.
On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > Are you suggesting that we have something like this from user space? > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > Rather than "unbreakable", perhaps a clearer description of the > > policy it entails is "exclusive"? > > > > i.e. what we are talking about here is an exclusive lease that > > prevents other processes from changing the layout. i.e. the > > mechanism used to guarantee a lease is exclusive is that the layout > > becomes "unbreakable" at the filesystem level, but the policy we are > > actually presenting to uses is "exclusive access"... > > That's rather different from the normal meaning of 'exclusive' in the > context of locks, which is "only one user can have access to this at > a time". As I understand it, this is rather more like a 'shared' or > 'read' lock. The filesystem would be the one which wants an exclusive > lock, so it can modify the mapping of logical to physical blocks. > > The complication being that by default the filesystem has an exclusive > lock on the mapping, and what we're trying to add is the ability for > readers to ask the filesystem to give up its exclusive lock. This is an interesting view... And after some more thought, exclusive does not seem like a good name for this because technically F_WRLCK _is_ an exclusive lease... In addition, the user does not need to take the "exclusive" write lease to be notified of (broken by) an unexpected truncate. A "read" lease is broken by truncate. (And "write" leases really don't do anything different WRT the interaction of the FS and the user app. Write leases control "exclusive" access between other file descriptors.) Another thing to consider is that this patch set _allows_ a truncate/hole punch to proceed _if_ the pages being affected are not actually pinned. So the unbreakable/exclusive nature of the lease is not absolute. Personally I like this functionality. I'm not quite sure I can make it work with what Jan is suggesting. But I like it. Given the muddied water of "exclusive" and "write" lease I'm now feeling like Jeff has a point WRT the conflation of F_RDLCK/F_WRLCK/F_UNLCK and this new functionality. Should we use his suggested F_SETLAYOUT/F_GETLAYOUT cmd type?[1] Ira [1] https://lkml.org/lkml/2019/6/9/117
On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > > > with such an "invalidate". > > > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > > everything avoids creating inconsistencies that destroying a subset > > > > > would create. > > > > > > > > > > What has been talked about for lease break is not destroying anything > > > > > but very selectively saying that one memory region linked to the GUP > > > > > is no longer functional. > > > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > > and closes the file with existing pins (and thus layout lease) we would > > > > force it to abort everything. Yes, it is disruptive but then the app didn't > > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > > such situation should never happen unless the app is malicious / buggy. > > > > > > We do have the infrastructure to completely revoke the entire > > > *content* of a FD (this is called device disassociate). It is > > > basically close without the app doing close. But again it only works > > > with some drivers. However, this is more likely something a driver > > > could support without a HW change though. > > > > > > It is quite destructive as it forcibly kills everything RDMA related > > > the process(es) are doing, but it is less violent than SIGKILL, and > > > there is perhaps a way for the app to recover from this, if it is > > > coded for it. > > > > I don't think many are... I think most would effectively be "killed" if this > > happened to them. > > > > > > > > My preference would be to avoid this scenario, but if it is really > > > necessary, we could probably build it with some work. > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > used and only for an 'emergency' like use case. > > > > I'd really like to avoid this as well. I think it will be very confusing for > > RDMA apps to have their context suddenly be invalid. I think if we have a way > > for admins to ID who is pinning a file the admin can take more appropriate > > action on those processes. Up to and including killing the process. > > Can RDMA context invalidation, "device disassociate", be inflicted on > a process from the outside? Identifying the pid of a pin holder only > leaves SIGKILL of the entire process as the remediation for revoking a > pin, and I assume admins would use the finer grained invalidation > where it was available. No not in the way you are describing it. As Jason said you can hotplug the device which is "from the outside" but this would affect all users of that device. Effectively, we would need a way for an admin to close a specific file descriptor (or set of fds) which point to that file. AFAIK there is no way to do that at all, is there? Ira
On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > Are you suggesting that we have something like this from user space? > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > Rather than "unbreakable", perhaps a clearer description of the > > policy it entails is "exclusive"? > > > > i.e. what we are talking about here is an exclusive lease that > > prevents other processes from changing the layout. i.e. the > > mechanism used to guarantee a lease is exclusive is that the layout > > becomes "unbreakable" at the filesystem level, but the policy we are > > actually presenting to uses is "exclusive access"... > > That's rather different from the normal meaning of 'exclusive' in the > context of locks, which is "only one user can have access to this at > a time". Layout leases are not locks, they are a user access policy object. It is the process/fd which holds the lease and it's the process/fd that is granted exclusive access. This is exactly the same semantic as O_EXCL provides for granting exclusive access to a block device via open(), yes? > As I understand it, this is rather more like a 'shared' or > 'read' lock. The filesystem would be the one which wants an exclusive > lock, so it can modify the mapping of logical to physical blocks. ISTM that you're conflating internal filesystem implementation with application visible semantics. Yes, the filesystem uses internal locks to serialise the modification of the things the lease manages access too, but that has nothing to do with the access policy the lease provides to users. e.g. Process A has an exclusive layout lease on file F. It does an IO to file F. The filesystem IO path checks that Process A owns the lease on the file and so skips straight through layout breaking because it owns the lease and is allowed to modify the layout. It then takes the inode metadata locks to allocate new space and write new data. Process B now tries to write to file F. The FS checks whether Process B owns a layout lease on file F. It doesn't, so then it tries to break the layout lease so the IO can proceed. The layout breaking code sees that process A has an exclusive layout lease granted, and so returns -ETXTBSY to process B - it is not allowed to break the lease and so the IO fails with -ETXTBSY. i.e. the exclusive layout lease prevents other processes from performing operations that may need to modify the layout from performing those operations. It does not "lock" the file/inode in any way, it just changes how the layout lease breaking behaves. Further, the "exclusiveness" of a layout lease is completely irrelevant to the filesystem that is indicating that an operation that may need to modify the layout is about to be performed. All the filesystem has to do is handle failures to break the lease appropriately. Yes, XFS serialises the layout lease validation against other IO to the same file via it's IO locks, but that's an internal data IO coherency requirement, not anything to do with layout lease management. Note that I talk about /writes/ here. This is interchangable with any other operation that may need to modify the extent layout of the file, be it truncate, fallocate, etc: the attempt to break the layout lease by a non-owner should fail if the lease is "exclusive" to the owner. > The complication being that by default the filesystem has an exclusive > lock on the mapping, and what we're trying to add is the ability for > readers to ask the filesystem to give up its exclusive lock. The filesystem doesn't even lock the "mapping" until after the layout lease has been validated or broken. Cheers, Dave.
On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote: > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > > Are you suggesting that we have something like this from user space? > > > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > > > Rather than "unbreakable", perhaps a clearer description of the > > > policy it entails is "exclusive"? > > > > > > i.e. what we are talking about here is an exclusive lease that > > > prevents other processes from changing the layout. i.e. the > > > mechanism used to guarantee a lease is exclusive is that the layout > > > becomes "unbreakable" at the filesystem level, but the policy we are > > > actually presenting to uses is "exclusive access"... > > > > That's rather different from the normal meaning of 'exclusive' in the > > context of locks, which is "only one user can have access to this at > > a time". As I understand it, this is rather more like a 'shared' or > > 'read' lock. The filesystem would be the one which wants an exclusive > > lock, so it can modify the mapping of logical to physical blocks. > > > > The complication being that by default the filesystem has an exclusive > > lock on the mapping, and what we're trying to add is the ability for > > readers to ask the filesystem to give up its exclusive lock. > > This is an interesting view... > > And after some more thought, exclusive does not seem like a good name for this > because technically F_WRLCK _is_ an exclusive lease... > > In addition, the user does not need to take the "exclusive" write lease to be > notified of (broken by) an unexpected truncate. A "read" lease is broken by > truncate. (And "write" leases really don't do anything different WRT the > interaction of the FS and the user app. Write leases control "exclusive" > access between other file descriptors.) I've been assuming that there is only one type of layout lease - there is no use case I've heard of for read/write layout leases, and like you say there is zero difference in behaviour at the filesystem level - they all have to be broken to allow a non-lease truncate to proceed. IMO, taking a "read lease" to be able to modify and write to the underlying mapping of a file makes absolutely no sense at all. IOWs, we're talking exaclty about a revokable layout lease vs an exclusive layout lease here, and so read/write really doesn't match the policy or semantics we are trying to provide. > Another thing to consider is that this patch set _allows_ a truncate/hole punch > to proceed _if_ the pages being affected are not actually pinned. So the > unbreakable/exclusive nature of the lease is not absolute. If you're talking about the process that owns the layout lease running the truncate, then that is fine. However, if you are talking about a process that does not own the layout lease being allowed to truncate a file without first breaking the layout lease, then that is fundamentally broken. i.e. If you don't own a layout lease, the layout leases must be broken before the truncate can proceed. If it's an exclusive lease, then you cannot break the lease and the truncate *must fail before it is started*. i.e. the layout lease state must be correctly resolved before we start an operation that may modify a file layout. Determining if we can actually do the truncate based on page state occurs /after/ the lease says the truncate can proceed.... Cheers, Dave.
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny <ira.weiny@intel.com> wrote: > > On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > > > > with such an "invalidate". > > > > > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > > > everything avoids creating inconsistencies that destroying a subset > > > > > > would create. > > > > > > > > > > > > What has been talked about for lease break is not destroying anything > > > > > > but very selectively saying that one memory region linked to the GUP > > > > > > is no longer functional. > > > > > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > > > and closes the file with existing pins (and thus layout lease) we would > > > > > force it to abort everything. Yes, it is disruptive but then the app didn't > > > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > > > such situation should never happen unless the app is malicious / buggy. > > > > > > > > We do have the infrastructure to completely revoke the entire > > > > *content* of a FD (this is called device disassociate). It is > > > > basically close without the app doing close. But again it only works > > > > with some drivers. However, this is more likely something a driver > > > > could support without a HW change though. > > > > > > > > It is quite destructive as it forcibly kills everything RDMA related > > > > the process(es) are doing, but it is less violent than SIGKILL, and > > > > there is perhaps a way for the app to recover from this, if it is > > > > coded for it. > > > > > > I don't think many are... I think most would effectively be "killed" if this > > > happened to them. > > > > > > > > > > > My preference would be to avoid this scenario, but if it is really > > > > necessary, we could probably build it with some work. > > > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > > used and only for an 'emergency' like use case. > > > > > > I'd really like to avoid this as well. I think it will be very confusing for > > > RDMA apps to have their context suddenly be invalid. I think if we have a way > > > for admins to ID who is pinning a file the admin can take more appropriate > > > action on those processes. Up to and including killing the process. > > > > Can RDMA context invalidation, "device disassociate", be inflicted on > > a process from the outside? Identifying the pid of a pin holder only > > leaves SIGKILL of the entire process as the remediation for revoking a > > pin, and I assume admins would use the finer grained invalidation > > where it was available. > > No not in the way you are describing it. As Jason said you can hotplug the > device which is "from the outside" but this would affect all users of that > device. > > Effectively, we would need a way for an admin to close a specific file > descriptor (or set of fds) which point to that file. AFAIK there is no way to > do that at all, is there? Even if there were that gets back to my other question, does RDMA teardown happen at close(fd), or at final fput() of the 'struct file'? I.e. does it also need munmap() to get the vma to drop its reference? Perhaps a pointer to the relevant code would help me wrap my head around this mechanism.
On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > That's rather different from the normal meaning of 'exclusive' in the > > context of locks, which is "only one user can have access to this at > > a time". > > Layout leases are not locks, they are a user access policy object. > It is the process/fd which holds the lease and it's the process/fd > that is granted exclusive access. This is exactly the same semantic > as O_EXCL provides for granting exclusive access to a block device > via open(), yes? This isn't my understanding of how RDMA wants this to work, so we should probably clear that up before we get too far down deciding what name to give it. For the RDMA usage case, it is entirely possible that both process A and process B which don't know about each other want to perform RDMA to file F. So there will be two layout leases active on this file at the same time. It's fine for IOs to simultaneously be active to both leases. But if the filesystem wants to move blocks around, it has to break both leases. If Process C tries to do a write to file F without a lease, there's no problem, unless a side-effect of the write would be to change the block mapping, in which case either the leases must break first, or the write must be denied. Jason, please correct me if I've misunderstood the RDMA needs here.
On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote: > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > That's rather different from the normal meaning of 'exclusive' in the > > > context of locks, which is "only one user can have access to this at > > > a time". > > > > Layout leases are not locks, they are a user access policy object. > > It is the process/fd which holds the lease and it's the process/fd > > that is granted exclusive access. This is exactly the same semantic > > as O_EXCL provides for granting exclusive access to a block device > > via open(), yes? > > This isn't my understanding of how RDMA wants this to work, so we should > probably clear that up before we get too far down deciding what name to > give it. > > For the RDMA usage case, it is entirely possible that both process A > and process B which don't know about each other want to perform RDMA to > file F. So there will be two layout leases active on this file at the > same time. It's fine for IOs to simultaneously be active to both leases. Yes, it is. > But if the filesystem wants to move blocks around, it has to break > both leases. No, the _lease layer_ needs to break both leases when the filesystem calls break_layout(). The filesystem is /completely unaware/ of whether a lease is held, how many leases are held, what is involved in revoking leases or whether they are exclusive or not. The filesystem only knows that it is about to perform an operation that may require a layout lease to be broken, so it's _asking permission_ from the layout lease layer whether it is OK to go ahead with the operation. See what I mean about the layout lease being an /access arbitration/ layer? It's the layer that decides whether a modification can be made or not, not the filesystem. The layout lease layer tells the filesystem what it should do, the filesystem just has to ensure it adds layout breaking callouts in places that can block safely and are serialised to ensure operations from new layouts can't race with the operation that broke the existing layouts. > If Process C tries to do a write to file F without a lease, there's no > problem, unless a side-effect of the write would be to change the block > mapping, That's a side effect we cannot predict ahead of time. But it's also _completely irrelevant_ to the layout lease layer API and implementation.(*) > in which case either the leases must break first, or the write > must be denied. Which is exactly how I'm saying layout leases already interact with the filesystem: that if the lease cannot be broken, break_layout() returns -ETXTBSY to the filesystem, and the filesystem returns that to the application having made no changes at all. Layout leases are the policy engine, the filesystem just has to implement the break_layout() callouts such that layout breaking is consistent, correct, and robust.... Cheers, Dave. (*) In the case of XFS, we don't know if a layout change will be necessary until we are deep inside the actual IO path and hold inode metadata locks. We can't block here to break the layout because IO completion and metadata commits need to occur to allow the application to release it's lease and IO completion requires that same inode metadata lock. i.e. if we block once we know a layout change needs to occur, we will deadlock the filesystem on the inode metadata lock. Hence the filesystem implementation dictates when the filesystem issues layout lease break notifications. However, these filesystem implementation issues do not dictate how applications interact with layout leases, how layout leases are managed, whether concurrent leases are allowed, whether leases can be broken, etc. That's all managed by the layout lease layer and that's where the go/no go decision is made and communicated to the filesystem as the return value from the break_layout() call.
On Wed 12-06-19 11:41:53, Dan Williams wrote: > On Wed, Jun 12, 2019 at 5:09 AM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > little HW can actually implement it, having the alternative still > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > on fire, I need to unplug it). > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > with such an "invalidate". > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > everything that's going on... And I wanted similar behavior here. > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > everything avoids creating inconsistencies that destroying a subset > > > would create. > > > > > > What has been talked about for lease break is not destroying anything > > > but very selectively saying that one memory region linked to the GUP > > > is no longer functional. > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > and closes the file with existing pins (and thus layout lease) we would > > force it to abort everything. Yes, it is disruptive but then the app didn't > > obey the rule that it has to maintain file lease while holding pins. Thus > > such situation should never happen unless the app is malicious / buggy. > > When you say 'close' do you mean the final release of the fd? The vma > keeps a reference to a 'struct file' live even after the fd is closed. When I say 'close', I mean a call to ->release file operation which happens when the last reference to struct file is dropped. I.e., when all file descriptors and vmas (and possibly other places holding struct file reference) are gone. Honza
On Wed 12-06-19 11:49:52, Dan Williams wrote: > On Wed, Jun 12, 2019 at 3:29 AM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 07-06-19 07:52:13, Ira Weiny wrote: > > > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > > > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > > > > > > > Because the pins would be invisible to sysadmin from that point on. > > > > > > > > It is not invisible, it just shows up in a rdma specific kernel > > > > interface. You have to use rdma netlink to see the kernel object > > > > holding this pin. > > > > > > > > If this visibility is the main sticking point I suggest just enhancing > > > > the existing MR reporting to include the file info for current GUP > > > > pins and teaching lsof to collect information from there as well so it > > > > is easy to use. > > > > > > > > If the ownership of the lease transfers to the MR, and we report that > > > > ownership to userspace in a way lsof can find, then I think all the > > > > concerns that have been raised are met, right? > > > > > > I was contemplating some new lsof feature yesterday. But what I don't > > > think we want is sysadmins to have multiple tools for multiple > > > subsystems. Or even have to teach lsof something new for every potential > > > new subsystem user of GUP pins. > > > > Agreed. > > > > > I was thinking more along the lines of reporting files which have GUP > > > pins on them directly somewhere (dare I say procfs?) and teaching lsof to > > > report that information. That would cover any subsystem which does a > > > longterm pin. > > > > So lsof already parses /proc/<pid>/maps to learn about files held open by > > memory mappings. It could parse some other file as well I guess. The good > > thing about that would be that then "longterm pin" structure would just hold > > struct file reference. That would avoid any needs of special behavior on > > file close (the file reference in the "longterm pin" structure would make > > sure struct file and thus the lease stays around, we'd just need to make > > explicit lease unlock block until the "longterm pin" structure is freed). > > The bad thing is that it requires us to come up with a sane new proc > > interface for reporting "longterm pins" and associated struct file. Also we > > need to define what this interface shows if the pinned pages are in DRAM > > (either page cache or anon) and not on NVDIMM. > > The anon vs shared detection case is important because a longterm pin > might be blocking a memory-hot-unplug operation if it is pinning > ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an > explicit concern of the interface. For the anon / cached case I expect > it might be useful to put that communication under the memory-blocks > sysfs interface. I.e. a list of pids that are pinning that > memory-block from being hot-unplugged. Yes, I was thinking of memory hotplug as well. But I don't think the distinction is really shared vs anon - a pinned page cache page can be blocking your memory unplug / migration the same way as a pinned anon page. So the information for a pin we need to convey is the "location of resources" being pinned - that is pfn (both for DRAM and NVDIMM) - but then also additional mapping information (which is filename for DAX page, not sure about DRAM). Also a separate question is how to expose this information so that it is efficiently usable by userspace. For lsof, a file in /proc/<pid>/xxx with information would be probably the easiest to use plus all the issues with file access permissions and visibility among different user namespaces is solved out of the box. And I believe it would be reasonably usable for memory hotplug usecase as well. A file in sysfs would be OK for memory hotplug I guess, but not really usable for lsof and so I'm not sure we really need it when we are going to have one in procfs. Honza
On Wed 12-06-19 15:13:36, Ira Weiny wrote: > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > > with such an "invalidate". > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > everything avoids creating inconsistencies that destroying a subset > > > > would create. > > > > > > > > What has been talked about for lease break is not destroying anything > > > > but very selectively saying that one memory region linked to the GUP > > > > is no longer functional. > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > and closes the file with existing pins (and thus layout lease) we would > > > force it to abort everything. Yes, it is disruptive but then the app didn't > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > such situation should never happen unless the app is malicious / buggy. > > > > We do have the infrastructure to completely revoke the entire > > *content* of a FD (this is called device disassociate). It is > > basically close without the app doing close. But again it only works > > with some drivers. However, this is more likely something a driver > > could support without a HW change though. > > > > It is quite destructive as it forcibly kills everything RDMA related > > the process(es) are doing, but it is less violent than SIGKILL, and > > there is perhaps a way for the app to recover from this, if it is > > coded for it. > > I don't think many are... I think most would effectively be "killed" if this > happened to them. Yes, I repeat we are in a situation when the application has a bug and didn't propely manage its long term pins which are fully under its control. So in my mind a situation similar to application using memory it has already freed. The kernel has to manage that but we don't really care what's left from the application when this happens. That being said I'm not insisting this has to happen - tracking associated "RDMA file" with a layout lease and somehow invalidating it on close of a leased file is somewhat ugly anyway. But it is still an option if exposing pins to userspace for lsof to consume proves even worse... Honza
On Thu, Jun 13, 2019 at 02:36:49PM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote: > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > > That's rather different from the normal meaning of 'exclusive' in the > > > > context of locks, which is "only one user can have access to this at > > > > a time". > > > > > > Layout leases are not locks, they are a user access policy object. > > > It is the process/fd which holds the lease and it's the process/fd > > > that is granted exclusive access. This is exactly the same semantic > > > as O_EXCL provides for granting exclusive access to a block device > > > via open(), yes? > > > > This isn't my understanding of how RDMA wants this to work, so we should > > probably clear that up before we get too far down deciding what name to > > give it. > > > > For the RDMA usage case, it is entirely possible that both process A > > and process B which don't know about each other want to perform RDMA to > > file F. So there will be two layout leases active on this file at the > > same time. It's fine for IOs to simultaneously be active to both leases. > > Yes, it is. > > > But if the filesystem wants to move blocks around, it has to break > > both leases. > > No, the _lease layer_ needs to break both leases when the filesystem > calls break_layout(). That's a distinction without a difference as far as userspace is concerned. If process A asks for an exclusive lease (and gets it), then process B asks for an exclusive lease (and gets it), that lease isn't exclusive! It's shared. I think the example you give of O_EXCL is more of a historical accident. It's a relatively recent Linuxism that O_EXCL on a block device means "this block device is not part of a filesystem", and I don't think most userspace programmers are aware of what it means when not paired with O_CREAT. > > If Process C tries to do a write to file F without a lease, there's no > > problem, unless a side-effect of the write would be to change the block > > mapping, > > That's a side effect we cannot predict ahead of time. But it's > also _completely irrelevant_ to the layout lease layer API and > implementation.(*) It's irrelevant to the naming, but you brought it up as part of the semantics.
On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > > > My preference would be to avoid this scenario, but if it is really > > > necessary, we could probably build it with some work. > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > used and only for an 'emergency' like use case. > > > > I'd really like to avoid this as well. I think it will be very confusing for > > RDMA apps to have their context suddenly be invalid. I think if we have a way > > for admins to ID who is pinning a file the admin can take more appropriate > > action on those processes. Up to and including killing the process. > > Can RDMA context invalidation, "device disassociate", be inflicted on > a process from the outside? Yes, but it is currently only applied to the entire device - ie you do 'rmmod mlx5_ib' and all the running user space process see that their FD has moved to some error and the device is broken. Targetting the disassociate of only a single FD would be a new thing. Jason
On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote: > > Effectively, we would need a way for an admin to close a specific file > > descriptor (or set of fds) which point to that file. AFAIK there is no way to > > do that at all, is there? > > Even if there were that gets back to my other question, does RDMA > teardown happen at close(fd), or at final fput() of the 'struct > file'? AFAIK there is no kernel side driver hook for close(fd). rdma uses a normal chardev so it's lifetime is linked to the file_ops release, which is called on last fput. So all the mmaps, all the dups, everything must go before it releases its resources. Jason
On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > e.g. Process A has an exclusive layout lease on file F. It does an > IO to file F. The filesystem IO path checks that Process A owns the > lease on the file and so skips straight through layout breaking > because it owns the lease and is allowed to modify the layout. It > then takes the inode metadata locks to allocate new space and write > new data. > > Process B now tries to write to file F. The FS checks whether > Process B owns a layout lease on file F. It doesn't, so then it > tries to break the layout lease so the IO can proceed. The layout > breaking code sees that process A has an exclusive layout lease > granted, and so returns -ETXTBSY to process B - it is not allowed to > break the lease and so the IO fails with -ETXTBSY. This description doesn't match the behaviour that RDMA wants either. Even if Process A has a lease on the file, an IO from Process A which results in blocks being freed from the file is going to result in the RDMA device being able to write to blocks which are now freed (and potentially reallocated to another file).
On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote: > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > That's rather different from the normal meaning of 'exclusive' in the > > > context of locks, which is "only one user can have access to this at > > > a time". > > > > Layout leases are not locks, they are a user access policy object. > > It is the process/fd which holds the lease and it's the process/fd > > that is granted exclusive access. This is exactly the same semantic > > as O_EXCL provides for granting exclusive access to a block device > > via open(), yes? > > This isn't my understanding of how RDMA wants this to work, so we should > probably clear that up before we get too far down deciding what name to > give it. > > For the RDMA usage case, it is entirely possible that both process A > and process B which don't know about each other want to perform RDMA to > file F. So there will be two layout leases active on this file at the > same time. It's fine for IOs to simultaneously be active to both leases. > But if the filesystem wants to move blocks around, it has to break > both leases. > > If Process C tries to do a write to file F without a lease, there's no > problem, unless a side-effect of the write would be to change the block > mapping, in which case either the leases must break first, or the write > must be denied. > > Jason, please correct me if I've misunderstood the RDMA needs here. Yes, I think you've captured it Based on Dave's remarks how frequently a filesystem needs to break the lease will determine if it is usuable to be combined with RDMA or not... Jason
On Thu, Jun 13, 2019 at 8:14 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote: > > > Effectively, we would need a way for an admin to close a specific file > > > descriptor (or set of fds) which point to that file. AFAIK there is no way to > > > do that at all, is there? > > > > Even if there were that gets back to my other question, does RDMA > > teardown happen at close(fd), or at final fput() of the 'struct > > file'? > > AFAIK there is no kernel side driver hook for close(fd). > > rdma uses a normal chardev so it's lifetime is linked to the file_ops > release, which is called on last fput. So all the mmaps, all the dups, > everything must go before it releases its resources. Oh, I must have missed where this conversation started talking about the driver-device fd. I thought we were talking about the close / release of the target file that is MAP_SHARED for the memory registration. A release of the driver fd is orthogonal to coordinating / signalling actions relative to the leased file.
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny <ira.weiny@intel.com> wrote: > > On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > > > > > Agreed. I think software wise there is not much some of the devices can do > > > > > > > > with such an "invalidate". > > > > > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting > > > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > > > everything avoids creating inconsistencies that destroying a subset > > > > > > would create. > > > > > > > > > > > > What has been talked about for lease break is not destroying anything > > > > > > but very selectively saying that one memory region linked to the GUP > > > > > > is no longer functional. > > > > > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > > > and closes the file with existing pins (and thus layout lease) we would > > > > > force it to abort everything. Yes, it is disruptive but then the app didn't > > > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > > > such situation should never happen unless the app is malicious / buggy. > > > > > > > > We do have the infrastructure to completely revoke the entire > > > > *content* of a FD (this is called device disassociate). It is > > > > basically close without the app doing close. But again it only works > > > > with some drivers. However, this is more likely something a driver > > > > could support without a HW change though. > > > > > > > > It is quite destructive as it forcibly kills everything RDMA related > > > > the process(es) are doing, but it is less violent than SIGKILL, and > > > > there is perhaps a way for the app to recover from this, if it is > > > > coded for it. > > > > > > I don't think many are... I think most would effectively be "killed" if this > > > happened to them. > > > > > > > > > > > My preference would be to avoid this scenario, but if it is really > > > > necessary, we could probably build it with some work. > > > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > > used and only for an 'emergency' like use case. > > > > > > I'd really like to avoid this as well. I think it will be very confusing for > > > RDMA apps to have their context suddenly be invalid. I think if we have a way > > > for admins to ID who is pinning a file the admin can take more appropriate > > > action on those processes. Up to and including killing the process. > > > > Can RDMA context invalidation, "device disassociate", be inflicted on > > a process from the outside? Identifying the pid of a pin holder only > > leaves SIGKILL of the entire process as the remediation for revoking a > > pin, and I assume admins would use the finer grained invalidation > > where it was available. > > No not in the way you are describing it. As Jason said you can hotplug the > device which is "from the outside" but this would affect all users of that > device. > > Effectively, we would need a way for an admin to close a specific file > descriptor (or set of fds) which point to that file. AFAIK there is no way to > do that at all, is there? You can certainly give the lease holder the option to close the file voluntarily via the siginfo_t that can be attached to a lease break signal. But it's not really "close" you want as much as a finer grained disassociate. All that said you could require the lease taker opt-in to SIGKILL via F_SETSIG before marking the lease "exclusive". That effectively precludes failing truncate, but it's something we can enforce today and work on finer grained / less drastic escalations over time for something that should "never" happen.
On Thu, Jun 13, 2019 at 09:25:54AM -0700, Dan Williams wrote: > On Thu, Jun 13, 2019 at 8:14 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote: > > > > Effectively, we would need a way for an admin to close a specific file > > > > descriptor (or set of fds) which point to that file. AFAIK there is no way to > > > > do that at all, is there? > > > > > > Even if there were that gets back to my other question, does RDMA > > > teardown happen at close(fd), or at final fput() of the 'struct > > > file'? > > > > AFAIK there is no kernel side driver hook for close(fd). > > > > rdma uses a normal chardev so it's lifetime is linked to the file_ops > > release, which is called on last fput. So all the mmaps, all the dups, > > everything must go before it releases its resources. > > Oh, I must have missed where this conversation started talking about > the driver-device fd. In the first paragraph above where Ira is musing about 'close a specific file', he is talking about the driver-device fd. Ie unilaterally closing /dev/uverbs as a punishment for an application that used leases wrong: ie that released its lease with the RDMA is still ongoing. Jason
On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > > Are you suggesting that we have something like this from user space? > > > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > > > Rather than "unbreakable", perhaps a clearer description of the > > > policy it entails is "exclusive"? > > > > > > i.e. what we are talking about here is an exclusive lease that > > > prevents other processes from changing the layout. i.e. the > > > mechanism used to guarantee a lease is exclusive is that the layout > > > becomes "unbreakable" at the filesystem level, but the policy we are > > > actually presenting to uses is "exclusive access"... > > > > That's rather different from the normal meaning of 'exclusive' in the > > context of locks, which is "only one user can have access to this at > > a time". > > > Layout leases are not locks, they are a user access policy object. > It is the process/fd which holds the lease and it's the process/fd > that is granted exclusive access. This is exactly the same semantic > as O_EXCL provides for granting exclusive access to a block device > via open(), yes? > > > As I understand it, this is rather more like a 'shared' or > > 'read' lock. The filesystem would be the one which wants an exclusive > > lock, so it can modify the mapping of logical to physical blocks. > > ISTM that you're conflating internal filesystem implementation with > application visible semantics. Yes, the filesystem uses internal > locks to serialise the modification of the things the lease manages > access too, but that has nothing to do with the access policy the > lease provides to users. > > e.g. Process A has an exclusive layout lease on file F. It does an > IO to file F. The filesystem IO path checks that Process A owns the > lease on the file and so skips straight through layout breaking > because it owns the lease and is allowed to modify the layout. It > then takes the inode metadata locks to allocate new space and write > new data. > > Process B now tries to write to file F. The FS checks whether > Process B owns a layout lease on file F. It doesn't, so then it > tries to break the layout lease so the IO can proceed. The layout > breaking code sees that process A has an exclusive layout lease > granted, and so returns -ETXTBSY to process B - it is not allowed to > break the lease and so the IO fails with -ETXTBSY. > > i.e. the exclusive layout lease prevents other processes from > performing operations that may need to modify the layout from > performing those operations. It does not "lock" the file/inode in > any way, it just changes how the layout lease breaking behaves. Question: Do we expect Process A to get notified that Process B was attempting to change the layout? This changes the exclusivity semantics. While Process A has an exclusive lease it could release it if notified to allow process B temporary exclusivity. Question 2: Do we expect other process' (say Process C) to also be able to map and pin the file? I believe users will need this and for layout purposes it is ok to do so. But this means that Process A does not have "exclusive" access to the lease. So given Process C has also placed a layout lease on the file. Indicating that it does not want the layout to change. Both A and C need to be "broken" by Process B to change the layout. If there is no Process B; A and C can run just fine with a "locked" layout. Ira > > Further, the "exclusiveness" of a layout lease is completely > irrelevant to the filesystem that is indicating that an operation > that may need to modify the layout is about to be performed. All the > filesystem has to do is handle failures to break the lease > appropriately. Yes, XFS serialises the layout lease validation > against other IO to the same file via it's IO locks, but that's an > internal data IO coherency requirement, not anything to do with > layout lease management. > > Note that I talk about /writes/ here. This is interchangable with > any other operation that may need to modify the extent layout of the > file, be it truncate, fallocate, etc: the attempt to break the > layout lease by a non-owner should fail if the lease is "exclusive" > to the owner. > > > The complication being that by default the filesystem has an exclusive > > lock on the mapping, and what we're trying to add is the ability for > > readers to ask the filesystem to give up its exclusive lock. > > The filesystem doesn't even lock the "mapping" until after the > layout lease has been validated or broken. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote: > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > > > Are you suggesting that we have something like this from user space? > > > > > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > > > > > Rather than "unbreakable", perhaps a clearer description of the > > > > policy it entails is "exclusive"? > > > > > > > > i.e. what we are talking about here is an exclusive lease that > > > > prevents other processes from changing the layout. i.e. the > > > > mechanism used to guarantee a lease is exclusive is that the layout > > > > becomes "unbreakable" at the filesystem level, but the policy we are > > > > actually presenting to uses is "exclusive access"... > > > > > > That's rather different from the normal meaning of 'exclusive' in the > > > context of locks, which is "only one user can have access to this at > > > a time". As I understand it, this is rather more like a 'shared' or > > > 'read' lock. The filesystem would be the one which wants an exclusive > > > lock, so it can modify the mapping of logical to physical blocks. > > > > > > The complication being that by default the filesystem has an exclusive > > > lock on the mapping, and what we're trying to add is the ability for > > > readers to ask the filesystem to give up its exclusive lock. > > > > This is an interesting view... > > > > And after some more thought, exclusive does not seem like a good name for this > > because technically F_WRLCK _is_ an exclusive lease... > > > > In addition, the user does not need to take the "exclusive" write lease to be > > notified of (broken by) an unexpected truncate. A "read" lease is broken by > > truncate. (And "write" leases really don't do anything different WRT the > > interaction of the FS and the user app. Write leases control "exclusive" > > access between other file descriptors.) > > I've been assuming that there is only one type of layout lease - > there is no use case I've heard of for read/write layout leases, and > like you say there is zero difference in behaviour at the filesystem > level - they all have to be broken to allow a non-lease truncate to > proceed. > > IMO, taking a "read lease" to be able to modify and write to the > underlying mapping of a file makes absolutely no sense at all. > IOWs, we're talking exaclty about a revokable layout lease vs an > exclusive layout lease here, and so read/write really doesn't match > the policy or semantics we are trying to provide. I humbly disagree, at least depending on how you look at it... :-D The patches as they stand expect the user to take a "read" layout lease which indicates they are currently using "reading" the layout as is. They are not changing ("writing" to) the layout. They then pin pages which locks parts of the layout and therefore they expect no "writers" to change the layout. The "write" layout lease breaks the "read" layout lease indicating that the layout is being written to. Should the layout be pinned in such a way that the layout can't be changed the "layout writer" (truncate) fails. In fact, this is what NFS does right now. The lease it puts on the file is of "read" type. nfs4layouts.c: static int nfsd4_layout_setlease(struct nfs4_layout_stateid *ls) { ... fl->fl_flags = FL_LAYOUT; fl->fl_type = F_RDLCK; ... } I was not changing that much from the NFS patter which meant the break lease code worked. Jans proposal is solid but it means that there is no breaking of the lease. I tried to add an "exclusive" flag to the "write" lease but the __break_lease() code gets weird. I'm not saying it is not possible. Just that I have not seen a good way to do it. > > > Another thing to consider is that this patch set _allows_ a truncate/hole punch > > to proceed _if_ the pages being affected are not actually pinned. So the > > unbreakable/exclusive nature of the lease is not absolute. > > If you're talking about the process that owns the layout lease > running the truncate, then that is fine. > > However, if you are talking about a process that does not own the > layout lease being allowed to truncate a file without first breaking > the layout lease, then that is fundamentally broken. In both cases (local or remote process) the lease is broken prior to the attempt to truncate. > > i.e. If you don't own a layout lease, the layout leases must be > broken before the truncate can proceed. Agreed. > > If it's an exclusive lease, > then you cannot break the lease and the truncate *must fail before > it is started*. i.e. the layout lease state must be correctly > resolved before we start an operation that may modify a file layout. > > Determining if we can actually do the truncate based on page state > occurs /after/ the lease says the truncate can proceed.... That makes a lot of sense and that is the way the patch currently works. I need to think on this some more. Keeping the lease may not be critical. As discussed with Jan; dealing with close() is best dealt with by tracking the actual pins on the file. If that works then we could potentially keep the lease semantics closer to what you and I are talking about here. Ira > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote: > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > e.g. Process A has an exclusive layout lease on file F. It does an > > IO to file F. The filesystem IO path checks that Process A owns the > > lease on the file and so skips straight through layout breaking > > because it owns the lease and is allowed to modify the layout. It > > then takes the inode metadata locks to allocate new space and write > > new data. > > > > Process B now tries to write to file F. The FS checks whether > > Process B owns a layout lease on file F. It doesn't, so then it > > tries to break the layout lease so the IO can proceed. The layout > > breaking code sees that process A has an exclusive layout lease > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > break the lease and so the IO fails with -ETXTBSY. > > This description doesn't match the behaviour that RDMA wants either. > Even if Process A has a lease on the file, an IO from Process A which > results in blocks being freed from the file is going to result in the > RDMA device being able to write to blocks which are now freed (and > potentially reallocated to another file). I don't understand why this would not work for RDMA? As long as the layout does not change the page pins can remain in place. Ira
On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote: > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote: > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > > e.g. Process A has an exclusive layout lease on file F. It does an > > > IO to file F. The filesystem IO path checks that Process A owns the > > > lease on the file and so skips straight through layout breaking > > > because it owns the lease and is allowed to modify the layout. It > > > then takes the inode metadata locks to allocate new space and write > > > new data. > > > > > > Process B now tries to write to file F. The FS checks whether > > > Process B owns a layout lease on file F. It doesn't, so then it > > > tries to break the layout lease so the IO can proceed. The layout > > > breaking code sees that process A has an exclusive layout lease > > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > > break the lease and so the IO fails with -ETXTBSY. > > > > This description doesn't match the behaviour that RDMA wants either. > > Even if Process A has a lease on the file, an IO from Process A which > > results in blocks being freed from the file is going to result in the > > RDMA device being able to write to blocks which are now freed (and > > potentially reallocated to another file). > > I don't understand why this would not work for RDMA? As long as the layout > does not change the page pins can remain in place. Because process A had a layout lease (and presumably a MR) and the layout was still modified in way that invalidates the RDMA MR. Jason
On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote: > On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote: > > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote: > > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > > > e.g. Process A has an exclusive layout lease on file F. It does an > > > > IO to file F. The filesystem IO path checks that Process A owns the > > > > lease on the file and so skips straight through layout breaking > > > > because it owns the lease and is allowed to modify the layout. It > > > > then takes the inode metadata locks to allocate new space and write > > > > new data. > > > > > > > > Process B now tries to write to file F. The FS checks whether > > > > Process B owns a layout lease on file F. It doesn't, so then it > > > > tries to break the layout lease so the IO can proceed. The layout > > > > breaking code sees that process A has an exclusive layout lease > > > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > > > break the lease and so the IO fails with -ETXTBSY. > > > > > > This description doesn't match the behaviour that RDMA wants either. > > > Even if Process A has a lease on the file, an IO from Process A which > > > results in blocks being freed from the file is going to result in the > > > RDMA device being able to write to blocks which are now freed (and > > > potentially reallocated to another file). > > > > I don't understand why this would not work for RDMA? As long as the layout > > does not change the page pins can remain in place. > > Because process A had a layout lease (and presumably a MR) and the > layout was still modified in way that invalidates the RDMA MR. Oh sorry I miss read the above... (got Process A and B mixed up...) Right, but Process A still can't free those blocks because the gup pin exists on them... So yea it can't _just_ be a layout lease which controls this on the "file fd". Ira
On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote: > On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote: > > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote: > > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > > > e.g. Process A has an exclusive layout lease on file F. It does an > > > > IO to file F. The filesystem IO path checks that Process A owns the > > > > lease on the file and so skips straight through layout breaking > > > > because it owns the lease and is allowed to modify the layout. It > > > > then takes the inode metadata locks to allocate new space and write > > > > new data. > > > > > > > > Process B now tries to write to file F. The FS checks whether > > > > Process B owns a layout lease on file F. It doesn't, so then it > > > > tries to break the layout lease so the IO can proceed. The layout > > > > breaking code sees that process A has an exclusive layout lease > > > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > > > break the lease and so the IO fails with -ETXTBSY. > > > > > > This description doesn't match the behaviour that RDMA wants either. > > > Even if Process A has a lease on the file, an IO from Process A which > > > results in blocks being freed from the file is going to result in the > > > RDMA device being able to write to blocks which are now freed (and > > > potentially reallocated to another file). > > > > I don't understand why this would not work for RDMA? As long as the layout > > does not change the page pins can remain in place. > > Because process A had a layout lease (and presumably a MR) and the > layout was still modified in way that invalidates the RDMA MR. The lease holder is allowed to modify the mapping it has a lease over. That's necessary so lease holders can write data into unallocated space in the file. The lease is there to prevent third parties from modifying the layout without the lease holder being informed and taking appropriate action to allow that 3rd party modification to occur. If the lease holder modifies the mapping in a way that causes it's own internal state to screw up, then that's a bug in the lease holder application. Cheers, Dave.
On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote: > On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote: > > On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote: > > > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote: > > > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > > > > e.g. Process A has an exclusive layout lease on file F. It does an > > > > > IO to file F. The filesystem IO path checks that Process A owns the > > > > > lease on the file and so skips straight through layout breaking > > > > > because it owns the lease and is allowed to modify the layout. It > > > > > then takes the inode metadata locks to allocate new space and write > > > > > new data. > > > > > > > > > > Process B now tries to write to file F. The FS checks whether > > > > > Process B owns a layout lease on file F. It doesn't, so then it > > > > > tries to break the layout lease so the IO can proceed. The layout > > > > > breaking code sees that process A has an exclusive layout lease > > > > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > > > > break the lease and so the IO fails with -ETXTBSY. > > > > > > > > This description doesn't match the behaviour that RDMA wants either. > > > > Even if Process A has a lease on the file, an IO from Process A which > > > > results in blocks being freed from the file is going to result in the > > > > RDMA device being able to write to blocks which are now freed (and > > > > potentially reallocated to another file). > > > > > > I don't understand why this would not work for RDMA? As long as the layout > > > does not change the page pins can remain in place. > > > > Because process A had a layout lease (and presumably a MR) and the > > layout was still modified in way that invalidates the RDMA MR. > > The lease holder is allowed to modify the mapping it has a lease > over. That's necessary so lease holders can write data into > unallocated space in the file. The lease is there to prevent third > parties from modifying the layout without the lease holder being > informed and taking appropriate action to allow that 3rd party > modification to occur. > > If the lease holder modifies the mapping in a way that causes it's > own internal state to screw up, then that's a bug in the lease > holder application. Sounds like the lease semantics aren't the right ones for the longterm GUP users then. The point of the longterm GUP is so the pages can be written to, and if the filesystem is going to move the pages around when they're written to, that just won't work.
On Thu, Jun 13, 2019 at 01:34:05PM -0700, Ira Weiny wrote: > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > > > Are you suggesting that we have something like this from user space? > > > > > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > > > > > Rather than "unbreakable", perhaps a clearer description of the > > > > policy it entails is "exclusive"? > > > > > > > > i.e. what we are talking about here is an exclusive lease that > > > > prevents other processes from changing the layout. i.e. the > > > > mechanism used to guarantee a lease is exclusive is that the layout > > > > becomes "unbreakable" at the filesystem level, but the policy we are > > > > actually presenting to uses is "exclusive access"... > > > > > > That's rather different from the normal meaning of 'exclusive' in the > > > context of locks, which is "only one user can have access to this at > > > a time". > > > > > > Layout leases are not locks, they are a user access policy object. > > It is the process/fd which holds the lease and it's the process/fd > > that is granted exclusive access. This is exactly the same semantic > > as O_EXCL provides for granting exclusive access to a block device > > via open(), yes? > > > > > As I understand it, this is rather more like a 'shared' or > > > 'read' lock. The filesystem would be the one which wants an exclusive > > > lock, so it can modify the mapping of logical to physical blocks. > > > > ISTM that you're conflating internal filesystem implementation with > > application visible semantics. Yes, the filesystem uses internal > > locks to serialise the modification of the things the lease manages > > access too, but that has nothing to do with the access policy the > > lease provides to users. > > > > e.g. Process A has an exclusive layout lease on file F. It does an > > IO to file F. The filesystem IO path checks that Process A owns the > > lease on the file and so skips straight through layout breaking > > because it owns the lease and is allowed to modify the layout. It > > then takes the inode metadata locks to allocate new space and write > > new data. > > > > Process B now tries to write to file F. The FS checks whether > > Process B owns a layout lease on file F. It doesn't, so then it > > tries to break the layout lease so the IO can proceed. The layout > > breaking code sees that process A has an exclusive layout lease > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > break the lease and so the IO fails with -ETXTBSY. > > > > i.e. the exclusive layout lease prevents other processes from > > performing operations that may need to modify the layout from > > performing those operations. It does not "lock" the file/inode in > > any way, it just changes how the layout lease breaking behaves. > > Question: Do we expect Process A to get notified that Process B was attempting > to change the layout? In which case? In the non-exclusive case, yes, the lease gets recalled and the application needs to play nice and release it's references and drop the lease. In the exclusive case, no. The application has said "I don't play nice with others" and so we basically tell process B to get stuffed and process A can continue onwards oblivious to the wreckage it leaves behind.... > This changes the exclusivity semantics. While Process A has an exclusive lease > it could release it if notified to allow process B temporary exclusivity. And then it's not an exclusive lease - it's just a normal layout lease. Process B -does not need a lease- to write to the file. All the layout lease does is provide notification to applications that rely on the layout of the file being under their control that someone else is about to modify the layout. The lease holder that "plays nice" then releases the layout and drops it's lease, allowing process B to begin it's operation. Process A then immediately takes a new layout lease, and remaps the file layout via FIEMAP or by creating a new RDMA MR for the mmap region. THose operations get serialised by the filesystem because the operation being run by process B is run atomically w.r.t. the original lease being broken. Hence the new mapping that process A gets with it's new lease reflects whatever change was made by process B. IOWs, the "normal" layout lease recall behaviour provides "temporary exclusivity" for third parties. If you are able to release leases temporarily and regain them then there is no need for an exclusive lease. > Question 2: Do we expect other process' (say Process C) to also be able to map > and pin the file? I believe users will need this and for layout purposes it is > ok to do so. But this means that Process A does not have "exclusive" access to > the lease. This is an application architecture problem, not a layout lease or filesystem problem. :) i.e. if you have a single process controlling all the RDMA mappings, then you can use exclusive leases. If you have multiple processes that are uncoordinated and all require layout access to the same file then you can't use exclusive layout leases in the application. i.e. your application has to play nice with others. Indeed, this is more than a application architecture problem - it's actually a system wide architecture problem. e.g. the pNFS server cannot use exclusive layout leases because it has to play nice with anything else on the local filesystem that might require a layout lease. An example of this woudl be an app that provides coherent RDMA access to the same storage that pNFS is sharing (e.g. a userspace CIFS server). Hence I see that exclusive layout leases will end up being the exception rather than the norm, because most applications will need to play nice with other applications on the system that also directly access the storage under the filesystem.... > So given Process C has also placed a layout lease on the file. Indicating > that it does not want the layout to change. That is *not what layout leases provide*. Layout leases grant the owner the ability to map the layout and directly access the underlying storage and to do it safely because they will get a notification of 3rd party access that will invalidate their mapping. Layout leases do not prevent anyone from _changing_ the layout and, in fact, pNFS _requires_ the lease holder to be able to modify the layout. IOWs, the layout lease _as it stands now_ is a notification mechanism that tells the lease owner when someone else is about to modify the layout. It does not make the file layout immutable. The "exclusive" aspect of layout we have been discussing is a mechanism that prevents 3rd party modification of the layout by denying the ability to break the layout. This "exclusive" aspect does not make the layout immutable, either, it just means the layout is only modifiable by the exclusive lease holder. Cheers, Dave.
On Thu, Jun 13, 2019 at 07:31:07PM -0700, Matthew Wilcox wrote: > On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote: > > If the lease holder modifies the mapping in a way that causes it's > > own internal state to screw up, then that's a bug in the lease > > holder application. > > Sounds like the lease semantics aren't the right ones for the longterm > GUP users then. The point of the longterm GUP is so the pages can be > written to, and if the filesystem is going to move the pages around when > they're written to, that just won't work. And now we go full circle back to the constraints we decided on long ago because we can't rely on demand paging RDMA hardware any time soon to do everything we need to transparently support long-term GUP on file-backed mappings. i.e.: RDMA to file backed mappings must first preallocate and write zeros to the range of the file they are mapping so that the filesystem block mapping is complete and static for the life of the RDMA mapping that will pin it. IOWs, the layout lease will tell the RDMA application that the static setup it has already done to work correctly with a file backed mapping may be about to be broken by a third party..... -Dave.
On Thu, Jun 13, 2019 at 01:34:06PM -0700, Ira Weiny wrote: > On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote: > > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote: > > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote: > > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote: > > > > > > Are you suggesting that we have something like this from user space? > > > > > > > > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE); > > > > > > > > > > Rather than "unbreakable", perhaps a clearer description of the > > > > > policy it entails is "exclusive"? > > > > > > > > > > i.e. what we are talking about here is an exclusive lease that > > > > > prevents other processes from changing the layout. i.e. the > > > > > mechanism used to guarantee a lease is exclusive is that the layout > > > > > becomes "unbreakable" at the filesystem level, but the policy we are > > > > > actually presenting to uses is "exclusive access"... > > > > > > > > That's rather different from the normal meaning of 'exclusive' in the > > > > context of locks, which is "only one user can have access to this at > > > > a time". As I understand it, this is rather more like a 'shared' or > > > > 'read' lock. The filesystem would be the one which wants an exclusive > > > > lock, so it can modify the mapping of logical to physical blocks. > > > > > > > > The complication being that by default the filesystem has an exclusive > > > > lock on the mapping, and what we're trying to add is the ability for > > > > readers to ask the filesystem to give up its exclusive lock. > > > > > > This is an interesting view... > > > > > > And after some more thought, exclusive does not seem like a good name for this > > > because technically F_WRLCK _is_ an exclusive lease... > > > > > > In addition, the user does not need to take the "exclusive" write lease to be > > > notified of (broken by) an unexpected truncate. A "read" lease is broken by > > > truncate. (And "write" leases really don't do anything different WRT the > > > interaction of the FS and the user app. Write leases control "exclusive" > > > access between other file descriptors.) > > > > I've been assuming that there is only one type of layout lease - > > there is no use case I've heard of for read/write layout leases, and > > like you say there is zero difference in behaviour at the filesystem > > level - they all have to be broken to allow a non-lease truncate to > > proceed. > > > > IMO, taking a "read lease" to be able to modify and write to the > > underlying mapping of a file makes absolutely no sense at all. > > IOWs, we're talking exaclty about a revokable layout lease vs an > > exclusive layout lease here, and so read/write really doesn't match > > the policy or semantics we are trying to provide. > > I humbly disagree, at least depending on how you look at it... :-D > > The patches as they stand expect the user to take a "read" layout lease which > indicates they are currently using "reading" the layout as is. > They are not > changing ("writing" to) the layout. As I said in a another email in the thread, a layout lease does not make the layout "read only". It just means the lease owner will be notified when someone else is about to modify it. The lease owner can modify the mapping themselves, and they will not get notified about their own modifications. > They then pin pages which locks parts of > the layout and therefore they expect no "writers" to change the layout. Except they can change the layout themselves. It's perfectly valid to get a layout lease, write() from offset 0 to EOF and fsync() to intiialise the file and allocate all the space in the file, then mmap() it and hand to off to RMDA, all while holding the layout lease. > The "write" layout lease breaks the "read" layout lease indicating that the > layout is being written to. Layout leases do not work this way. > In fact, this is what NFS does right now. The lease it puts on the file is of > "read" type. > > nfs4layouts.c: > static int > nfsd4_layout_setlease(struct nfs4_layout_stateid *ls) > { > ... > fl->fl_flags = FL_LAYOUT; > fl->fl_type = F_RDLCK; > ... > } Yes, the existing /implementation/ uses F_RDLCK, but that doesn't mean the layout is "read only". Look at the pNFS mapping layout code - the ->map_blocks export operation: int (*map_blocks)(struct inode *inode, loff_t offset, u64 len, struct iomap *iomap, bool write, u32 *device_generation); ^^^^^^^^^^ Yup, it has a write variable that, when set, causes the filesystem to _allocate_ blocks if the range to be written to falls over a hole in the file. IOWs, a pNFS layout lease can modify the file layout - you're conflating use of a "read lock" API to mean that what the lease _manages_ is "read only". That is not correct. Layouts are /always writeable/ by the lease owner(s), the question here is what we do with third parties attempting to modify a layout covered by an "exclusive" layout lease. Hence, I'll repeat: > > we're talking exaclty about a revokable layout lease vs an > > exclusive layout lease here, and so read/write really doesn't match > > the policy or semantics we are trying to provide. Cheers, Dave.
On Thu 13-06-19 08:27:55, Matthew Wilcox wrote: > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote: > > e.g. Process A has an exclusive layout lease on file F. It does an > > IO to file F. The filesystem IO path checks that Process A owns the > > lease on the file and so skips straight through layout breaking > > because it owns the lease and is allowed to modify the layout. It > > then takes the inode metadata locks to allocate new space and write > > new data. > > > > Process B now tries to write to file F. The FS checks whether > > Process B owns a layout lease on file F. It doesn't, so then it > > tries to break the layout lease so the IO can proceed. The layout > > breaking code sees that process A has an exclusive layout lease > > granted, and so returns -ETXTBSY to process B - it is not allowed to > > break the lease and so the IO fails with -ETXTBSY. > > This description doesn't match the behaviour that RDMA wants either. > Even if Process A has a lease on the file, an IO from Process A which > results in blocks being freed from the file is going to result in the > RDMA device being able to write to blocks which are now freed (and > potentially reallocated to another file). I think you're partially wrong here. You are correct that the lease won't stop process A from doing truncate on the file. *But* there are still page pins in existence so truncate will block on waiting for these pins to go away (after all this is a protection that guards all short-term page pin users). So there is no problem with blocks being freed under the RDMA app. Yes, the app will effectively deadlock and sysadmin has to kill it. IMO an acceptable answer for doing something stupid and unsupportable... Honza
From: Ira Weiny <ira.weiny@intel.com> ... V1,000,000 ;-) Pre-requisites: John Hubbard's put_user_pages() patch series.[1] Jan Kara's ext4_break_layouts() fixes[2] Based on the feedback from LSFmm and the LWN article which resulted. I've decided to take a slightly different tack on this problem. The real issue is that there is no use case for a user to have RDMA pinn'ed memory which is then truncated. So really any solution we present which: A) Prevents file system corruption or data leaks ...and... B) Informs the user that they did something wrong Should be an acceptable solution. Because this is slightly new behavior. And because this is gonig to be specific to DAX (because of the lack of a page cache) we have made the user "opt in" to this behavior. The following patches implement the following solution. 1) The user has to opt in to allowing GUP pins on a file with a layout lease (now made visible). 2) GUP will fail (EPERM) if a layout lease is not taken 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. 4) The user has the option of holding the layout lease to receive a SIGIO for notification to the original thread that another thread has tried to delete their data. Furthermore this indicates that if the user needs to GUP the file again they will need to retake the Layout lease before doing so. NOTE: If the user releases the layout lease or if it has been broken by another operation further GUP operations on the file will fail without re-taking the lease. This means that if a user would like to register pieces of a file and continue to register other pieces later they would be advised to keep the layout lease, get a SIGIO notification, and retake the lease. NOTE2: Truncation of pages which are not actively pinned will succeed. Similar to accessing an mmap to this area GUP pins of that memory may fail. A general overview follows for background. It should be noted that one solution for this problem is to use RDMA's On Demand Paging (ODP). There are 2 big reasons this may not work. 1) The hardware being used for RDMA may not support ODP 2) ODP may be detrimental to the over all network (cluster or cloud) performance Therefore, in order to support RDMA to File system pages without On Demand Paging (ODP) a number of things need to be done. 1) GUP "longterm" users need to inform the other subsystems that they have taken a pin on a page which may remain pinned for a very "long time".[3] 2) Any page which is "controlled" by a file system needs to have special handling. The details of the handling depends on if the page is page cache fronted or not. 2a) A page cache fronted page which has been pinned by GUP long term can use a bounce buffer to allow the file system to write back snap shots of the page. This is handled by the FS recognizing the GUP long term pin and making a copy of the page to be written back. NOTE: this patch set does not address this path. 2b) A FS "controlled" page which is not page cache fronted is either easier to deal with or harder depending on the operation the filesystem is trying to do. 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the FS can no longer use the pages in question until the pin has been removed. This patch set presents a solution to this by introducing some reasonable restrictions on user space applications. 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch then there is nothing which need be done. Data is Read or Written directly to the page. This is an easy case which would currently work if not for GUP long term pins being disabled. Therefore this patch set need not change access to the file data but does allow for GUP pins after 2ba above is dealt with. This patch series and presents a solution for problem 2ba) [1] https://github.com/johnhubbard/linux/tree/gup_dma_core [2] ext4/dev branch: - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev Specific patches: [2a] ext4: wait for outstanding dio during truncate in nojournal mode - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8 [2b] ext4: do not delete unlinked inode from orphan list on failed truncate - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14 [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c [3] The definition of long time is debatable but it has been established that RDMAs use of pages, minutes or hours after the pin is the extreme case which makes this problem most severe. Ira Weiny (10): fs/locks: Add trace_leases_conflict fs/locks: Export F_LAYOUT lease to user space mm/gup: Pass flags down to __gup_device_huge* calls mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages fs/ext4: Teach ext4 to break layout leases fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range fs/ext4: Fail truncate if pages are GUP pinned fs/xfs: Teach xfs to use new dax_layout_busy_page() fs/xfs: Fail truncate if pages are GUP pinned mm/gup: Remove FOLL_LONGTERM DAX exclusion fs/Kconfig | 1 + fs/dax.c | 38 ++++++--- fs/ext4/ext4.h | 2 +- fs/ext4/extents.c | 6 +- fs/ext4/inode.c | 26 +++++-- fs/locks.c | 97 ++++++++++++++++++++--- fs/xfs/xfs_file.c | 24 ++++-- fs/xfs/xfs_inode.h | 5 +- fs/xfs/xfs_ioctl.c | 15 +++- fs/xfs/xfs_iops.c | 14 +++- fs/xfs/xfs_pnfs.c | 14 ++-- include/linux/dax.h | 9 ++- include/linux/fs.h | 2 +- include/linux/mm.h | 2 + include/trace/events/filelock.h | 35 +++++++++ include/uapi/asm-generic/fcntl.h | 3 + mm/gup.c | 129 ++++++++++++------------------- mm/huge_memory.c | 12 +++ 18 files changed, 299 insertions(+), 135 deletions(-)