Message ID | 20190809225833.6657-1-ira.weiny@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | RDMA/FS DAX truncate proposal V1,000,002 ;-) | expand |
Hello! On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > Pre-requisites > ============== > Based on mmotm tree. > > Based on the feedback from LSFmm, the LWN article, the RFC series since > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > Solution summary > ================ > > The real issue is that there is no use case for a user to have RDMA pinn'ed > memory which is then truncated. So really any solution we present which: > > A) Prevents file system corruption or data leaks > ...and... > B) Informs the user that they did something wrong > > Should be an acceptable solution. > > Because this is slightly new behavior. And because this is going to be > specific to DAX (because of the lack of a page cache) we have made the user > "opt in" to this behavior. > > The following patches implement the following solution. > > 0) Registrations to Device DAX char devs are not affected > > 1) The user has to opt in to allowing page pins on a file with an exclusive > layout lease. Both exclusive and layout lease flags are user visible now. > > 2) page pins will fail if the lease is not active when the file back page is > encountered. > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. So I didn't fully grok the patch set yet but by "pinned DAX page" do you mean a page which has corresponding file_pin covering it? Or do you mean a page which has pincount increased? If the first then I'd rephrase this to be less ambiguous, if the second then I think it is wrong. > 4) The user has the option of holding the lease or releasing it. If they > release it no other pin calls will work on the file. Last time we spoke the plan was that the lease is kept while the pages are pinned (and an attempt to release the lease would block until the pages are unpinned). That also makes it clear that the *lease* is what is making truncate and hole punch fail with ETXTBUSY and the file_pin structure is just an implementation detail how the existence is efficiently tracked (and what keeps the backing file for the pages open so that the lease does not get auto-destroyed). Why did you change this? > 5) Closing the file is ok. > > 6) Unmapping the file is ok > > 7) Pins against the files are tracked back to an owning file or an owning mm > depending on the internal subsystem needs. With RDMA there is an owning > file which is related to the pined file. > > 8) Only RDMA is currently supported If you currently only need "owning file" variant in your patch set, then I'd just implement that and leave "owning mm" variant for later if it proves to be necessary. The things are complex enough as is... > 9) Truncation of pages which are not actively pinned nor covered by a lease > will succeed. Otherwise I like the design. Honza
On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > Hello! > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > > Pre-requisites > > ============== > > Based on mmotm tree. > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since > > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > > > Solution summary > > ================ > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > memory which is then truncated. So really any solution we present which: > > > > A) Prevents file system corruption or data leaks > > ...and... > > B) Informs the user that they did something wrong > > > > Should be an acceptable solution. > > > > Because this is slightly new behavior. And because this is going to be > > specific to DAX (because of the lack of a page cache) we have made the user > > "opt in" to this behavior. > > > > The following patches implement the following solution. > > > > 0) Registrations to Device DAX char devs are not affected > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive > > layout lease. Both exclusive and layout lease flags are user visible now. > > > > 2) page pins will fail if the lease is not active when the file back page is > > encountered. > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you > mean a page which has corresponding file_pin covering it? Or do you mean a > page which has pincount increased? If the first then I'd rephrase this to > be less ambiguous, if the second then I think it is wrong. I mean the second. but by "fail" I mean hang. Right now the "normal" page pincount processing will hang the truncate. Given the discussion with John H we can make this a bit better if we use something like FOLL_PIN and the page count bias to indicate this type of pin. Then I could fail the truncate outright. but that is not done yet. so... I used the word "fail" to be a bit more vague as the final implementation may return ETXTBUSY or hang as noted. > > > 4) The user has the option of holding the lease or releasing it. If they > > release it no other pin calls will work on the file. > > Last time we spoke the plan was that the lease is kept while the pages are > pinned (and an attempt to release the lease would block until the pages are > unpinned). That also makes it clear that the *lease* is what is making > truncate and hole punch fail with ETXTBUSY and the file_pin structure is > just an implementation detail how the existence is efficiently tracked (and > what keeps the backing file for the pages open so that the lease does not > get auto-destroyed). Why did you change this? closing the file _and_ unmaping it will cause the lease to be released regardless of if we allow this or not. As we discussed preventing the close seemed intractable. I thought about failing the munmap but that seemed wrong as well. But more importantly AFAIK RDMA can pass its memory pins to other processes via FD passing... This means that one could pin this memory, pass it to another process and exit. The file lease on the pin'ed file is lost. The file lease is just a key to get the memory pin. Once unlocked the procfs tracking keeps track of where that pin goes and which processes need to be killed to get rid of it. > > > 5) Closing the file is ok. > > > > 6) Unmapping the file is ok > > > > 7) Pins against the files are tracked back to an owning file or an owning mm > > depending on the internal subsystem needs. With RDMA there is an owning > > file which is related to the pined file. > > > > 8) Only RDMA is currently supported > > If you currently only need "owning file" variant in your patch set, then > I'd just implement that and leave "owning mm" variant for later if it > proves to be necessary. The things are complex enough as is... I can do that... I was trying to get io_uring working as well with the owning_mm but I should save that for later. > > > 9) Truncation of pages which are not actively pinned nor covered by a lease > > will succeed. > > Otherwise I like the design. Thanks, Ira > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed 14-08-19 11:08:49, Ira Weiny wrote: > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > Hello! > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > > > Pre-requisites > > > ============== > > > Based on mmotm tree. > > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > > > > > Solution summary > > > ================ > > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > > memory which is then truncated. So really any solution we present which: > > > > > > A) Prevents file system corruption or data leaks > > > ...and... > > > B) Informs the user that they did something wrong > > > > > > Should be an acceptable solution. > > > > > > Because this is slightly new behavior. And because this is going to be > > > specific to DAX (because of the lack of a page cache) we have made the user > > > "opt in" to this behavior. > > > > > > The following patches implement the following solution. > > > > > > 0) Registrations to Device DAX char devs are not affected > > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive > > > layout lease. Both exclusive and layout lease flags are user visible now. > > > > > > 2) page pins will fail if the lease is not active when the file back page is > > > encountered. > > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you > > mean a page which has corresponding file_pin covering it? Or do you mean a > > page which has pincount increased? If the first then I'd rephrase this to > > be less ambiguous, if the second then I think it is wrong. > > I mean the second. but by "fail" I mean hang. Right now the "normal" page > pincount processing will hang the truncate. Given the discussion with John H > we can make this a bit better if we use something like FOLL_PIN and the page > count bias to indicate this type of pin. Then I could fail the truncate > outright. but that is not done yet. > > so... I used the word "fail" to be a bit more vague as the final implementation > may return ETXTBUSY or hang as noted. Ah, OK. Hanging is fine in principle but with longterm pins, your work makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that e.g. DIO will use page pins as well for its buffers and we must wait there until the pin is released. So please just clarify your 'fail' here a bit :). > > > 4) The user has the option of holding the lease or releasing it. If they > > > release it no other pin calls will work on the file. > > > > Last time we spoke the plan was that the lease is kept while the pages are > > pinned (and an attempt to release the lease would block until the pages are > > unpinned). That also makes it clear that the *lease* is what is making > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is > > just an implementation detail how the existence is efficiently tracked (and > > what keeps the backing file for the pages open so that the lease does not > > get auto-destroyed). Why did you change this? > > closing the file _and_ unmaping it will cause the lease to be released > regardless of if we allow this or not. > > As we discussed preventing the close seemed intractable. Yes, preventing the application from closing the file is difficult. But from a quick look at your patches it seemed to me that you actually hold a backing file reference from the file_pin structure thus even though the application closes its file descriptor, the struct file (and thus the lease) lives further until the file_pin gets released. And that should last as long as the pages are pinned. Am I missing something? > I thought about failing the munmap but that seemed wrong as well. But more > importantly AFAIK RDMA can pass its memory pins to other processes via FD > passing... This means that one could pin this memory, pass it to another > process and exit. The file lease on the pin'ed file is lost. Not if file_pin grabs struct file reference as I mentioned above... > The file lease is just a key to get the memory pin. Once unlocked the procfs > tracking keeps track of where that pin goes and which processes need to be > killed to get rid of it. I think having file lease being just a key to get the pin is conceptually wrong. The lease is what expresses: "I'm accessing these blocks directly, don't touch them without coordinating with me." So it would be only natural if we maintained the lease while we are accessing blocks instead of transferring this protection responsibility to another structure - namely file_pin - and letting the lease go. But maybe I miss some technical reason why maintaining file lease is difficult. If that's the case, I'd like to hear what... > > > 5) Closing the file is ok. > > > > > > 6) Unmapping the file is ok > > > > > > 7) Pins against the files are tracked back to an owning file or an owning mm > > > depending on the internal subsystem needs. With RDMA there is an owning > > > file which is related to the pined file. > > > > > > 8) Only RDMA is currently supported > > > > If you currently only need "owning file" variant in your patch set, then > > I'd just implement that and leave "owning mm" variant for later if it > > proves to be necessary. The things are complex enough as is... > > I can do that... I was trying to get io_uring working as well with the > owning_mm but I should save that for later. Ah, OK. Yes, I guess io_uring can be next step. Honza
On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > Hello! > > > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > > > > Pre-requisites > > > > ============== > > > > Based on mmotm tree. > > > > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > > > > > > > Solution summary > > > > ================ > > > > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > > > memory which is then truncated. So really any solution we present which: > > > > > > > > A) Prevents file system corruption or data leaks > > > > ...and... > > > > B) Informs the user that they did something wrong > > > > > > > > Should be an acceptable solution. > > > > > > > > Because this is slightly new behavior. And because this is going to be > > > > specific to DAX (because of the lack of a page cache) we have made the user > > > > "opt in" to this behavior. > > > > > > > > The following patches implement the following solution. > > > > > > > > 0) Registrations to Device DAX char devs are not affected > > > > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive > > > > layout lease. Both exclusive and layout lease flags are user visible now. > > > > > > > > 2) page pins will fail if the lease is not active when the file back page is > > > > encountered. > > > > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. > > > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you > > > mean a page which has corresponding file_pin covering it? Or do you mean a > > > page which has pincount increased? If the first then I'd rephrase this to > > > be less ambiguous, if the second then I think it is wrong. > > > > I mean the second. but by "fail" I mean hang. Right now the "normal" page > > pincount processing will hang the truncate. Given the discussion with John H > > we can make this a bit better if we use something like FOLL_PIN and the page > > count bias to indicate this type of pin. Then I could fail the truncate > > outright. but that is not done yet. > > > > so... I used the word "fail" to be a bit more vague as the final implementation > > may return ETXTBUSY or hang as noted. > > Ah, OK. Hanging is fine in principle but with longterm pins, your work > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that > e.g. DIO will use page pins as well for its buffers and we must wait there > until the pin is released. So please just clarify your 'fail' here a bit > :). It will fail with ETXTBSY. I've fixed a bug... See below. > > > > > 4) The user has the option of holding the lease or releasing it. If they > > > > release it no other pin calls will work on the file. > > > > > > Last time we spoke the plan was that the lease is kept while the pages are > > > pinned (and an attempt to release the lease would block until the pages are > > > unpinned). That also makes it clear that the *lease* is what is making > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is > > > just an implementation detail how the existence is efficiently tracked (and > > > what keeps the backing file for the pages open so that the lease does not > > > get auto-destroyed). Why did you change this? > > > > closing the file _and_ unmaping it will cause the lease to be released > > regardless of if we allow this or not. > > > > As we discussed preventing the close seemed intractable. > > Yes, preventing the application from closing the file is difficult. But > from a quick look at your patches it seemed to me that you actually hold a > backing file reference from the file_pin structure thus even though the > application closes its file descriptor, the struct file (and thus the > lease) lives further until the file_pin gets released. And that should last > as long as the pages are pinned. Am I missing something? > > > I thought about failing the munmap but that seemed wrong as well. But more > > importantly AFAIK RDMA can pass its memory pins to other processes via FD > > passing... This means that one could pin this memory, pass it to another > > process and exit. The file lease on the pin'ed file is lost. > > Not if file_pin grabs struct file reference as I mentioned above... > > > The file lease is just a key to get the memory pin. Once unlocked the procfs > > tracking keeps track of where that pin goes and which processes need to be > > killed to get rid of it. > > I think having file lease being just a key to get the pin is conceptually > wrong. The lease is what expresses: "I'm accessing these blocks directly, > don't touch them without coordinating with me." So it would be only natural > if we maintained the lease while we are accessing blocks instead of > transferring this protection responsibility to another structure - namely > file_pin - and letting the lease go. We do transfer that protection to the file_pin but we don't have to "let the lease" go. We just keep the lease with the file_pin as you said. See below... > But maybe I miss some technical reason > why maintaining file lease is difficult. If that's the case, I'd like to hear > what... Ok, I've thought a bit about what you said and indeed it should work that way. The reason I had to think a bit is that I was not sure why I thought we needed to hang... Turns out there were a couple of reasons... 1 not so good and 1 ok but still not good enough to allow this... 1) I had a bug in the XFS code which should have failed rather than hanging... So this was not a good reason... And I was able to find/fix it... Thanks! 2) Second reason is that I thought I did not have a good way to tell if the lease was actually in use. What I mean is that letting the lease go should be ok IFF we don't have any pins... I was thinking that without John's code we don't have a way to know if there are any pins... But that is wrong... All we have to do is check !list_empty(file->file_pins) So now with this detail I think you are right, we should be able to hold the lease through the struct file even if the process no longer has any "references" to it (ie closes and munmaps the file). I'm going to add a patch to fail releasing the lease and remove this (item 4) as part of the overall solution. > > > > > 5) Closing the file is ok. > > > > > > > > 6) Unmapping the file is ok > > > > > > > > 7) Pins against the files are tracked back to an owning file or an owning mm > > > > depending on the internal subsystem needs. With RDMA there is an owning > > > > file which is related to the pined file. > > > > > > > > 8) Only RDMA is currently supported > > > > > > If you currently only need "owning file" variant in your patch set, then > > > I'd just implement that and leave "owning mm" variant for later if it > > > proves to be necessary. The things are complex enough as is... > > > > I can do that... I was trying to get io_uring working as well with the > > owning_mm but I should save that for later. > > Ah, OK. Yes, I guess io_uring can be next step. FWIW I have split the mm_struct stuff out. I can keep it as a follow on series for other users later. At this point I have to solve the issue Jason brought up WRT the RDMA file reference counting. Thanks! Ira
On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote: > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > > Hello! > > > > > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > > > > > Pre-requisites > > > > > ============== > > > > > Based on mmotm tree. > > > > > > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since > > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > > > > > > > > > Solution summary > > > > > ================ > > > > > > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > > > > memory which is then truncated. So really any solution we present which: > > > > > > > > > > A) Prevents file system corruption or data leaks > > > > > ...and... > > > > > B) Informs the user that they did something wrong > > > > > > > > > > Should be an acceptable solution. > > > > > > > > > > Because this is slightly new behavior. And because this is going to be > > > > > specific to DAX (because of the lack of a page cache) we have made the user > > > > > "opt in" to this behavior. > > > > > > > > > > The following patches implement the following solution. > > > > > > > > > > 0) Registrations to Device DAX char devs are not affected > > > > > > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive > > > > > layout lease. Both exclusive and layout lease flags are user visible now. > > > > > > > > > > 2) page pins will fail if the lease is not active when the file back page is > > > > > encountered. > > > > > > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. > > > > > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you > > > > mean a page which has corresponding file_pin covering it? Or do you mean a > > > > page which has pincount increased? If the first then I'd rephrase this to > > > > be less ambiguous, if the second then I think it is wrong. > > > > > > I mean the second. but by "fail" I mean hang. Right now the "normal" page > > > pincount processing will hang the truncate. Given the discussion with John H > > > we can make this a bit better if we use something like FOLL_PIN and the page > > > count bias to indicate this type of pin. Then I could fail the truncate > > > outright. but that is not done yet. > > > > > > so... I used the word "fail" to be a bit more vague as the final implementation > > > may return ETXTBUSY or hang as noted. > > > > Ah, OK. Hanging is fine in principle but with longterm pins, your work > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that > > e.g. DIO will use page pins as well for its buffers and we must wait there > > until the pin is released. So please just clarify your 'fail' here a bit > > :). > > It will fail with ETXTBSY. I've fixed a bug... See below. > > > > > > > > 4) The user has the option of holding the lease or releasing it. If they > > > > > release it no other pin calls will work on the file. > > > > > > > > Last time we spoke the plan was that the lease is kept while the pages are > > > > pinned (and an attempt to release the lease would block until the pages are > > > > unpinned). That also makes it clear that the *lease* is what is making > > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is > > > > just an implementation detail how the existence is efficiently tracked (and > > > > what keeps the backing file for the pages open so that the lease does not > > > > get auto-destroyed). Why did you change this? > > > > > > closing the file _and_ unmaping it will cause the lease to be released > > > regardless of if we allow this or not. > > > > > > As we discussed preventing the close seemed intractable. > > > > Yes, preventing the application from closing the file is difficult. But > > from a quick look at your patches it seemed to me that you actually hold a > > backing file reference from the file_pin structure thus even though the > > application closes its file descriptor, the struct file (and thus the > > lease) lives further until the file_pin gets released. And that should last > > as long as the pages are pinned. Am I missing something? > > > > > I thought about failing the munmap but that seemed wrong as well. But more > > > importantly AFAIK RDMA can pass its memory pins to other processes via FD > > > passing... This means that one could pin this memory, pass it to another > > > process and exit. The file lease on the pin'ed file is lost. > > > > Not if file_pin grabs struct file reference as I mentioned above... > > > > > The file lease is just a key to get the memory pin. Once unlocked the procfs > > > tracking keeps track of where that pin goes and which processes need to be > > > killed to get rid of it. > > > > I think having file lease being just a key to get the pin is conceptually > > wrong. The lease is what expresses: "I'm accessing these blocks directly, > > don't touch them without coordinating with me." So it would be only natural > > if we maintained the lease while we are accessing blocks instead of > > transferring this protection responsibility to another structure - namely > > file_pin - and letting the lease go. > > We do transfer that protection to the file_pin but we don't have to "let the > lease" go. We just keep the lease with the file_pin as you said. See below... > > > But maybe I miss some technical reason > > why maintaining file lease is difficult. If that's the case, I'd like to hear > > what... > > Ok, I've thought a bit about what you said and indeed it should work that way. > The reason I had to think a bit is that I was not sure why I thought we needed > to hang... Turns out there were a couple of reasons... 1 not so good and 1 ok > but still not good enough to allow this... > > 1) I had a bug in the XFS code which should have failed rather than hanging... > So this was not a good reason... And I was able to find/fix it... Thanks! > > 2) Second reason is that I thought I did not have a good way to tell if the > lease was actually in use. What I mean is that letting the lease go should > be ok IFF we don't have any pins... I was thinking that without John's code > we don't have a way to know if there are any pins... But that is wrong... > All we have to do is check > > !list_empty(file->file_pins) Oops... I got my "struct files" mixed up... The RDMA struct file has the file_pins hanging off it... This will not work. I'll have to try something else to prevent this. However, I don't want to walk all the pages of the inode. Also I'm concerned about just failing if they happen to be pinned. They need to be LONGTERM pinned... Otherwise we might have a transient failure of an unlock based on some internal kernel transient pin... :-/ Ira > > So now with this detail I think you are right, we should be able to hold the > lease through the struct file even if the process no longer has any > "references" to it (ie closes and munmaps the file). > > I'm going to add a patch to fail releasing the lease and remove this (item 4) > as part of the overall solution. > > > > > > > > 5) Closing the file is ok. > > > > > > > > > > 6) Unmapping the file is ok > > > > > > > > > > 7) Pins against the files are tracked back to an owning file or an owning mm > > > > > depending on the internal subsystem needs. With RDMA there is an owning > > > > > file which is related to the pined file. > > > > > > > > > > 8) Only RDMA is currently supported > > > > > > > > If you currently only need "owning file" variant in your patch set, then > > > > I'd just implement that and leave "owning mm" variant for later if it > > > > proves to be necessary. The things are complex enough as is... > > > > > > I can do that... I was trying to get io_uring working as well with the > > > owning_mm but I should save that for later. > > > > Ah, OK. Yes, I guess io_uring can be next step. > > FWIW I have split the mm_struct stuff out. I can keep it as a follow on series > for other users later. At this point I have to solve the issue Jason brought > up WRT the RDMA file reference counting. > > Thanks! > Ira > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > 2) Second reason is that I thought I did not have a good way to tell if the > lease was actually in use. What I mean is that letting the lease go should > be ok IFF we don't have any pins... I was thinking that without John's code > we don't have a way to know if there are any pins... But that is wrong... > All we have to do is check > > !list_empty(file->file_pins) > > So now with this detail I think you are right, we should be able to hold the > lease through the struct file even if the process no longer has any > "references" to it (ie closes and munmaps the file). I really, really dislike the idea of zombie layout leases. It's a nasty hack for poor application behaviour. This is a "we allow use after layout lease release" API, and I think encoding largely untraceable zombie objects into an API is very poor design. From the fcntl man page: LEASES Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) re‐ fer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed. Leases are associated with *open* file descriptors, not the lifetime of the struct file in the kernel. If the application closes the open fds that refer to the lease, then the kernel does not guarantee, and the application has no right to expect, that the lease remains active in any way once the application closes all direct references to the lease. IOWs, applications using layout leases need to hold the lease fd open for as long as the want access to the physical file layout. It is a also a requirement of the layout lease that the holder releases the resources it holds on the layout before it releases the layout lease, exclusive lease or not. Closing the fd indicates they do not need access to the file any more, and so the lease should be reclaimed at that point. I'm of a mind to make the last close() on a file block if there's an active layout lease to prevent processes from zombie-ing layout leases like this. i.e. you can't close the fd until resources that pin the lease have been released. Cheers, Dave.
On Sat 17-08-19 12:26:03, Dave Chinner wrote: > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > 2) Second reason is that I thought I did not have a good way to tell if the > > lease was actually in use. What I mean is that letting the lease go should > > be ok IFF we don't have any pins... I was thinking that without John's code > > we don't have a way to know if there are any pins... But that is wrong... > > All we have to do is check > > > > !list_empty(file->file_pins) > > > > So now with this detail I think you are right, we should be able to hold the > > lease through the struct file even if the process no longer has any > > "references" to it (ie closes and munmaps the file). > > I really, really dislike the idea of zombie layout leases. It's a > nasty hack for poor application behaviour. This is a "we allow use > after layout lease release" API, and I think encoding largely > untraceable zombie objects into an API is very poor design. > > From the fcntl man page: > > LEASES > Leases are associated with an open file description (see > open(2)). This means that duplicate file descriptors > (created by, for example, fork(2) or dup(2)) re‐ fer to > the same lease, and this lease may be modified or > released using any of these descriptors. Furthermore, the > lease is released by either an explicit F_UNLCK operation on > any of these duplicate file descriptors, or when all such > file descriptors have been closed. > > Leases are associated with *open* file descriptors, not the > lifetime of the struct file in the kernel. If the application closes > the open fds that refer to the lease, then the kernel does not > guarantee, and the application has no right to expect, that the > lease remains active in any way once the application closes all > direct references to the lease. > > IOWs, applications using layout leases need to hold the lease fd > open for as long as the want access to the physical file layout. It > is a also a requirement of the layout lease that the holder releases > the resources it holds on the layout before it releases the layout > lease, exclusive lease or not. Closing the fd indicates they do not > need access to the file any more, and so the lease should be > reclaimed at that point. > > I'm of a mind to make the last close() on a file block if there's an > active layout lease to prevent processes from zombie-ing layout > leases like this. i.e. you can't close the fd until resources that > pin the lease have been released. Yeah, so this was my initial though as well [1]. But as the discussion in that thread revealed, the problem with blocking last close is that kernel does not really expect close to block. You could easily deadlock e.g. if the process gets SIGKILL, file with lease has fd 10, and the RDMA context holding pages pinned has fd 15. Or you could wait for another process to release page pins and blocking SIGKILL on that is also bad. So in the end the least bad solution we've come up with were these "zombie" leases as you call them and tracking them in /proc so that userspace at least has a way of seeing them. But if you can come up with a different solution, I'm certainly not attached to the current one... Honza [1] https://lore.kernel.org/lkml/20190606104203.GF7433@quack2.suse.cz
On Fri 16-08-19 16:20:07, Ira Weiny wrote: > On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote: > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > > > Hello! > > > > > > > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote: > > > > > > Pre-requisites > > > > > > ============== > > > > > > Based on mmotm tree. > > > > > > > > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since > > > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1] > > > > > > > > > > > > Solution summary > > > > > > ================ > > > > > > > > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > > > > > memory which is then truncated. So really any solution we present which: > > > > > > > > > > > > A) Prevents file system corruption or data leaks > > > > > > ...and... > > > > > > B) Informs the user that they did something wrong > > > > > > > > > > > > Should be an acceptable solution. > > > > > > > > > > > > Because this is slightly new behavior. And because this is going to be > > > > > > specific to DAX (because of the lack of a page cache) we have made the user > > > > > > "opt in" to this behavior. > > > > > > > > > > > > The following patches implement the following solution. > > > > > > > > > > > > 0) Registrations to Device DAX char devs are not affected > > > > > > > > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive > > > > > > layout lease. Both exclusive and layout lease flags are user visible now. > > > > > > > > > > > > 2) page pins will fail if the lease is not active when the file back page is > > > > > > encountered. > > > > > > > > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail. > > > > > > > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you > > > > > mean a page which has corresponding file_pin covering it? Or do you mean a > > > > > page which has pincount increased? If the first then I'd rephrase this to > > > > > be less ambiguous, if the second then I think it is wrong. > > > > > > > > I mean the second. but by "fail" I mean hang. Right now the "normal" page > > > > pincount processing will hang the truncate. Given the discussion with John H > > > > we can make this a bit better if we use something like FOLL_PIN and the page > > > > count bias to indicate this type of pin. Then I could fail the truncate > > > > outright. but that is not done yet. > > > > > > > > so... I used the word "fail" to be a bit more vague as the final implementation > > > > may return ETXTBUSY or hang as noted. > > > > > > Ah, OK. Hanging is fine in principle but with longterm pins, your work > > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that > > > e.g. DIO will use page pins as well for its buffers and we must wait there > > > until the pin is released. So please just clarify your 'fail' here a bit > > > :). > > > > It will fail with ETXTBSY. I've fixed a bug... See below. > > > > > > > > > > > 4) The user has the option of holding the lease or releasing it. If they > > > > > > release it no other pin calls will work on the file. > > > > > > > > > > Last time we spoke the plan was that the lease is kept while the pages are > > > > > pinned (and an attempt to release the lease would block until the pages are > > > > > unpinned). That also makes it clear that the *lease* is what is making > > > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is > > > > > just an implementation detail how the existence is efficiently tracked (and > > > > > what keeps the backing file for the pages open so that the lease does not > > > > > get auto-destroyed). Why did you change this? > > > > > > > > closing the file _and_ unmaping it will cause the lease to be released > > > > regardless of if we allow this or not. > > > > > > > > As we discussed preventing the close seemed intractable. > > > > > > Yes, preventing the application from closing the file is difficult. But > > > from a quick look at your patches it seemed to me that you actually hold a > > > backing file reference from the file_pin structure thus even though the > > > application closes its file descriptor, the struct file (and thus the > > > lease) lives further until the file_pin gets released. And that should last > > > as long as the pages are pinned. Am I missing something? > > > > > > > I thought about failing the munmap but that seemed wrong as well. But more > > > > importantly AFAIK RDMA can pass its memory pins to other processes via FD > > > > passing... This means that one could pin this memory, pass it to another > > > > process and exit. The file lease on the pin'ed file is lost. > > > > > > Not if file_pin grabs struct file reference as I mentioned above... > > > > > > > The file lease is just a key to get the memory pin. Once unlocked the procfs > > > > tracking keeps track of where that pin goes and which processes need to be > > > > killed to get rid of it. > > > > > > I think having file lease being just a key to get the pin is conceptually > > > wrong. The lease is what expresses: "I'm accessing these blocks directly, > > > don't touch them without coordinating with me." So it would be only natural > > > if we maintained the lease while we are accessing blocks instead of > > > transferring this protection responsibility to another structure - namely > > > file_pin - and letting the lease go. > > > > We do transfer that protection to the file_pin but we don't have to "let the > > lease" go. We just keep the lease with the file_pin as you said. See below... > > > > > But maybe I miss some technical reason > > > why maintaining file lease is difficult. If that's the case, I'd like to hear > > > what... > > > > Ok, I've thought a bit about what you said and indeed it should work that way. > > The reason I had to think a bit is that I was not sure why I thought we needed > > to hang... Turns out there were a couple of reasons... 1 not so good and 1 ok > > but still not good enough to allow this... > > > > 1) I had a bug in the XFS code which should have failed rather than hanging... > > So this was not a good reason... And I was able to find/fix it... Thanks! > > > > 2) Second reason is that I thought I did not have a good way to tell if the > > lease was actually in use. What I mean is that letting the lease go should > > be ok IFF we don't have any pins... I was thinking that without John's code > > we don't have a way to know if there are any pins... But that is wrong... > > All we have to do is check > > > > !list_empty(file->file_pins) > > Oops... I got my "struct files" mixed up... The RDMA struct file has the > file_pins hanging off it... This will not work. > > I'll have to try something else to prevent this. However, I don't want to walk > all the pages of the inode. > > Also I'm concerned about just failing if they happen to be pinned. They need > to be LONGTERM pinned... Otherwise we might have a transient failure of an > unlock based on some internal kernel transient pin... :-/ My solution for this was that file_pin would contain counter of pinned pages which vaddr_pin_pages() would increment and vaddr_unpin_pages() would decrement. Checking whether there's any outstanding page pinned attached to the file_pin is then trivial... Honza
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: > On Sat 17-08-19 12:26:03, Dave Chinner wrote: > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > 2) Second reason is that I thought I did not have a good way to tell if the > > > lease was actually in use. What I mean is that letting the lease go should > > > be ok IFF we don't have any pins... I was thinking that without John's code > > > we don't have a way to know if there are any pins... But that is wrong... > > > All we have to do is check > > > > > > !list_empty(file->file_pins) > > > > > > So now with this detail I think you are right, we should be able to hold the > > > lease through the struct file even if the process no longer has any > > > "references" to it (ie closes and munmaps the file). > > > > I really, really dislike the idea of zombie layout leases. It's a > > nasty hack for poor application behaviour. This is a "we allow use > > after layout lease release" API, and I think encoding largely > > untraceable zombie objects into an API is very poor design. > > > > From the fcntl man page: > > > > LEASES > > Leases are associated with an open file description (see > > open(2)). This means that duplicate file descriptors > > (created by, for example, fork(2) or dup(2)) re‐ fer to > > the same lease, and this lease may be modified or > > released using any of these descriptors. Furthermore, the > > lease is released by either an explicit F_UNLCK operation on > > any of these duplicate file descriptors, or when all such > > file descriptors have been closed. > > > > Leases are associated with *open* file descriptors, not the > > lifetime of the struct file in the kernel. If the application closes > > the open fds that refer to the lease, then the kernel does not > > guarantee, and the application has no right to expect, that the > > lease remains active in any way once the application closes all > > direct references to the lease. > > > > IOWs, applications using layout leases need to hold the lease fd > > open for as long as the want access to the physical file layout. It > > is a also a requirement of the layout lease that the holder releases > > the resources it holds on the layout before it releases the layout > > lease, exclusive lease or not. Closing the fd indicates they do not > > need access to the file any more, and so the lease should be > > reclaimed at that point. > > > > I'm of a mind to make the last close() on a file block if there's an > > active layout lease to prevent processes from zombie-ing layout > > leases like this. i.e. you can't close the fd until resources that > > pin the lease have been released. > > Yeah, so this was my initial though as well [1]. But as the discussion in > that thread revealed, the problem with blocking last close is that kernel > does not really expect close to block. You could easily deadlock e.g. if > the process gets SIGKILL, file with lease has fd 10, and the RDMA context > holding pages pinned has fd 15. Sure, I did think about this a bit about it before suggesting it :) The last close is an interesting case because the __fput() call actually runs from task_work() context, not where the last reference is actually dropped. So it already has certain specific interactions with signals and task exit processing via task_add_work() and task_work_run(). task_add_work() calls set_notify_resume(task), so if nothing else triggers when returning to userspace we run this path: exit_to_usermode_loop() tracehook_notify_resume() task_work_run() __fput() locks_remove_file() locks_remove_lease() .... It's worth noting that locks_remove_lease() does a percpu_down_read() which means we can already block in this context removing leases.... If there is a signal pending, the task work is run this way (before the above notify path): exit_to_usermode_loop() do_signal() get_signal() task_work_run() __fput() We can detect this case via signal_pending() and even SIGKILL via fatal_signal_pending(), and so we can decide not to block based on the fact the process is about to be reaped and so the lease largely doesn't matter anymore. I'd argue that it is close and we can't easily back out, so we'd only break the block on a fatal signal.... And then, of course, is the call path through do_exit(), which has the PF_EXITING task flag set: do_exit() exit_task_work() task_work_run() __fput() and so it's easy to avoid blocking in this case, too. So that leaves just the normal close() syscall exit case, where the application has full control of the order in which resources are released. We've already established that we can block in this context. Blocking in an interruptible state will allow fatal signal delivery to wake us, and then we fall into the fatal_signal_pending() case if we get a SIGKILL while blocking. Hence I think blocking in this case would be OK - it indicates an application bug (releasing a lease before releasing the resources) but leaves SIGKILL available to administrators to resolve situations involving buggy applications. This requires applications to follow the rules: any process that pins physical resources must have an active reference to a layout lease, either via a duplicated fd or it's own private lease. If the app doesn't play by the rules, it hangs in close() until it is killed. > Or you could wait for another process to > release page pins and blocking SIGKILL on that is also bad. Again, each individual process that pins pages from the layout must have it's own active layout lease reference. > So in the end > the least bad solution we've come up with were these "zombie" leases as you > call them and tracking them in /proc so that userspace at least has a way > of seeing them. But if you can come up with a different solution, I'm > certainly not attached to the current one... It might be the "least bad" solution, but it's still a pretty bad one. And one that I don't think is necessary if we simply enforce the "process must have active references for the entire time the process uses the resource" rule. That's the way file access has always worked, I don't see why we should be doing anything different for access to the physical layout of files... Cheers, Dave.
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > So that leaves just the normal close() syscall exit case, where the > > application has full control of the order in which resources are > > released. We've already established that we can block in this > > context. Blocking in an interruptible state will allow fatal signal > > delivery to wake us, and then we fall into the > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > The major problem with RDMA is that it doesn't always wait on close() for the > MR holding the page pins to be destoyed. This is done to avoid a > deadlock of the form: > > uverbs_destroy_ufile_hw() > mutex_lock() > [..] > mmput() > exit_mmap() > remove_vma() > fput(); > file_operations->release() > ib_uverbs_close() > uverbs_destroy_ufile_hw() > mutex_lock() <-- Deadlock > > But, as I said to Ira earlier, I wonder if this is now impossible on > modern kernels and we can switch to making the whole thing > synchronous. That would resolve RDMA's main problem with this. I'm still looking into this... but my bigger concern is that the RDMA FD can be passed to other processes via SCM_RIGHTS. Which means the process holding the pin may _not_ be the one with the open file and layout lease... Ira
On 8/19/19 2:24 AM, Dave Chinner wrote: > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: >> On Sat 17-08-19 12:26:03, Dave Chinner wrote: >>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: >>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: >>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote: >>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: ... > The last close is an interesting case because the __fput() call > actually runs from task_work() context, not where the last reference > is actually dropped. So it already has certain specific interactions > with signals and task exit processing via task_add_work() and > task_work_run(). > > task_add_work() calls set_notify_resume(task), so if nothing else > triggers when returning to userspace we run this path: > > exit_to_usermode_loop() > tracehook_notify_resume() > task_work_run() > __fput() > locks_remove_file() > locks_remove_lease() > .... > > It's worth noting that locks_remove_lease() does a > percpu_down_read() which means we can already block in this context > removing leases.... > > If there is a signal pending, the task work is run this way (before > the above notify path): > > exit_to_usermode_loop() > do_signal() > get_signal() > task_work_run() > __fput() > > We can detect this case via signal_pending() and even SIGKILL via > fatal_signal_pending(), and so we can decide not to block based on > the fact the process is about to be reaped and so the lease largely > doesn't matter anymore. I'd argue that it is close and we can't > easily back out, so we'd only break the block on a fatal signal.... > > And then, of course, is the call path through do_exit(), which has > the PF_EXITING task flag set: > > do_exit() > exit_task_work() > task_work_run() > __fput() > > and so it's easy to avoid blocking in this case, too. Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins memory with FOLL_LONGTERM, and wondering how to make that work here. These are close to files, in how they're handled, but just different enough that it's not clear to me how to make work with this system. > > So that leaves just the normal close() syscall exit case, where the > application has full control of the order in which resources are > released. We've already established that we can block in this > context. Blocking in an interruptible state will allow fatal signal > delivery to wake us, and then we fall into the > fatal_signal_pending() case if we get a SIGKILL while blocking. > > Hence I think blocking in this case would be OK - it indicates an > application bug (releasing a lease before releasing the resources) > but leaves SIGKILL available to administrators to resolve situations > involving buggy applications. > > This requires applications to follow the rules: any process > that pins physical resources must have an active reference to a > layout lease, either via a duplicated fd or it's own private lease. > If the app doesn't play by the rules, it hangs in close() until it > is killed. +1 for these rules, assuming that we can make them work. They are easy to explain and intuitive. thanks,
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > So that leaves just the normal close() syscall exit case, where the > > application has full control of the order in which resources are > > released. We've already established that we can block in this > > context. Blocking in an interruptible state will allow fatal signal > > delivery to wake us, and then we fall into the > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > The major problem with RDMA is that it doesn't always wait on close() for the > MR holding the page pins to be destoyed. This is done to avoid a > deadlock of the form: > > uverbs_destroy_ufile_hw() > mutex_lock() > [..] > mmput() > exit_mmap() > remove_vma() > fput(); > file_operations->release() I think this is wrong, and I'm pretty sure it's an example of why the final __fput() call is moved out of line. fput() fput_many() task_add_work(f, __fput()) and the call chain ends there. Before the syscall returns to userspace, it then runs the __fput() call through the task_work_run() interfaces, and hence the call chain is just: task_work_run __fput > file_operations->release() > ib_uverbs_close() > uverbs_destroy_ufile_hw() > mutex_lock() <-- Deadlock And there is no deadlock because nothing holds the mutex at this point. Cheers, Dave.
On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote: > On 8/19/19 2:24 AM, Dave Chinner wrote: > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote: > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > ... > > The last close is an interesting case because the __fput() call > > actually runs from task_work() context, not where the last reference > > is actually dropped. So it already has certain specific interactions > > with signals and task exit processing via task_add_work() and > > task_work_run(). > > > > task_add_work() calls set_notify_resume(task), so if nothing else > > triggers when returning to userspace we run this path: > > > > exit_to_usermode_loop() > > tracehook_notify_resume() > > task_work_run() > > __fput() > > locks_remove_file() > > locks_remove_lease() > > .... > > > > It's worth noting that locks_remove_lease() does a > > percpu_down_read() which means we can already block in this context > > removing leases.... > > > > If there is a signal pending, the task work is run this way (before > > the above notify path): > > > > exit_to_usermode_loop() > > do_signal() > > get_signal() > > task_work_run() > > __fput() > > > > We can detect this case via signal_pending() and even SIGKILL via > > fatal_signal_pending(), and so we can decide not to block based on > > the fact the process is about to be reaped and so the lease largely > > doesn't matter anymore. I'd argue that it is close and we can't > > easily back out, so we'd only break the block on a fatal signal.... > > > > And then, of course, is the call path through do_exit(), which has > > the PF_EXITING task flag set: > > > > do_exit() > > exit_task_work() > > task_work_run() > > __fput() > > > > and so it's easy to avoid blocking in this case, too. > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins > memory with FOLL_LONGTERM, and wondering how to make that work here. I'm not sure how this interacts with file mappings? I mean, this is just pinning anonymous pages for direct data placement into userspace, right? Are you asking "what if this pinned memory was a file mapping?", or something else? > These are close to files, in how they're handled, but just different > enough that it's not clear to me how to make work with this system. I'm guessing that if they are pinning a file backed mapping, they are trying to dma direct to the file (zero copy into page cache?) and so they'll need to either play by ODP rules or take layout leases, too.... Cheers, Dave.
On 8/19/19 6:20 PM, Dave Chinner wrote: > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote: >> On 8/19/19 2:24 AM, Dave Chinner wrote: >>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: >>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote: >>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: >>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: >>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote: >>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: >> ... >> >> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins >> memory with FOLL_LONGTERM, and wondering how to make that work here. > > I'm not sure how this interacts with file mappings? I mean, this > is just pinning anonymous pages for direct data placement into > userspace, right? > > Are you asking "what if this pinned memory was a file mapping?", > or something else? Yes, mainly that one. Especially since the FOLL_LONGTERM flag is already there in xdp_umem_pin_pages(), unconditionally. So the simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is not set) are not going to work here. > >> These are close to files, in how they're handled, but just different >> enough that it's not clear to me how to make work with this system. > > I'm guessing that if they are pinning a file backed mapping, they > are trying to dma direct to the file (zero copy into page cache?) > and so they'll need to either play by ODP rules or take layout > leases, too.... > OK. I was just wondering if there was some simple way to dig up a struct file associated with a socket (I don't think so), but it sounds like this is an exercise that's potentially different for each subsystem. thanks,
On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote: > On 8/19/19 6:20 PM, Dave Chinner wrote: > > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote: > > > On 8/19/19 2:24 AM, Dave Chinner wrote: > > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: > > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote: > > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > ... > > > > > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins > > > memory with FOLL_LONGTERM, and wondering how to make that work here. > > > > I'm not sure how this interacts with file mappings? I mean, this > > is just pinning anonymous pages for direct data placement into > > userspace, right? > > > > Are you asking "what if this pinned memory was a file mapping?", > > or something else? > > Yes, mainly that one. Especially since the FOLL_LONGTERM flag is > already there in xdp_umem_pin_pages(), unconditionally. So the > simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is > not set) are not going to work here. > > > > > > > These are close to files, in how they're handled, but just different > > > enough that it's not clear to me how to make work with this system. > > > > I'm guessing that if they are pinning a file backed mapping, they > > are trying to dma direct to the file (zero copy into page cache?) > > and so they'll need to either play by ODP rules or take layout > > leases, too.... > > > > OK. I was just wondering if there was some simple way to dig up a > struct file associated with a socket (I don't think so), but it sounds > like this is an exercise that's potentially different for each subsystem. AFAIA, there is no struct file here - the memory that has been pinned is just something mapped into the application's address space. It seems to me that the socket here is equivalent of the RDMA handle that that owns the hardware that pins the pages. Again, that RDMA handle is not aware of waht the mapping represents, hence need to hold a layout lease if it's a file mapping. SO from the filesystem persepctive, there's no difference between XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly into the filesystem's backing store and that will require use of layout leases to perform safely. Cheers, Dave.
On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > application has full control of the order in which resources are > > > > released. We've already established that we can block in this > > > > context. Blocking in an interruptible state will allow fatal signal > > > > delivery to wake us, and then we fall into the > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > MR holding the page pins to be destoyed. This is done to avoid a > > > deadlock of the form: > > > > > > uverbs_destroy_ufile_hw() > > > mutex_lock() > > > [..] > > > mmput() > > > exit_mmap() > > > remove_vma() > > > fput(); > > > file_operations->release() > > > > I think this is wrong, and I'm pretty sure it's an example of why > > the final __fput() call is moved out of line. > > Yes, I think so too, all I can say is this *used* to happen, as we > have special code avoiding it, which is the code that is messing up > Ira's lifetime model. > > Ira, you could try unraveling the special locking, that solves your > lifetime issues? Yes I will try to prove this out... But I'm still not sure this fully solves the problem. This only ensures that the process which has the RDMA context (RDMA FD) is safe with regard to hanging the close for the "data file FD" (the file which has pinned pages) in that _same_ process. But what about the scenario. Process A has the RDMA context FD and data file FD (with lease) open. Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B. Process A attempts to exit (hangs because data file FD is pinned). Admin kills process A. kill works because we have allowed for it... Process B _still_ has the RDMA context FD open _and_ therefore still holds the file pins. Truncation still fails. Admin does not know which process is holding the pin. What am I missing? Ira
On 8/21/19 11:13 AM, Jason Gunthorpe wrote: > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: >> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: >>> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: >>>> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: >>>>> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: >>>>> >>>>>> So that leaves just the normal close() syscall exit case, where the >>>>>> application has full control of the order in which resources are >>>>>> released. We've already established that we can block in this >>>>>> context. Blocking in an interruptible state will allow fatal signal >>>>>> delivery to wake us, and then we fall into the >>>>>> fatal_signal_pending() case if we get a SIGKILL while blocking. >>>>> >>>>> The major problem with RDMA is that it doesn't always wait on close() for the >>>>> MR holding the page pins to be destoyed. This is done to avoid a >>>>> deadlock of the form: >>>>> >>>>> uverbs_destroy_ufile_hw() >>>>> mutex_lock() >>>>> [..] >>>>> mmput() >>>>> exit_mmap() >>>>> remove_vma() >>>>> fput(); >>>>> file_operations->release() >>>> >>>> I think this is wrong, and I'm pretty sure it's an example of why >>>> the final __fput() call is moved out of line. >>> >>> Yes, I think so too, all I can say is this *used* to happen, as we >>> have special code avoiding it, which is the code that is messing up >>> Ira's lifetime model. >>> >>> Ira, you could try unraveling the special locking, that solves your >>> lifetime issues? >> >> Yes I will try to prove this out... But I'm still not sure this fully solves >> the problem. >> >> This only ensures that the process which has the RDMA context (RDMA FD) is safe >> with regard to hanging the close for the "data file FD" (the file which has >> pinned pages) in that _same_ process. But what about the scenario. > > Oh, I didn't think we were talking about that. Hanging the close of > the datafile fd contingent on some other FD's closure is a recipe for > deadlock.. > > IMHO the pin refcnt is held by the driver char dev FD, that is the > object you need to make it visible against. If you do that, it might make it a lot simpler to add lease support to drivers like XDP, which is otherwise looking pretty difficult to set up with an fd. (It's socket-based, and not immediately clear where to connect up the fd.) thanks,
On 8/19/19 8:36 PM, Dave Chinner wrote: > On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote: >> On 8/19/19 6:20 PM, Dave Chinner wrote: >>> On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote: >>>> On 8/19/19 2:24 AM, Dave Chinner wrote: >>>>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: >>>>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote: >>>>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: >>>>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: >>>>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote: >>>>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: >>>> ... > AFAIA, there is no struct file here - the memory that has been pinned > is just something mapped into the application's address space. > > It seems to me that the socket here is equivalent of the RDMA handle > that that owns the hardware that pins the pages. Again, that RDMA > handle is not aware of waht the mapping represents, hence need to > hold a layout lease if it's a file mapping. > > SO from the filesystem persepctive, there's no difference between > XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly > into the filesystem's backing store and that will require use of > layout leases to perform safely. > OK, got it! Makes perfect sense. thanks,
On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > > > application has full control of the order in which resources are > > > > > > released. We've already established that we can block in this > > > > > > context. Blocking in an interruptible state will allow fatal signal > > > > > > delivery to wake us, and then we fall into the > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > > > MR holding the page pins to be destoyed. This is done to avoid a > > > > > deadlock of the form: > > > > > > > > > > uverbs_destroy_ufile_hw() > > > > > mutex_lock() > > > > > [..] > > > > > mmput() > > > > > exit_mmap() > > > > > remove_vma() > > > > > fput(); > > > > > file_operations->release() > > > > > > > > I think this is wrong, and I'm pretty sure it's an example of why > > > > the final __fput() call is moved out of line. > > > > > > Yes, I think so too, all I can say is this *used* to happen, as we > > > have special code avoiding it, which is the code that is messing up > > > Ira's lifetime model. > > > > > > Ira, you could try unraveling the special locking, that solves your > > > lifetime issues? > > > > Yes I will try to prove this out... But I'm still not sure this fully solves > > the problem. > > > > This only ensures that the process which has the RDMA context (RDMA FD) is safe > > with regard to hanging the close for the "data file FD" (the file which has > > pinned pages) in that _same_ process. But what about the scenario. > > Oh, I didn't think we were talking about that. Hanging the close of > the datafile fd contingent on some other FD's closure is a recipe for > deadlock.. The discussion between Jan and Dave was concerning what happens when a user calls fd = open() fnctl(...getlease...) addr = mmap(fd...) ib_reg_mr() <pin> munmap(addr...) close(fd) Dave suggested: "I'm of a mind to make the last close() on a file block if there's an active layout lease to prevent processes from zombie-ing layout leases like this. i.e. you can't close the fd until resources that pin the lease have been released." -- Dave https://lkml.org/lkml/2019/8/16/994 > > IMHO the pin refcnt is held by the driver char dev FD, that is the > object you need to make it visible against. I'm sorry but what do you mean by "driver char dev FD"? > > Why not just have a single table someplace of all the layout leases > with the file they are held on and the FD/socket/etc that is holding > the pin? Make it independent of processes and FDs? If it is independent of processes how will we know which process is blocking the truncate? Using a global table is an interesting idea but I still believe the users are going to want to track this to specific processes. It's not clear to me how that would be done with a global table. I agree the XDP/socket case is bothersome... I was thinking that somewhere the fd of the socket could be hooked up in this case. But taking a look at it reveals that is not going to be easy. And I assume XDP has the same issue WRT SCM_RIGHTS and the ability to share the xdp context? Ira > > Jason
On Wed, Aug 21, 2019 at 11:57:03AM -0700, 'Ira Weiny' wrote: > On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote: > > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: > > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > > > > application has full control of the order in which resources are > > > > > > > released. We've already established that we can block in this > > > > > > > context. Blocking in an interruptible state will allow fatal signal > > > > > > > delivery to wake us, and then we fall into the > > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > > > > MR holding the page pins to be destoyed. This is done to avoid a > > > > > > deadlock of the form: > > > > > > > > > > > > uverbs_destroy_ufile_hw() > > > > > > mutex_lock() > > > > > > [..] > > > > > > mmput() > > > > > > exit_mmap() > > > > > > remove_vma() > > > > > > fput(); > > > > > > file_operations->release() > > > > > > > > > > I think this is wrong, and I'm pretty sure it's an example of why > > > > > the final __fput() call is moved out of line. > > > > > > > > Yes, I think so too, all I can say is this *used* to happen, as we > > > > have special code avoiding it, which is the code that is messing up > > > > Ira's lifetime model. > > > > > > > > Ira, you could try unraveling the special locking, that solves your > > > > lifetime issues? > > > > > > Yes I will try to prove this out... But I'm still not sure this fully solves > > > the problem. > > > > > > This only ensures that the process which has the RDMA context (RDMA FD) is safe > > > with regard to hanging the close for the "data file FD" (the file which has > > > pinned pages) in that _same_ process. But what about the scenario. > > > > Oh, I didn't think we were talking about that. Hanging the close of > > the datafile fd contingent on some other FD's closure is a recipe for > > deadlock.. > > The discussion between Jan and Dave was concerning what happens when a user > calls > > fd = open() > fnctl(...getlease...) > addr = mmap(fd...) > ib_reg_mr() <pin> > munmap(addr...) > close(fd) > > Dave suggested: > > "I'm of a mind to make the last close() on a file block if there's an > active layout lease to prevent processes from zombie-ing layout > leases like this. i.e. you can't close the fd until resources that > pin the lease have been released." > > -- Dave https://lkml.org/lkml/2019/8/16/994 I think this may all be easier if there was a way to block a dup() if it comes from an SCM_RIGHTS. Does anyone know if that is easy? I assume it would just mean passing some flag through the dup() call chain. Jason, if we did that would it break RDMA use cases? I personally don't know of any. We could pass data back from vaddr_pin indicating that a file pin has been taken and predicate the blocking of SCM_RIGHTS on that? Of course if the user called: fd = open() fnctl(...getlease...) addr = mmap(fd...) ib_reg_mr() <pin> munmap(addr...) close(fd) fork() <=== in another thread because close is hanging Would that dup() "fd" above into the child? Or maybe that would be part of the work to make close() hang? Ensure the fd/file is still in the FD table so it gets dupped??? Ira > > > > > IMHO the pin refcnt is held by the driver char dev FD, that is the > > object you need to make it visible against. > > I'm sorry but what do you mean by "driver char dev FD"? > > > > > Why not just have a single table someplace of all the layout leases > > with the file they are held on and the FD/socket/etc that is holding > > the pin? Make it independent of processes and FDs? > > If it is independent of processes how will we know which process is blocking > the truncate? Using a global table is an interesting idea but I still believe > the users are going to want to track this to specific processes. It's not > clear to me how that would be done with a global table. > > I agree the XDP/socket case is bothersome... I was thinking that somewhere the > fd of the socket could be hooked up in this case. But taking a look at it > reveals that is not going to be easy. And I assume XDP has the same issue WRT > SCM_RIGHTS and the ability to share the xdp context? > > Ira > > > > > Jason
On Wed, Aug 21, 2019 at 11:43:30AM -0700, John Hubbard wrote: > On 8/19/19 8:36 PM, Dave Chinner wrote: > > On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote: > > > On 8/19/19 6:20 PM, Dave Chinner wrote: > > > > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote: > > > > > On 8/19/19 2:24 AM, Dave Chinner wrote: > > > > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote: > > > > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote: > > > > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote: > > > > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote: > > > > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote: > > > > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote: > > > > > ... > > AFAIA, there is no struct file here - the memory that has been pinned > > is just something mapped into the application's address space. > > > > It seems to me that the socket here is equivalent of the RDMA handle > > that that owns the hardware that pins the pages. Again, that RDMA > > handle is not aware of waht the mapping represents, hence need to > > hold a layout lease if it's a file mapping. > > > > SO from the filesystem persepctive, there's no difference between > > XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly > > into the filesystem's backing store and that will require use of > > layout leases to perform safely. > > > > OK, got it! Makes perfect sense. Just to chime in here... Yea from the FS perspective it is the same. But on the driver side it is more complicated because of how the references to the pins can be shared among other processes. See the other branch of this thread https://lkml.org/lkml/2019/8/21/828 Ira > > thanks, > -- > John Hubbard > NVIDIA
On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote: > > > > Oh, I didn't think we were talking about that. Hanging the close of > > > the datafile fd contingent on some other FD's closure is a recipe for > > > deadlock.. > > > > The discussion between Jan and Dave was concerning what happens when a user > > calls > > > > fd = open() > > fnctl(...getlease...) > > addr = mmap(fd...) > > ib_reg_mr() <pin> > > munmap(addr...) > > close(fd) > > I don't see how blocking close(fd) could work. Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I can't prove it won't.. Maybe we are all just touching a different part of this elephant[1] but the above scenario or one without munmap is very reasonably something a user would do. So we can either allow the close to complete (my current patches) or try to make it block like Dave is suggesting. I don't disagree with Dave with the semantics being nice and clean for the filesystem. But the fact that RDMA, and potentially others, can "pass the pins" to other processes is something I spent a lot of time trying to work out. > > Write it like this: > > fd = open() > uverbs = open(/dev/uverbs) > fnctl(...getlease...) > addr = mmap(fd...) > ib_reg_mr() <pin> > munmap(addr...) > <sigkill> > > The order FD's are closed during sigkill is not deterministic, so when > all the fputs happen during a kill'd exit we could end up blocking in > close(fd) as close(uverbs) will come after in the close > list. close(uverbs) is the thing that does the dereg_mr and releases > the pin. Of course, that is a different scenario which needs to be fixed in my patch set. Now that my servers are back up I can hopefully make progress. (Power was down for them yesterday). > > We don't need complexity with dup to create problems. No but that complexity _will_ come unless we "zombie" layout leases. Ira [1] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant > > Jason >
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > > application has full control of the order in which resources are > > > > > released. We've already established that we can block in this > > > > > context. Blocking in an interruptible state will allow fatal signal > > > > > delivery to wake us, and then we fall into the > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > > MR holding the page pins to be destoyed. This is done to avoid a > > > > deadlock of the form: > > > > > > > > uverbs_destroy_ufile_hw() > > > > mutex_lock() > > > > [..] > > > > mmput() > > > > exit_mmap() > > > > remove_vma() > > > > fput(); > > > > file_operations->release() > > > > > > I think this is wrong, and I'm pretty sure it's an example of why > > > the final __fput() call is moved out of line. > > > > Yes, I think so too, all I can say is this *used* to happen, as we > > have special code avoiding it, which is the code that is messing up > > Ira's lifetime model. > > > > Ira, you could try unraveling the special locking, that solves your > > lifetime issues? > > Yes I will try to prove this out... But I'm still not sure this fully solves > the problem. > > This only ensures that the process which has the RDMA context (RDMA FD) is safe > with regard to hanging the close for the "data file FD" (the file which has > pinned pages) in that _same_ process. But what about the scenario. > > Process A has the RDMA context FD and data file FD (with lease) open. > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B. Passing the RDMA context dependent on a file layout lease to another process that doesn't have a file layout lease or a reference to the original lease should be considered a violation of the layout lease. Process B does not have an active layout lease, and so by the rules of layout leases, it is not allowed to pin the layout of the file. > Process A attempts to exit (hangs because data file FD is pinned). > > Admin kills process A. kill works because we have allowed for it... > > Process B _still_ has the RDMA context FD open _and_ therefore still holds the > file pins. > > Truncation still fails. > > Admin does not know which process is holding the pin. > > What am I missing? Application does not hold the correct file layout lease references. Passing the fd via SCM_RIGHTS to a process without a layout lease is equivalent to not using layout leases in the first place. Cheers, Dave.
On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote: > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote: > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote: > > > > > > Oh, I didn't think we were talking about that. Hanging the close of > > > > the datafile fd contingent on some other FD's closure is a recipe for > > > > deadlock.. > > > > > > The discussion between Jan and Dave was concerning what happens when a user > > > calls > > > > > > fd = open() > > > fnctl(...getlease...) > > > addr = mmap(fd...) > > > ib_reg_mr() <pin> > > > munmap(addr...) > > > close(fd) > > > > I don't see how blocking close(fd) could work. > > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I > can't prove it won't.. Right, I proposed it as a possible way of making sure application developers don't do this. It _could_ be made to work (e.g. recording longterm page pins on the vma->file), but this is tangential to the discussion of requiring active references to all resources covered by the layout lease. I think allowing applications to behave like the above is simply poor system level design, regardless of the interaction with filesystems and layout leases. > Maybe we are all just touching a different part of this > elephant[1] but the above scenario or one without munmap is very reasonably > something a user would do. So we can either allow the close to complete (my > current patches) or try to make it block like Dave is suggesting. > > I don't disagree with Dave with the semantics being nice and clean for the > filesystem. I'm not trying to make it "nice and clean for the filesystem". The problem is not just RDMA/DAX - anything that is directly accessing the block device under the filesystem has the same set of issues. That is, the filesystem controls the life cycle of the blocks in the block device, so direct access to the blocks by any means needs to be co-ordinated with the filesystem. Pinning direct access to a file via page pins attached to a hardware context that the filesystem knows nothing about is not an access model that the filesystems can support. IOWs, anyone looking at this problem just from the RDMA POV of page pins is not seeing all the other direct storage access mechainsms that we need to support in the filesystems. RDMA on DAX is just one of them. pNFS is another. Remote acces via NVMeOF is another. XDP -> DAX (direct file data placement from the network hardware) is another. There are /lots/ of different direct storage access mechanisms that filesystems need to support and we sure as hell do not want to have to support special case semantics for every single one of them. Hence if we don't start with a sane model for arbitrating direct access to the storage at the filesystem level we'll never get this stuff to work reliably, let alone work together coherently. An application that wants a direct data path to storage should have a single API that enables then to safely access the storage, regardless of how they are accessing the storage. From that perspective, what we are talking about here with RDMA doing "mmap, page pin, unmap, close" and "pass page pins via SCM_RIGHTS" are fundamentally unworkable from the filesystem perspective. They are use-after-free situations from the filesystem perspective - they do not hold direct references to anything in the filesystem, and so the filesytem is completely unaware of them. The filesystem needs to be aware of /all users/ of it's resources if it's going to manage them sanely. It needs to be able to corectly coordinate modifications to ownership of the underlying storage with all the users directly accessing that physical storage regardless of the mechanism being used to access the storage. IOWs, access control must be independent of the mechanism used to gain access to the storage hardware. That's what file layout leases are defining - the filesystem support model for allowing direct storage access from userspace. It's not defining "RDMA/FSDAX" access rules, it's defining a generic direct access model. And one of the rules in this model is "if you don't have an active reference to the file layout, you are not allowed to directly access the layout.". Anything else is unsupportable from the filesystem perspective - designing an access mechanism that allows userspace to retain access indefinitely by relying on hidden details of kernel subsystem implementations is a terrible architecture. Not only does it bleed kernel implementation into the API and the behavioural model, it means we can't ever change that internal kernel behaviour because userspace may be dependent on it. I shouldn't be having to point out how bad this is from a system design perspective. That's where the "nice and clean" semantics come from - starting from "what can we actually support?", "what exactly do all the different direct access mechanisms actually require?", "does it work for future technologies not yet on our radar?" and working from there. So I'm not just looking at what we are doing right now, I'm looking at 15 years down the track when we still have to support layout leases and we've got hardware we haven't dreamed of yet. If the model is clean, simple, robust, implementation independent and has well defined semantics, then it should stand the test of time. i.e. the "nice and clean" semantics have nothign to do with the filesystem per se, but everything to do with ensuring the mechanism is generic and workable for direct storage access for a long time into the future. We can't force people to use layout leases - at all, let alone correctly - but if you want filesystems and enterprise distros to support direct access to filesystem controlled storage, then direct access applications need to follow a sane set of rules that are supportable at the filesystem level. > But the fact that RDMA, and potentially others, can "pass the > pins" to other processes is something I spent a lot of time trying to work out. There's nothing in file layout lease architecture that says you can't "pass the pins" to another process. All the file layout lease requirements say is that if you are going to pass a resource for which the layout lease guarantees access for to another process, then the destination process already have a valid, active layout lease that covers the range of the pins being passed to it via the RDMA handle. i.e. as the pins pass from one process to another, they pass from the protection of the lease process A holds to the protection that the lease process B holds. This can probably even be done by duplicating the lease fd and passing it by SCM_RIGHTS first..... Cheers, Dave.
On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote: > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > > > application has full control of the order in which resources are > > > > > > released. We've already established that we can block in this > > > > > > context. Blocking in an interruptible state will allow fatal signal > > > > > > delivery to wake us, and then we fall into the > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > > > MR holding the page pins to be destoyed. This is done to avoid a > > > > > deadlock of the form: > > > > > > > > > > uverbs_destroy_ufile_hw() > > > > > mutex_lock() > > > > > [..] > > > > > mmput() > > > > > exit_mmap() > > > > > remove_vma() > > > > > fput(); > > > > > file_operations->release() > > > > > > > > I think this is wrong, and I'm pretty sure it's an example of why > > > > the final __fput() call is moved out of line. > > > > > > Yes, I think so too, all I can say is this *used* to happen, as we > > > have special code avoiding it, which is the code that is messing up > > > Ira's lifetime model. > > > > > > Ira, you could try unraveling the special locking, that solves your > > > lifetime issues? > > > > Yes I will try to prove this out... But I'm still not sure this fully solves > > the problem. > > > > This only ensures that the process which has the RDMA context (RDMA FD) is safe > > with regard to hanging the close for the "data file FD" (the file which has > > pinned pages) in that _same_ process. But what about the scenario. > > > > Process A has the RDMA context FD and data file FD (with lease) open. > > > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B. > > Passing the RDMA context dependent on a file layout lease to another > process that doesn't have a file layout lease or a reference to the > original lease should be considered a violation of the layout lease. > Process B does not have an active layout lease, and so by the rules > of layout leases, it is not allowed to pin the layout of the file. > I don't disagree with the semantics of this. I just don't know how to enforce it. > > Process A attempts to exit (hangs because data file FD is pinned). > > > > Admin kills process A. kill works because we have allowed for it... > > > > Process B _still_ has the RDMA context FD open _and_ therefore still holds the > > file pins. > > > > Truncation still fails. > > > > Admin does not know which process is holding the pin. > > > > What am I missing? > > Application does not hold the correct file layout lease references. > Passing the fd via SCM_RIGHTS to a process without a layout lease > is equivalent to not using layout leases in the first place. Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS in this case? I'm ok with that but not sure how to implement it right now. To that end, I would like to simplify this slightly because I'm not convinced that SCM_RIGHTS is a problem we need to solve right now. ie I don't know of a user who wants to do this. Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by definition leases) exist underneath the "RDMA FD" (or other direct access FD, like XDP etc) being duplicated. Later, if this becomes a use case we will need to code up the proper checks, potentially within each of the subsystems. This is because, with RDMA at least, there are potentially large numbers of MR's and file leases which may have to be checked. Ira
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote: > > > > But the fact that RDMA, and potentially others, can "pass the > > > pins" to other processes is something I spent a lot of time trying to work out. > > > > There's nothing in file layout lease architecture that says you > > can't "pass the pins" to another process. All the file layout lease > > requirements say is that if you are going to pass a resource for > > which the layout lease guarantees access for to another process, > > then the destination process already have a valid, active layout > > lease that covers the range of the pins being passed to it via the > > RDMA handle. > > How would the kernel detect and enforce this? There are many ways to > pass a FD. AFAIC, that's not really a kernel problem. It's more of an application design constraint than anything else. i.e. if the app passes the IB context to another process without a lease, then the original process is still responsible for recalling the lease and has to tell that other process to release the IB handle and it's resources. > IMHO it is wrong to try and create a model where the file lease exists > independently from the kernel object relying on it. In other words the > IB MR object itself should hold a reference to the lease it relies > upon to function properly. That still doesn't work. Leases are not individually trackable or reference counted objects objects - they are attached to a struct file bUt, in reality, they are far more restricted than a struct file. That is, a lease specifically tracks the pid and the _open fd_ it was obtained for, so it is essentially owned by a specific process context. Hence a lease is not able to be passed to a separate process context and have it still work correctly for lease break notifications. i.e. the layout break signal gets delivered to original process that created the struct file, if it still exists and has the original fd still open. It does not get sent to the process that currently holds a reference to the IB context. So while a struct file passed to another process might still have an active lease, and you can change the owner of the struct file via fcntl(F_SETOWN), you can't associate the existing lease with a the new fd in the new process and so layout break signals can't be directed at the lease fd.... This really means that a lease can only be owned by a single process context - it can't be shared across multiple processes (so I was wrong about dup/pass as being a possible way of passing them) because there's only one process that can "own" a struct file, and that where signals are sent when the lease needs to be broken. So, fundamentally, if you want to pass a resource that pins a file layout between processes, both processes need to hold a layout lease on that file range. And that means exclusive leases and passing layouts between processes are fundamentally incompatible because you can't hold two exclusive leases on the same file range.... Cheers, Dave.
On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote: > On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote: > > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote: > > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote: > > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote: > > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote: > > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote: > > > > > > > > > > > > > So that leaves just the normal close() syscall exit case, where the > > > > > > > application has full control of the order in which resources are > > > > > > > released. We've already established that we can block in this > > > > > > > context. Blocking in an interruptible state will allow fatal signal > > > > > > > delivery to wake us, and then we fall into the > > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking. > > > > > > > > > > > > The major problem with RDMA is that it doesn't always wait on close() for the > > > > > > MR holding the page pins to be destoyed. This is done to avoid a > > > > > > deadlock of the form: > > > > > > > > > > > > uverbs_destroy_ufile_hw() > > > > > > mutex_lock() > > > > > > [..] > > > > > > mmput() > > > > > > exit_mmap() > > > > > > remove_vma() > > > > > > fput(); > > > > > > file_operations->release() > > > > > > > > > > I think this is wrong, and I'm pretty sure it's an example of why > > > > > the final __fput() call is moved out of line. > > > > > > > > Yes, I think so too, all I can say is this *used* to happen, as we > > > > have special code avoiding it, which is the code that is messing up > > > > Ira's lifetime model. > > > > > > > > Ira, you could try unraveling the special locking, that solves your > > > > lifetime issues? > > > > > > Yes I will try to prove this out... But I'm still not sure this fully solves > > > the problem. > > > > > > This only ensures that the process which has the RDMA context (RDMA FD) is safe > > > with regard to hanging the close for the "data file FD" (the file which has > > > pinned pages) in that _same_ process. But what about the scenario. > > > > > > Process A has the RDMA context FD and data file FD (with lease) open. > > > > > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B. > > > > Passing the RDMA context dependent on a file layout lease to another > > process that doesn't have a file layout lease or a reference to the > > original lease should be considered a violation of the layout lease. > > Process B does not have an active layout lease, and so by the rules > > of layout leases, it is not allowed to pin the layout of the file. > > > > I don't disagree with the semantics of this. I just don't know how to enforce > it. > > > > Process A attempts to exit (hangs because data file FD is pinned). > > > > > > Admin kills process A. kill works because we have allowed for it... > > > > > > Process B _still_ has the RDMA context FD open _and_ therefore still holds the > > > file pins. > > > > > > Truncation still fails. > > > > > > Admin does not know which process is holding the pin. > > > > > > What am I missing? > > > > Application does not hold the correct file layout lease references. > > Passing the fd via SCM_RIGHTS to a process without a layout lease > > is equivalent to not using layout leases in the first place. > > Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS > in this case? I'm ok with that but not sure how to implement it right now. > > To that end, I would like to simplify this slightly because I'm not convinced > that SCM_RIGHTS is a problem we need to solve right now. ie I don't know of a > user who wants to do this. I don't think we can support it, let alone want to. SCM_RIGHTS was a mistake made years ago that has been causing bugs and complexity to try and avoid those bugs ever since. I'm only taking about it because someone else raised it and I asummed they raised it because they want it to "work". > Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by > definition leases) exist underneath the "RDMA FD" (or other direct access FD, > like XDP etc) being duplicated. Sounds like a fine idea to me. Cheers, Dave.
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote: > On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote: > > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote: > > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote: > > > > > > > > Oh, I didn't think we were talking about that. Hanging the close of > > > > > the datafile fd contingent on some other FD's closure is a recipe for > > > > > deadlock.. > > > > > > > > The discussion between Jan and Dave was concerning what happens when a user > > > > calls > > > > > > > > fd = open() > > > > fnctl(...getlease...) > > > > addr = mmap(fd...) > > > > ib_reg_mr() <pin> > > > > munmap(addr...) > > > > close(fd) > > > > > > I don't see how blocking close(fd) could work. > > > > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I > > can't prove it won't.. > > Right, I proposed it as a possible way of making sure application > developers don't do this. It _could_ be made to work (e.g. recording > longterm page pins on the vma->file), but this is tangential to > the discussion of requiring active references to all resources > covered by the layout lease. > > I think allowing applications to behave like the above is simply > poor system level design, regardless of the interaction with > filesystems and layout leases. > > > Maybe we are all just touching a different part of this > > elephant[1] but the above scenario or one without munmap is very reasonably > > something a user would do. So we can either allow the close to complete (my > > current patches) or try to make it block like Dave is suggesting. My belief when writing the current series was that hanging the close would cause deadlock. But it seems I was wrong because of the delayed __fput(). So far, I have not been able to get RDMA to have an issue like Jason suggested would happen (or used to happen). So from that perspective it may be ok to hang the close. > > > > I don't disagree with Dave with the semantics being nice and clean for the > > filesystem. > > I'm not trying to make it "nice and clean for the filesystem". > > The problem is not just RDMA/DAX - anything that is directly > accessing the block device under the filesystem has the same set of > issues. That is, the filesystem controls the life cycle of the > blocks in the block device, so direct access to the blocks by any > means needs to be co-ordinated with the filesystem. Pinning direct > access to a file via page pins attached to a hardware context that > the filesystem knows nothing about is not an access model that the > filesystems can support. > > IOWs, anyone looking at this problem just from the RDMA POV of page > pins is not seeing all the other direct storage access mechainsms > that we need to support in the filesystems. RDMA on DAX is just one > of them. pNFS is another. Remote acces via NVMeOF is another. XDP > -> DAX (direct file data placement from the network hardware) is > another. There are /lots/ of different direct storage access > mechanisms that filesystems need to support and we sure as hell do > not want to have to support special case semantics for every single > one of them. My use of struct file was based on the fact that FDs are a primary interface for linux and my thought was that they would be more universal than having file pin information stored in an RDMA specific structure. XDP is not as direct; it uses sockets. But sockets also have a struct file which I believe could be used in a similar manner. I'm not 100% sure of the xdp_umem lifetime yet but it seems that my choice of using struct file was a good one in this respect. > > Hence if we don't start with a sane model for arbitrating direct > access to the storage at the filesystem level we'll never get this > stuff to work reliably, let alone work together coherently. An > application that wants a direct data path to storage should have a > single API that enables then to safely access the storage, > regardless of how they are accessing the storage. > > From that perspective, what we are talking about here with RDMA > doing "mmap, page pin, unmap, close" and "pass page pins via > SCM_RIGHTS" are fundamentally unworkable from the filesystem > perspective. They are use-after-free situations from the filesystem > perspective - they do not hold direct references to anything in the > filesystem, and so the filesytem is completely unaware of them. I see your point of view but looking at it from a different point of view I don't see this as a "use after free". The user has explicitly registered this memory (and layout) with another direct access subsystem (RDMA for example) so why do they need to keep the FD around? > > The filesystem needs to be aware of /all users/ of it's resources if > it's going to manage them sanely. From the way I look at it the underlying filesystem _is_ aware of the leases with my patch set. And so to is the user. It is just not through the original "data file fd". And the owner of the lease becomes the subsystem object ("RDMA FD" in this case) which is holding the pins. Furthermore, the lease is maintained and transferred automatically through the normal FD processing. (Furthermore, tracking of these pins is available for whatever subsystem by tracking them with struct file; _not_ just RDMA). When those subsystem objects are released the "data file lease" will be released as well. That was the design. > > > But the fact that RDMA, and potentially others, can "pass the > > pins" to other processes is something I spent a lot of time trying to work out. > > There's nothing in file layout lease architecture that says you > can't "pass the pins" to another process. All the file layout lease > requirements say is that if you are going to pass a resource for > which the layout lease guarantees access for to another process, > then the destination process already have a valid, active layout > lease that covers the range of the pins being passed to it via the > RDMA handle. > > i.e. as the pins pass from one process to another, they pass from > the protection of the lease process A holds to the protection that > the lease process B holds. This can probably even be done by > duplicating the lease fd and passing it by SCM_RIGHTS first..... My worry with this is how to enforce it. As I said in the other thread I think we could potentially block SCM_RIGHTS use in the short term. But I'm not sure about blocking every call which may "dup()" an FD to random processes. Ira
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote: > > > > > > But the fact that RDMA, and potentially others, can "pass the > > > > pins" to other processes is something I spent a lot of time trying to work out. > > > > > > There's nothing in file layout lease architecture that says you > > > can't "pass the pins" to another process. All the file layout lease > > > requirements say is that if you are going to pass a resource for > > > which the layout lease guarantees access for to another process, > > > then the destination process already have a valid, active layout > > > lease that covers the range of the pins being passed to it via the > > > RDMA handle. > > > > How would the kernel detect and enforce this? There are many ways to > > pass a FD. > > AFAIC, that's not really a kernel problem. It's more of an > application design constraint than anything else. i.e. if the app > passes the IB context to another process without a lease, then the > original process is still responsible for recalling the lease and > has to tell that other process to release the IB handle and it's > resources. > > > IMHO it is wrong to try and create a model where the file lease exists > > independently from the kernel object relying on it. In other words the > > IB MR object itself should hold a reference to the lease it relies > > upon to function properly. > > That still doesn't work. Leases are not individually trackable or > reference counted objects objects - they are attached to a struct > file bUt, in reality, they are far more restricted than a struct > file. > > That is, a lease specifically tracks the pid and the _open fd_ it > was obtained for, so it is essentially owned by a specific process > context. Hence a lease is not able to be passed to a separate > process context and have it still work correctly for lease break > notifications. i.e. the layout break signal gets delivered to > original process that created the struct file, if it still exists > and has the original fd still open. It does not get sent to the > process that currently holds a reference to the IB context. > The fcntl man page says: "Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed." From this I took it that the child process FD would have the lease as well _and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not seem to work the same way as dup() so I'm not so sure. Ira > > So while a struct file passed to another process might still have > an active lease, and you can change the owner of the struct file > via fcntl(F_SETOWN), you can't associate the existing lease with a > the new fd in the new process and so layout break signals can't be > directed at the lease fd.... > > This really means that a lease can only be owned by a single process > context - it can't be shared across multiple processes (so I was > wrong about dup/pass as being a possible way of passing them) > because there's only one process that can "own" a struct file, and > that where signals are sent when the lease needs to be broken. > > So, fundamentally, if you want to pass a resource that pins a file > layout between processes, both processes need to hold a layout lease > on that file range. And that means exclusive leases and passing > layouts between processes are fundamentally incompatible because you > can't hold two exclusive leases on the same file range.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote: > > > > > > > > But the fact that RDMA, and potentially others, can "pass the > > > > > pins" to other processes is something I spent a lot of time trying to work out. > > > > > > > > There's nothing in file layout lease architecture that says you > > > > can't "pass the pins" to another process. All the file layout lease > > > > requirements say is that if you are going to pass a resource for > > > > which the layout lease guarantees access for to another process, > > > > then the destination process already have a valid, active layout > > > > lease that covers the range of the pins being passed to it via the > > > > RDMA handle. > > > > > > How would the kernel detect and enforce this? There are many ways to > > > pass a FD. > > > > AFAIC, that's not really a kernel problem. It's more of an > > application design constraint than anything else. i.e. if the app > > passes the IB context to another process without a lease, then the > > original process is still responsible for recalling the lease and > > has to tell that other process to release the IB handle and it's > > resources. > > > > > IMHO it is wrong to try and create a model where the file lease exists > > > independently from the kernel object relying on it. In other words the > > > IB MR object itself should hold a reference to the lease it relies > > > upon to function properly. > > > > That still doesn't work. Leases are not individually trackable or > > reference counted objects objects - they are attached to a struct > > file bUt, in reality, they are far more restricted than a struct > > file. > > > > That is, a lease specifically tracks the pid and the _open fd_ it > > was obtained for, so it is essentially owned by a specific process > > context. Hence a lease is not able to be passed to a separate > > process context and have it still work correctly for lease break > > notifications. i.e. the layout break signal gets delivered to > > original process that created the struct file, if it still exists > > and has the original fd still open. It does not get sent to the > > process that currently holds a reference to the IB context. > > > > The fcntl man page says: > > "Leases are associated with an open file description (see open(2)). This means > that duplicate file descriptors (created by, for example, fork(2) or dup(2)) > refer to the same lease, and this lease may be modified or released using any > of these descriptors. Furthermore, the lease is released by either an > explicit F_UNLCK operation on any of these duplicate file descriptors, or when > all such file descriptors have been closed." Right, the lease is attached to the struct file, so it follows where-ever the struct file goes. That doesn't mean it's actually useful when the struct file is duplicated and/or passed to another process. :/ AFAICT, the problem is that when we take another reference to the struct file, or when the struct file is passed to a different process, nothing updates the lease or lease state attached to that struct file. > From this I took it that the child process FD would have the lease as well > _and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not > seem to work the same way as dup() so I'm not so sure. Sure, that part works because the struct file is passed. It doesn't end up with the same fd number in the other process, though. The issue is that layout leases need to notify userspace when they are broken by the kernel, so a lease stores the owner pid/tid in the file->f_owner field via __f_setown(). It also keeps a struct fasync attached to the file_lock that records the fd that the lease was created on. When a signal needs to be sent to userspace for that lease, we call kill_fasync() and that walks the list of fasync structures on the lease and calls: send_sigio(fown, fa->fa_fd, band); And it does for every fasync struct attached to a lease. Yes, a lease can track multiple fds, but it can only track them in a single process context. The moment the struct file is shared with another process, the lease is no longer capable of sending notifications to all the lease holders. Yes, you can change the owning process via F_SETOWNER, but that's still only a single process context, and you can't change the fd in the fasync list. You can add new fd to an existing lease by calling F_SETLEASE on the new fd, but you still only have a single process owner context for signal delivery. As such, leases that require callbacks to userspace are currently only valid within the process context the lease was taken in. Indeed, even closing the fd the lease was taken on without F_UNLCKing it first doesn't mean the lease has been torn down if there is some other reference to the struct file. That means the original lease owner will still get SIGIO delivered to that fd on a lease break regardless of whether it is open or not. ANd if we implement "layout lease not released within SIGIO response timeout" then that process will get killed, despite the fact it may not even have a reference to that file anymore. So, AFAICT, leases that require userspace callbacks only work within their original process context while they original fd is still open. Cheers, Dave.
On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote: > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > > > > > > > IMHO it is wrong to try and create a model where the file lease exists > > > > independently from the kernel object relying on it. In other words the > > > > IB MR object itself should hold a reference to the lease it relies > > > > upon to function properly. > > > > > > That still doesn't work. Leases are not individually trackable or > > > reference counted objects objects - they are attached to a struct > > > file bUt, in reality, they are far more restricted than a struct > > > file. > > > > > > That is, a lease specifically tracks the pid and the _open fd_ it > > > was obtained for, so it is essentially owned by a specific process > > > context. Hence a lease is not able to be passed to a separate > > > process context and have it still work correctly for lease break > > > notifications. i.e. the layout break signal gets delivered to > > > original process that created the struct file, if it still exists > > > and has the original fd still open. It does not get sent to the > > > process that currently holds a reference to the IB context. But this is an exclusive layout lease which does not send a signal. There is no way to break it. > > > > > > > The fcntl man page says: > > > > "Leases are associated with an open file description (see open(2)). This means > > that duplicate file descriptors (created by, for example, fork(2) or dup(2)) > > refer to the same lease, and this lease may be modified or released using any > > of these descriptors. Furthermore, the lease is released by either an > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when > > all such file descriptors have been closed." > > Right, the lease is attached to the struct file, so it follows > where-ever the struct file goes. That doesn't mean it's actually > useful when the struct file is duplicated and/or passed to another > process. :/ > > AFAICT, the problem is that when we take another reference to the > struct file, or when the struct file is passed to a different > process, nothing updates the lease or lease state attached to that > struct file. Ok, I probably should have made this more clear in the cover letter but _only_ the process which took the lease can actually pin memory. That pinned memory _can_ be passed to another process but those sub-process' can _not_ use the original lease to pin _more_ of the file. They would need to take their own lease to do that. Sorry for not being clear on that. > > > From this I took it that the child process FD would have the lease as well > > _and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not > > seem to work the same way as dup() so I'm not so sure. > > Sure, that part works because the struct file is passed. It doesn't > end up with the same fd number in the other process, though. > > The issue is that layout leases need to notify userspace when they > are broken by the kernel, so a lease stores the owner pid/tid in the > file->f_owner field via __f_setown(). It also keeps a struct fasync > attached to the file_lock that records the fd that the lease was > created on. When a signal needs to be sent to userspace for that > lease, we call kill_fasync() and that walks the list of fasync > structures on the lease and calls: > > send_sigio(fown, fa->fa_fd, band); > > And it does for every fasync struct attached to a lease. Yes, a > lease can track multiple fds, but it can only track them in a single > process context. The moment the struct file is shared with another > process, the lease is no longer capable of sending notifications to > all the lease holders. > > Yes, you can change the owning process via F_SETOWNER, but that's > still only a single process context, and you can't change the fd in > the fasync list. You can add new fd to an existing lease by calling > F_SETLEASE on the new fd, but you still only have a single process > owner context for signal delivery. > > As such, leases that require callbacks to userspace are currently > only valid within the process context the lease was taken in. But for long term pins we are not requiring callbacks. > Indeed, even closing the fd the lease was taken on without > F_UNLCKing it first doesn't mean the lease has been torn down if > there is some other reference to the struct file. That means the > original lease owner will still get SIGIO delivered to that fd on a > lease break regardless of whether it is open or not. ANd if we > implement "layout lease not released within SIGIO response timeout" > then that process will get killed, despite the fact it may not even > have a reference to that file anymore. I'm not seeing that as a problem. This is all a result of the application failing to do the right thing. The code here is simply keeping the kernel consistent and safe so that an admin or the user themselves can unwind the badness without damage to the file system. > > So, AFAICT, leases that require userspace callbacks only work within > their original process context while they original fd is still open. But they _work_ IFF the application actually expects to do something with the SIGIO. The application could just as well chose to ignore the SIGIO without closing the FD which would do the same thing. If the application expected to do something with the SIGIO but closed the FD then it's really just the applications fault. So after thinking on this for a day I don't think we have a serious issue. Even the "zombie" lease is just an application error and it is already possible to get something like this. If the application passes the FD to another process and closes their FD then SIGIO's don't get delivered but there is a lease hanging off the struct file until it is destroyed. No harm, no foul. In the case of close it is _not_ true that users don't have a way to release the lease. It is just that they can't call F_UNLCK to do so. Once they have "zombie'ed" the lease (again an application error) the only recourse is to unpin the file through the subsystem which pinned the page. Probably through killing the process. Ira
On 8/28/19 7:02 PM, Ira Weiny wrote: > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote: >> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: >>> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: >>>> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: ... >> >> Sure, that part works because the struct file is passed. It doesn't >> end up with the same fd number in the other process, though. >> >> The issue is that layout leases need to notify userspace when they >> are broken by the kernel, so a lease stores the owner pid/tid in the >> file->f_owner field via __f_setown(). It also keeps a struct fasync >> attached to the file_lock that records the fd that the lease was >> created on. When a signal needs to be sent to userspace for that >> lease, we call kill_fasync() and that walks the list of fasync >> structures on the lease and calls: >> >> send_sigio(fown, fa->fa_fd, band); >> >> And it does for every fasync struct attached to a lease. Yes, a >> lease can track multiple fds, but it can only track them in a single >> process context. The moment the struct file is shared with another >> process, the lease is no longer capable of sending notifications to >> all the lease holders. >> >> Yes, you can change the owning process via F_SETOWNER, but that's >> still only a single process context, and you can't change the fd in >> the fasync list. You can add new fd to an existing lease by calling >> F_SETLEASE on the new fd, but you still only have a single process >> owner context for signal delivery. >> >> As such, leases that require callbacks to userspace are currently >> only valid within the process context the lease was taken in. > > But for long term pins we are not requiring callbacks. > Hi Ira, If "require callbacks to userspace" means sending SIGIO, then actually FOLL_LONGTERM *does* require those callbacks. Because we've been, so far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease. What am I missing here? thanks,
On Wed, Aug 28, 2019 at 08:27:23PM -0700, John Hubbard wrote: > On 8/28/19 7:02 PM, Ira Weiny wrote: > > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote: > > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: > > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > ... > > > > > > Sure, that part works because the struct file is passed. It doesn't > > > end up with the same fd number in the other process, though. > > > > > > The issue is that layout leases need to notify userspace when they > > > are broken by the kernel, so a lease stores the owner pid/tid in the > > > file->f_owner field via __f_setown(). It also keeps a struct fasync > > > attached to the file_lock that records the fd that the lease was > > > created on. When a signal needs to be sent to userspace for that > > > lease, we call kill_fasync() and that walks the list of fasync > > > structures on the lease and calls: > > > > > > send_sigio(fown, fa->fa_fd, band); > > > > > > And it does for every fasync struct attached to a lease. Yes, a > > > lease can track multiple fds, but it can only track them in a single > > > process context. The moment the struct file is shared with another > > > process, the lease is no longer capable of sending notifications to > > > all the lease holders. > > > > > > Yes, you can change the owning process via F_SETOWNER, but that's > > > still only a single process context, and you can't change the fd in > > > the fasync list. You can add new fd to an existing lease by calling > > > F_SETLEASE on the new fd, but you still only have a single process > > > owner context for signal delivery. > > > > > > As such, leases that require callbacks to userspace are currently > > > only valid within the process context the lease was taken in. > > > > But for long term pins we are not requiring callbacks. > > > > Hi Ira, > > If "require callbacks to userspace" means sending SIGIO, then actually > FOLL_LONGTERM *does* require those callbacks. Because we've been, so > far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease. > > What am I missing here? We agreed back in June that the layout lease would have 2 "levels". The "normal" layout lease would cause SIGIO and could be broken and another "exclusive" level which could _not_ be broken. Because we _can't_ _trust_ user space to react to the SIGIO properly the "exclusive" lease is required to take the longterm pins. Also this is the lease which causes the truncate to fail (return ETXTBSY) because the kernel can't break the lease. The vaddr_pin struct in the current RFC is there for a couple of reasons. 1) To ensure that we have a way to correlate the long term pin user with the file if the data file FD's are closed. (ie the application has zombie'd the lease). 2) And more importantly as a token the vaddr_pin*() callers use to be able to properly ref count the file itself while in use. Ira
On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote: > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote: > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > > > "Leases are associated with an open file description (see open(2)). This means > > > that duplicate file descriptors (created by, for example, fork(2) or dup(2)) > > > refer to the same lease, and this lease may be modified or released using any > > > of these descriptors. Furthermore, the lease is released by either an > > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when > > > all such file descriptors have been closed." > > > > Right, the lease is attached to the struct file, so it follows > > where-ever the struct file goes. That doesn't mean it's actually > > useful when the struct file is duplicated and/or passed to another > > process. :/ > > > > AFAICT, the problem is that when we take another reference to the > > struct file, or when the struct file is passed to a different > > process, nothing updates the lease or lease state attached to that > > struct file. > > Ok, I probably should have made this more clear in the cover letter but _only_ > the process which took the lease can actually pin memory. Sure, no question about that. > That pinned memory _can_ be passed to another process but those sub-process' can > _not_ use the original lease to pin _more_ of the file. They would need to > take their own lease to do that. Yes, they would need a new lease to extend it. But that ignores the fact they don't have a lease on the existing pins they are using and have no control over the lease those pins originated under. e.g. the originating process dies (for whatever reason) and now we have pins without a valid lease holder. If something else now takes an exclusive lease on the file (because the original exclusive lease no longer exists), it's not going to work correctly because of the zombied page pins caused by closing the exclusive lease they were gained under. IOWs, pages pinned under an exclusive lease are no longer "exclusive" the moment the original exclusive lease is dropped, and pins passed to another process are no longer covered by the original lease they were created under. > Sorry for not being clear on that. I know exactly what you are saying. What I'm failing to get across is that file layout leases don't actually allow the behaviour you want to have. > > As such, leases that require callbacks to userspace are currently > > only valid within the process context the lease was taken in. > > But for long term pins we are not requiring callbacks. Regardless, we still require an active lease for long term pins so that other lease holders fail operations appropriately. And that exclusive lease must follow the process that pins the pages so that the life cycle is the same... > > Indeed, even closing the fd the lease was taken on without > > F_UNLCKing it first doesn't mean the lease has been torn down if > > there is some other reference to the struct file. That means the > > original lease owner will still get SIGIO delivered to that fd on a > > lease break regardless of whether it is open or not. ANd if we > > implement "layout lease not released within SIGIO response timeout" > > then that process will get killed, despite the fact it may not even > > have a reference to that file anymore. > > I'm not seeing that as a problem. This is all a result of the application > failing to do the right thing. How is that not a problem? Cheers, Dave.
On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote: > On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote: > > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote: > > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote: > > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote: > > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote: > > > > "Leases are associated with an open file description (see open(2)). This means > > > > that duplicate file descriptors (created by, for example, fork(2) or dup(2)) > > > > refer to the same lease, and this lease may be modified or released using any > > > > of these descriptors. Furthermore, the lease is released by either an > > > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when > > > > all such file descriptors have been closed." > > > > > > Right, the lease is attached to the struct file, so it follows > > > where-ever the struct file goes. That doesn't mean it's actually > > > useful when the struct file is duplicated and/or passed to another > > > process. :/ > > > > > > AFAICT, the problem is that when we take another reference to the > > > struct file, or when the struct file is passed to a different > > > process, nothing updates the lease or lease state attached to that > > > struct file. > > > > Ok, I probably should have made this more clear in the cover letter but _only_ > > the process which took the lease can actually pin memory. > > Sure, no question about that. > > > That pinned memory _can_ be passed to another process but those sub-process' can > > _not_ use the original lease to pin _more_ of the file. They would need to > > take their own lease to do that. > > Yes, they would need a new lease to extend it. But that ignores the > fact they don't have a lease on the existing pins they are using and > have no control over the lease those pins originated under. e.g. > the originating process dies (for whatever reason) and now we have > pins without a valid lease holder. Define "valid lease holder"? > > If something else now takes an exclusive lease on the file (because > the original exclusive lease no longer exists), it's not going to > work correctly because of the zombied page pins caused by closing > the exclusive lease they were gained under. IOWs, pages pinned under > an exclusive lease are no longer "exclusive" the moment the original > exclusive lease is dropped, and pins passed to another process are > no longer covered by the original lease they were created under. The page pins are not zombied the lease is. The lease still exists, it can't be dropped while the pins are in place. I need to double check the implementation but that was the intent. Yep just did a quick check, I have a test for that. If the page pins exist then the lease can _not_ be released. Closing the FD will "zombie" the lease but it and the struct file will still exist until the pins go away. Furthermore, a "zombie" lease is _not_ sufficient to pin more pages. (I have a test for this too.) I apologize that I don't have something to submit to xfstests. I'm new to that code base. I'm happy to share the code I have which I've been using to test... But it is pretty rough as it has undergone a number of changes. I think it would be better to convert my test series to xfstests. However, I don't know if it is ok to require RDMA within those tests. Right now that is the only sub-system I have allowed to create these page pins. So I'm not sure what to do at this time. I'm open to suggestions. > > > Sorry for not being clear on that. > > I know exactly what you are saying. What I'm failing to get across > is that file layout leases don't actually allow the behaviour you > want to have. Not currently, no. But we are discussing the semantics to allow them _to_ have the behavior needed. > > > > As such, leases that require callbacks to userspace are currently > > > only valid within the process context the lease was taken in. > > > > But for long term pins we are not requiring callbacks. > > Regardless, we still require an active lease for long term pins so > that other lease holders fail operations appropriately. And that > exclusive lease must follow the process that pins the pages so that > the life cycle is the same... I disagree. See below. > > > > Indeed, even closing the fd the lease was taken on without > > > F_UNLCKing it first doesn't mean the lease has been torn down if > > > there is some other reference to the struct file. That means the > > > original lease owner will still get SIGIO delivered to that fd on a > > > lease break regardless of whether it is open or not. ANd if we > > > implement "layout lease not released within SIGIO response timeout" > > > then that process will get killed, despite the fact it may not even > > > have a reference to that file anymore. > > > > I'm not seeing that as a problem. This is all a result of the application > > failing to do the right thing. > > How is that not a problem? The application has taken an exclusive lease and they don't have to let it go. IOW, there is little difference between the application closing the FD and creating a zombie lease vs keeping the FD open with a real lease. Because no SIGIO is sent and there is no need to react to it anyway as the intention is to keep the lease active and the layout pinned "indefinitely". Furthermore, in both cases the admin must kill the application to change the layout forcibly. Basically applications don't _have_ to do the right thing but the kernel and the filesystem is still protected while the admin has a way to correct the situation given a bad application. Therefore, from the POV of the kernel and file system I don't see a problem. Ira
From: Ira Weiny <ira.weiny@intel.com> Pre-requisites ============== Based on mmotm tree. Based on the feedback from LSFmm, the LWN article, the RFC series since then, and a ton of scenarios I've worked in my mind and/or tested...[1] Solution summary ================ The real issue is that there is no use case for a user to have RDMA pinn'ed memory which is then truncated. So really any solution we present which: A) Prevents file system corruption or data leaks ...and... B) Informs the user that they did something wrong Should be an acceptable solution. Because this is slightly new behavior. And because this is going to be specific to DAX (because of the lack of a page cache) we have made the user "opt in" to this behavior. The following patches implement the following solution. 0) Registrations to Device DAX char devs are not affected 1) The user has to opt in to allowing page pins on a file with an exclusive layout lease. Both exclusive and layout lease flags are user visible now. 2) page pins will fail if the lease is not active when the file back page is encountered. 3) Any truncate or hole punch operation on a pinned DAX page will fail. 4) The user has the option of holding the lease or releasing it. If they release it no other pin calls will work on the file. 5) Closing the file is ok. 6) Unmapping the file is ok 7) Pins against the files are tracked back to an owning file or an owning mm depending on the internal subsystem needs. With RDMA there is an owning file which is related to the pined file. 8) Only RDMA is currently supported 9) Truncation of pages which are not actively pinned nor covered by a lease will succeed. Reporting of pinned files in procfs =================================== A number of alternatives were explored for how to report the file pins within procfs. The following incorporates ideas from Jan Kara, Jason Gunthorpe, Dave Chinner, Dan Williams and myself. A new entry is added to procfs /proc/<pid>/file_pins For processes which have pinned DAX file memory file_pins reference come in 2 flavors. Those which are attached to another open file descriptor (For example what is done in the RDMA subsytem) and those which are attached to a process mm. For those which are attached to another open file descriptor (such as RDMA) the file pin references go through the 'struct file' associated with that pin. In RDMA this is the RDMA context struct file. The resulting output from proc fs is something like. $ cat /proc/<pid>/file_pins 3: /dev/infiniband/uverbs0 /mnt/pmem/foo Where '3' is the file descriptor (and file path) of the rdma context within the process. The paths of the files pinned using that context are then listed. RDMA contexts may have multiple MR each of which may have multiple files pinned within them. So an output like the following is possible. $ cat /proc/<pid>/file_pins 4: /dev/infiniband/uverbs0 /mnt/pmem/foo /mnt/pmem/bar /mnt/pmem/another /mnt/pmem/one The actual memory regions associated with the file pins are not reported. For processes which are pinning memory which is not associated with a specific file descriptor memory pins are reported directly as paths to the file. $ cat /proc/<pid>/file_pins /mnt/pmem/foo Putting the above together if a process was using RDMA and another subsystem the output could be something like: $ cat /proc/<pid>/file_pins 4: /dev/infiniband/uverbs0 /mnt/pmem/foo /mnt/pmem/bar /mnt/pmem/another /mnt/pmem/one /mnt/pmem/foo /mnt/pmem/another /mnt/pmem/mm_mapped_file [1] https://lkml.org/lkml/2019/6/5/1046 Background ========== It should be noted that one solution for this problem is to use RDMA's On Demand Paging (ODP). There are 2 big reasons this may not work. 1) The hardware being used for RDMA may not support ODP 2) ODP may be detrimental to the over all network (cluster or cloud) performance Therefore, in order to support RDMA to File system pages without On Demand Paging (ODP) a number of things need to be done. 1) "longterm" GUP users need to inform other subsystems that they have taken a pin on a page which may remain pinned for a very "long time". The definition of long time is debatable but it has been established that RDMAs use of pages for, minutes, hours, or even days after the pin is the extreme case which makes this problem most severe. 2) Any page which is "controlled" by a file system needs to have special handling. The details of the handling depends on if the page is page cache fronted or not. 2a) A page cache fronted page which has been pinned by GUP long term can use a bounce buffer to allow the file system to write back snap shots of the page. This is handled by the FS recognizing the GUP long term pin and making a copy of the page to be written back. NOTE: this patch set does not address this path. 2b) A FS "controlled" page which is not page cache fronted is either easier to deal with or harder depending on the operation the filesystem is trying to do. 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the FS can no longer use the pages in question until the pin has been removed. This patch set presents a solution to this by introducing some reasonable restrictions on user space applications. 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch then there is nothing which need be done. Data is Read or Written directly to the page. This is an easy case which would currently work if not for GUP long term pins being disabled. Therefore this patch set need not change access to the file data but does allow for GUP pins after 2ba above is dealt with. This patch series and presents a solution for problem 2ba) Ira Weiny (19): fs/locks: Export F_LAYOUT lease to user space fs/locks: Add Exclusive flag to user Layout lease mm/gup: Pass flags down to __gup_device_huge* calls mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages fs/ext4: Teach ext4 to break layout leases fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range fs/xfs: Teach xfs to use new dax_layout_busy_page() fs/xfs: Fail truncate if page lease can't be broken mm/gup: Introduce vaddr_pin structure mm/gup: Pass a NULL vaddr_pin through GUP fast mm/gup: Pass follow_page_context further down the call stack mm/gup: Prep put_user_pages() to take an vaddr_pin struct {mm,file}: Add file_pins objects fs/locks: Associate file pins while performing GUP mm/gup: Introduce vaddr_pin_pages() RDMA/uverbs: Add back pointer to system file object RDMA/umem: Convert to vaddr_[pin|unpin]* operations. {mm,procfs}: Add display file_pins proc mm/gup: Remove FOLL_LONGTERM DAX exclusion drivers/infiniband/core/umem.c | 26 +- drivers/infiniband/core/umem_odp.c | 16 +- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_main.c | 1 + fs/Kconfig | 1 + fs/dax.c | 38 ++- fs/ext4/ext4.h | 2 +- fs/ext4/extents.c | 6 +- fs/ext4/inode.c | 26 +- fs/file_table.c | 4 + fs/locks.c | 291 +++++++++++++++++- fs/proc/base.c | 214 +++++++++++++ fs/xfs/xfs_file.c | 21 +- fs/xfs/xfs_inode.h | 5 +- fs/xfs/xfs_ioctl.c | 15 +- fs/xfs/xfs_iops.c | 14 +- include/linux/dax.h | 12 +- include/linux/file.h | 49 +++ include/linux/fs.h | 5 +- include/linux/huge_mm.h | 17 -- include/linux/mm.h | 69 +++-- include/linux/mm_types.h | 2 + include/rdma/ib_umem.h | 2 +- include/uapi/asm-generic/fcntl.h | 5 + kernel/fork.c | 3 + mm/gup.c | 418 ++++++++++++++++---------- mm/huge_memory.c | 18 +- mm/internal.h | 28 ++ 28 files changed, 1048 insertions(+), 261 deletions(-)