diff mbox series

[v5,3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA

Message ID 20240808183340.483468-4-martin.oliveira@eideticom.com (mailing list archive)
State New
Headers show
Series Enable P2PDMA in Userspace RDMA | expand

Commit Message

Martin Oliveira Aug. 8, 2024, 6:33 p.m. UTC
This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).

The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.

The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>

[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

Comments

Christoph Hellwig Aug. 12, 2024, 6:39 a.m. UTC | #1
On Thu, Aug 08, 2024 at 12:33:39PM -0600, Martin Oliveira wrote:
> This check existed originally due to concerns that P2PDMA needed to copy
> fsdax until pgmap refcounts were fixed (see [1]).
> 
> The P2PDMA infrastructure will only call unmap_mapping_range() when the
> underlying device is unbound, and immediately after unmapping it waits
> for the reference of all ZONE_DEVICE pages to be released before
> continuing. This does not allow for a page to be reused and no user
> access fault is therefore possible. It does not have the same problem as
> fsdax.
> 
> The one minor concern with FOLL_LONGTERM pins is they will block device
> unbind until userspace releases them all.

This is unfortunately not really minor unless we have a well documented
way to force this :(
Jason Gunthorpe Aug. 12, 2024, 11:12 p.m. UTC | #2
On Sun, Aug 11, 2024 at 11:39:22PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 08, 2024 at 12:33:39PM -0600, Martin Oliveira wrote:
> > This check existed originally due to concerns that P2PDMA needed to copy
> > fsdax until pgmap refcounts were fixed (see [1]).
> > 
> > The P2PDMA infrastructure will only call unmap_mapping_range() when the
> > underlying device is unbound, and immediately after unmapping it waits
> > for the reference of all ZONE_DEVICE pages to be released before
> > continuing. This does not allow for a page to be reused and no user
> > access fault is therefore possible. It does not have the same problem as
> > fsdax.
> > 
> > The one minor concern with FOLL_LONGTERM pins is they will block device
> > unbind until userspace releases them all.
> 
> This is unfortunately not really minor unless we have a well documented
> way to force this :(

It is not that different from blocking driver unbind while FDs are
open which a lot of places do in various ways?

Jason
Christoph Hellwig Aug. 13, 2024, 5:41 a.m. UTC | #3
On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > This is unfortunately not really minor unless we have a well documented
> > way to force this :(
> 
> It is not that different from blocking driver unbind while FDs are
> open which a lot of places do in various ways?

Where do we block driver unbind with an open resource?  The whole
concept is that open resources will pin the in-memory object (and
modulo for a modular driver), but never an unbind or hardware
unplug, of which unbind really just is a simulation.
Jason Gunthorpe Aug. 13, 2024, 4:05 p.m. UTC | #4
On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > This is unfortunately not really minor unless we have a well documented
> > > way to force this :(
> > 
> > It is not that different from blocking driver unbind while FDs are
> > open which a lot of places do in various ways?
> 
> Where do we block driver unbind with an open resource?  

I keep seeing it in different subsystems, safe driver unbind is really
hard. :\ eg I think VFIO has some waits in it

> The whole concept is that open resources will pin the in-memory
> object (and modulo for a modular driver), but never an unbind or
> hardware unplug, of which unbind really just is a simulation.

Yes, ideally, but not every part of the kernel hits that ideal in my
experience. It is alot of work and some places don't have any good
solutions, like here.

Jason
Dan Williams Aug. 13, 2024, 5:03 p.m. UTC | #5
Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> > On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > > This is unfortunately not really minor unless we have a well documented
> > > > way to force this :(
> > > 
> > > It is not that different from blocking driver unbind while FDs are
> > > open which a lot of places do in various ways?
> > 
> > Where do we block driver unbind with an open resource?  
> 
> I keep seeing it in different subsystems, safe driver unbind is really
> hard. :\ eg I think VFIO has some waits in it
> 
> > The whole concept is that open resources will pin the in-memory
> > object (and modulo for a modular driver), but never an unbind or
> > hardware unplug, of which unbind really just is a simulation.
> 
> Yes, ideally, but not every part of the kernel hits that ideal in my
> experience. It is alot of work and some places don't have any good
> solutions, like here.

...but there is a distinction between transient and permanent waits,
right? The difficult aspect of FOLL_LONGTERM is the holder has no idea
someone is trying to cleanup and may never drop its pin.
Jason Gunthorpe Aug. 14, 2024, 12:02 a.m. UTC | #6
On Tue, Aug 13, 2024 at 10:03:55AM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> > > On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > > > This is unfortunately not really minor unless we have a well documented
> > > > > way to force this :(
> > > > 
> > > > It is not that different from blocking driver unbind while FDs are
> > > > open which a lot of places do in various ways?
> > > 
> > > Where do we block driver unbind with an open resource?  
> > 
> > I keep seeing it in different subsystems, safe driver unbind is really
> > hard. :\ eg I think VFIO has some waits in it
> > 
> > > The whole concept is that open resources will pin the in-memory
> > > object (and modulo for a modular driver), but never an unbind or
> > > hardware unplug, of which unbind really just is a simulation.
> > 
> > Yes, ideally, but not every part of the kernel hits that ideal in my
> > experience. It is alot of work and some places don't have any good
> > solutions, like here.
> 
> ...but there is a distinction between transient and permanent waits,
> right? The difficult aspect of FOLL_LONGTERM is the holder has no idea
> someone is trying to cleanup and may never drop its pin.

It is the quite similar to userspace holding a FD open while a driver
is trying to unbind. The FD holder has possibly no idea things are
waiting on it.

Nice subsystems allow the FD to keep existing while the driver is
unplugged, but still many have to wait for the FD to close as
disconnecting an active driver from it's FD requires some pretty
careful design.

Jason
Christoph Hellwig Aug. 14, 2024, 4:26 a.m. UTC | #7
On Tue, Aug 13, 2024 at 01:05:02PM -0300, Jason Gunthorpe wrote:
> > Where do we block driver unbind with an open resource?  
> 
> I keep seeing it in different subsystems, safe driver unbind is really
> hard. :\ eg I think VFIO has some waits in it

Well, waits for what?  Usuaully an unbind has to wait for something,
but waiting for unbound references is always a bug.  And I can't
think of any sensible subsystem doing this.
Jason Gunthorpe Aug. 15, 2024, 4:38 p.m. UTC | #8
On Tue, Aug 13, 2024 at 09:26:55PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 13, 2024 at 01:05:02PM -0300, Jason Gunthorpe wrote:
> > > Where do we block driver unbind with an open resource?  
> > 
> > I keep seeing it in different subsystems, safe driver unbind is really
> > hard. :\ eg I think VFIO has some waits in it
> 
> Well, waits for what? 

Looks like it waits for the fds to close at least. It has weird
handshake where it notifies userspace that the device is closing and
the userspace needs to close the fd. See vfio_unregister_group_dev()

In the past VFIO couldn't do anything else because it had VMAs open
that point to the device's mmio and it would be wrong to unbind the
driver and leave those uncleaned. VFIO now knows how to zap VMA so
maybe this could be improved, but it looks like a sizable uplift.

> Usuaully an unbind has to wait for something, but waiting for
> unbound references is always a bug.  And I can't think of any
> sensible subsystem doing this.

I've seen several cases over the years.. It can be hard in many cases
to deal with the issue. Like the VMA's above for instance. rdma also
had to learn how to do zap before it could do fast unbind.

Some places just have bugs where they don't wait and Weird Stuff
happens. There is a strange misconception that module refcounts
protect against unbind :\

Not saying this is good design, or desirable, just pointing out we
seem to tolerate this in enough other places.

In this case the only resolution I could think of would be to have the
rdma stack somehow register a notifier on the pgmap. Then we could
revoke the RDMA access synchronously during destruction and return
those refcounts.

That does seems a bit complicated though..

Jason
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 54d0dc3831fb..df267b7890aa 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2547,11 +2547,6 @@  static bool is_valid_gup_args(struct page **pages, int *locked,
 	if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
 		return false;
 
-	/* We want to allow the pgmap to be hot-unplugged at all times */
-	if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
-			 (gup_flags & FOLL_PCI_P2PDMA)))
-		return false;
-
 	*gup_flags_p = gup_flags;
 	return true;
 }