diff mbox series

vduse: avoid using __GFP_NOFAIL

Message ID 20240805082106.65847-1-jasowang@redhat.com (mailing list archive)
State New
Headers show
Series vduse: avoid using __GFP_NOFAIL | expand

Commit Message

Jason Wang Aug. 5, 2024, 8:21 a.m. UTC
Barry said [1]:

"""
mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
__GFP_NOFAIL without direct reclamation may just result in a busy
loop within non-sleepable contexts.
""“

Unfortuantely, we do that under read lock. A possible way to fix that
is to move the pages allocation out of the lock into the caller, but
having to allocate a huge number of pages and auxiliary page array
seems to be problematic as well per Tetsuon [2]:

"""
You should implement proper error handling instead of using
__GFP_NOFAIL if count can become large.
"""

So I choose another way, which does not release kernel bounce pages
when user tries to register usersapce bounce pages. Then we don't need
to do allocation in the path which is not expected to be fail (e.g in
the release). We pay this for more memory usage but further
optimizations could be done on top.

[1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
[2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480

Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
 drivers/vdpa/vdpa_user/iova_domain.h |  1 +
 2 files changed, 11 insertions(+), 8 deletions(-)

Comments

Jason Wang Aug. 5, 2024, 8:23 a.m. UTC | #1
On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
> Barry said [1]:
>
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“
>
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
>
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
>
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need
> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further
> optimizations could be done on top.
>
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
>
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

Note for YongJi:

I can only test it without usersapce bounce pages as neither DPDK nor
libvduse users use that. Would you want to review and have a test for
this?

Thanks
Michael S. Tsirkin Aug. 5, 2024, 8:25 a.m. UTC | #2
On Mon, Aug 05, 2024 at 04:21:06PM +0800, Jason Wang wrote:
> Barry said [1]:
> 
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“
> 
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
> 
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
> 
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need

userspace

> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further

what does "we pay this for more memory usage" mean?
Do you mean "we pay for this by using more memory"?
How much more?

> optimizations could be done on top.
> 
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> 
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
>  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
>  2 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..933d2f7cd49a 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>  				enum dma_data_direction dir)
>  {
>  	struct vduse_bounce_map *map;
> +	struct page *page;
>  	unsigned int offset;
>  	void *addr;
>  	size_t sz;
> @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>  			    map->orig_phys == INVALID_PHYS_ADDR))
>  			return;
>  
> -		addr = kmap_local_page(map->bounce_page);
> +		page = domain->user_bounce_pages ?
> +		       map->user_bounce_page : map->bounce_page;
> +
> +		addr = kmap_local_page(page);
>  		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
>  		kunmap_local(addr);
>  		size -= sz;
> @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>  				memcpy_to_page(pages[i], 0,
>  					       page_address(map->bounce_page),
>  					       PAGE_SIZE);
> -			__free_page(map->bounce_page);
>  		}
> -		map->bounce_page = pages[i];
> +		map->user_bounce_page = pages[i];
>  		get_page(pages[i]);
>  	}
>  	domain->user_bounce_pages = true;
> @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>  		struct page *page = NULL;
>  
>  		map = &domain->bounce_maps[i];
> -		if (WARN_ON(!map->bounce_page))
> +		if (WARN_ON(!map->user_bounce_page))
>  			continue;
>  
>  		/* Copy user page to kernel page if it's in use */
>  		if (map->orig_phys != INVALID_PHYS_ADDR) {
> -			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> +			page = map->bounce_page;
>  			memcpy_from_page(page_address(page),
> -					 map->bounce_page, 0, PAGE_SIZE);
> +					 map->user_bounce_page, 0, PAGE_SIZE);
>  		}
> -		put_page(map->bounce_page);
> -		map->bounce_page = page;
> +		put_page(map->user_bounce_page);
>  	}
>  	domain->user_bounce_pages = false;
>  out:
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index f92f22a7267d..7f3f0928ec78 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -21,6 +21,7 @@
>  
>  struct vduse_bounce_map {
>  	struct page *bounce_page;
> +	struct page *user_bounce_page;
>  	u64 orig_phys;
>  };
>  
> -- 
> 2.31.1
Yongji Xie Aug. 5, 2024, 10:42 a.m. UTC | #3
On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
> >
> > Unfortuantely, we do that under read lock. A possible way to fix that
> > is to move the pages allocation out of the lock into the caller, but
> > having to allocate a huge number of pages and auxiliary page array
> > seems to be problematic as well per Tetsuon [2]:
> >
> > """
> > You should implement proper error handling instead of using
> > __GFP_NOFAIL if count can become large.
> > """
> >

I think the problem is it's hard to do the error handling in
fops->release() currently.

So can we temporarily hold the user page refcount, and release it when
vduse_dev_open()/vduse_domain_release() is executed. The kernel page
allocation and memcpy can be done in vduse_dev_open() which allows
some error handling.

> > So I choose another way, which does not release kernel bounce pages
> > when user tries to register usersapce bounce pages. Then we don't need
> > to do allocation in the path which is not expected to be fail (e.g in
> > the release). We pay this for more memory usage but further
> > optimizations could be done on top.
> >
> > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> >
> > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
>
> Note for YongJi:
>
> I can only test it without usersapce bounce pages as neither DPDK nor
> libvduse users use that. Would you want to review and have a test for
> this?
>

I can do some tests for it.

Thanks,
Yongji
Jason Wang Aug. 6, 2024, 2:26 a.m. UTC | #4
On Mon, Aug 5, 2024 at 4:26 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Aug 05, 2024 at 04:21:06PM +0800, Jason Wang wrote:
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
> >
> > Unfortuantely, we do that under read lock. A possible way to fix that
> > is to move the pages allocation out of the lock into the caller, but
> > having to allocate a huge number of pages and auxiliary page array
> > seems to be problematic as well per Tetsuon [2]:
> >
> > """
> > You should implement proper error handling instead of using
> > __GFP_NOFAIL if count can become large.
> > """
> >
> > So I choose another way, which does not release kernel bounce pages
> > when user tries to register usersapce bounce pages. Then we don't need
>
> userspace
>
> > to do allocation in the path which is not expected to be fail (e.g in
> > the release). We pay this for more memory usage but further
>
> what does "we pay this for more memory usage" mean?
> Do you mean "we pay for this by using more memory"?

Yes.

> How much more?

Depends on the workload, but basically, it's just the maximum size of
bounce buffer:

Default size 64M

#define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)

Maximum 1G:

#define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)

Thanks
Jason Wang Aug. 6, 2024, 2:28 a.m. UTC | #5
On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > Barry said [1]:
> > >
> > > """
> > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > loop within non-sleepable contexts.
> > > ""“
> > >
> > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > is to move the pages allocation out of the lock into the caller, but
> > > having to allocate a huge number of pages and auxiliary page array
> > > seems to be problematic as well per Tetsuon [2]:
> > >
> > > """
> > > You should implement proper error handling instead of using
> > > __GFP_NOFAIL if count can become large.
> > > """
> > >
>
> I think the problem is it's hard to do the error handling in
> fops->release() currently.

vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.

>
> So can we temporarily hold the user page refcount, and release it when
> vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> allocation and memcpy can be done in vduse_dev_open() which allows
> some error handling.

Just to make sure I understand this, the free is probably not the big
issue but the allocation itself.

And if we do the memcpy() in open(), it seems to be a subtle userspace
noticeable change? (Or I don't get how copying in vduse_dev_open() can
help here).

>
> > > So I choose another way, which does not release kernel bounce pages
> > > when user tries to register usersapce bounce pages. Then we don't need
> > > to do allocation in the path which is not expected to be fail (e.g in
> > > the release). We pay this for more memory usage but further
> > > optimizations could be done on top.
> > >
> > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > >
> > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> >
> > Note for YongJi:
> >
> > I can only test it without usersapce bounce pages as neither DPDK nor
> > libvduse users use that. Would you want to review and have a test for
> > this?
> >
>
> I can do some tests for it.

Great.

>
> Thanks,
> Yongji
>

Thanks
Barry Song Aug. 6, 2024, 2:30 a.m. UTC | #6
On Mon, Aug 5, 2024 at 8:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
> Barry said [1]:
>
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“

the current code will result in returning a NULL pointer but
not a busy-loop.

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        ...
        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
                ...
        }
        ...
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}


We have two choices to address the issue:
1. busy-loop
2. BUG_ON

the below patch chose 2:
https://lore.kernel.org/linux-mm/20240731000155.109583-5-21cnbao@gmail.com/

>
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
>
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
>
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need
> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further
> optimizations could be done on top.
>
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
>
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
>  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
>  2 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..933d2f7cd49a 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                                 enum dma_data_direction dir)
>  {
>         struct vduse_bounce_map *map;
> +       struct page *page;
>         unsigned int offset;
>         void *addr;
>         size_t sz;
> @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                             map->orig_phys == INVALID_PHYS_ADDR))
>                         return;
>
> -               addr = kmap_local_page(map->bounce_page);
> +               page = domain->user_bounce_pages ?
> +                      map->user_bounce_page : map->bounce_page;
> +
> +               addr = kmap_local_page(page);
>                 do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
>                 kunmap_local(addr);
>                 size -= sz;
> @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>                                 memcpy_to_page(pages[i], 0,
>                                                page_address(map->bounce_page),
>                                                PAGE_SIZE);
> -                       __free_page(map->bounce_page);
>                 }
> -               map->bounce_page = pages[i];
> +               map->user_bounce_page = pages[i];
>                 get_page(pages[i]);
>         }
>         domain->user_bounce_pages = true;
> @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>                 struct page *page = NULL;
>
>                 map = &domain->bounce_maps[i];
> -               if (WARN_ON(!map->bounce_page))
> +               if (WARN_ON(!map->user_bounce_page))
>                         continue;
>
>                 /* Copy user page to kernel page if it's in use */
>                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> +                       page = map->bounce_page;
>                         memcpy_from_page(page_address(page),
> -                                        map->bounce_page, 0, PAGE_SIZE);
> +                                        map->user_bounce_page, 0, PAGE_SIZE);
>                 }
> -               put_page(map->bounce_page);
> -               map->bounce_page = page;
> +               put_page(map->user_bounce_page);
>         }
>         domain->user_bounce_pages = false;
>  out:
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index f92f22a7267d..7f3f0928ec78 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -21,6 +21,7 @@
>
>  struct vduse_bounce_map {
>         struct page *bounce_page;
> +       struct page *user_bounce_page;
>         u64 orig_phys;
>  };
>
> --
> 2.31.1
>
Yongji Xie Aug. 6, 2024, 3:10 a.m. UTC | #7
On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > Barry said [1]:
> > > >
> > > > """
> > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > loop within non-sleepable contexts.
> > > > ""“
> > > >
> > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > is to move the pages allocation out of the lock into the caller, but
> > > > having to allocate a huge number of pages and auxiliary page array
> > > > seems to be problematic as well per Tetsuon [2]:
> > > >
> > > > """
> > > > You should implement proper error handling instead of using
> > > > __GFP_NOFAIL if count can become large.
> > > > """
> > > >
> >
> > I think the problem is it's hard to do the error handling in
> > fops->release() currently.
>
> vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
>
> >
> > So can we temporarily hold the user page refcount, and release it when
> > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > allocation and memcpy can be done in vduse_dev_open() which allows
> > some error handling.
>
> Just to make sure I understand this, the free is probably not the big
> issue but the allocation itself.
>

Yes, so defer the allocation might be a solution.

> And if we do the memcpy() in open(), it seems to be a subtle userspace
> noticeable change? (Or I don't get how copying in vduse_dev_open() can
> help here).
>

Maybe we don't need to do the copy in open(). We can hold the user
page refcount until the inflight I/O is completed. That means the
allocation of new kernel pages can be done in
vduse_domain_map_bounce_page() and the release of old user pages can
be done in vduse_domain_unmap_bounce_page(). Of course, we still have
a copy (old user page -> new user spage) if the daemon calls
vduse_dev_reg_umem() again.

Thanks,
Yongji
Jason Wang Aug. 12, 2024, 6:59 a.m. UTC | #8
On Thu, Aug 8, 2024 at 7:09 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Thu, Aug 8, 2024 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 2:54 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 12:38 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > >
> > > > > On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Barry said [1]:
> > > > > > > > > > >
> > > > > > > > > > > """
> > > > > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > > > > > loop within non-sleepable contexts.
> > > > > > > > > > > ""“
> > > > > > > > > > >
> > > > > > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > > > > > >
> > > > > > > > > > > """
> > > > > > > > > > > You should implement proper error handling instead of using
> > > > > > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > > > > > """
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the problem is it's hard to do the error handling in
> > > > > > > > > fops->release() currently.
> > > > > > > >
> > > > > > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > > > > > some error handling.
> > > > > > > >
> > > > > > > > Just to make sure I understand this, the free is probably not the big
> > > > > > > > issue but the allocation itself.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, so defer the allocation might be a solution.
> > > > > >
> > > > > > Would you mind posting a patch for this?
> > > > > >
> > > > > > >
> > > > > > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > > > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > > > > > help here).
> > > > > > > >
> > > > > > >
> > > > > > > Maybe we don't need to do the copy in open(). We can hold the user
> > > > > > > page refcount until the inflight I/O is completed. That means the
> > > > > > > allocation of new kernel pages can be done in
> > > > > > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > > > > > be done in vduse_domain_unmap_bounce_page().
> > > > > >
> > > > > > This seems to be a subtle userspace noticeable behaviour?
> > > > > >
> > > > >
> > > > > Yes, userspace needs to ensure that it does not reuse the old user
> > > > > pages for other purposes before vduse_dev_dereg_umem() returns
> > > > > successfully. The vduse_dev_dereg_umem() will only return successfully
> > > > > when there is no inflight I/O which means we don't need to allocate
> > > > > extra kernel pages to store data. If we can't accept this, then your
> > > > > current patch might be the most suitable.
> > > >
> > > > It might be better to not break.
> > > >
> > > > Actually during my testing, the read_lock in the do_bounce path slows
> > > > down the performance. Remove read_lock or use rcu_read_lock() to give
> > > > 20% improvement of PPS.
> > > >
> > >
> > > Looks like rcu_read_lock() should be OK here.
> >
> > The tricky part is that we may still end up behaviour changes (or lose
> > some of the synchronization between kernel and bounce pages):
> >
> > RCU allows the read to be executed in parallel with the writer. So
> > bouncing could be done in parallel with
> > vduse_domain_add_user_bounce_pages(), there would be a race in two
> > memcpy.
> >
>
> Hmm...this is a problem. We may still need some userspace noticeable
> behaviour, e.g. only allowing reg_umem/dereg_umem when the device is
> not started.

Exactly, maybe have a new userspace flag.

Thanks

>
> Thanks,
> Yongji
>
Jason Wang Aug. 12, 2024, 7 a.m. UTC | #9
On Thu, Aug 8, 2024 at 6:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Thu, Aug 8, 2024 at 10:58 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > Barry said [1]:
> > > >
> > > > """
> > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > loop within non-sleepable contexts.
> > > > ""“
> > > >
> > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > is to move the pages allocation out of the lock into the caller, but
> > > > having to allocate a huge number of pages and auxiliary page array
> > > > seems to be problematic as well per Tetsuon [2]:
> > > >
> > > > """
> > > > You should implement proper error handling instead of using
> > > > __GFP_NOFAIL if count can become large.
> > > > """
> > > >
> > > > So I choose another way, which does not release kernel bounce pages
> > > > when user tries to register usersapce bounce pages. Then we don't need
> > > > to do allocation in the path which is not expected to be fail (e.g in
> > > > the release). We pay this for more memory usage but further
> > > > optimizations could be done on top.
> > > >
> > > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > > >
> > > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > >
> > > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > > Tested-by: Xie Yongji <xieyongji@bytedance.com>
> >
> > Thanks.
> >
> > >
> > > Have tested it with qemu-storage-daemon [1]:
> > >
> > > $ qemu-storage-daemon \
> > >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > >     --monitor chardev=charmonitor \
> > >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > > \
> > >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> > >
> > > [1] https://github.com/bytedance/qemu/tree/vduse-umem
> >
> > Great, would you want to post them to the Qemu?
> >
>
> Looks like qemu-storage-daemon would not benefit from this feature
> which is designed for some hugepage users such as SPDK/DPDK.

Yes, but maybe for testing purposes like here?

Thanks

>
> Thanks,
> Yongji
>
Yongji Xie Aug. 12, 2024, 7:21 a.m. UTC | #10
On Mon, Aug 12, 2024 at 3:00 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 8, 2024 at 6:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Thu, Aug 8, 2024 at 10:58 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > Barry said [1]:
> > > > >
> > > > > """
> > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > loop within non-sleepable contexts.
> > > > > ""“
> > > > >
> > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > seems to be problematic as well per Tetsuon [2]:
> > > > >
> > > > > """
> > > > > You should implement proper error handling instead of using
> > > > > __GFP_NOFAIL if count can become large.
> > > > > """
> > > > >
> > > > > So I choose another way, which does not release kernel bounce pages
> > > > > when user tries to register usersapce bounce pages. Then we don't need
> > > > > to do allocation in the path which is not expected to be fail (e.g in
> > > > > the release). We pay this for more memory usage but further
> > > > > optimizations could be done on top.
> > > > >
> > > > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > > > >
> > > > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > ---
> > > >
> > > > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > > > Tested-by: Xie Yongji <xieyongji@bytedance.com>
> > >
> > > Thanks.
> > >
> > > >
> > > > Have tested it with qemu-storage-daemon [1]:
> > > >
> > > > $ qemu-storage-daemon \
> > > >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > >     --monitor chardev=charmonitor \
> > > >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > > > \
> > > >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> > > >
> > > > [1] https://github.com/bytedance/qemu/tree/vduse-umem
> > >
> > > Great, would you want to post them to the Qemu?
> > >
> >
> > Looks like qemu-storage-daemon would not benefit from this feature
> > which is designed for some hugepage users such as SPDK/DPDK.
>
> Yes, but maybe for testing purposes like here?
>

OK for me.

Thanks,
Yongji
diff mbox series

Patch

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 791d38d6284c..933d2f7cd49a 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -162,6 +162,7 @@  static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 				enum dma_data_direction dir)
 {
 	struct vduse_bounce_map *map;
+	struct page *page;
 	unsigned int offset;
 	void *addr;
 	size_t sz;
@@ -178,7 +179,10 @@  static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 			    map->orig_phys == INVALID_PHYS_ADDR))
 			return;
 
-		addr = kmap_local_page(map->bounce_page);
+		page = domain->user_bounce_pages ?
+		       map->user_bounce_page : map->bounce_page;
+
+		addr = kmap_local_page(page);
 		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
 		kunmap_local(addr);
 		size -= sz;
@@ -270,9 +274,8 @@  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
 				memcpy_to_page(pages[i], 0,
 					       page_address(map->bounce_page),
 					       PAGE_SIZE);
-			__free_page(map->bounce_page);
 		}
-		map->bounce_page = pages[i];
+		map->user_bounce_page = pages[i];
 		get_page(pages[i]);
 	}
 	domain->user_bounce_pages = true;
@@ -297,17 +300,16 @@  void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
 		struct page *page = NULL;
 
 		map = &domain->bounce_maps[i];
-		if (WARN_ON(!map->bounce_page))
+		if (WARN_ON(!map->user_bounce_page))
 			continue;
 
 		/* Copy user page to kernel page if it's in use */
 		if (map->orig_phys != INVALID_PHYS_ADDR) {
-			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
+			page = map->bounce_page;
 			memcpy_from_page(page_address(page),
-					 map->bounce_page, 0, PAGE_SIZE);
+					 map->user_bounce_page, 0, PAGE_SIZE);
 		}
-		put_page(map->bounce_page);
-		map->bounce_page = page;
+		put_page(map->user_bounce_page);
 	}
 	domain->user_bounce_pages = false;
 out:
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index f92f22a7267d..7f3f0928ec78 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -21,6 +21,7 @@ 
 
 struct vduse_bounce_map {
 	struct page *bounce_page;
+	struct page *user_bounce_page;
 	u64 orig_phys;
 };