diff mbox series

[v2] RDMA/umem: minor bug fix and cleanup in error handling paths

Message ID 20190302202435.31889-1-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [v2] RDMA/umem: minor bug fix and cleanup in error handling paths | expand

Commit Message

john.hubbard@gmail.com March 2, 2019, 8:24 p.m. UTC
From: John Hubbard <jhubbard@nvidia.com>

1. Bug fix: the error handling release pages starting
at the first page that experienced an error.

2. Refinement: release_pages() is better than put_page()
in a loop.

3. Dead code removal: the check for (user_virt & ~page_mask)
is checking for a condition that can never happen,
because earlier:

    user_virt = user_virt & page_mask;

...so, remove that entire phrase.

4. Minor: As long as I'm here, shorten up a couple of long lines
in the same function, without harming the ability to
grep for the printed error message.

Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---

v2: Fixes a kbuild test robot reported build failure, by directly
    including pagemap.h

 drivers/infiniband/core/umem_odp.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

Comments

Ira Weiny March 2, 2019, 7:44 p.m. UTC | #1
FWIW I don't have ODP hardware either.  So I can't test this either.

On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> 1. Bug fix: the error handling release pages starting
> at the first page that experienced an error.
> 
> 2. Refinement: release_pages() is better than put_page()
> in a loop.
> 
> 3. Dead code removal: the check for (user_virt & ~page_mask)
> is checking for a condition that can never happen,
> because earlier:
> 
>     user_virt = user_virt & page_mask;
> 
> ...so, remove that entire phrase.
> 
> 4. Minor: As long as I'm here, shorten up a couple of long lines
> in the same function, without harming the ability to
> grep for the printed error message.
> 
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: linux-rdma@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> 
> v2: Fixes a kbuild test robot reported build failure, by directly
>     including pagemap.h
> 
>  drivers/infiniband/core/umem_odp.c | 25 ++++++++++---------------
>  1 file changed, 10 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index acb882f279cb..83872c1f3f2c 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -40,6 +40,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/hugetlb.h>
>  #include <linux/interval_tree_generic.h>
> +#include <linux/pagemap.h>
>  
>  #include <rdma/ib_verbs.h>
>  #include <rdma/ib_umem.h>
> @@ -648,25 +649,17 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
>  
>  		if (npages < 0) {
>  			if (npages != -EAGAIN)
> -				pr_warn("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
> +				pr_warn("fail to get %zu user pages with error %d\n",
> +					gup_num_pages, npages);
>  			else
> -				pr_debug("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
> +				pr_debug("fail to get %zu user pages with error %d\n",
> +					 gup_num_pages, npages);
>  			break;
>  		}
>  
>  		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
>  		mutex_lock(&umem_odp->umem_mutex);
>  		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
> -			if (user_virt & ~page_mask) {
> -				p += PAGE_SIZE;
> -				if (page_to_phys(local_page_list[j]) != p) {
> -					ret = -EFAULT;
> -					break;
> -				}
> -				put_page(local_page_list[j]);
> -				continue;
> -			}
> -

I think this is trying to account for compound pages. (ie page_mask could
represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
But putting the page in that case seems to be the wrong thing to do?

Yes this was added by Artemy[1] now cc'ed.

>  			ret = ib_umem_odp_map_dma_single_page(
>  					umem_odp, k, local_page_list[j],
>  					access_mask, current_seq);
> @@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
>  		mutex_unlock(&umem_odp->umem_mutex);
>  
>  		if (ret < 0) {
> -			/* Release left over pages when handling errors. */
> -			for (++j; j < npages; ++j)
> -				put_page(local_page_list[j]);
> +			/*
> +			 * Release pages, starting at the the first page
> +			 * that experienced an error.
> +			 */
> +			release_pages(&local_page_list[j], npages - j);

My concern here is that release_pages handle compound pages, perhaps
differently from the above code so calling it may may not work?  But I've not
really spent a lot of time on it...

:-/

Ira

[1]

commit 403cd12e2cf759ead5cbdcb62bf9872b9618d400
Author: Artemy Kovalyov <artemyko@mellanox.com>
Date:   Wed Apr 5 09:23:55 2017 +0300

    IB/umem: Add contiguous ODP support

    Currenlty ODP supports only regular MMU pages.
    Add ODP support for regions consisting of physically contiguous chunks
    of arbitrary order (huge pages for instance) to improve performance.

    Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>
    Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov March 3, 2019, 9:52 a.m. UTC | #2
On 02/03/2019 21:44, Ira Weiny wrote:
> 
> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> ...
>> 3. Dead code removal: the check for (user_virt & ~page_mask)
>> is checking for a condition that can never happen,
>> because earlier:
>>
>>      user_virt = user_virt & page_mask;
>>
>> ...so, remove that entire phrase.
>>
>>   
>>   		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
>>   		mutex_lock(&umem_odp->umem_mutex);
>>   		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
>> -			if (user_virt & ~page_mask) {
>> -				p += PAGE_SIZE;
>> -				if (page_to_phys(local_page_list[j]) != p) {
>> -					ret = -EFAULT;
>> -					break;
>> -				}
>> -				put_page(local_page_list[j]);
>> -				continue;
>> -			}
>> -
> 
> I think this is trying to account for compound pages. (ie page_mask could
> represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
> But putting the page in that case seems to be the wrong thing to do?
> 
> Yes this was added by Artemy[1] now cc'ed.

Right, this is for huge pages, please keep it.
put_page() needed to decrement refcount of the head page.
Ira Weiny March 3, 2019, 4:55 p.m. UTC | #3
On Sun, Mar 03, 2019 at 11:52:41AM +0200, Artemy Kovalyov wrote:
> 
> 
> On 02/03/2019 21:44, Ira Weiny wrote:
> > 
> > On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > > ...
> > > 3. Dead code removal: the check for (user_virt & ~page_mask)
> > > is checking for a condition that can never happen,
> > > because earlier:
> > > 
> > >      user_virt = user_virt & page_mask;
> > > 
> > > ...so, remove that entire phrase.
> > > 
> > >   		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
> > >   		mutex_lock(&umem_odp->umem_mutex);
> > >   		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
> > > -			if (user_virt & ~page_mask) {
> > > -				p += PAGE_SIZE;
> > > -				if (page_to_phys(local_page_list[j]) != p) {
> > > -					ret = -EFAULT;
> > > -					break;
> > > -				}
> > > -				put_page(local_page_list[j]);
> > > -				continue;
> > > -			}
> > > -
> > 
> > I think this is trying to account for compound pages. (ie page_mask could
> > represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
> > But putting the page in that case seems to be the wrong thing to do?
> > 
> > Yes this was added by Artemy[1] now cc'ed.
> 
> Right, this is for huge pages, please keep it.
> put_page() needed to decrement refcount of the head page.

You mean decrement the refcount of the _non_-head pages?

Ira

>
John Hubbard March 3, 2019, 10:37 p.m. UTC | #4
On 3/3/19 1:52 AM, Artemy Kovalyov wrote:
> 
> 
> On 02/03/2019 21:44, Ira Weiny wrote:
>>
>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>>> From: John Hubbard <jhubbard@nvidia.com>
>>>
>>> ...
>>> 3. Dead code removal: the check for (user_virt & ~page_mask)
>>> is checking for a condition that can never happen,
>>> because earlier:
>>>
>>>      user_virt = user_virt & page_mask;
>>>
>>> ...so, remove that entire phrase.
>>>
>>>           bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
>>>           mutex_lock(&umem_odp->umem_mutex);
>>>           for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
>>> -            if (user_virt & ~page_mask) {
>>> -                p += PAGE_SIZE;
>>> -                if (page_to_phys(local_page_list[j]) != p) {
>>> -                    ret = -EFAULT;
>>> -                    break;
>>> -                }
>>> -                put_page(local_page_list[j]);
>>> -                continue;
>>> -            }
>>> -
>>
>> I think this is trying to account for compound pages. (ie page_mask could
>> represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
>> But putting the page in that case seems to be the wrong thing to do?
>>
>> Yes this was added by Artemy[1] now cc'ed.
> 
> Right, this is for huge pages, please keep it.
> put_page() needed to decrement refcount of the head page.
> 

OK, thanks for explaining! Artemy, while you're here, any thoughts about the
release_pages, and the change of the starting point, from the other part of the 
patch:

@@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
	mutex_unlock(&umem_odp->umem_mutex);

  		if (ret < 0) {
-			/* Release left over pages when handling errors. */
-			for (++j; j < npages; ++j)
-				put_page(local_page_list[j]);
+			/*
+			 * Release pages, starting at the the first page
+			 * that experienced an error.
+			 */
+			release_pages(&local_page_list[j], npages - j);
  			break;
  		}
  	}

?

thanks,
Leon Romanovsky March 4, 2019, 6:44 a.m. UTC | #5
On Sun, Mar 03, 2019 at 02:37:33PM -0800, John Hubbard wrote:
> On 3/3/19 1:52 AM, Artemy Kovalyov wrote:
> >
> >
> > On 02/03/2019 21:44, Ira Weiny wrote:
> > >
> > > On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
> > > > From: John Hubbard <jhubbard@nvidia.com>
> > > >
> > > > ...
> > > > 3. Dead code removal: the check for (user_virt & ~page_mask)
> > > > is checking for a condition that can never happen,
> > > > because earlier:
> > > >
> > > >      user_virt = user_virt & page_mask;
> > > >
> > > > ...so, remove that entire phrase.
> > > >
> > > >           bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
> > > >           mutex_lock(&umem_odp->umem_mutex);
> > > >           for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
> > > > -            if (user_virt & ~page_mask) {
> > > > -                p += PAGE_SIZE;
> > > > -                if (page_to_phys(local_page_list[j]) != p) {
> > > > -                    ret = -EFAULT;
> > > > -                    break;
> > > > -                }
> > > > -                put_page(local_page_list[j]);
> > > > -                continue;
> > > > -            }
> > > > -
> > >
> > > I think this is trying to account for compound pages. (ie page_mask could
> > > represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
> > > But putting the page in that case seems to be the wrong thing to do?
> > >
> > > Yes this was added by Artemy[1] now cc'ed.
> >
> > Right, this is for huge pages, please keep it.
> > put_page() needed to decrement refcount of the head page.
> >
>
> OK, thanks for explaining! Artemy, while you're here, any thoughts about the
> release_pages, and the change of the starting point, from the other part
> of the patch:

Your release pages code is right fix, it will be great if you prepare
proper and standalone patch.

Thanks

>
> @@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp
> *umem_odp, u64 user_virt,
> 	mutex_unlock(&umem_odp->umem_mutex);
>
>  		if (ret < 0) {
> -			/* Release left over pages when handling errors. */
> -			for (++j; j < npages; ++j)
> -				put_page(local_page_list[j]);
> +			/*
> +			 * Release pages, starting at the the first page
> +			 * that experienced an error.
> +			 */
> +			release_pages(&local_page_list[j], npages - j);
>  			break;
>  		}
>  	}
>
> ?
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
Ira Weiny March 4, 2019, 8:13 p.m. UTC | #6
On Mon, Mar 04, 2019 at 03:11:05PM -0800, John Hubbard wrote:
> On 3/3/19 8:55 AM, Ira Weiny wrote:
> > On Sun, Mar 03, 2019 at 11:52:41AM +0200, Artemy Kovalyov wrote:
> >>
> >>
> >> On 02/03/2019 21:44, Ira Weiny wrote:
> >>>
> >>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
> >>>> From: John Hubbard <jhubbard@nvidia.com>
> >>>>
> >>>> ...
> >>>> 3. Dead code removal: the check for (user_virt & ~page_mask)
> >>>> is checking for a condition that can never happen,
> >>>> because earlier:
> >>>>
> >>>>      user_virt = user_virt & page_mask;
> >>>>
> >>>> ...so, remove that entire phrase.
> >>>>
> >>>>   		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
> >>>>   		mutex_lock(&umem_odp->umem_mutex);
> >>>>   		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
> >>>> -			if (user_virt & ~page_mask) {
> >>>> -				p += PAGE_SIZE;
> >>>> -				if (page_to_phys(local_page_list[j]) != p) {
> >>>> -					ret = -EFAULT;
> >>>> -					break;
> >>>> -				}
> >>>> -				put_page(local_page_list[j]);
> >>>> -				continue;
> >>>> -			}
> >>>> -
> >>>
> >>> I think this is trying to account for compound pages. (ie page_mask could
> >>> represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
> >>> But putting the page in that case seems to be the wrong thing to do?
> >>>
> >>> Yes this was added by Artemy[1] now cc'ed.
> >>
> >> Right, this is for huge pages, please keep it.
> >> put_page() needed to decrement refcount of the head page.
> > 
> > You mean decrement the refcount of the _non_-head pages?
> > 
> > Ira
> > 
> 
> Actually, I'm sure Artemy means head page, because put_page() always
> operates on the head page. 
> 
> And this reminds me that I have a problem to solve nearby: get_user_pages
> on huge pages increments the page->_refcount *for each tail page* as well.
> That's a minor problem for my put_user_page() 
> patchset, because my approach so far assumed that I could just change us
> over to:
> 
> get_user_page(): increments page->_refcount by a large amount (1024)
> 
> put_user_page(): decrements page->_refcount by a large amount (1024)
> 
> ...and just stop doing the odd (to me) technique of incrementing once for
> each tail page. I cannot see any reason why that's actually required, as
> opposed to just "raise the page->_refcount enough to avoid losing the head
> page too soon".

What about splitting a huge page?

From Documention/vm/transhuge.rst

<quoute>
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
structures. It can be done easily for refcounts taken by page table
entries. But we don't have enough information on how to distribute any
additional pins (i.e. from get_user_pages). split_huge_page() fails any
requests to split pinned huge page: it expects page count to be equal to
sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).
</quote>

FWIW, I'm not sure why it needs to "store" the reference in the head page for
this.  I don't see any check to make sure the ref has been "stored" but I'm not
really familiar with the compound page code yet.

Ira

> 
> However, it may be tricky to do this in one pass. Probably at first, I'll have
> to do this horrible thing approach:
> 
> get_user_page(): increments page->_refcount by a large amount (1024)
> 
> put_user_page(): decrements page->_refcount by a large amount (1024) MULTIPLIED
>                  by the number of tail pages. argghhh that's ugly.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
>
John Hubbard March 4, 2019, 11:11 p.m. UTC | #7
On 3/3/19 8:55 AM, Ira Weiny wrote:
> On Sun, Mar 03, 2019 at 11:52:41AM +0200, Artemy Kovalyov wrote:
>>
>>
>> On 02/03/2019 21:44, Ira Weiny wrote:
>>>
>>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> ...
>>>> 3. Dead code removal: the check for (user_virt & ~page_mask)
>>>> is checking for a condition that can never happen,
>>>> because earlier:
>>>>
>>>>      user_virt = user_virt & page_mask;
>>>>
>>>> ...so, remove that entire phrase.
>>>>
>>>>   		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
>>>>   		mutex_lock(&umem_odp->umem_mutex);
>>>>   		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
>>>> -			if (user_virt & ~page_mask) {
>>>> -				p += PAGE_SIZE;
>>>> -				if (page_to_phys(local_page_list[j]) != p) {
>>>> -					ret = -EFAULT;
>>>> -					break;
>>>> -				}
>>>> -				put_page(local_page_list[j]);
>>>> -				continue;
>>>> -			}
>>>> -
>>>
>>> I think this is trying to account for compound pages. (ie page_mask could
>>> represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
>>> But putting the page in that case seems to be the wrong thing to do?
>>>
>>> Yes this was added by Artemy[1] now cc'ed.
>>
>> Right, this is for huge pages, please keep it.
>> put_page() needed to decrement refcount of the head page.
> 
> You mean decrement the refcount of the _non_-head pages?
> 
> Ira
> 

Actually, I'm sure Artemy means head page, because put_page() always
operates on the head page. 

And this reminds me that I have a problem to solve nearby: get_user_pages
on huge pages increments the page->_refcount *for each tail page* as well.
That's a minor problem for my put_user_page() 
patchset, because my approach so far assumed that I could just change us
over to:

get_user_page(): increments page->_refcount by a large amount (1024)

put_user_page(): decrements page->_refcount by a large amount (1024)

...and just stop doing the odd (to me) technique of incrementing once for
each tail page. I cannot see any reason why that's actually required, as
opposed to just "raise the page->_refcount enough to avoid losing the head
page too soon".

However, it may be tricky to do this in one pass. Probably at first, I'll have
to do this horrible thing approach:

get_user_page(): increments page->_refcount by a large amount (1024)

put_user_page(): decrements page->_refcount by a large amount (1024) MULTIPLIED
                 by the number of tail pages. argghhh that's ugly.

thanks,
John Hubbard March 4, 2019, 11:36 p.m. UTC | #8
On 3/4/19 3:11 PM, John Hubbard wrote:
> On 3/3/19 8:55 AM, Ira Weiny wrote:
>> On Sun, Mar 03, 2019 at 11:52:41AM +0200, Artemy Kovalyov wrote:
>>>
>>>
>>> On 02/03/2019 21:44, Ira Weiny wrote:
>>>>
>>>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>
>>>>> ...
>>>>> 3. Dead code removal: the check for (user_virt & ~page_mask)
>>>>> is checking for a condition that can never happen,
>>>>> because earlier:
>>>>>
>>>>>      user_virt = user_virt & page_mask;
>>>>>
>>>>> ...so, remove that entire phrase.
>>>>>
>>>>>   		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
>>>>>   		mutex_lock(&umem_odp->umem_mutex);
>>>>>   		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
>>>>> -			if (user_virt & ~page_mask) {
>>>>> -				p += PAGE_SIZE;
>>>>> -				if (page_to_phys(local_page_list[j]) != p) {
>>>>> -					ret = -EFAULT;
>>>>> -					break;
>>>>> -				}
>>>>> -				put_page(local_page_list[j]);
>>>>> -				continue;
>>>>> -			}
>>>>> -
>>>>
>>>> I think this is trying to account for compound pages. (ie page_mask could
>>>> represent more than PAGE_SIZE which is what user_virt is being incrimented by.)
>>>> But putting the page in that case seems to be the wrong thing to do?
>>>>
>>>> Yes this was added by Artemy[1] now cc'ed.
>>>
>>> Right, this is for huge pages, please keep it.
>>> put_page() needed to decrement refcount of the head page.
>>
>> You mean decrement the refcount of the _non_-head pages?
>>
>> Ira
>>
> 
> Actually, I'm sure Artemy means head page, because put_page() always
> operates on the head page. 
> 
> And this reminds me that I have a problem to solve nearby: get_user_pages
> on huge pages increments the page->_refcount *for each tail page* as well.
> That's a minor problem for my put_user_page() 
> patchset, because my approach so far assumed that I could just change us
> over to:
> 
> get_user_page(): increments page->_refcount by a large amount (1024)
> 
> put_user_page(): decrements page->_refcount by a large amount (1024)
> 
> ...and just stop doing the odd (to me) technique of incrementing once for
> each tail page. I cannot see any reason why that's actually required, as
> opposed to just "raise the page->_refcount enough to avoid losing the head
> page too soon".
> 
> However, it may be tricky to do this in one pass. Probably at first, I'll have
> to do this horrible thing approach:
> 
> get_user_page(): increments page->_refcount by a large amount (1024)
> 
> put_user_page(): decrements page->_refcount by a large amount (1024) MULTIPLIED
>                  by the number of tail pages. argghhh that's ugly.
> 

I see that this is still not stated quite right.

...to clarify, I mean to leave the existing behavior alone. So it would be
the call sites (not put_user_page as the above says) that would be doing all
that decrementing. The call sites know how many decrements are appropriate.

Unless someone thinks of a clever way to clean this up in one shot. I'm not
really seeing any way.

thanks,
Jason Gunthorpe March 5, 2019, 12:53 a.m. UTC | #9
On Mon, Mar 04, 2019 at 03:11:05PM -0800, John Hubbard wrote:

> get_user_page(): increments page->_refcount by a large amount (1024)
> 
> put_user_page(): decrements page->_refcount by a large amount (1024)
> 
> ...and just stop doing the odd (to me) technique of incrementing once for
> each tail page. I cannot see any reason why that's actually required, as
> opposed to just "raise the page->_refcount enough to avoid losing the head
> page too soon".

I'd very much like to see this in the infiniband umem code - the extra
work and cost of touching every page in a huge page is very much
undesired.

Jason
John Hubbard March 5, 2019, 8:10 p.m. UTC | #10
On 3/4/19 12:13 PM, Ira Weiny wrote:
[snip]
>> And this reminds me that I have a problem to solve nearby: get_user_pages
>> on huge pages increments the page->_refcount *for each tail page* as well.
>> That's a minor problem for my put_user_page() 
>> patchset, because my approach so far assumed that I could just change us
>> over to:
>>
>> get_user_page(): increments page->_refcount by a large amount (1024)
>>
>> put_user_page(): decrements page->_refcount by a large amount (1024)
>>
>> ...and just stop doing the odd (to me) technique of incrementing once for
>> each tail page. I cannot see any reason why that's actually required, as
>> opposed to just "raise the page->_refcount enough to avoid losing the head
>> page too soon".
> 
> What about splitting a huge page?
> 
> From Documention/vm/transhuge.rst
> 
> <quoute>
> split_huge_page internally has to distribute the refcounts in the head
> page to the tail pages before clearing all PG_head/tail bits from the page
> structures. It can be done easily for refcounts taken by page table
> entries. But we don't have enough information on how to distribute any
> additional pins (i.e. from get_user_pages). split_huge_page() fails any
> requests to split pinned huge page: it expects page count to be equal to
> sum of mapcount of all sub-pages plus one (split_huge_page caller must
> have reference for head page).
> </quote>
> 

heh, so in the end, split_huge_page just needs enough information to say
"no" for gup pages. So as long as page->_refcount avoids one particular
value, the code keeps working. :)


> FWIW, I'm not sure why it needs to "store" the reference in the head page for
> this.  I don't see any check to make sure the ref has been "stored" but I'm not
> really familiar with the compound page code yet.
> 
> Ira
> 

Thanks for peeking at this, I'll look deeper too.

thanks,
Artemy Kovalyov March 6, 2019, 1:02 a.m. UTC | #11
On 04/03/2019 00:37, John Hubbard wrote:
> On 3/3/19 1:52 AM, Artemy Kovalyov wrote:
>>
>>
>> On 02/03/2019 21:44, Ira Weiny wrote:
>>>
>>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> ...
> 
> OK, thanks for explaining! Artemy, while you're here, any thoughts about the
> release_pages, and the change of the starting point, from the other part of the
> patch:
> 
> @@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> u64 user_virt,
> 	mutex_unlock(&umem_odp->umem_mutex);
> 
>    		if (ret < 0) {
> -			/* Release left over pages when handling errors. */
> -			for (++j; j < npages; ++j)
release_pages() is an optimized batch put_page() so it's ok.
but! release starting from page next to one cause failure in 
ib_umem_odp_map_dma_single_page() is correct because failure flow of 
this functions already called put_page().
So release_pages(&local_page_list[j+1], npages - j-1) would be correct.
> -				put_page(local_page_list[j]);
> +			/*
> +			 * Release pages, starting at the the first page
> +			 * that experienced an error.
> +			 */
> +			release_pages(&local_page_list[j], npages - j);
>    			break;
>    		}
>    	}
> 
> ?
> 
> thanks,
>
Jason Gunthorpe March 6, 2019, 1:32 a.m. UTC | #12
On Wed, Mar 06, 2019 at 03:02:36AM +0200, Artemy Kovalyov wrote:
> 
> 
> On 04/03/2019 00:37, John Hubbard wrote:
> > On 3/3/19 1:52 AM, Artemy Kovalyov wrote:
> > > 
> > > 
> > > On 02/03/2019 21:44, Ira Weiny wrote:
> > > > 
> > > > On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
> > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > > 
> > > > > ...
> > 
> > OK, thanks for explaining! Artemy, while you're here, any thoughts about the
> > release_pages, and the change of the starting point, from the other part of the
> > patch:
> > 
> > @@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > u64 user_virt,
> > 	mutex_unlock(&umem_odp->umem_mutex);
> > 
> >    		if (ret < 0) {
> > -			/* Release left over pages when handling errors. */
> > -			for (++j; j < npages; ++j)
> release_pages() is an optimized batch put_page() so it's ok.
> but! release starting from page next to one cause failure in
> ib_umem_odp_map_dma_single_page() is correct because failure flow of this
> functions already called put_page().
> So release_pages(&local_page_list[j+1], npages - j-1) would be correct.

Someone send a fixup patch please...

Jason
John Hubbard March 6, 2019, 1:34 a.m. UTC | #13
On 3/5/19 5:32 PM, Jason Gunthorpe wrote:
> On Wed, Mar 06, 2019 at 03:02:36AM +0200, Artemy Kovalyov wrote:
>>
>>
>> On 04/03/2019 00:37, John Hubbard wrote:
>>> On 3/3/19 1:52 AM, Artemy Kovalyov wrote:
>>>>
>>>>
>>>> On 02/03/2019 21:44, Ira Weiny wrote:
>>>>>
>>>>> On Sat, Mar 02, 2019 at 12:24:35PM -0800, john.hubbard@gmail.com wrote:
>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>>
>>>>>> ...
>>>
>>> OK, thanks for explaining! Artemy, while you're here, any thoughts about the
>>> release_pages, and the change of the starting point, from the other part of the
>>> patch:
>>>
>>> @@ -684,9 +677,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
>>> u64 user_virt,
>>> 	mutex_unlock(&umem_odp->umem_mutex);
>>>
>>>    		if (ret < 0) {
>>> -			/* Release left over pages when handling errors. */
>>> -			for (++j; j < npages; ++j)
>> release_pages() is an optimized batch put_page() so it's ok.
>> but! release starting from page next to one cause failure in
>> ib_umem_odp_map_dma_single_page() is correct because failure flow of this
>> functions already called put_page().
>> So release_pages(&local_page_list[j+1], npages - j-1) would be correct.
> 
> Someone send a fixup patch please...
> 
> Jason

Yeah, I'm on it. Just need to double-check that this is the case. But Jason,
you're confirming it already, so that helps too.

Patch coming shortly.

thanks,
John Hubbard March 6, 2019, 1:37 a.m. UTC | #14
On 3/5/19 5:34 PM, John Hubbard wrote:
[snip]
>>> So release_pages(&local_page_list[j+1], npages - j-1) would be correct.
>>
>> Someone send a fixup patch please...
>>
>> Jason
> 
> Yeah, I'm on it. Just need to double-check that this is the case. But Jason,
> you're confirming it already, so that helps too.
> 
> Patch coming shortly.
> 

Jason, btw, do you prefer a patch that fixes the previous one, or a new 
patch that stands alone? (I'm not sure how this tree is maintained, exactly.)

thanks,
Jason Gunthorpe March 6, 2019, 1:51 a.m. UTC | #15
On Tue, Mar 05, 2019 at 05:37:18PM -0800, John Hubbard wrote:
> On 3/5/19 5:34 PM, John Hubbard wrote:
> [snip]
> >>> So release_pages(&local_page_list[j+1], npages - j-1) would be correct.
> >>
> >> Someone send a fixup patch please...
> >>
> >> Jason
> > 
> > Yeah, I'm on it. Just need to double-check that this is the case. But Jason,
> > you're confirming it already, so that helps too.

I didn't look, just assuming Artemy is right since he knows this
code..

> > Patch coming shortly.
> > 
> 
> Jason, btw, do you prefer a patch that fixes the previous one, or a new 
> patch that stands alone? (I'm not sure how this tree is maintained, exactly.)

It has past the point when I should apply a fixup, rdma is general has
about an approximately 1 day period after the 'thanks applied' message
where rebase is possible

Otherwise the tree is strictly no rebase

Jason
John Hubbard March 6, 2019, 2:04 a.m. UTC | #16
On 3/5/19 5:51 PM, Jason Gunthorpe wrote:
> On Tue, Mar 05, 2019 at 05:37:18PM -0800, John Hubbard wrote:
>> On 3/5/19 5:34 PM, John Hubbard wrote:
>> [snip]
>>>>> So release_pages(&local_page_list[j+1], npages - j-1) would be correct.
>>>>
>>>> Someone send a fixup patch please...
>>>>
>>>> Jason
>>>
>>> Yeah, I'm on it. Just need to double-check that this is the case. But Jason,
>>> you're confirming it already, so that helps too.
> 
> I didn't look, just assuming Artemy is right since he knows this
> code..
> 

OK. And I've confirmed it, too.

>>> Patch coming shortly.
>>>
>>
>> Jason, btw, do you prefer a patch that fixes the previous one, or a new 
>> patch that stands alone? (I'm not sure how this tree is maintained, exactly.)
> 
> It has past the point when I should apply a fixup, rdma is general has
> about an approximately 1 day period after the 'thanks applied' message
> where rebase is possible
> 
> Otherwise the tree is strictly no rebase
> 
> Jason
> 

Got it, patch is posted now.

thanks,
diff mbox series

Patch

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index acb882f279cb..83872c1f3f2c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -40,6 +40,7 @@ 
 #include <linux/vmalloc.h>
 #include <linux/hugetlb.h>
 #include <linux/interval_tree_generic.h>
+#include <linux/pagemap.h>
 
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_umem.h>
@@ -648,25 +649,17 @@  int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 
 		if (npages < 0) {
 			if (npages != -EAGAIN)
-				pr_warn("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
+				pr_warn("fail to get %zu user pages with error %d\n",
+					gup_num_pages, npages);
 			else
-				pr_debug("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
+				pr_debug("fail to get %zu user pages with error %d\n",
+					 gup_num_pages, npages);
 			break;
 		}
 
 		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
 		mutex_lock(&umem_odp->umem_mutex);
 		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
-			if (user_virt & ~page_mask) {
-				p += PAGE_SIZE;
-				if (page_to_phys(local_page_list[j]) != p) {
-					ret = -EFAULT;
-					break;
-				}
-				put_page(local_page_list[j]);
-				continue;
-			}
-
 			ret = ib_umem_odp_map_dma_single_page(
 					umem_odp, k, local_page_list[j],
 					access_mask, current_seq);
@@ -684,9 +677,11 @@  int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 		mutex_unlock(&umem_odp->umem_mutex);
 
 		if (ret < 0) {
-			/* Release left over pages when handling errors. */
-			for (++j; j < npages; ++j)
-				put_page(local_page_list[j]);
+			/*
+			 * Release pages, starting at the the first page
+			 * that experienced an error.
+			 */
+			release_pages(&local_page_list[j], npages - j);
 			break;
 		}
 	}