diff mbox series

[v2,rdma-next,1/5] RDMA/umem: Add API to find best driver supported page size in an MR

Message ID 20190419134353.12684-2-shiraz.saleem@intel.com (mailing list archive)
State Superseded
Headers show
Series Introduce a DMA block iterator | expand

Commit Message

Saleem, Shiraz April 19, 2019, 1:43 p.m. UTC
This helper iterates through the SG list to find the best page size to use
from a bitmap of HW supported page sizes. Drivers that support multiple
page sizes, but not mixed sizes in an MR can use this API.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Gal Pressman <galpress@amazon.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/core/umem.c | 57 ++++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_umem.h         | 19 ++++++++++++++
 include/rdma/ib_verbs.h        | 24 ++++++++++++++++++
 3 files changed, 100 insertions(+)

Comments

Jason Gunthorpe April 25, 2019, 2:25 p.m. UTC | #1
On Fri, Apr 19, 2019 at 08:43:49AM -0500, Shiraz Saleem wrote:
> This helper iterates through the SG list to find the best page size to use
> from a bitmap of HW supported page sizes. Drivers that support multiple
> page sizes, but not mixed sizes in an MR can use this API.
> 
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Gal Pressman <galpress@amazon.com>
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
>  drivers/infiniband/core/umem.c | 57 ++++++++++++++++++++++++++++++++++++++++++
>  include/rdma/ib_umem.h         | 19 ++++++++++++++
>  include/rdma/ib_verbs.h        | 24 ++++++++++++++++++
>  3 files changed, 100 insertions(+)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 7e912a9..8624ba1 100644
> +++ b/drivers/infiniband/core/umem.c
> @@ -127,6 +127,63 @@ static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
>  }
>  
>  /**
> + * ib_umem_find_best_pgsz - Find best HW page size to use for this MR
> + *
> + * @umem: umem struct
> + * @pgsz_bitmap: bitmap of HW supported page sizes
> + * @flags: see enum ib_umem_find_best_pgsz_flags
> + * @uvirt_addr: user-space virtual MR base address (provided if
> + * IB_UMEM_VA_BASED_OFFSET flag is set)
> + *
> + * This helper is intended for HW that support multiple page
> + * sizes but can do only a single page size in an MR.
> + *
> + * Returns 0 if the umem requires page sizes not supported by
> + * the driver to be mapped. Drivers always supporting PAGE_SIZE
> + * or smaller will never see a 0 result.
> + */
> +unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> +				     unsigned long pgsz_bitmap,
> +				     unsigned int flags,
> +				     unsigned long uvirt_addr)
> +{
> +	struct scatterlist *sg;
> +	dma_addr_t mask = 0;
> +	unsigned int best_pg_bit;
> +	int i;
> +
> +	if (WARN_ON(!(pgsz_bitmap & GENMASK(PAGE_SHIFT, 0))))
> +		return 0;
> +
> +	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
> +		dma_addr_t dma_addr_start, dma_addr_end;
> +
> +		dma_addr_start = sg_dma_address(sg);
> +		dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg);
> +		if (!i)
> +			/* first SGE: can start on partial page size.
> +			 * Ignore alignment of start addr.
> +			 * For HW that uses VA_BASED_OFFSET, minimal alignment
> +			 * restricted by end of the first page in virtual space.
> +			 */
> +			mask |= (flags & IB_UMEM_VA_BASED_OFFSET ?
> +					((uvirt_addr + sg_dma_len(sg)) | dma_addr_end) :
> +					dma_addr_end);

This doesn't look quite right to me..

Lets say we have 2M page with physical address 0x2600000 mapped into
virtual address 0x6400000

If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the 2M
page size with the NIC.

We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000 as
the SGL is always rounded to pages

So the above mask math will compute:

 0x2600000 + (0x6400FFF + 0x200000) = 0x8c00fff

I have a feeling the uvirt_addr should be computed more like

  if (flags & IB_UMEM_VA_BASED_OFFSET)
        mask |= uvirt_addr ^ umem->addr;

As umem->addr & hw_page_mask implicitly includes the offset into the
page, and if the uvirt_addr has a different offset bit then we can't
map 1:1 and thus have to reduce the page size.

For the above example if uvirt_addr = 0x6400FFF and umem->addr =
0x6400FFF because they are the same then the xor is 0.

If uvirt_addr = 0 (ie ZERO_BASED) and umem->addr is 0x6400FFF then the
function fails because this HW cannot create a zero based MR that is
not page aligned

If uvirt_addr = 0xFFF (ie offset'd ZERO_BASED) then (0xFFF ^
0x6400FFF) = 0x6400000 setting the max HW page size to 4M (as
expected)

.. and a more resonable case of say, uvirt_addr = 0 and umem->addr =
0x271000, then 0 ^ 0x271000 -> means the biggest page size is 4k,
which is correct.

Yes?

Jason
Jason Gunthorpe April 30, 2019, 6:05 p.m. UTC | #2
On Tue, Apr 30, 2019 at 01:36:14PM +0000, Saleem, Shiraz wrote:

> >If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the 2M page
> >size with the NIC.
> >
> >We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000 as the
> >SGL is always rounded to pages
> 
> why isn't it the sg_dma_len 2M-4095? Is it because we compute npages in ib_umem_get()
> to build the SGL? Could using (addr & PAGE_MASK) and then adding dma_len help take
> care of this?  

We always round to page sizes when building the SGL, so the start is
rounded down and the end remains the same.

> >I have a feeling the uvirt_addr should be computed more like
> >
> >  if (flags & IB_UMEM_VA_BASED_OFFSET)
> >        mask |= uvirt_addr ^ umem->addr;
> 
> I am not following.
> 
> For a case like below where uvirt_addr = umem_addr, mask = 0 and we run rdma_find_pgbit on it
> we ll pick 4K instead of 2M which is wrong.
 
> uvirt_addr [0x7f2b98200000]
> best_pgsz [0x201000]
> umem->address [0x7f2b98200000]
> dma_addr_start [0x130e200000]
> dma_addr_end [0x130e400000]
> sg_dma_len [2097152]

??

0x7f2b98200000 ^ 0x7f2b98200000 = 0

So mask isn't altered by the requested VA and you get 2M pages.
 
> Somewhere in the computation of mask, we need to take the va_addr + len and
> make sure that's 2M aligned? Correct?

No, we only care about the bits in the VA that differ from bits in the
PA, ie the bits that can't just pass through the lookup process

ie we want to compute the maximum value such that

(VA + off) & MASK == (PA + off)

Which is the hard upper bound on our page size.


> >.. and a more resonable case of say, uvirt_addr = 0 and umem->addr =
> >0x271000, then 0 ^ 0x271000 -> means the biggest page size is 4k, which is
> >correct.
> >
> 
> For ZERO_BASED, is the uvirt_addr just added to umem->addr or not?

No, for ZERO_BASED uvirt_addr == 0 and umem->addr is some userspace
address

Jason
Saleem, Shiraz April 30, 2019, 7:54 p.m. UTC | #3
>Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
>supported page size in an MR
>
>On Tue, Apr 30, 2019 at 01:36:14PM +0000, Saleem, Shiraz wrote:
>
>> >If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the 2M
>> >page size with the NIC.
>> >
>> >We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000 as
>> >the SGL is always rounded to pages
>>
>> why isn't it the sg_dma_len 2M-4095? Is it because we compute npages
>> in ib_umem_get() to build the SGL? Could using (addr & PAGE_MASK) and
>> then adding dma_len help take care of this?
>
>We always round to page sizes when building the SGL, so the start is rounded
>down and the end remains the same.
>
>> >I have a feeling the uvirt_addr should be computed more like
>> >
>> >  if (flags & IB_UMEM_VA_BASED_OFFSET)
>> >        mask |= uvirt_addr ^ umem->addr;
>>
>> I am not following.
>>
>> For a case like below where uvirt_addr = umem_addr, mask = 0 and we
>> run rdma_find_pgbit on it we ll pick 4K instead of 2M which is wrong.
>
>> uvirt_addr [0x7f2b98200000]
>> best_pgsz [0x201000]
>> umem->address [0x7f2b98200000]
>> dma_addr_start [0x130e200000]
>> dma_addr_end [0x130e400000]
>> sg_dma_len [2097152]
>
>??
>
>0x7f2b98200000 ^ 0x7f2b98200000 = 0
>
>So mask isn't altered by the requested VA and you get 2M pages.
I am still missing the point. mask was initialized to 0 and if we just do
"mask |= uvirt_addr ^ umem->addr" for the first SGE, then you ll
always get 0 for mask (and one page size) when uvirt_addr = umem->addr
irrespective of their values.

Maybe 'mask' is a wrong variable name to use but the intention of the algo is to
OR together the dma_addr_start for every sg but the first, and the dma_addr_end
of every sg but the last. mask is initialized to 0 and tracks the computed value after
running through all SG's. Basically mask will track the lowest set bit from the entire set
and running rdma_find_pg_bit gets the best page size to use.
Jason Gunthorpe April 30, 2019, 9 p.m. UTC | #4
On Tue, Apr 30, 2019 at 07:54:24PM +0000, Saleem, Shiraz wrote:
> >Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
> >supported page size in an MR
> >
> >On Tue, Apr 30, 2019 at 01:36:14PM +0000, Saleem, Shiraz wrote:
> >
> >> >If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the 2M
> >> >page size with the NIC.
> >> >
> >> >We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000 as
> >> >the SGL is always rounded to pages
> >>
> >> why isn't it the sg_dma_len 2M-4095? Is it because we compute npages
> >> in ib_umem_get() to build the SGL? Could using (addr & PAGE_MASK) and
> >> then adding dma_len help take care of this?
> >
> >We always round to page sizes when building the SGL, so the start is rounded
> >down and the end remains the same.
> >
> >> >I have a feeling the uvirt_addr should be computed more like
> >> >
> >> >  if (flags & IB_UMEM_VA_BASED_OFFSET)
> >> >        mask |= uvirt_addr ^ umem->addr;
> >>
> >> I am not following.
> >>
> >> For a case like below where uvirt_addr = umem_addr, mask = 0 and we
> >> run rdma_find_pgbit on it we ll pick 4K instead of 2M which is wrong.
> >
> >> uvirt_addr [0x7f2b98200000]
> >> best_pgsz [0x201000]
> >> umem->address [0x7f2b98200000]
> >> dma_addr_start [0x130e200000]
> >> dma_addr_end [0x130e400000]
> >> sg_dma_len [2097152]
> >
> >??
> >
> >0x7f2b98200000 ^ 0x7f2b98200000 = 0
> >
> >So mask isn't altered by the requested VA and you get 2M pages.

> I am still missing the point. mask was initialized to 0 and if we just do
> "mask |= uvirt_addr ^ umem->addr" for the first SGE, then you ll
> always get 0 for mask (and one page size) when uvirt_addr = umem->addr
> irrespective of their values.

This is because mask shouldn't start as zero - the highest possible
mask is something like log2_ru(umem length)

ie absent other considerations the page size at the NIC should be the
size of the umem.

Then we scan the sgl and reduce that value based on the physical
address layout

Then we reduce it again based on the uvirt vs address difference

Oring a '0' into the mask means that step contributes no restriction.

..

So I think the algorithm is just off as is, it should be more like

 // Page size can't be larger than the length of the MR
 mask = log2_ru(umem length); 

 // offset into the first SGL for umem->addr
 pgoff = umem->address & PAGE_MASK;
 va = uvirt_addr;

 for_each_sgl()
   mask |= (sgl->dma_addr + pgoff) ^ va
   if (not first or last)
       // Interior SGLs limit by the length as well
       mask |= sgl->length;
   va += sgl->length - pgoff;
   pgoff = 0;

The key is the xor of the VA to the PA at every step because that is
really the restriction we are looking at. If any VA/PA bits differ for
any address in the MR that immediately reduces the max page size.

Remember what the HW does is
  PA = MR_PAGE[VA] | (VA & PAGE_MASK)

So any situation where PA & PAGE_MASK != VA is a violation - this is
the calculation the XOR is doing.

Jason
Saleem, Shiraz May 3, 2019, 3:22 p.m. UTC | #5
>Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
>supported page size in an MR
>
>On Tue, Apr 30, 2019 at 07:54:24PM +0000, Saleem, Shiraz wrote:
>> >Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best
>> >driver supported page size in an MR
>> >
>> >On Tue, Apr 30, 2019 at 01:36:14PM +0000, Saleem, Shiraz wrote:
>> >
>> >> >If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the
>> >> >2M page size with the NIC.
>> >> >
>> >> >We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000
>> >> >as the SGL is always rounded to pages
>> >>
>> >> why isn't it the sg_dma_len 2M-4095? Is it because we compute
>> >> npages in ib_umem_get() to build the SGL? Could using (addr &
>> >> PAGE_MASK) and then adding dma_len help take care of this?
>> >
>> >We always round to page sizes when building the SGL, so the start is
>> >rounded down and the end remains the same.
>> >
>> >> >I have a feeling the uvirt_addr should be computed more like
>> >> >
>> >> >  if (flags & IB_UMEM_VA_BASED_OFFSET)
>> >> >        mask |= uvirt_addr ^ umem->addr;
>> >>
>> >> I am not following.
>> >>
>> >> For a case like below where uvirt_addr = umem_addr, mask = 0 and we
>> >> run rdma_find_pgbit on it we ll pick 4K instead of 2M which is wrong.
>> >
>> >> uvirt_addr [0x7f2b98200000]
>> >> best_pgsz [0x201000]
>> >> umem->address [0x7f2b98200000]
>> >> dma_addr_start [0x130e200000]
>> >> dma_addr_end [0x130e400000]
>> >> sg_dma_len [2097152]
>> >
>> >??
>> >
>> >0x7f2b98200000 ^ 0x7f2b98200000 = 0
>> >
>> >So mask isn't altered by the requested VA and you get 2M pages.
>
>> I am still missing the point. mask was initialized to 0 and if we just
>> do "mask |= uvirt_addr ^ umem->addr" for the first SGE, then you ll
>> always get 0 for mask (and one page size) when uvirt_addr = umem->addr
>> irrespective of their values.
>
>This is because mask shouldn't start as zero - the highest possible mask is
>something like log2_ru(umem length)
>
>ie absent other considerations the page size at the NIC should be the size of the
>umem.
>
>Then we scan the sgl and reduce that value based on the physical address
>layout
>
>Then we reduce it again based on the uvirt vs address difference
>
>Oring a '0' into the mask means that step contributes no restriction.
>
>..
>
>So I think the algorithm is just off as is, it should be more like
>
> // Page size can't be larger than the length of the MR mask = log2_ru(umem
>length);
>
> // offset into the first SGL for umem->addr pgoff = umem->address &
>PAGE_MASK;  va = uvirt_addr;
>

Did you mean pgoff = umem->address & ~PAGE_MASK?

> for_each_sgl()
>   mask |= (sgl->dma_addr + pgoff) ^ va
>   if (not first or last)
>       // Interior SGLs limit by the length as well
>       mask |= sgl->length;
>   va += sgl->length - pgoff;
>   pgoff = 0;
>
>The key is the xor of the VA to the PA at every step because that is really the
>restriction we are looking at. If any VA/PA bits differ for any address in the MR
>that immediately reduces the max page size.
>
>Remember what the HW does is
>  PA = MR_PAGE[VA] | (VA & PAGE_MASK)
>
>So any situation where PA & PAGE_MASK != VA is a violation - this is the
>calculation the XOR is doing.
>
>Jason
Jason Gunthorpe May 3, 2019, 3:28 p.m. UTC | #6
On Fri, May 03, 2019 at 03:22:59PM +0000, Saleem, Shiraz wrote:
> >This is because mask shouldn't start as zero - the highest possible mask is
> >something like log2_ru(umem length)
> >
> >ie absent other considerations the page size at the NIC should be the size of the
> >umem.
> >
> >Then we scan the sgl and reduce that value based on the physical address
> >layout
> >
> >Then we reduce it again based on the uvirt vs address difference
> >
> >Oring a '0' into the mask means that step contributes no restriction.
> >
> >..
> >
> >So I think the algorithm is just off as is, it should be more like
> >
> > // Page size can't be larger than the length of the MR mask = log2_ru(umem
> >length);
> >
> > // offset into the first SGL for umem->addr pgoff = umem->address &
> >PAGE_MASK;  va = uvirt_addr;
> >
> 
> Did you mean pgoff = umem->address & ~PAGE_MASK?

Yes...

But really even that is not what it should be, the 'pgoff' should
simply be the 'offset' member of the first sgl and we should set it
correctly based on the umem->address when building the sgl in the core code.

Jason
Saleem, Shiraz May 3, 2019, 4:01 p.m. UTC | #7
>Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
>supported page size in an MR
>
>On Fri, May 03, 2019 at 03:22:59PM +0000, Saleem, Shiraz wrote:
>> >This is because mask shouldn't start as zero - the highest possible
>> >mask is something like log2_ru(umem length)
>> >
>> >ie absent other considerations the page size at the NIC should be the
>> >size of the umem.
>> >
>> >Then we scan the sgl and reduce that value based on the physical
>> >address layout
>> >
>> >Then we reduce it again based on the uvirt vs address difference
>> >
>> >Oring a '0' into the mask means that step contributes no restriction.
>> >
>> >..
>> >
>> >So I think the algorithm is just off as is, it should be more like
>> >
>> > // Page size can't be larger than the length of the MR mask =
>> >log2_ru(umem length);
>> >
>> > // offset into the first SGL for umem->addr pgoff = umem->address &
>> >PAGE_MASK;  va = uvirt_addr;
>> >
>>
>> Did you mean pgoff = umem->address & ~PAGE_MASK?
>
>Yes...

OK. Don't we need something like this for zero based VA?
va = uvirt_addr ?  uvirt_addr :  umem->addr;

Also can we do this for all HW?
Or do we keep the IB_UMEM_VA_BASED_OFFSET flag and OR
in the dma_addr_end (for first SGL) and dma_addr_start (for last SGL)
to the mask when the flag is not set?

> for_each_sgl()
>   mask |= (sgl->dma_addr + pgoff) ^ va
>   if (not first or last)
>       // Interior SGLs limit by the length as well
>       mask |= sgl->length;
>   va += sgl->length - pgoff;
>   pgoff = 0;
>

>
>But really even that is not what it should be, the 'pgoff' should simply be the
>'offset' member of the first sgl and we should set it correctly based on the umem-
>>address when building the sgl in the core code.
>

I can look to update it post this series.
And the core code changes to not over allocate scatter table.
Jason Gunthorpe May 3, 2019, 11:50 p.m. UTC | #8
On Fri, May 03, 2019 at 04:01:27PM +0000, Saleem, Shiraz wrote:
> >Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
> >supported page size in an MR
> >
> >On Fri, May 03, 2019 at 03:22:59PM +0000, Saleem, Shiraz wrote:
> >> >This is because mask shouldn't start as zero - the highest possible
> >> >mask is something like log2_ru(umem length)
> >> >
> >> >ie absent other considerations the page size at the NIC should be the
> >> >size of the umem.
> >> >
> >> >Then we scan the sgl and reduce that value based on the physical
> >> >address layout
> >> >
> >> >Then we reduce it again based on the uvirt vs address difference
> >> >
> >> >Oring a '0' into the mask means that step contributes no restriction.
> >> >
> >> >..
> >> >
> >> >So I think the algorithm is just off as is, it should be more like
> >> >
> >> > // Page size can't be larger than the length of the MR mask =
> >> >log2_ru(umem length);
> >> >
> >> > // offset into the first SGL for umem->addr pgoff = umem->address &
> >> >PAGE_MASK;  va = uvirt_addr;
> >> >
> >>
> >> Did you mean pgoff = umem->address & ~PAGE_MASK?
> >
> >Yes...
> 
> OK. Don't we need something like this for zero based VA?
> va = uvirt_addr ?  uvirt_addr :  umem->addr;

Every MR is created with an IOVA (here called VA). Before we get here
the caller should figure out the IOVA and it should either be 0 or
umem->address in the cases we implement today *however* it can really
be anything and this code shouldn't care..

> Also can we do this for all HW?

All hardware has to support an arbitary IOVA.

> Or do we keep the IB_UMEM_VA_BASED_OFFSET flag and OR
> in the dma_addr_end (for first SGL) and dma_addr_start (for last SGL)
> to the mask when the flag is not set?

I wasn't totally sure what that flag did..

If we don't have any drivers that need something different today then
I would focus entirely on the:

 PA = MR_PAGE_TABLE[IOVA/PAGE_SIZE] | (IOVA & PAGE_MASK)

Case, which is what I outlined.

The ^ is checking that the (IOVA & PAGE_MASK) scheme will work.

> > for_each_sgl()
> >   mask |= (sgl->dma_addr + pgoff) ^ va
> >   if (not first or last)
> >       // Interior SGLs limit by the length as well
> >       mask |= sgl->length;
> >   va += sgl->length - pgoff;
> >   pgoff = 0;
> >
> 
> >
> >But really even that is not what it should be, the 'pgoff' should simply be the
> >'offset' member of the first sgl and we should set it correctly based on the umem-
> >>address when building the sgl in the core code.
> >
> 
> I can look to update it post this series.
> And the core code changes to not over allocate scatter table.

Sure

Lets try and get this sorted out right away so it can go in this merge
window, which might start on Monday.

Jason
Saleem, Shiraz May 6, 2019, 1:52 p.m. UTC | #9
>Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best driver
>supported page size in an MR
>
>On Fri, May 03, 2019 at 04:01:27PM +0000, Saleem, Shiraz wrote:
>> >Subject: Re: [PATCH v2 rdma-next 1/5] RDMA/umem: Add API to find best
>> >driver supported page size in an MR
>> >
>> >On Fri, May 03, 2019 at 03:22:59PM +0000, Saleem, Shiraz wrote:
>> >> >This is because mask shouldn't start as zero - the highest
>> >> >possible mask is something like log2_ru(umem length)
>> >> >
>> >> >ie absent other considerations the page size at the NIC should be
>> >> >the size of the umem.
>> >> >
>> >> >Then we scan the sgl and reduce that value based on the physical
>> >> >address layout
>> >> >
>> >> >Then we reduce it again based on the uvirt vs address difference
>> >> >
>> >> >Oring a '0' into the mask means that step contributes no restriction.
>> >> >
>> >> >..
>> >> >
>> >> >So I think the algorithm is just off as is, it should be more like
>> >> >
>> >> > // Page size can't be larger than the length of the MR mask =
>> >> >log2_ru(umem length);
>> >> >
>> >> > // offset into the first SGL for umem->addr pgoff = umem->address
>> >> >& PAGE_MASK;  va = uvirt_addr;
>> >> >
>> >>
>> >> Did you mean pgoff = umem->address & ~PAGE_MASK?
>> >
>> >Yes...
>>
>> OK. Don't we need something like this for zero based VA?
>> va = uvirt_addr ?  uvirt_addr :  umem->addr;
>
>Every MR is created with an IOVA (here called VA). Before we get here the caller
>should figure out the IOVA and it should either be 0 or
>umem->address in the cases we implement today *however* it can really
>be anything and this code shouldn't care..
>
>> Also can we do this for all HW?
>
>All hardware has to support an arbitary IOVA.
>
>> Or do we keep the IB_UMEM_VA_BASED_OFFSET flag and OR in the
>> dma_addr_end (for first SGL) and dma_addr_start (for last SGL) to the
>> mask when the flag is not set?
>
>I wasn't totally sure what that flag did..
>
>If we don't have any drivers that need something different today then I would
>focus entirely on the:
>
> PA = MR_PAGE_TABLE[IOVA/PAGE_SIZE] | (IOVA & PAGE_MASK)
>

OK. Agreed.

Shiraz
diff mbox series

Patch

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7e912a9..8624ba1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -127,6 +127,63 @@  static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
 }
 
 /**
+ * ib_umem_find_best_pgsz - Find best HW page size to use for this MR
+ *
+ * @umem: umem struct
+ * @pgsz_bitmap: bitmap of HW supported page sizes
+ * @flags: see enum ib_umem_find_best_pgsz_flags
+ * @uvirt_addr: user-space virtual MR base address (provided if
+ * IB_UMEM_VA_BASED_OFFSET flag is set)
+ *
+ * This helper is intended for HW that support multiple page
+ * sizes but can do only a single page size in an MR.
+ *
+ * Returns 0 if the umem requires page sizes not supported by
+ * the driver to be mapped. Drivers always supporting PAGE_SIZE
+ * or smaller will never see a 0 result.
+ */
+unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
+				     unsigned long pgsz_bitmap,
+				     unsigned int flags,
+				     unsigned long uvirt_addr)
+{
+	struct scatterlist *sg;
+	dma_addr_t mask = 0;
+	unsigned int best_pg_bit;
+	int i;
+
+	if (WARN_ON(!(pgsz_bitmap & GENMASK(PAGE_SHIFT, 0))))
+		return 0;
+
+	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
+		dma_addr_t dma_addr_start, dma_addr_end;
+
+		dma_addr_start = sg_dma_address(sg);
+		dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg);
+		if (!i)
+			/* first SGE: can start on partial page size.
+			 * Ignore alignment of start addr.
+			 * For HW that uses VA_BASED_OFFSET, minimal alignment
+			 * restricted by end of the first page in virtual space.
+			 */
+			mask |= (flags & IB_UMEM_VA_BASED_OFFSET ?
+					((uvirt_addr + sg_dma_len(sg)) | dma_addr_end) :
+					dma_addr_end);
+		else if (i == (umem->nmap - 1))
+			/* last SGE: Can end on a partial page size.
+			 * Ignore alignment of end addr.
+			 */
+			mask |= dma_addr_start;
+		else
+			mask |= (dma_addr_start | dma_addr_end);
+	}
+	best_pg_bit = rdma_find_pg_bit(mask, pgsz_bitmap);
+
+	return BIT_ULL(best_pg_bit);
+}
+EXPORT_SYMBOL(ib_umem_find_best_pgsz);
+
+/**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
  * If access flags indicate ODP memory, avoid pinning. Instead, stores
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index b13a2e9..3fb403a 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -56,6 +56,14 @@  struct ib_umem {
 	unsigned int    sg_nents;
 };
 
+enum ib_umem_find_best_pgsz_flags {
+	/* Flag for HW that uses least significant bits (eg: 12 bits for
+	 * 4K pages, 21 bits for 2M pages) of the VA to indicate start offset
+	 * into the DMA page list.
+	 */
+	IB_UMEM_VA_BASED_OFFSET = (1 << 0),
+};
+
 /* Returns the offset of the umem start relative to the first page. */
 static inline int ib_umem_offset(struct ib_umem *umem)
 {
@@ -87,6 +95,10 @@  struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 int ib_umem_page_count(struct ib_umem *umem);
 int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		      size_t length);
+unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
+				     unsigned long pgsz_bitmap,
+				     unsigned int flags,
+				     unsigned long uvirt_addr);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
@@ -104,6 +116,13 @@  static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
 		      		    size_t length) {
 	return -EINVAL;
 }
+static inline int ib_umem_find_best_pgsz(struct ib_umem *umem,
+					 unsigned long pgsz_bitmap,
+					 unsigned int flags,
+					 unsigned long uvirt_addr) {
+	return -EINVAL;
+}
+
 #endif /* CONFIG_INFINIBAND_USER_MEM */
 
 #endif /* IB_UMEM_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 43a75ab..720ce23 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3214,6 +3214,30 @@  static inline bool rdma_cap_read_inv(struct ib_device *dev, u32 port_num)
 	return rdma_protocol_iwarp(dev, port_num);
 }
 
+/**
+ * rdma_find_pg_bit - Find page bit given address and HW supported page sizes
+ *
+ * @addr: address
+ * @pgsz_bitmap: bitmap of HW supported page sizes
+ */
+static inline unsigned int rdma_find_pg_bit(unsigned long addr,
+					    unsigned long pgsz_bitmap)
+{
+	unsigned long align;
+	unsigned long pgsz;
+
+	align = addr & -addr;
+
+	/* Find page bit such that addr is aligned to the highest supported
+	 * HW page size
+	 */
+	pgsz = pgsz_bitmap & ~(-align << 1);
+	if (!pgsz)
+		return __ffs(pgsz_bitmap);
+
+	return __fls(pgsz);
+}
+
 int ib_set_vf_link_state(struct ib_device *device, int vf, u8 port,
 			 int state);
 int ib_get_vf_config(struct ib_device *device, int vf, u8 port,