diff mbox

[v3,25/25] IB/mlx4: Workaround for mlx4_alloc_priv_pages() array allocator

Message ID 20160620161200.10809.45762.stgit@manet.1015granger.net (mailing list archive)
State New, archived
Headers show

Commit Message

Chuck Lever III June 20, 2016, 4:12 p.m. UTC
Ensure the MR's PBL array never occupies the last 8 bytes of a page.
This eliminates random "Local Protection Error" flushes when SLUB
debugging is enabled.

Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |    2 +-
 drivers/infiniband/hw/mlx4/mr.c      |   40 +++++++++++++++++++---------------
 2 files changed, 23 insertions(+), 19 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Or Gerlitz June 21, 2016, 5:52 a.m. UTC | #1
On Mon, Jun 20, 2016 at 7:12 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Ensure the MR's PBL array never occupies the last 8 bytes of a page.
> This eliminates random "Local Protection Error" flushes when SLUB
> debugging is enabled.
>
> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')

Can't the driver advertize smaller quantity for what's occupies later
those last eight bytes (255 or 511 of
attr XX instead of 256 or 512)?


> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 22, 2016, 11:56 a.m. UTC | #2
On 20/06/16 19:12, Chuck Lever wrote:
> Ensure the MR's PBL array never occupies the last 8 bytes of a page.
> This eliminates random "Local Protection Error" flushes when SLUB
> debugging is enabled.
>
> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   drivers/infiniband/hw/mlx4/mlx4_ib.h |    2 +-
>   drivers/infiniband/hw/mlx4/mr.c      |   40 +++++++++++++++++++---------------
>   2 files changed, 23 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index 6c5ac5d..29acda2 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -139,7 +139,7 @@ struct mlx4_ib_mr {
>   	u32			max_pages;
>   	struct mlx4_mr		mmr;
>   	struct ib_umem	       *umem;
> -	void			*pages_alloc;
> +	size_t			page_map_size;
>   };
>
>   struct mlx4_ib_mw {
> diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
> index 6312721..b90e47c 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -277,20 +277,27 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>   		      struct mlx4_ib_mr *mr,
>   		      int max_pages)
>   {
> -	int size = max_pages * sizeof(u64);
> -	int add_size;
>   	int ret;
>
> -	add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> -
> -	mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> -	if (!mr->pages_alloc)
> +	/* Round mapping size up to ensure DMA cacheline
> +	 * alignment, and cache the size to avoid mult/div
> +	 * in fast path.
> +	 */
> +	mr->page_map_size = roundup(max_pages * sizeof(u64),
> +				    MLX4_MR_PAGES_ALIGN);
> +	if (mr->page_map_size > PAGE_SIZE)
> +		return -EINVAL;
> +
> +	/* This is overkill, but hardware requires that the
> +	 * PBL array begins at a properly aligned address and
> +	 * never occupies the last 8 bytes of a page.
> +	 */
> +	mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
> +	if (!mr->pages)
>   		return -ENOMEM;

Again, I'm not convinced that this is a better choice then allocating
the exact needed size as dma coherent, but given that the dma coherent
allocations are always page aligned I wander if it's not the same
effect...

In any event, we can move forward with this for now:

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 22, 2016, 1:29 p.m. UTC | #3
>> Ensure the MR's PBL array never occupies the last 8 bytes of a page.
>> This eliminates random "Local Protection Error" flushes when SLUB
>> debugging is enabled.
>>
>> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')
>
> Can't the driver advertize smaller quantity for what's occupies later
> those last eight bytes (255 or 511 of
> attr XX instead of 256 or 512)?

Not sure I understand your question. Are you suggesting that the driver
would expose that it's capable of 256 pages per MR?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz June 22, 2016, 1:47 p.m. UTC | #4
On Wed, Jun 22, 2016 at 4:29 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>> Ensure the MR's PBL array never occupies the last 8 bytes of a page.
>>> This eliminates random "Local Protection Error" flushes when SLUB
>>> debugging is enabled.
>>>
>>> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')
>>
>>
>> Can't the driver advertize smaller quantity for what's occupies later
>> those last eight bytes (255 or 511 of
>> attr XX instead of 256 or 512)?

> Not sure I understand your question. Are you suggesting that the driver
> would expose that it's capable of 256 pages per MR?

I am asking if we can advertize something else what would cause not to
fall into the last
eight bytes on a page, e.g if we fall there since we advertize for
some attr N, lets advertize N-1

just asking
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 22, 2016, 2:02 p.m. UTC | #5
>>> Can't the driver advertize smaller quantity for what's occupies later
>>> those last eight bytes (255 or 511 of
>>> attr XX instead of 256 or 512)?
>
>> Not sure I understand your question. Are you suggesting that the driver
>> would expose that it's capable of 256 pages per MR?
>
> I am asking if we can advertize something else what would cause not to
> fall into the last
> eight bytes on a page, e.g if we fall there since we advertize for
> some attr N, lets advertize N-1
>
> just asking

We do advertise 511 pages, but the code still needs an aligned
allocation to avoid the last 8 bytes.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 22, 2016, 2:04 p.m. UTC | #6
> +	/* This is overkill, but hardware requires that the
> +	 * PBL array begins at a properly aligned address and
> +	 * never occupies the last 8 bytes of a page.
> +	 */
> +	mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
> +	if (!mr->pages)
>   		return -ENOMEM;

Again, I'm not convinced that this is a better choice then allocating
the exact needed size as dma coherent, but given that the dma coherent
allocations are always page aligned I wander if it's not the same
effect...

In any event, we can move forward with this for now:

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 22, 2016, 2:09 p.m. UTC | #7
On Wed, Jun 22, 2016 at 05:04:22PM +0300, Sagi Grimberg wrote:
> 
> >+	/* This is overkill, but hardware requires that the
> >+	 * PBL array begins at a properly aligned address and
> >+	 * never occupies the last 8 bytes of a page.
> >+	 */
> >+	mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
> >+	if (!mr->pages)
> >  		return -ENOMEM;
> 
> Again, I'm not convinced that this is a better choice then allocating
> the exact needed size as dma coherent, but given that the dma coherent
> allocations are always page aligned I wander if it's not the same
> effect...
> 
> In any event, we can move forward with this for now:
> 
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Thanks Sagi,

As an update, we don't have reliable answer regarding mlx5 yet, if
similar limitation exists there too.
Chuck Lever III June 22, 2016, 2:47 p.m. UTC | #8
> On Jun 22, 2016, at 10:04 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> +    /* This is overkill, but hardware requires that the
>> +     * PBL array begins at a properly aligned address and
>> +     * never occupies the last 8 bytes of a page.
>> +     */
>> +    mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
>> +    if (!mr->pages)
>>          return -ENOMEM;
> 
> Again, I'm not convinced that this is a better choice then allocating
> the exact needed size as dma coherent, but given that the dma coherent
> allocations are always page aligned I wander if it's not the same
> effect...

My concerns with DMA coherent were:

1. That pool may be a somewhat limited resource?

2. IMO DMA-API.txt suggests DMA coherent will perform less
well in some cases. Macro benchmarks I ran seemed to show
there was a slight performance hit with that approach, though
it was nearly in the noise.

I agree that the over-allocation in the streaming solution is a
concern. But as you say, there may be little we can do about it.

Wrt to Or's comment, the device's maximum page list depth
is advertised to consumers via the device's attributes. However,
it would be defensive if there was a sanity check added in
mlx4_alloc_priv_pages to ensure that the max_pages argument
is a reasonable value (ie, that the calculated array size does
indeed fit into a page).

> In any event, we can move forward with this for now:
> 
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Thanks, I'll add that! Though as before, I'm happy to drop this
patch if there is a different preferred official fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 22, 2016, 3:50 p.m. UTC | #9
On Wed, Jun 22, 2016 at 10:47:27AM -0400, Chuck Lever wrote:
> 
> > On Jun 22, 2016, at 10:04 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> > 
> > 
> >> +    /* This is overkill, but hardware requires that the
> >> +     * PBL array begins at a properly aligned address and
> >> +     * never occupies the last 8 bytes of a page.
> >> +     */
> >> +    mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
> >> +    if (!mr->pages)
> >>          return -ENOMEM;
> > 
> > Again, I'm not convinced that this is a better choice then allocating
> > the exact needed size as dma coherent, but given that the dma coherent
> > allocations are always page aligned I wander if it's not the same
> > effect...
> 
> My concerns with DMA coherent were:
> 
> 1. That pool may be a somewhat limited resource?
> 
> 2. IMO DMA-API.txt suggests DMA coherent will perform less
> well in some cases. Macro benchmarks I ran seemed to show
> there was a slight performance hit with that approach, though
> it was nearly in the noise.
> 
> I agree that the over-allocation in the streaming solution is a
> concern. But as you say, there may be little we can do about it.

According to [1] dma_alloc_coherent doesn't allocate from pool, but
calls to the __get_free_page().

"A DMA pool is an allocation mechanism for small, coherent DMA mappings.
Mappings obtained from dma_alloc_coherent may have a minimum size of one
page."

> 
> Wrt to Or's comment, the device's maximum page list depth
> is advertised to consumers via the device's attributes. However,
> it would be defensive if there was a sanity check added in
> mlx4_alloc_priv_pages to ensure that the max_pages argument
> is a reasonable value (ie, that the calculated array size does
> indeed fit into a page).
> 
> > In any event, we can move forward with this for now:
> > 
> > Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> 
> Thanks, I'll add that! Though as before, I'm happy to drop this
> patch if there is a different preferred official fix.

We submitted your version of patch with minor changes in
comments and commit message together with Sagi's ROB tag [2].

[1] http://www.makelinux.net/ldd3/chp-15-sect-4
[2] https://patchwork.kernel.org/patch/9193075/

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 22, 2016, 4:20 p.m. UTC | #10
On Wed, Jun 22, 2016 at 06:50:03PM +0300, Leon Romanovsky wrote:
> According to [1] dma_alloc_coherent doesn't allocate from pool, but
> calls to the __get_free_page().

Use the source luke!  dma_alloc_coherent implementations vary widely.
While some indeed use alloc_pages there are various complications,
especially on non-x86 architectures, or if the dma mask is not the full
64 bits allowed.

> "A DMA pool is an allocation mechanism for small, coherent DMA mappings.
> Mappings obtained from dma_alloc_coherent may have a minimum size of one
> page."

Just because it's not using the dma_pool_* API it could allocate from
various resource constrained pools.  A trivial example for that is
the swiotlb used for address limited devices if no hardware IOMMU
is available, but other architectures have even more complicated
implementations by default.  Take a look at
arch/arm/mm/dma-mapping.c:__dma_alloc() for fun.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6c5ac5d..29acda2 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -139,7 +139,7 @@  struct mlx4_ib_mr {
 	u32			max_pages;
 	struct mlx4_mr		mmr;
 	struct ib_umem	       *umem;
-	void			*pages_alloc;
+	size_t			page_map_size;
 };
 
 struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 6312721..b90e47c 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -277,20 +277,27 @@  mlx4_alloc_priv_pages(struct ib_device *device,
 		      struct mlx4_ib_mr *mr,
 		      int max_pages)
 {
-	int size = max_pages * sizeof(u64);
-	int add_size;
 	int ret;
 
-	add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
-
-	mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
-	if (!mr->pages_alloc)
+	/* Round mapping size up to ensure DMA cacheline
+	 * alignment, and cache the size to avoid mult/div
+	 * in fast path.
+	 */
+	mr->page_map_size = roundup(max_pages * sizeof(u64),
+				    MLX4_MR_PAGES_ALIGN);
+	if (mr->page_map_size > PAGE_SIZE)
+		return -EINVAL;
+
+	/* This is overkill, but hardware requires that the
+	 * PBL array begins at a properly aligned address and
+	 * never occupies the last 8 bytes of a page.
+	 */
+	mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL);
+	if (!mr->pages)
 		return -ENOMEM;
 
-	mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
-
 	mr->page_map = dma_map_single(device->dma_device, mr->pages,
-				      size, DMA_TO_DEVICE);
+				      mr->page_map_size, DMA_TO_DEVICE);
 
 	if (dma_mapping_error(device->dma_device, mr->page_map)) {
 		ret = -ENOMEM;
@@ -298,9 +305,9 @@  mlx4_alloc_priv_pages(struct ib_device *device,
 	}
 
 	return 0;
-err:
-	kfree(mr->pages_alloc);
 
+err:
+	free_page((unsigned long)mr->pages);
 	return ret;
 }
 
@@ -309,11 +316,10 @@  mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
 {
 	if (mr->pages) {
 		struct ib_device *device = mr->ibmr.device;
-		int size = mr->max_pages * sizeof(u64);
 
 		dma_unmap_single(device->dma_device, mr->page_map,
-				 size, DMA_TO_DEVICE);
-		kfree(mr->pages_alloc);
+				 mr->page_map_size, DMA_TO_DEVICE);
+		free_page((unsigned long)mr->pages);
 		mr->pages = NULL;
 	}
 }
@@ -537,14 +543,12 @@  int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 	mr->npages = 0;
 
 	ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
-				   sizeof(u64) * mr->max_pages,
-				   DMA_TO_DEVICE);
+				   mr->page_map_size, DMA_TO_DEVICE);
 
 	rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
 
 	ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
-				      sizeof(u64) * mr->max_pages,
-				      DMA_TO_DEVICE);
+				      mr->page_map_size, DMA_TO_DEVICE);
 
 	return rc;
 }