Message ID | 20160620161200.10809.45762.stgit@manet.1015granger.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 20, 2016 at 7:12 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > Ensure the MR's PBL array never occupies the last 8 bytes of a page. > This eliminates random "Local Protection Error" flushes when SLUB > debugging is enabled. > > Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') Can't the driver advertize smaller quantity for what's occupies later those last eight bytes (255 or 511 of attr XX instead of 256 or 512)? > Suggested-by: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20/06/16 19:12, Chuck Lever wrote: > Ensure the MR's PBL array never occupies the last 8 bytes of a page. > This eliminates random "Local Protection Error" flushes when SLUB > debugging is enabled. > > Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') > Suggested-by: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +- > drivers/infiniband/hw/mlx4/mr.c | 40 +++++++++++++++++++--------------- > 2 files changed, 23 insertions(+), 19 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h > index 6c5ac5d..29acda2 100644 > --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h > +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h > @@ -139,7 +139,7 @@ struct mlx4_ib_mr { > u32 max_pages; > struct mlx4_mr mmr; > struct ib_umem *umem; > - void *pages_alloc; > + size_t page_map_size; > }; > > struct mlx4_ib_mw { > diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c > index 6312721..b90e47c 100644 > --- a/drivers/infiniband/hw/mlx4/mr.c > +++ b/drivers/infiniband/hw/mlx4/mr.c > @@ -277,20 +277,27 @@ mlx4_alloc_priv_pages(struct ib_device *device, > struct mlx4_ib_mr *mr, > int max_pages) > { > - int size = max_pages * sizeof(u64); > - int add_size; > int ret; > > - add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); > - > - mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); > - if (!mr->pages_alloc) > + /* Round mapping size up to ensure DMA cacheline > + * alignment, and cache the size to avoid mult/div > + * in fast path. > + */ > + mr->page_map_size = roundup(max_pages * sizeof(u64), > + MLX4_MR_PAGES_ALIGN); > + if (mr->page_map_size > PAGE_SIZE) > + return -EINVAL; > + > + /* This is overkill, but hardware requires that the > + * PBL array begins at a properly aligned address and > + * never occupies the last 8 bytes of a page. > + */ > + mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); > + if (!mr->pages) > return -ENOMEM; Again, I'm not convinced that this is a better choice then allocating the exact needed size as dma coherent, but given that the dma coherent allocations are always page aligned I wander if it's not the same effect... In any event, we can move forward with this for now: Reviewed-by: Sagi Grimberg <sagi@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Ensure the MR's PBL array never occupies the last 8 bytes of a page. >> This eliminates random "Local Protection Error" flushes when SLUB >> debugging is enabled. >> >> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') > > Can't the driver advertize smaller quantity for what's occupies later > those last eight bytes (255 or 511 of > attr XX instead of 256 or 512)? Not sure I understand your question. Are you suggesting that the driver would expose that it's capable of 256 pages per MR? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2016 at 4:29 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > >>> Ensure the MR's PBL array never occupies the last 8 bytes of a page. >>> This eliminates random "Local Protection Error" flushes when SLUB >>> debugging is enabled. >>> >>> Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') >> >> >> Can't the driver advertize smaller quantity for what's occupies later >> those last eight bytes (255 or 511 of >> attr XX instead of 256 or 512)? > Not sure I understand your question. Are you suggesting that the driver > would expose that it's capable of 256 pages per MR? I am asking if we can advertize something else what would cause not to fall into the last eight bytes on a page, e.g if we fall there since we advertize for some attr N, lets advertize N-1 just asking -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Can't the driver advertize smaller quantity for what's occupies later >>> those last eight bytes (255 or 511 of >>> attr XX instead of 256 or 512)? > >> Not sure I understand your question. Are you suggesting that the driver >> would expose that it's capable of 256 pages per MR? > > I am asking if we can advertize something else what would cause not to > fall into the last > eight bytes on a page, e.g if we fall there since we advertize for > some attr N, lets advertize N-1 > > just asking We do advertise 511 pages, but the code still needs an aligned allocation to avoid the last 8 bytes. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> + /* This is overkill, but hardware requires that the > + * PBL array begins at a properly aligned address and > + * never occupies the last 8 bytes of a page. > + */ > + mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); > + if (!mr->pages) > return -ENOMEM; Again, I'm not convinced that this is a better choice then allocating the exact needed size as dma coherent, but given that the dma coherent allocations are always page aligned I wander if it's not the same effect... In any event, we can move forward with this for now: Reviewed-by: Sagi Grimberg <sagi@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2016 at 05:04:22PM +0300, Sagi Grimberg wrote: > > >+ /* This is overkill, but hardware requires that the > >+ * PBL array begins at a properly aligned address and > >+ * never occupies the last 8 bytes of a page. > >+ */ > >+ mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); > >+ if (!mr->pages) > > return -ENOMEM; > > Again, I'm not convinced that this is a better choice then allocating > the exact needed size as dma coherent, but given that the dma coherent > allocations are always page aligned I wander if it's not the same > effect... > > In any event, we can move forward with this for now: > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Thanks Sagi, As an update, we don't have reliable answer regarding mlx5 yet, if similar limitation exists there too.
> On Jun 22, 2016, at 10:04 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >> + /* This is overkill, but hardware requires that the >> + * PBL array begins at a properly aligned address and >> + * never occupies the last 8 bytes of a page. >> + */ >> + mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); >> + if (!mr->pages) >> return -ENOMEM; > > Again, I'm not convinced that this is a better choice then allocating > the exact needed size as dma coherent, but given that the dma coherent > allocations are always page aligned I wander if it's not the same > effect... My concerns with DMA coherent were: 1. That pool may be a somewhat limited resource? 2. IMO DMA-API.txt suggests DMA coherent will perform less well in some cases. Macro benchmarks I ran seemed to show there was a slight performance hit with that approach, though it was nearly in the noise. I agree that the over-allocation in the streaming solution is a concern. But as you say, there may be little we can do about it. Wrt to Or's comment, the device's maximum page list depth is advertised to consumers via the device's attributes. However, it would be defensive if there was a sanity check added in mlx4_alloc_priv_pages to ensure that the max_pages argument is a reasonable value (ie, that the calculated array size does indeed fit into a page). > In any event, we can move forward with this for now: > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Thanks, I'll add that! Though as before, I'm happy to drop this patch if there is a different preferred official fix. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2016 at 10:47:27AM -0400, Chuck Lever wrote: > > > On Jun 22, 2016, at 10:04 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > > > > >> + /* This is overkill, but hardware requires that the > >> + * PBL array begins at a properly aligned address and > >> + * never occupies the last 8 bytes of a page. > >> + */ > >> + mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); > >> + if (!mr->pages) > >> return -ENOMEM; > > > > Again, I'm not convinced that this is a better choice then allocating > > the exact needed size as dma coherent, but given that the dma coherent > > allocations are always page aligned I wander if it's not the same > > effect... > > My concerns with DMA coherent were: > > 1. That pool may be a somewhat limited resource? > > 2. IMO DMA-API.txt suggests DMA coherent will perform less > well in some cases. Macro benchmarks I ran seemed to show > there was a slight performance hit with that approach, though > it was nearly in the noise. > > I agree that the over-allocation in the streaming solution is a > concern. But as you say, there may be little we can do about it. According to [1] dma_alloc_coherent doesn't allocate from pool, but calls to the __get_free_page(). "A DMA pool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page." > > Wrt to Or's comment, the device's maximum page list depth > is advertised to consumers via the device's attributes. However, > it would be defensive if there was a sanity check added in > mlx4_alloc_priv_pages to ensure that the max_pages argument > is a reasonable value (ie, that the calculated array size does > indeed fit into a page). > > > In any event, we can move forward with this for now: > > > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> > > Thanks, I'll add that! Though as before, I'm happy to drop this > patch if there is a different preferred official fix. We submitted your version of patch with minor changes in comments and commit message together with Sagi's ROB tag [2]. [1] http://www.makelinux.net/ldd3/chp-15-sect-4 [2] https://patchwork.kernel.org/patch/9193075/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2016 at 06:50:03PM +0300, Leon Romanovsky wrote: > According to [1] dma_alloc_coherent doesn't allocate from pool, but > calls to the __get_free_page(). Use the source luke! dma_alloc_coherent implementations vary widely. While some indeed use alloc_pages there are various complications, especially on non-x86 architectures, or if the dma mask is not the full 64 bits allowed. > "A DMA pool is an allocation mechanism for small, coherent DMA mappings. > Mappings obtained from dma_alloc_coherent may have a minimum size of one > page." Just because it's not using the dma_pool_* API it could allocate from various resource constrained pools. A trivial example for that is the swiotlb used for address limited devices if no hardware IOMMU is available, but other architectures have even more complicated implementations by default. Take a look at arch/arm/mm/dma-mapping.c:__dma_alloc() for fun. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 6c5ac5d..29acda2 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -139,7 +139,7 @@ struct mlx4_ib_mr { u32 max_pages; struct mlx4_mr mmr; struct ib_umem *umem; - void *pages_alloc; + size_t page_map_size; }; struct mlx4_ib_mw { diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 6312721..b90e47c 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -277,20 +277,27 @@ mlx4_alloc_priv_pages(struct ib_device *device, struct mlx4_ib_mr *mr, int max_pages) { - int size = max_pages * sizeof(u64); - int add_size; int ret; - add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); - - mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); - if (!mr->pages_alloc) + /* Round mapping size up to ensure DMA cacheline + * alignment, and cache the size to avoid mult/div + * in fast path. + */ + mr->page_map_size = roundup(max_pages * sizeof(u64), + MLX4_MR_PAGES_ALIGN); + if (mr->page_map_size > PAGE_SIZE) + return -EINVAL; + + /* This is overkill, but hardware requires that the + * PBL array begins at a properly aligned address and + * never occupies the last 8 bytes of a page. + */ + mr->pages = (__be64 *)get_zeroed_page(GFP_KERNEL); + if (!mr->pages) return -ENOMEM; - mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN); - mr->page_map = dma_map_single(device->dma_device, mr->pages, - size, DMA_TO_DEVICE); + mr->page_map_size, DMA_TO_DEVICE); if (dma_mapping_error(device->dma_device, mr->page_map)) { ret = -ENOMEM; @@ -298,9 +305,9 @@ mlx4_alloc_priv_pages(struct ib_device *device, } return 0; -err: - kfree(mr->pages_alloc); +err: + free_page((unsigned long)mr->pages); return ret; } @@ -309,11 +316,10 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr) { if (mr->pages) { struct ib_device *device = mr->ibmr.device; - int size = mr->max_pages * sizeof(u64); dma_unmap_single(device->dma_device, mr->page_map, - size, DMA_TO_DEVICE); - kfree(mr->pages_alloc); + mr->page_map_size, DMA_TO_DEVICE); + free_page((unsigned long)mr->pages); mr->pages = NULL; } } @@ -537,14 +543,12 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents, mr->npages = 0; ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map, - sizeof(u64) * mr->max_pages, - DMA_TO_DEVICE); + mr->page_map_size, DMA_TO_DEVICE); rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page); ib_dma_sync_single_for_device(ibmr->device, mr->page_map, - sizeof(u64) * mr->max_pages, - DMA_TO_DEVICE); + mr->page_map_size, DMA_TO_DEVICE); return rc; }
Ensure the MR's PBL array never occupies the last 8 bytes of a page. This eliminates random "Local Protection Error" flushes when SLUB debugging is enabled. Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +- drivers/infiniband/hw/mlx4/mr.c | 40 +++++++++++++++++++--------------- 2 files changed, 23 insertions(+), 19 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html