Message ID | 57473025.5020801@grimberg.me (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
> On May 26, 2016, at 1:19 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > >>>>>> >>>>>> When debugging is disabled, kzalloc returns page-aligned >>>>>> addresses: >>>>> >>>>> Is it defined some where that regular kzalloc/kmalloc guaranties to >>>>> return a page-aligned address as you see in your testing ? if so the >>>>> debug mode should behave the same. Otherwise we can consider using any >>>>> flag allocation that can force that if such exists. >>>>> Let's get other people's input here. >>>> >>>> My understanding is that the fact that k[mz]alloc() returns a >>>> page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a >>>> side effect of the implementation and not something callers of that >>>> function should rely on. I think the only assumption k[mz]alloc() >>>> callers should rely on is that the allocated memory respects >>>> ARCH_KMALLOC_MINALIGN. >>> >>> I agree. mlx4_alloc_priv_pages() is carefully designed to >>> correct the alignment of the buffer, so it already assumes >>> that it is not getting a page-aligned buffer. >>> >>> The alignment isn't the problem here, though. It's that >>> the buffer contains a page-boundary. That is guaranteed >>> to be the case for HCAs that support more than 512 >>> sges, so that will have to be addressed (at least in >>> mlx5). >> >> rrr... >> >> I think we should make the pages allocations dma coherent >> in order to fix that... >> >> Nice catch Chunk. > > Does this untested patch help (if so, mlx5 will need an identical patch)? Thanks, Sagi. Is it safe? Yes, IPoIB and NFS/RDMA work after the patch is applied. Is it effective? I booted with slub_debug=zfpu, and I am not able to reproduce the Local Protection Error WC flushes. However, it's not clear whether that's because DMA-coherent memory is not allocated out of a slab cache, and thus it is not subject to SLUB debugging. <shrug> To test it more thoroughly, mlx4_alloc_priv_pages() could allocate a two-page buffer and push the address of the array up far enough that it would cross the page boundary. I didn't try that. (Also, your patch can delete the definition of MLX4_MR_PAGES_ALIGN) Tested-by: Chuck Lever <chuck.lever@oracle.com> > -- > diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h > index ba328177eae9..78e9b3addfea 100644 > --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h > +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h > @@ -139,7 +139,6 @@ struct mlx4_ib_mr { > u32 max_pages; > struct mlx4_mr mmr; > struct ib_umem *umem; > - void *pages_alloc; > }; > > struct mlx4_ib_mw { > diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c > index b04f6238e7e2..becb4a65c755 100644 > --- a/drivers/infiniband/hw/mlx4/mr.c > +++ b/drivers/infiniband/hw/mlx4/mr.c > @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device, > int max_pages) > { > int size = max_pages * sizeof(u64); > - int add_size; > - int ret; > - > - add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); > > - mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); > - if (!mr->pages_alloc) > + mr->pages = dma_alloc_coherent(device->dma_device, size, > + &mr->page_map, GFP_KERNEL); > + if (!mr->pages) > return -ENOMEM; > > - mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN); > - > - mr->page_map = dma_map_single(device->dma_device, mr->pages, > - size, DMA_TO_DEVICE); > - > - if (dma_mapping_error(device->dma_device, mr->page_map)) { > - ret = -ENOMEM; > - goto err; > - } > - > return 0; > -err: > - kfree(mr->pages_alloc); > - > - return ret; > } > > static void > @@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr) > struct ib_device *device = mr->ibmr.device; > int size = mr->max_pages * sizeof(u64); > > - dma_unmap_single(device->dma_device, mr->page_map, > - size, DMA_TO_DEVICE); > - kfree(mr->pages_alloc); > + dma_free_coherent(device->dma_device, size, > + mr->pages, mr->page_map); > mr->pages = NULL; > } > } > @@ -532,19 +514,8 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents, > unsigned int sg_offset) > { > struct mlx4_ib_mr *mr = to_mmr(ibmr); > - int rc; > > mr->npages = 0; > > - ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map, > - sizeof(u64) * mr->max_pages, > - DMA_TO_DEVICE); > - > - rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page); > - > - ib_dma_sync_single_for_device(ibmr->device, mr->page_map, > - sizeof(u64) * mr->max_pages, > - DMA_TO_DEVICE); > - > - return rc; > + return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page); > } > -- > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 26, 2016 at 08:19:33PM +0300, Sagi Grimberg wrote: > > >>>>>When debugging is disabled, kzalloc returns page-aligned > >>>>>addresses: > >>>> > >>>>Is it defined some where that regular kzalloc/kmalloc guaranties to > >>>>return a page-aligned address as you see in your testing ? if so the > >>>>debug mode should behave the same. Otherwise we can consider using any > >>>>flag allocation that can force that if such exists. > >>>>Let's get other people's input here. > >>> > >>>My understanding is that the fact that k[mz]alloc() returns a > >>>page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a > >>>side effect of the implementation and not something callers of that > >>>function should rely on. I think the only assumption k[mz]alloc() > >>>callers should rely on is that the allocated memory respects > >>>ARCH_KMALLOC_MINALIGN. > >> > >>I agree. mlx4_alloc_priv_pages() is carefully designed to > >>correct the alignment of the buffer, so it already assumes > >>that it is not getting a page-aligned buffer. > >> > >>The alignment isn't the problem here, though. It's that > >>the buffer contains a page-boundary. That is guaranteed > >>to be the case for HCAs that support more than 512 > >>sges, so that will have to be addressed (at least in > >>mlx5). > > > >rrr... > > > >I think we should make the pages allocations dma coherent > >in order to fix that... > > > >Nice catch Chunk. > > Does this untested patch help (if so, mlx5 will need an identical patch)? > > -- > diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h > b/drivers/infiniband/hw/mlx4/mlx4_ib.h > index ba328177eae9..78e9b3addfea 100644 > --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h > +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h > @@ -139,7 +139,6 @@ struct mlx4_ib_mr { > u32 max_pages; > struct mlx4_mr mmr; > struct ib_umem *umem; > - void *pages_alloc; > }; > > struct mlx4_ib_mw { > diff --git a/drivers/infiniband/hw/mlx4/mr.c > b/drivers/infiniband/hw/mlx4/mr.c > index b04f6238e7e2..becb4a65c755 100644 > --- a/drivers/infiniband/hw/mlx4/mr.c > +++ b/drivers/infiniband/hw/mlx4/mr.c > @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device, > int max_pages) > { > int size = max_pages * sizeof(u64); > - int add_size; > - int ret; > - > - add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, > 0); > > - mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); > - if (!mr->pages_alloc) > + mr->pages = dma_alloc_coherent(device->dma_device, size, > + &mr->page_map, GFP_KERNEL); Sagi, I'm wondering if allocation from ZONE_DMA is the right way to replace ZONE_NORMAL allocations. I don't remember if memory compaction works on ZONE_DMA and it makes me nervous to think about long and multiple alloc/dealloc MR scenario.
On Thu, 26 May 2016, Leon Romanovsky wrote: > Sagi, > I'm wondering if allocation from ZONE_DMA is the right way to > replace ZONE_NORMAL allocations. > > I don't remember if memory compaction works on ZONE_DMA and it makes > me nervous to think about long and multiple alloc/dealloc MR scenario. ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era (such as floppy drivers). Please do not use it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Sagi, >> I'm wondering if allocation from ZONE_DMA is the right way to >> replace ZONE_NORMAL allocations. >> >> I don't remember if memory compaction works on ZONE_DMA and it makes >> me nervous to think about long and multiple alloc/dealloc MR scenario. > > ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era > (such as floppy drivers). Please do not use it. Really? It's used all over the stack... So the suggestion is to restrict the allocation does not cross the page boundary (which can easily be done with making MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote: > > + mr->pages = dma_alloc_coherent(device->dma_device, size, > > + &mr->page_map, GFP_KERNEL); > > Sagi, > I'm wondering if allocation from ZONE_DMA is the right way to > replace ZONE_NORMAL allocations. Where do you see a ZONE_DMA allocation? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 29, 2016 at 10:02:53AM +0300, Sagi Grimberg wrote: > >ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era > >(such as floppy drivers). Please do not use it. > > Really? It's used all over the stack... > > So the suggestion is to restrict the allocation does not > cross the page boundary (which can easily be done with making > MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)? We could simply switch to alloc_pages, e.g. something like the patch below: But I have to admit I really like the coherent dma mapping version, for all sane architectures that works much better, although it sucks for some architetures where dma coherent mappings cause a bit of overhead. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Really? It's used all over the stack... >> >> So the suggestion is to restrict the allocation does not >> cross the page boundary (which can easily be done with making >> MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)? > > We could simply switch to alloc_pages, e.g. something like the patch below: Patch below? :) > But I have to admit I really like the coherent dma mapping version, > for all sane architectures that works much better, although it sucks > for some architetures where dma coherent mappings cause a bit of > overhead. But this is specific to mlx4 which supports up 511 pages per MR, mlx5 will need a coherent allocation anyways right (it supports up to 64K pages per MR)? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 29, 2016 at 12:10:40AM -0700, Christoph Hellwig wrote: > On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote: > > > + mr->pages = dma_alloc_coherent(device->dma_device, size, > > > + &mr->page_map, GFP_KERNEL); > > > > Sagi, > > I'm wondering if allocation from ZONE_DMA is the right way to > > replace ZONE_NORMAL allocations. > > Where do you see a ZONE_DMA allocation? Thanks for pointing that out, it caused me to reread code again more carefully and it looks like I mistakenly added GFP_DMA flag to the proposed code which is definitely not the case. Sorry for the noise.
On Sun, 29 May 2016, Sagi Grimberg wrote: > > > > Sagi, > > > I'm wondering if allocation from ZONE_DMA is the right way to > > > replace ZONE_NORMAL allocations. > > > > > > I don't remember if memory compaction works on ZONE_DMA and it makes > > > me nervous to think about long and multiple alloc/dealloc MR scenario. > > > > ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era > > (such as floppy drivers). Please do not use it. > > Really? It's used all over the stack... On x86 that is the case. Other arches may have different uses for ZONE_DMA but the intend it to support legacy devices unable to write to the whole of memory (that is usally ZONE_NORMAL). from include/linux/mmzone.h enum zone_type { #ifdef CONFIG_ZONE_DMA /* * ZONE_DMA is used when there are devices that are not able * to do DMA to all of addressable memory (ZONE_NORMAL). Then we * carve out the portion of memory that is needed for these devices. * The range is arch specific. * * Some examples * * Architecture Limit * --------------------------- * parisc, ia64, sparc <4G * s390 <2G * arm Various * alpha Unlimited or 0-16MB. * * i386, x86_64 and multiple other arches * <16M. */ ZONE_DMA, #endif -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index ba328177eae9..78e9b3addfea 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -139,7 +139,6 @@ struct mlx4_ib_mr { u32 max_pages; struct mlx4_mr mmr; struct ib_umem *umem; - void *pages_alloc; }; struct mlx4_ib_mw { diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index b04f6238e7e2..becb4a65c755 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device, int max_pages) { int size = max_pages * sizeof(u64); - int add_size; - int ret; - - add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); - mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); - if (!mr->pages_alloc) + mr->pages = dma_alloc_coherent(device->dma_device, size, + &mr->page_map, GFP_KERNEL); + if (!mr->pages) return -ENOMEM; - mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN); - - mr->page_map = dma_map_single(device->dma_device, mr->pages, - size, DMA_TO_DEVICE); - - if (dma_mapping_error(device->dma_device, mr->page_map)) { - ret = -ENOMEM; - goto err; - } - return 0; -err: - kfree(mr->pages_alloc); - - return ret; }