diff mbox

RDMA Read: Local protection error

Message ID 57473025.5020801@grimberg.me (mailing list archive)
State Superseded
Headers show

Commit Message

Sagi Grimberg May 26, 2016, 5:19 p.m. UTC
>>>>> When debugging is disabled, kzalloc returns page-aligned
>>>>> addresses:
>>>>
>>>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>>>> return a page-aligned address as you see in your testing ? if so the
>>>> debug mode should behave the same. Otherwise we can consider using any
>>>> flag allocation that can force that if such exists.
>>>> Let's get other people's input here.
>>>
>>> My understanding is that the fact that k[mz]alloc() returns a
>>> page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
>>> side effect of the implementation and not something callers of that
>>> function should rely on. I think the only assumption k[mz]alloc()
>>> callers should rely on is that the allocated memory respects
>>> ARCH_KMALLOC_MINALIGN.
>>
>> I agree. mlx4_alloc_priv_pages() is carefully designed to
>> correct the alignment of the buffer, so it already assumes
>> that it is not getting a page-aligned buffer.
>>
>> The alignment isn't the problem here, though. It's that
>> the buffer contains a page-boundary. That is guaranteed
>> to be the case for HCAs that support more than 512
>> sges, so that will have to be addressed (at least in
>> mlx5).
>
> rrr...
>
> I think we should make the pages allocations dma coherent
> in order to fix that...
>
> Nice catch Chunk.

Does this untested patch help (if so, mlx5 will need an identical patch)?

--
  static void
@@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
                 struct ib_device *device = mr->ibmr.device;
                 int size = mr->max_pages * sizeof(u64);

-               dma_unmap_single(device->dma_device, mr->page_map,
-                                size, DMA_TO_DEVICE);
-               kfree(mr->pages_alloc);
+               dma_free_coherent(device->dma_device, size,
+                               mr->pages, mr->page_map);
                 mr->pages = NULL;
         }
  }
@@ -532,19 +514,8 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct 
scatterlist *sg, int sg_nents,
                 unsigned int sg_offset)
  {
         struct mlx4_ib_mr *mr = to_mmr(ibmr);
-       int rc;

         mr->npages = 0;

-       ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
-                                  sizeof(u64) * mr->max_pages,
-                                  DMA_TO_DEVICE);
-
-       rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
-
-       ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
-                                     sizeof(u64) * mr->max_pages,
-                                     DMA_TO_DEVICE);
-
-       return rc;
+       return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
  }
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever III May 26, 2016, 5:57 p.m. UTC | #1
> On May 26, 2016, at 1:19 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
>>>>>> 
>>>>>> When debugging is disabled, kzalloc returns page-aligned
>>>>>> addresses:
>>>>> 
>>>>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>>>>> return a page-aligned address as you see in your testing ? if so the
>>>>> debug mode should behave the same. Otherwise we can consider using any
>>>>> flag allocation that can force that if such exists.
>>>>> Let's get other people's input here.
>>>> 
>>>> My understanding is that the fact that k[mz]alloc() returns a
>>>> page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
>>>> side effect of the implementation and not something callers of that
>>>> function should rely on. I think the only assumption k[mz]alloc()
>>>> callers should rely on is that the allocated memory respects
>>>> ARCH_KMALLOC_MINALIGN.
>>> 
>>> I agree. mlx4_alloc_priv_pages() is carefully designed to
>>> correct the alignment of the buffer, so it already assumes
>>> that it is not getting a page-aligned buffer.
>>> 
>>> The alignment isn't the problem here, though. It's that
>>> the buffer contains a page-boundary. That is guaranteed
>>> to be the case for HCAs that support more than 512
>>> sges, so that will have to be addressed (at least in
>>> mlx5).
>> 
>> rrr...
>> 
>> I think we should make the pages allocations dma coherent
>> in order to fix that...
>> 
>> Nice catch Chunk.
> 
> Does this untested patch help (if so, mlx5 will need an identical patch)?

Thanks, Sagi.

Is it safe?

Yes, IPoIB and NFS/RDMA work after the patch is applied.

Is it effective?

I booted with slub_debug=zfpu, and I am not able to reproduce
the Local Protection Error WC flushes.


However, it's not clear whether that's because DMA-coherent
memory is not allocated out of a slab cache, and thus it is
not subject to SLUB debugging. <shrug>

To test it more thoroughly, mlx4_alloc_priv_pages() could
allocate a two-page buffer and push the address of the
array up far enough that it would cross the page boundary.
I didn't try that.

(Also, your patch can delete the definition of
MLX4_MR_PAGES_ALIGN)

Tested-by: Chuck Lever <chuck.lever@oracle.com>


> --
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index ba328177eae9..78e9b3addfea 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -139,7 +139,6 @@ struct mlx4_ib_mr {
>        u32                     max_pages;
>        struct mlx4_mr          mmr;
>        struct ib_umem         *umem;
> -       void                    *pages_alloc;
> };
> 
> struct mlx4_ib_mw {
> diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
> index b04f6238e7e2..becb4a65c755 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>                      int max_pages)
> {
>        int size = max_pages * sizeof(u64);
> -       int add_size;
> -       int ret;
> -
> -       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> 
> -       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> -       if (!mr->pages_alloc)
> +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> +                               &mr->page_map, GFP_KERNEL);
> +       if (!mr->pages)
>                return -ENOMEM;
> 
> -       mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
> -
> -       mr->page_map = dma_map_single(device->dma_device, mr->pages,
> -                                     size, DMA_TO_DEVICE);
> -
> -       if (dma_mapping_error(device->dma_device, mr->page_map)) {
> -               ret = -ENOMEM;
> -               goto err;
> -       }
> -
>        return 0;
> -err:
> -       kfree(mr->pages_alloc);
> -
> -       return ret;
> }
> 
> static void
> @@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
>                struct ib_device *device = mr->ibmr.device;
>                int size = mr->max_pages * sizeof(u64);
> 
> -               dma_unmap_single(device->dma_device, mr->page_map,
> -                                size, DMA_TO_DEVICE);
> -               kfree(mr->pages_alloc);
> +               dma_free_coherent(device->dma_device, size,
> +                               mr->pages, mr->page_map);
>                mr->pages = NULL;
>        }
> }
> @@ -532,19 +514,8 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
>                unsigned int sg_offset)
> {
>        struct mlx4_ib_mr *mr = to_mmr(ibmr);
> -       int rc;
> 
>        mr->npages = 0;
> 
> -       ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
> -                                  sizeof(u64) * mr->max_pages,
> -                                  DMA_TO_DEVICE);
> -
> -       rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
> -
> -       ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
> -                                     sizeof(u64) * mr->max_pages,
> -                                     DMA_TO_DEVICE);
> -
> -       return rc;
> +       return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
> }
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 26, 2016, 7:23 p.m. UTC | #2
On Thu, May 26, 2016 at 08:19:33PM +0300, Sagi Grimberg wrote:
> 
> >>>>>When debugging is disabled, kzalloc returns page-aligned
> >>>>>addresses:
> >>>>
> >>>>Is it defined some where that regular kzalloc/kmalloc guaranties to
> >>>>return a page-aligned address as you see in your testing ? if so the
> >>>>debug mode should behave the same. Otherwise we can consider using any
> >>>>flag allocation that can force that if such exists.
> >>>>Let's get other people's input here.
> >>>
> >>>My understanding is that the fact that k[mz]alloc() returns a
> >>>page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
> >>>side effect of the implementation and not something callers of that
> >>>function should rely on. I think the only assumption k[mz]alloc()
> >>>callers should rely on is that the allocated memory respects
> >>>ARCH_KMALLOC_MINALIGN.
> >>
> >>I agree. mlx4_alloc_priv_pages() is carefully designed to
> >>correct the alignment of the buffer, so it already assumes
> >>that it is not getting a page-aligned buffer.
> >>
> >>The alignment isn't the problem here, though. It's that
> >>the buffer contains a page-boundary. That is guaranteed
> >>to be the case for HCAs that support more than 512
> >>sges, so that will have to be addressed (at least in
> >>mlx5).
> >
> >rrr...
> >
> >I think we should make the pages allocations dma coherent
> >in order to fix that...
> >
> >Nice catch Chunk.
> 
> Does this untested patch help (if so, mlx5 will need an identical patch)?
> 
> --
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index ba328177eae9..78e9b3addfea 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -139,7 +139,6 @@ struct mlx4_ib_mr {
>         u32                     max_pages;
>         struct mlx4_mr          mmr;
>         struct ib_umem         *umem;
> -       void                    *pages_alloc;
>  };
> 
>  struct mlx4_ib_mw {
> diff --git a/drivers/infiniband/hw/mlx4/mr.c
> b/drivers/infiniband/hw/mlx4/mr.c
> index b04f6238e7e2..becb4a65c755 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>                       int max_pages)
>  {
>         int size = max_pages * sizeof(u64);
> -       int add_size;
> -       int ret;
> -
> -       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN,
> 0);
> 
> -       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> -       if (!mr->pages_alloc)
> +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> +                               &mr->page_map, GFP_KERNEL);

Sagi,
I'm wondering if allocation from ZONE_DMA is the right way to
replace ZONE_NORMAL allocations.

I don't remember if memory compaction works on ZONE_DMA and it makes
me nervous to think about long and multiple alloc/dealloc MR scenario.
Christoph Lameter (Ampere) May 26, 2016, 8:12 p.m. UTC | #3
On Thu, 26 May 2016, Leon Romanovsky wrote:

> Sagi,
> I'm wondering if allocation from ZONE_DMA is the right way to
> replace ZONE_NORMAL allocations.
>
> I don't remember if memory compaction works on ZONE_DMA and it makes
> me nervous to think about long and multiple alloc/dealloc MR scenario.

ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
(such as floppy drivers). Please do not use it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg May 29, 2016, 7:02 a.m. UTC | #4
>> Sagi,
>> I'm wondering if allocation from ZONE_DMA is the right way to
>> replace ZONE_NORMAL allocations.
>>
>> I don't remember if memory compaction works on ZONE_DMA and it makes
>> me nervous to think about long and multiple alloc/dealloc MR scenario.
>
> ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> (such as floppy drivers). Please do not use it.

Really? It's used all over the stack...

So the suggestion is to restrict the allocation does not
cross the page boundary (which can easily be done with making
MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 29, 2016, 7:10 a.m. UTC | #5
On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote:
> > +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> > +                               &mr->page_map, GFP_KERNEL);
> 
> Sagi,
> I'm wondering if allocation from ZONE_DMA is the right way to
> replace ZONE_NORMAL allocations.

Where do you see a ZONE_DMA allocation?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 29, 2016, 7:17 a.m. UTC | #6
On Sun, May 29, 2016 at 10:02:53AM +0300, Sagi Grimberg wrote:
> >ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> >(such as floppy drivers). Please do not use it.
> 
> Really? It's used all over the stack...
> 
> So the suggestion is to restrict the allocation does not
> cross the page boundary (which can easily be done with making
> MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?

We could simply switch to alloc_pages, e.g. something like the patch below:

But I have to admit I really like the coherent dma mapping version,
for all sane architectures that works much better, although it sucks
for some architetures where dma coherent mappings cause a bit of
overhead.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg May 29, 2016, 8:13 a.m. UTC | #7
>> Really? It's used all over the stack...
>>
>> So the suggestion is to restrict the allocation does not
>> cross the page boundary (which can easily be done with making
>> MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?
>
> We could simply switch to alloc_pages, e.g. something like the patch below:

Patch below? :)

> But I have to admit I really like the coherent dma mapping version,
> for all sane architectures that works much better, although it sucks
> for some architetures where dma coherent mappings cause a bit of
> overhead.

But this is specific to mlx4 which supports up 511 pages per MR, mlx5
will need a coherent allocation anyways right (it supports up to 64K
pages per MR)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 29, 2016, 8:56 a.m. UTC | #8
On Sun, May 29, 2016 at 12:10:40AM -0700, Christoph Hellwig wrote:
> On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote:
> > > +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> > > +                               &mr->page_map, GFP_KERNEL);
> > 
> > Sagi,
> > I'm wondering if allocation from ZONE_DMA is the right way to
> > replace ZONE_NORMAL allocations.
> 
> Where do you see a ZONE_DMA allocation?

Thanks for pointing that out, it caused me to reread code again more
carefully and it looks like I mistakenly added GFP_DMA flag to the
proposed code which is definitely not the case.

Sorry for the noise.
Christoph Lameter (Ampere) May 31, 2016, 3:14 p.m. UTC | #9
On Sun, 29 May 2016, Sagi Grimberg wrote:

>
> > > Sagi,
> > > I'm wondering if allocation from ZONE_DMA is the right way to
> > > replace ZONE_NORMAL allocations.
> > >
> > > I don't remember if memory compaction works on ZONE_DMA and it makes
> > > me nervous to think about long and multiple alloc/dealloc MR scenario.
> >
> > ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> > (such as floppy drivers). Please do not use it.
>
> Really? It's used all over the stack...

On x86 that is the case. Other arches may have different uses for ZONE_DMA
but the intend it to support legacy devices unable to write to the whole
of memory (that is usally ZONE_NORMAL).

from include/linux/mmzone.h

enum zone_type {
#ifdef CONFIG_ZONE_DMA
        /*
         * ZONE_DMA is used when there are devices that are not able
         * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
         * carve out the portion of memory that is needed for these
devices.
         * The range is arch specific.
         *
         * Some examples
         *
         * Architecture         Limit
         * ---------------------------
         * parisc, ia64, sparc  <4G
         * s390                 <2G
         * arm                  Various
         * alpha                Unlimited or 0-16MB.
         *
         * i386, x86_64 and multiple other arches
         *                      <16M.
         */
        ZONE_DMA,
#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index ba328177eae9..78e9b3addfea 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -139,7 +139,6 @@  struct mlx4_ib_mr {
         u32                     max_pages;
         struct mlx4_mr          mmr;
         struct ib_umem         *umem;
-       void                    *pages_alloc;
  };

  struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c 
b/drivers/infiniband/hw/mlx4/mr.c
index b04f6238e7e2..becb4a65c755 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -278,30 +278,13 @@  mlx4_alloc_priv_pages(struct ib_device *device,
                       int max_pages)
  {
         int size = max_pages * sizeof(u64);
-       int add_size;
-       int ret;
-
-       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - 
ARCH_KMALLOC_MINALIGN, 0);

-       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
-       if (!mr->pages_alloc)
+       mr->pages = dma_alloc_coherent(device->dma_device, size,
+                               &mr->page_map, GFP_KERNEL);
+       if (!mr->pages)
                 return -ENOMEM;

-       mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
-
-       mr->page_map = dma_map_single(device->dma_device, mr->pages,
-                                     size, DMA_TO_DEVICE);
-
-       if (dma_mapping_error(device->dma_device, mr->page_map)) {
-               ret = -ENOMEM;
-               goto err;
-       }
-
         return 0;
-err:
-       kfree(mr->pages_alloc);
-
-       return ret;
  }