diff mbox

[v2,01/24] mlx4-ib: Use coherent memory for priv pages

Message ID 20160616143518.GX5408@leon.nu (mailing list archive)
State Superseded
Headers show

Commit Message

Leon Romanovsky June 16, 2016, 2:35 p.m. UTC
On Wed, Jun 15, 2016 at 12:40:07PM -0400, Chuck Lever wrote:
> 
> > On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >> From: Sagi Grimberg <sagi@grimberg.me>
> >> 
> >> kmalloc doesn't guarantee the returned memory is all on one page.
> > 
> > IMHO, the patch posted by Christoph at that thread is best way to go,
> > because you changed streaming DMA mappings to be coherent DMA mappings [1].
> > 
> > "The kernel developers recommend the use of streaming mappings over
> > coherent mappings whenever possible" [1].
> > 
> > [1] http://www.makelinux.net/ldd3/chp-15-sect-4
> 
> Hi Leon-
> 
> I'll happily drop this patch from my 4.8 series as soon
> as an official mlx4/mlx5 fix is merged.
> 
> Meanwhile, I notice some unexplained instability (driver
> resets, list corruption, and so on) when I test NFS/RDMA
> without this patch included. So it is attached to the
> series for anyone with mlx4 who wants to pull my topic
> branch and try it out.

hi Chuck,

We plan to send attached patch during our second round of fixes for
mlx4/mlx5 and would be grateful to you if you could provide your
Tested-by tag before.

Thanks

Comments

Sagi Grimberg June 16, 2016, 9:10 p.m. UTC | #1
>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>
>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
>>>> From: Sagi Grimberg <sagi@grimberg.me>
>>>>
>>>> kmalloc doesn't guarantee the returned memory is all on one page.
>>>
>>> IMHO, the patch posted by Christoph at that thread is best way to go,
>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
>>>
>>> "The kernel developers recommend the use of streaming mappings over
>>> coherent mappings whenever possible" [1].
>>>
>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
>>
>> Hi Leon-
>>
>> I'll happily drop this patch from my 4.8 series as soon
>> as an official mlx4/mlx5 fix is merged.
>>
>> Meanwhile, I notice some unexplained instability (driver
>> resets, list corruption, and so on) when I test NFS/RDMA
>> without this patch included. So it is attached to the
>> series for anyone with mlx4 who wants to pull my topic
>> branch and try it out.
>
> hi Chuck,
>
> We plan to send attached patch during our second round of fixes for
> mlx4/mlx5 and would be grateful to you if you could provide your
> Tested-by tag before.

First of all, IIRC the patch author was Christoph wasn't he.

Plus, you do realize that this patch makes the pages allocation
in granularity of pages. In systems with a large page size this
is completely redundant, it might even be harmful as the storage
ULPs need lots of MRs.

Also, I don't see how that solves the issue, I'm not sure I even
understand the issue. Do you? Were you able to reproduce it?

IFF the pages buffer end not being aligned to a cacheline is problematic
then why not extent it to end in a cacheline? Why in the next full page?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III June 16, 2016, 9:58 p.m. UTC | #2
> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:
> 
> 
>>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>> 
>>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
>>>>> From: Sagi Grimberg <sagi@grimberg.me>
>>>>> 
>>>>> kmalloc doesn't guarantee the returned memory is all on one page.
>>>> 
>>>> IMHO, the patch posted by Christoph at that thread is best way to go,
>>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
>>>> 
>>>> "The kernel developers recommend the use of streaming mappings over
>>>> coherent mappings whenever possible" [1].
>>>> 
>>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
>>> 
>>> Hi Leon-
>>> 
>>> I'll happily drop this patch from my 4.8 series as soon
>>> as an official mlx4/mlx5 fix is merged.
>>> 
>>> Meanwhile, I notice some unexplained instability (driver
>>> resets, list corruption, and so on) when I test NFS/RDMA
>>> without this patch included. So it is attached to the
>>> series for anyone with mlx4 who wants to pull my topic
>>> branch and try it out.
>> 
>> hi Chuck,
>> 
>> We plan to send attached patch during our second round of fixes for
>> mlx4/mlx5 and would be grateful to you if you could provide your
>> Tested-by tag before.

Fwiw, Tested-by: Chuck Lever <chuck.lever@oracle.com>


> First of all, IIRC the patch author was Christoph wasn't he.
> 
> Plus, you do realize that this patch makes the pages allocation
> in granularity of pages. In systems with a large page size this
> is completely redundant, it might even be harmful as the storage
> ULPs need lots of MRs.

I agree that the official fix should take a conservative
approach to allocating this resource; there will be lots
of MRs in an active system. This fix doesn't seem too
careful.


> Also, I don't see how that solves the issue, I'm not sure I even
> understand the issue. Do you? Were you able to reproduce it?

The issue is that dma_map_single() does not seem to DMA map
portions of a memory region that are past the end of the first
page of that region. Maybe that's a bug?

This patch works around that behavior by guaranteeing that

a) the memory region starts at the beginning of a page, and
b) the memory region is never larger than a page

This patch is not sufficient to repair mlx5, because b)
cannot be satisfied in that case; the array of __be64's can
be larger than 512 entries.


> IFF the pages buffer end not being aligned to a cacheline is problematic
> then why not extent it to end in a cacheline? Why in the next full page?

I think the patch description justifies the choice of
solution, but does not describe the original issue at
all. The original issue had nothing to do with cacheline
alignment.

Lastly, this patch should remove the definition of
MLX4_MR_PAGES_ALIGN.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 17, 2016, 9:05 a.m. UTC | #3
On Fri, Jun 17, 2016 at 12:10:33AM +0300, Sagi Grimberg wrote:
> 
> >>>On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <leon@kernel.org> wrote:
> >>>
> >>>On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >>>>From: Sagi Grimberg <sagi@grimberg.me>
> >>>>
> >>>>kmalloc doesn't guarantee the returned memory is all on one page.
> >>>
> >>>IMHO, the patch posted by Christoph at that thread is best way to go,
> >>>because you changed streaming DMA mappings to be coherent DMA mappings [1].
> >>>
> >>>"The kernel developers recommend the use of streaming mappings over
> >>>coherent mappings whenever possible" [1].
> >>>
> >>>[1] http://www.makelinux.net/ldd3/chp-15-sect-4
> >>
> >>Hi Leon-
> >>
> >>I'll happily drop this patch from my 4.8 series as soon
> >>as an official mlx4/mlx5 fix is merged.
> >>
> >>Meanwhile, I notice some unexplained instability (driver
> >>resets, list corruption, and so on) when I test NFS/RDMA
> >>without this patch included. So it is attached to the
> >>series for anyone with mlx4 who wants to pull my topic
> >>branch and try it out.
> >
> >hi Chuck,
> >
> >We plan to send attached patch during our second round of fixes for
> >mlx4/mlx5 and would be grateful to you if you could provide your
> >Tested-by tag before.
> 
> First of all, IIRC the patch author was Christoph wasn't he.

Do you think that author's name can provide different results in
verification/bug reproduction? We didn't send this patch officially
yet and all relevant authors will be acknowledged and honored when the
official (second round of IB fixes) will come.

> 
> Plus, you do realize that this patch makes the pages allocation
> in granularity of pages. In systems with a large page size this
> is completely redundant, it might even be harmful as the storage
> ULPs need lots of MRs.

I see proper execution of the driver as an important goal which goes
before various micro optimizations, which will come after.

id you ask yourself, why are not so many users use that ARCH_KMALLOC_MINALIGN?

➜  linux-rdma git:(master) grep -rI ARCH_KMALLOC_MINALIGN drivers/*
drivers/infiniband/hw/mlx4/mr.c:        add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
drivers/infiniband/hw/mlx5/mr.c:        add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
drivers/md/dm-crypt.c:                ARCH_KMALLOC_MINALIGN);
drivers/usb/core/buffer.c:       * ARCH_KMALLOC_MINALIGN.
drivers/usb/core/buffer.c:      if (ARCH_KMALLOC_MINALIGN <= 32)
drivers/usb/core/buffer.c:      else if (ARCH_KMALLOC_MINALIGN <= 64)
drivers/usb/core/buffer.c:      else if (ARCH_KMALLOC_MINALIGN <= 128)
drivers/usb/misc/usbtest.c:     return (unsigned long)buf & (ARCH_KMALLOC_MINALIGN - 1);

> 
> Also, I don't see how that solves the issue, I'm not sure I even
> understand the issue. Do you? Were you able to reproduce it?

Yes, the issue is that address supplied to dma_map_single wasn't aligned
to DMA cacheline size.

And as the one, who wrote this code for mlx5, it will be great if you
can give me a pointer. Why did you chose to use MLX5_UMR_ALIGN in that
function? This will add 2048 bytes instead of 64 for mlx4.

> 
> IFF the pages buffer end not being aligned to a cacheline is problematic
> then why not extent it to end in a cacheline? Why in the next full page?

It will fit in one page.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 17, 2016, 9:20 a.m. UTC | #4
On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
> 
> > On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:
> > 
> > 
> >>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <leon@kernel.org> wrote:
> >>>> 
> >>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >>>>> From: Sagi Grimberg <sagi@grimberg.me>
> >>>>> 
> >>>>> kmalloc doesn't guarantee the returned memory is all on one page.
> >>>> 
> >>>> IMHO, the patch posted by Christoph at that thread is best way to go,
> >>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
> >>>> 
> >>>> "The kernel developers recommend the use of streaming mappings over
> >>>> coherent mappings whenever possible" [1].
> >>>> 
> >>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
> >>> 
> >>> Hi Leon-
> >>> 
> >>> I'll happily drop this patch from my 4.8 series as soon
> >>> as an official mlx4/mlx5 fix is merged.
> >>> 
> >>> Meanwhile, I notice some unexplained instability (driver
> >>> resets, list corruption, and so on) when I test NFS/RDMA
> >>> without this patch included. So it is attached to the
> >>> series for anyone with mlx4 who wants to pull my topic
> >>> branch and try it out.
> >> 
> >> hi Chuck,
> >> 
> >> We plan to send attached patch during our second round of fixes for
> >> mlx4/mlx5 and would be grateful to you if you could provide your
> >> Tested-by tag before.
> 
> Fwiw, Tested-by: Chuck Lever <chuck.lever@oracle.com>

Thanks, I appreciate it.

> 
> 
> > First of all, IIRC the patch author was Christoph wasn't he.
> > 
> > Plus, you do realize that this patch makes the pages allocation
> > in granularity of pages. In systems with a large page size this
> > is completely redundant, it might even be harmful as the storage
> > ULPs need lots of MRs.
> 
> I agree that the official fix should take a conservative
> approach to allocating this resource; there will be lots
> of MRs in an active system. This fix doesn't seem too
> careful.

In mlx5 system, we always added 2048 bytes to such allocations, for
reasons unknown to me. And it doesn't seem as a conservative approach
either.

> 
> 
> > Also, I don't see how that solves the issue, I'm not sure I even
> > understand the issue. Do you? Were you able to reproduce it?
> 
> The issue is that dma_map_single() does not seem to DMA map
> portions of a memory region that are past the end of the first
> page of that region. Maybe that's a bug?

No, I didn't find support for that. Function dma_map_single expects
contiguous memory aligned to cache line, there is no limitation to be
page bounded.

> 
> This patch works around that behavior by guaranteeing that
> 
> a) the memory region starts at the beginning of a page, and
> b) the memory region is never larger than a page

b) the memory region ends on cache line.

> 
> This patch is not sufficient to repair mlx5, because b)
> cannot be satisfied in that case; the array of __be64's can
> be larger than 512 entries.
> 
> 
> > IFF the pages buffer end not being aligned to a cacheline is problematic
> > then why not extent it to end in a cacheline? Why in the next full page?
> 
> I think the patch description justifies the choice of
> solution, but does not describe the original issue at
> all. The original issue had nothing to do with cacheline
> alignment.

I disagree, kmalloc with supplied flags will return contiguous memory
which is enough for dma_map_single. It is cache line alignment.

> 
> Lastly, this patch should remove the definition of
> MLX4_MR_PAGES_ALIGN.

Thanks, I missed it.

> 
> 
> --
> Chuck Lever
> 
> 
>
Chuck Lever III June 17, 2016, 7:55 p.m. UTC | #5
> On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
>> 
>>> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:
>> 
>>> First of all, IIRC the patch author was Christoph wasn't he.
>>> 
>>> Plus, you do realize that this patch makes the pages allocation
>>> in granularity of pages. In systems with a large page size this
>>> is completely redundant, it might even be harmful as the storage
>>> ULPs need lots of MRs.
>> 
>> I agree that the official fix should take a conservative
>> approach to allocating this resource; there will be lots
>> of MRs in an active system. This fix doesn't seem too
>> careful.
> 
> In mlx5 system, we always added 2048 bytes to such allocations, for
> reasons unknown to me. And it doesn't seem as a conservative approach
> either.

The mlx5 approach is much better than allocating a whole
page, when you consider platforms with 64KB pages.

A 1MB payload (for NFS) on such a platform comprises just
16 pages. So xprtrdma will allocate MRs with support for
16 pages. That's a priv pages array of 128 bytes, and you
just put it in a 64KB page all by itself.

So maybe adding 2048 bytes is not optimal either. But I
think sticking with kmalloc here is a more optimal choice.


>>> Also, I don't see how that solves the issue, I'm not sure I even
>>> understand the issue. Do you? Were you able to reproduce it?
>> 
>> The issue is that dma_map_single() does not seem to DMA map
>> portions of a memory region that are past the end of the first
>> page of that region. Maybe that's a bug?
> 
> No, I didn't find support for that. Function dma_map_single expects
> contiguous memory aligned to cache line, there is no limitation to be
> page bounded.

There certainly isn't, but that doesn't mean there can't
be a bug somewhere ;-) and maybe not in dma_map_single.
It could be that the "array on one page only" limitation
is somewhere else in the mlx4 driver, or even in the HCA
firmware.


>> This patch works around that behavior by guaranteeing that
>> 
>> a) the memory region starts at the beginning of a page, and
>> b) the memory region is never larger than a page
> 
> b) the memory region ends on cache line.

I think we demonstrated pretty clearly that the issue
occurs only when the end of the priv pages array crosses
into a new page.

We didn't see any problem otherwise.


>> This patch is not sufficient to repair mlx5, because b)
>> cannot be satisfied in that case; the array of __be64's can
>> be larger than 512 entries.
>> 
>> 
>>> IFF the pages buffer end not being aligned to a cacheline is problematic
>>> then why not extent it to end in a cacheline? Why in the next full page?
>> 
>> I think the patch description justifies the choice of
>> solution, but does not describe the original issue at
>> all. The original issue had nothing to do with cacheline
>> alignment.
> 
> I disagree, kmalloc with supplied flags will return contiguous memory
> which is enough for dma_map_single. It is cache line alignment.

The reason I find this hard to believe is that there is
no end alignment guarantee at all in this code, but it
works without issue when SLUB debugging is not enabled.

xprtrdma allocates 256 elements in this array on x86.
The code makes the array start on an 0x40 byte boundary.
I'm pretty sure that means the end of that array will
also be on at least an 0x40 byte boundary, and thus
aligned to the DMA cacheline, whether or not SLUB
debugging is enabled.

Notice that in the current code, if the consumer requests
an odd number of SGs, that array can't possibly end on
an alignment boundary. But we've never had a complaint.

SLUB debugging changes the alignment of lots of things,
but mlx4_alloc_priv_pages is the only breakage that has
been reported.

DMA-API.txt says:

> [T]he mapped region must begin exactly on a cache line
> boundary and end exactly on one (to prevent two separately
> mapped regions from sharing a single cache line)

The way I read this, cacheline alignment shouldn't be
an issue at all, as long as DMA cachelines aren't
shared between mappings.

If I simply increase the memory allocation size a little
and ensure the end of the mapping is aligned, that should
be enough to prevent DMA cacheline sharing with another
memory allocation on the same page. But I still see Local
Protection Errors when SLUB debugging is enabled, on my
system (with patches to allocate more pages per MR).

I'm not convinced this has anything to do with DMA
cacheline alignment. The reason your patch fixes this
issue is because it keeps the entire array on one page.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 19, 2016, 7:05 a.m. UTC | #6
>> First of all, IIRC the patch author was Christoph wasn't he.
>
> Do you think that author's name can provide different results in
> verification/bug reproduction? We didn't send this patch officially
> yet and all relevant authors will be acknowledged and honored when the
> official (second round of IB fixes) will come.

Not sure what you mean here. what different results?

>> Plus, you do realize that this patch makes the pages allocation
>> in granularity of pages. In systems with a large page size this
>> is completely redundant, it might even be harmful as the storage
>> ULPs need lots of MRs.
>
> I see proper execution of the driver as an important goal which goes
> before various micro optimizations, which will come after.

I still don't understand how this fixes the original bug report from
Chuck. I sent a patch to make the pages allocation dma coherent
which fixes the issue. Yishai is the driver maintainer, he should
decide how this issue should be addressed.

In any event, if we end-up aligning to page size I would expect to
see a FIXME comment saying we can do better...

> id you ask yourself, why are not so many users use that ARCH_KMALLOC_MINALIGN?
>
> ➜  linux-rdma git:(master) grep -rI ARCH_KMALLOC_MINALIGN drivers/*
> drivers/infiniband/hw/mlx4/mr.c:        add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> drivers/infiniband/hw/mlx5/mr.c:        add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> drivers/md/dm-crypt.c:                ARCH_KMALLOC_MINALIGN);
> drivers/usb/core/buffer.c:       * ARCH_KMALLOC_MINALIGN.
> drivers/usb/core/buffer.c:      if (ARCH_KMALLOC_MINALIGN <= 32)
> drivers/usb/core/buffer.c:      else if (ARCH_KMALLOC_MINALIGN <= 64)
> drivers/usb/core/buffer.c:      else if (ARCH_KMALLOC_MINALIGN <= 128)
> drivers/usb/misc/usbtest.c:     return (unsigned long)buf & (ARCH_KMALLOC_MINALIGN - 1);
>

Not sure how to answer that, the use of it comes to make the allocation
padding just as much as we need it to be in order to align the pointer.

>>
>> Also, I don't see how that solves the issue, I'm not sure I even
>> understand the issue. Do you? Were you able to reproduce it?
>
> Yes, the issue is that address supplied to dma_map_single wasn't aligned
> to DMA cacheline size.

Thats not true Leon, the address is always aligned to
MLX4_MR_PAGES_ALIGN which is 64B.

>
> And as the one, who wrote this code for mlx5, it will be great if you
> can give me a pointer. Why did you chose to use MLX5_UMR_ALIGN in that
> function? This will add 2048 bytes instead of 64 for mlx4.

The PRM states that the data descriptors array (MTT/KLM) pointer
must be aligned to 2K. I wish it didn't but it does. You can see in the
code that each registration that goes via UMR does the exact same thing.

>> IFF the pages buffer end not being aligned to a cacheline is problematic
>> then why not extent it to end in a cacheline? Why in the next full page?
>
> It will fit in one page.

Yea, but that single page can be 64K for a 128 pointers array...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 19, 2016, 9:48 a.m. UTC | #7
>> First of all, IIRC the patch author was Christoph wasn't he.
>>
>> Plus, you do realize that this patch makes the pages allocation
>> in granularity of pages. In systems with a large page size this
>> is completely redundant, it might even be harmful as the storage
>> ULPs need lots of MRs.
>
> I agree that the official fix should take a conservative
> approach to allocating this resource; there will be lots
> of MRs in an active system. This fix doesn't seem too
> careful.
>
>
>> Also, I don't see how that solves the issue, I'm not sure I even
>> understand the issue. Do you? Were you able to reproduce it?
>
> The issue is that dma_map_single() does not seem to DMA map
> portions of a memory region that are past the end of the first
> page of that region. Maybe that's a bug?

That seems weird to me, from looking at the code I didn't see
any indication that such a mapping would fail. Maybe we are seeing
a mlx4 specific issue? If this is some kind of generic dma-mapping
bug mlx5 would suffer from the same problem right? does it?

> This patch works around that behavior by guaranteeing that
>
> a) the memory region starts at the beginning of a page, and
> b) the memory region is never larger than a page
>
> This patch is not sufficient to repair mlx5, because b)
> cannot be satisfied in that case; the array of __be64's can
> be larger than 512 entries.

If a single page boundary is indeed the root-cause then I agree
this would not solve the problem for mlx5.

>> IFF the pages buffer end not being aligned to a cacheline is problematic
>> then why not extent it to end in a cacheline? Why in the next full page?
>
> I think the patch description justifies the choice of
> solution, but does not describe the original issue at
> all. The original issue had nothing to do with cacheline
> alignment.
>
> Lastly, this patch should remove the definition of
> MLX4_MR_PAGES_ALIGN.

The mlx4 PRM explicitly states that the translation (pages) vector
should align to 64 bytes and this is where this define comes from,
hence I don't think it should be removed from the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 19, 2016, 9:58 a.m. UTC | #8
>> In mlx5 system, we always added 2048 bytes to such allocations, for
>> reasons unknown to me. And it doesn't seem as a conservative approach
>> either.
>
> The mlx5 approach is much better than allocating a whole
> page, when you consider platforms with 64KB pages.
>
> A 1MB payload (for NFS) on such a platform comprises just
> 16 pages. So xprtrdma will allocate MRs with support for
> 16 pages. That's a priv pages array of 128 bytes, and you
> just put it in a 64KB page all by itself.
>
> So maybe adding 2048 bytes is not optimal either. But I
> think sticking with kmalloc here is a more optimal choice.

Again, the 2K constraint is not coming from any sort of dma mapping
alignment consideration, it comes from the _device_ limitation requiring
the translation vector to be aligned to 2K.

>>>> Also, I don't see how that solves the issue, I'm not sure I even
>>>> understand the issue. Do you? Were you able to reproduce it?
>>>
>>> The issue is that dma_map_single() does not seem to DMA map
>>> portions of a memory region that are past the end of the first
>>> page of that region. Maybe that's a bug?
>>
>> No, I didn't find support for that. Function dma_map_single expects
>> contiguous memory aligned to cache line, there is no limitation to be
>> page bounded.
>
> There certainly isn't, but that doesn't mean there can't
> be a bug somewhere ;-) and maybe not in dma_map_single.
> It could be that the "array on one page only" limitation
> is somewhere else in the mlx4 driver, or even in the HCA
> firmware.

I'm starting to think this is the case. Leon, I think it's time
to get the FW/HW guys involved...

>> I disagree, kmalloc with supplied flags will return contiguous memory
>> which is enough for dma_map_single. It is cache line alignment.
>
> The reason I find this hard to believe is that there is
> no end alignment guarantee at all in this code, but it
> works without issue when SLUB debugging is not enabled.
>
> xprtrdma allocates 256 elements in this array on x86.
> The code makes the array start on an 0x40 byte boundary.
> I'm pretty sure that means the end of that array will
> also be on at least an 0x40 byte boundary, and thus
> aligned to the DMA cacheline, whether or not SLUB
> debugging is enabled.
>
> Notice that in the current code, if the consumer requests
> an odd number of SGs, that array can't possibly end on
> an alignment boundary. But we've never had a complaint.
>
> SLUB debugging changes the alignment of lots of things,
> but mlx4_alloc_priv_pages is the only breakage that has
> been reported.

I tend to agree, I even have a feeling that this won't happen
on mlx5.

> DMA-API.txt says:
>
>> [T]he mapped region must begin exactly on a cache line
>> boundary and end exactly on one (to prevent two separately
>> mapped regions from sharing a single cache line)
>
> The way I read this, cacheline alignment shouldn't be
> an issue at all, as long as DMA cachelines aren't
> shared between mappings.
>
> If I simply increase the memory allocation size a little
> and ensure the end of the mapping is aligned, that should
> be enough to prevent DMA cacheline sharing with another
> memory allocation on the same page. But I still see Local
> Protection Errors when SLUB debugging is enabled, on my
> system (with patches to allocate more pages per MR).
>
> I'm not convinced this has anything to do with DMA
> cacheline alignment. The reason your patch fixes this
> issue is because it keeps the entire array on one page.

I share this feeling, I wrote several times that I don't understand
how this patch solves the issue and I would appreciate if someone
can explain it to me (preferably with evidence).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 213f61b44b54edbcbf272e694e889c61412be579 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <yishaih@mellanox.com>
Date: Thu, 16 Jun 2016 11:13:41 +0300
Subject: [PATCH] IB/mlx4: Ensure cache line boundaries for dma_map_single

"In order for memory mapped by this API to operate
correctly, the mapped region must begin exactly on a cache line
boundary and end exactly on one (to prevent two separately mapped
regions from sharing a single cache line).
Therefore, it is recommended that driver writers who
don't take special care to determine the cache line size at run time
only map virtual regions that begin and end on page boundaries (which
are guaranteed also to be cache line boundaries)." [1]

This patch uses __get_free_pages instead of kzalloc
to be sure that above will be true in all ARCHs and in all
SLUBs debug configurations.

[1] https://www.kernel.org/doc/Documentation/DMA-API.txt

issue: 802618
Change-Id: Iee8176b183290213b1b4e66f1835f5c90f067075
Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
---

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6c5ac5d..4a8bbe4 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -139,7 +139,6 @@ 
 	u32			max_pages;
 	struct mlx4_mr		mmr;
 	struct ib_umem	       *umem;
-	void			*pages_alloc;
 };
 
 struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 6312721..56b8d87 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -278,16 +278,12 @@ 
 		      int max_pages)
 {
 	int size = max_pages * sizeof(u64);
-	int add_size;
 	int ret;
 
-	add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
-
-	mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
-	if (!mr->pages_alloc)
+	mr->pages = (__be64 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+					       get_order(size));
+	if (!mr->pages)
 		return -ENOMEM;
-
-	mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
 
 	mr->page_map = dma_map_single(device->dma_device, mr->pages,
 				      size, DMA_TO_DEVICE);
@@ -299,7 +295,7 @@ 
 
 	return 0;
 err:
-	kfree(mr->pages_alloc);
+	free_pages((unsigned long)mr->pages, get_order(size));
 
 	return ret;
 }
@@ -313,7 +309,7 @@ 
 
 		dma_unmap_single(device->dma_device, mr->page_map,
 				 size, DMA_TO_DEVICE);
-		kfree(mr->pages_alloc);
+		free_pages((unsigned long)mr->pages, get_order(size));
 		mr->pages = NULL;
 	}
 }