diff mbox

[v2,01/24] mlx4-ib: Use coherent memory for priv pages

Message ID 20160618105650.GD5408@leon.nu (mailing list archive)
State New, archived
Headers show

Commit Message

Leon Romanovsky June 18, 2016, 10:56 a.m. UTC
On Fri, Jun 17, 2016 at 03:55:56PM -0400, Chuck Lever wrote:
> 
> > On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
> >> 
> >>> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:
> >> 
> >>> First of all, IIRC the patch author was Christoph wasn't he.
> >>> 
> >>> Plus, you do realize that this patch makes the pages allocation
> >>> in granularity of pages. In systems with a large page size this
> >>> is completely redundant, it might even be harmful as the storage
> >>> ULPs need lots of MRs.
> >> 
> >> I agree that the official fix should take a conservative
> >> approach to allocating this resource; there will be lots
> >> of MRs in an active system. This fix doesn't seem too
> >> careful.
> > 
> > In mlx5 system, we always added 2048 bytes to such allocations, for
> > reasons unknown to me. And it doesn't seem as a conservative approach
> > either.
> 
> The mlx5 approach is much better than allocating a whole
> page, when you consider platforms with 64KB pages.
> 
> A 1MB payload (for NFS) on such a platform comprises just
> 16 pages. So xprtrdma will allocate MRs with support for
> 16 pages. That's a priv pages array of 128 bytes, and you
> just put it in a 64KB page all by itself.
> 
> So maybe adding 2048 bytes is not optimal either. But I
> think sticking with kmalloc here is a more optimal choice.

I agree with your's and Sagi's points, just preferred working solution
over optimal. I'll send an optimal version.

> 
> 
> >>> Also, I don't see how that solves the issue, I'm not sure I even
> >>> understand the issue. Do you? Were you able to reproduce it?
> >> 
> >> The issue is that dma_map_single() does not seem to DMA map
> >> portions of a memory region that are past the end of the first
> >> page of that region. Maybe that's a bug?
> > 
> > No, I didn't find support for that. Function dma_map_single expects
> > contiguous memory aligned to cache line, there is no limitation to be
> > page bounded.
> 
> There certainly isn't, but that doesn't mean there can't
> be a bug somewhere ;-) and maybe not in dma_map_single.
> It could be that the "array on one page only" limitation
> is somewhere else in the mlx4 driver, or even in the HCA
> firmware.

We checked with HW/FW/arch teams prior to respond.

> 
> 
> >> This patch works around that behavior by guaranteeing that
> >> 
> >> a) the memory region starts at the beginning of a page, and
> >> b) the memory region is never larger than a page
> > 
> > b) the memory region ends on cache line.
> 
> I think we demonstrated pretty clearly that the issue
> occurs only when the end of the priv pages array crosses
> into a new page.
> 
> We didn't see any problem otherwise.

SLUB debug do exactly one thing, change alignment and this is why no
issue there observed before.

> 
> >> This patch is not sufficient to repair mlx5, because b)
> >> cannot be satisfied in that case; the array of __be64's can
> >> be larger than 512 entries.
> >> 
> >> 
> >>> IFF the pages buffer end not being aligned to a cacheline is problematic
> >>> then why not extent it to end in a cacheline? Why in the next full page?
> >> 
> >> I think the patch description justifies the choice of
> >> solution, but does not describe the original issue at
> >> all. The original issue had nothing to do with cacheline
> >> alignment.
> > 
> > I disagree, kmalloc with supplied flags will return contiguous memory
> > which is enough for dma_map_single. It is cache line alignment.
> 
> The reason I find this hard to believe is that there is
> no end alignment guarantee at all in this code, but it
> works without issue when SLUB debugging is not enabled.
> 
> xprtrdma allocates 256 elements in this array on x86.
> The code makes the array start on an 0x40 byte boundary.
> I'm pretty sure that means the end of that array will
> also be on at least an 0x40 byte boundary, and thus
> aligned to the DMA cacheline, whether or not SLUB
> debugging is enabled.
> 
> Notice that in the current code, if the consumer requests
> an odd number of SGs, that array can't possibly end on
> an alignment boundary. But we've never had a complaint.
> 
> SLUB debugging changes the alignment of lots of things,
> but mlx4_alloc_priv_pages is the only breakage that has
> been reported.

I think it is related to custom logic which is in this function only.
I posted grep output earlier to emphasize it.

For example adds 2K region after the actual data and it will ensure
alignment.

> 
> DMA-API.txt says:
> 
> > [T]he mapped region must begin exactly on a cache line
> > boundary and end exactly on one (to prevent two separately
> > mapped regions from sharing a single cache line)
> 
> The way I read this, cacheline alignment shouldn't be
> an issue at all, as long as DMA cachelines aren't
> shared between mappings.
> 
> If I simply increase the memory allocation size a little
> and ensure the end of the mapping is aligned, that should
> be enough to prevent DMA cacheline sharing with another
> memory allocation on the same page. But I still see Local
> Protection Errors when SLUB debugging is enabled, on my
> system (with patches to allocate more pages per MR).
> 
> I'm not convinced this has anything to do with DMA
> cacheline alignment. The reason your patch fixes this
> issue is because it keeps the entire array on one page.

If you don't mind, we can do an experiment.
Let's add padding which will prevent alignment issue and
for sure will cross the page boundary.

Comments

Chuck Lever June 18, 2016, 8:08 p.m. UTC | #1
> On Jun 18, 2016, at 6:56 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Fri, Jun 17, 2016 at 03:55:56PM -0400, Chuck Lever wrote:
>> 
>>> On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> 
>>> On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
>>>> 
>>>> I think the patch description justifies the choice of
>>>> solution, but does not describe the original issue at
>>>> all. The original issue had nothing to do with cacheline
>>>> alignment.
>>> 
>>> I disagree, kmalloc with supplied flags will return contiguous memory
>>> which is enough for dma_map_single. It is cache line alignment.
>> 
>> The reason I find this hard to believe is that there is
>> no end alignment guarantee at all in this code, but it
>> works without issue when SLUB debugging is not enabled.
>> 
>> xprtrdma allocates 256 elements in this array on x86.
>> The code makes the array start on an 0x40 byte boundary.
>> I'm pretty sure that means the end of that array will
>> also be on at least an 0x40 byte boundary, and thus
>> aligned to the DMA cacheline, whether or not SLUB
>> debugging is enabled.
>> 
>> Notice that in the current code, if the consumer requests
>> an odd number of SGs, that array can't possibly end on
>> an alignment boundary. But we've never had a complaint.
>> 
>> SLUB debugging changes the alignment of lots of things,
>> but mlx4_alloc_priv_pages is the only breakage that has
>> been reported.
> 
> I think it is related to custom logic which is in this function only.
> I posted grep output earlier to emphasize it.
> 
> For example adds 2K region after the actual data and it will ensure
> alignment.
> 
>> 
>> DMA-API.txt says:
>> 
>>> [T]he mapped region must begin exactly on a cache line
>>> boundary and end exactly on one (to prevent two separately
>>> mapped regions from sharing a single cache line)
>> 
>> The way I read this, cacheline alignment shouldn't be
>> an issue at all, as long as DMA cachelines aren't
>> shared between mappings.
>> 
>> If I simply increase the memory allocation size a little
>> and ensure the end of the mapping is aligned, that should
>> be enough to prevent DMA cacheline sharing with another
>> memory allocation on the same page. But I still see Local
>> Protection Errors when SLUB debugging is enabled, on my
>> system (with patches to allocate more pages per MR).
>> 
>> I'm not convinced this has anything to do with DMA
>> cacheline alignment. The reason your patch fixes this
>> issue is because it keeps the entire array on one page.
> 
> If you don't mind, we can do an experiment.
> Let's add padding which will prevent alignment issue and
> for sure will cross the page boundary.
> 
> diff --git a/drivers/infiniband/hw/mlx4/mr.c
> b/drivers/infiniband/hw/mlx4/mr.c
> index 6312721..41e277e 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -280,8 +280,10 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>        int size = max_pages * sizeof(u64);
> 	int add_size;
> 	int ret;
> -
> +/*
>        add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> +*/
> +       add_size = 2048;
> 
>        mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> 	if (!mr->pages_alloc)

Hi Leon-

I created a different patch, see attachment. It aligns
the start _and_ end of the DMA mapped region, places
large arrays so they encounter a page boundary, and
leaves slack space around each array so there is no
possibility of a shared DMA cacheline or other activity
in that memory.

I am able to reproduce the Local Protection Errors with
this patch applied and SLUB debugging disabled.


--
Chuck Lever
Sagi Grimberg June 19, 2016, 10:04 a.m. UTC | #2
> Hi Leon-
>
> I created a different patch, see attachment. It aligns
> the start _and_ end of the DMA mapped region, places
> large arrays so they encounter a page boundary, and
> leaves slack space around each array so there is no
> possibility of a shared DMA cacheline or other activity
> in that memory.
>
> I am able to reproduce the Local Protection Errors with
> this patch applied and SLUB debugging disabled.

Thanks Chuck for proving that the dma alignment is not the issue here.

I suggest that we go with my dma coherent patch for now until Leon and
the Mellanox team can debug this one with the HW/FW folks and find out
what is going on.

Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
want I can help with debugging this one.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz June 19, 2016, 7:38 p.m. UTC | #3
On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:

>> I am able to reproduce the Local Protection Errors with
>> this patch applied and SLUB debugging disabled.

> Thanks Chuck for proving that the dma alignment is not the issue here.
>
> I suggest that we go with my dma coherent patch for now until Leon and
> the Mellanox team can debug this one with the HW/FW folks and find out
> what is going on.
>
> Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
> want I can help with debugging this one.

Hi Sagi, Leon and Co,

From quick reading of the patch I got the impression that some scheme
which used to work is now broken, did we get a bisection result
pointing to the upstream commit which introduce the regression? I
didn't see such note along the thread, basically, I think this is
where we should be starting, thoughts? I also added the mlx4 core/IB
maintainer.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz June 19, 2016, 7:43 p.m. UTC | #4
On Sun, Jun 19, 2016 at 10:38 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:

> From quick reading of the patch I got the impression that some scheme
> which used to work is now broken, did we get a bisection result
> pointing to the upstream commit which introduce the regression? I
> didn't see such note along the thread, basically, I think this is
> where we should be starting, thoughts? I also added the mlx4 core/IB
> maintainer.

Oh, I missed the 1st post of the thread pointing to commit
1b2cd0fc673c ('IB/mlx4: Support the new memory [...]') -- looking on
the patch, the only thing which is explicitly visible to upper layers
is the setting of ib_dev.map_mr_sg API call. So there's NFS code which
depends on this verb being exported, and if yes, does X else does Y?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 19, 2016, 8:02 p.m. UTC | #5
> On Jun 19, 2016, at 3:38 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> 
> On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <sagigrim@gmail.com> wrote:
> 
>>> I am able to reproduce the Local Protection Errors with
>>> this patch applied and SLUB debugging disabled.
> 
>> Thanks Chuck for proving that the dma alignment is not the issue here.
>> 
>> I suggest that we go with my dma coherent patch for now until Leon and
>> the Mellanox team can debug this one with the HW/FW folks and find out
>> what is going on.
>> 
>> Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
>> want I can help with debugging this one.
> 
> Hi Sagi, Leon and Co,
> 
> From quick reading of the patch I got the impression that some scheme
> which used to work is now broken, did we get a bisection result
> pointing to the upstream commit which introduce the regression?

Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')

The problem was introduced by the new FR API. I reported
this issue back in late April:

  http://marc.info/?l=linux-rdma&m=146194706501705&w=2

and bisected the appearance of symptoms to:

commit d86bd1bece6fc41d59253002db5441fe960a37f6
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Tue Mar 15 14:55:12 2016 -0700

    mm/slub: support left redzone

The left redzone changes the alignment characteristics
of regions returned by kmalloc. Further diagnosis showed
the problem was with mlx4_alloc_priv_pages(), and the
WR flush occurred only when mr->pages happened to contain
a page boundary.

What we don't understand is why a page boundary in that
array is a problem.


> I
> didn't see such note along the thread, basically, I think this is
> where we should be starting, thoughts? I also added the mlx4 core/IB
> maintainer.

Yishai was notified about this issue on May 25:

http://marc.info/?l=linux-rdma&m=146419192913960&w=2


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 20, 2016, 5:44 a.m. UTC | #6
On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote:
> 

Thanks Chuck and Sagi for the help.

<...>

> 
> > I
> > didn't see such note along the thread, basically, I think this is
> > where we should be starting, thoughts? I also added the mlx4 core/IB
> > maintainer.
> 
> Yishai was notified about this issue on May 25:
> 
> http://marc.info/?l=linux-rdma&m=146419192913960&w=2

Yishai and me follow this thread closely and we work on finding the
root cause of this issue.

Thanks

> 
> 
> --
> Chuck Lever
> 
> 
>
Sagi Grimberg June 20, 2016, 6:34 a.m. UTC | #7
> Yishai and me follow this thread closely and we work on finding the
> root cause of this issue.

Thanks Leon and Yishai, let me know if you need any help with this.

Do you agree we should move forward with the original patch until
we get this resolved?

Also, did anyone find out if this is happening in mlx5 as well?
(Chuck?), if not then this would trim the root-cause to be a mlx4
specific issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 20, 2016, 7:01 a.m. UTC | #8
On Mon, Jun 20, 2016 at 09:34:28AM +0300, Sagi Grimberg wrote:
> 
> >Yishai and me follow this thread closely and we work on finding the
> >root cause of this issue.
> 
> Thanks Leon and Yishai, let me know if you need any help with this.
> 
> Do you agree we should move forward with the original patch until
> we get this resolved?

We will do our best to meet our fixes submission target, which is
planned to be this week.

And for sure we will submit one of two: your's proposed patch
or Yishai's fix if any.

Right now, your's patch is running in our verification.

Thanks
Sagi Grimberg June 20, 2016, 8:35 a.m. UTC | #9
> We will do our best to meet our fixes submission target, which is
> planned to be this week.
>
> And for sure we will submit one of two: your's proposed patch
> or Yishai's fix if any.
>
> Right now, your's patch is running in our verification.

Sure, thanks Leon.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas June 20, 2016, 1:41 p.m. UTC | #10
On 6/20/2016 8:44 AM, Leon Romanovsky wrote:
> On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote:
>>
>
> Thanks Chuck and Sagi for the help.
>
> <...>
>
>>
>>> I
>>> didn't see such note along the thread, basically, I think this is
>>> where we should be starting, thoughts? I also added the mlx4 core/IB
>>> maintainer.
>>
>> Yishai was notified about this issue on May 25:
>>
>> http://marc.info/?l=linux-rdma&m=146419192913960&w=2
>
> Yishai and me follow this thread closely and we work on finding the
> root cause of this issue.
>

Just found the root cause of the problem, it was found to be a hardware 
limitation that is described as part of the PRM. The driver code had to 
be written accordingly, confirmed that internally with the relevant people.

 From PRM:
"The PBL should be physically contiguous, must reside in a 
64-byte-aligned address, and must not include the last 8 bytes of a page."

The last sentence pointed that only one page can be used as the last 8 
bytes should not be included. That's why there is a hard limit in the 
code for 511 entries.

Re the candidate fix that you sent, from initial review it makes sense, 
we'll formally confirm it soon after finalizing the regression testing 
in our side.

Thanks Chuck and Sagi for evaluating and working on a solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 21, 2016, 1:56 p.m. UTC | #11
> Just found the root cause of the problem, it was found to be a hardware
> limitation that is described as part of the PRM. The driver code had to
> be written accordingly, confirmed that internally with the relevant people.
>
>  From PRM:
> "The PBL should be physically contiguous, must reside in a
> 64-byte-aligned address, and must not include the last 8 bytes of a page."
>
> The last sentence pointed that only one page can be used as the last 8
> bytes should not be included. That's why there is a hard limit in the
> code for 511 entries.
>
> Re the candidate fix that you sent, from initial review it makes sense,
> we'll formally confirm it soon after finalizing the regression testing
> in our side.
>
> Thanks Chuck and Sagi for evaluating and working on a solution.

Thanks Yishai,

That clears up the root-cause.

Does the same holds for mlx5? or we can leave it alone?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman June 21, 2016, 2:35 p.m. UTC | #12
----- Original Message -----
> From: "Sagi Grimberg" <sagigrim@gmail.com>
> To: "Yishai Hadas" <yishaih@dev.mellanox.co.il>, "Chuck Lever" <chuck.lever@oracle.com>
> Cc: leon@kernel.org, "Or Gerlitz" <gerlitz.or@gmail.com>, "Yishai Hadas" <yishaih@mellanox.com>, "linux-rdma"
> <linux-rdma@vger.kernel.org>, "Linux NFS Mailing List" <linux-nfs@vger.kernel.org>, "Majd Dibbiny"
> <majd@mellanox.com>
> Sent: Tuesday, June 21, 2016 9:56:44 AM
> Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages
> 
> 
> > Just found the root cause of the problem, it was found to be a hardware
> > limitation that is described as part of the PRM. The driver code had to
> > be written accordingly, confirmed that internally with the relevant people.
> >
> >  From PRM:
> > "The PBL should be physically contiguous, must reside in a
> > 64-byte-aligned address, and must not include the last 8 bytes of a page."
> >
> > The last sentence pointed that only one page can be used as the last 8
> > bytes should not be included. That's why there is a hard limit in the
> > code for 511 entries.
> >
> > Re the candidate fix that you sent, from initial review it makes sense,
> > we'll formally confirm it soon after finalizing the regression testing
> > in our side.
> >
> > Thanks Chuck and Sagi for evaluating and working on a solution.
> 
> Thanks Yishai,
> 
> That clears up the root-cause.
> 
> Does the same holds for mlx5? or we can leave it alone?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Also wondering about mlx5 because the default is coherent and increasing the allowed queue depth got me into the swiotlb error situation.
Backing the queue depth down per Bart's suggestion to 32 avoids the swiotlb errors.
Likley 128 is too high anyway, but the weird part of my testing as already mentioned was that its only seen during reconnect activity.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx4/mr.c
b/drivers/infiniband/hw/mlx4/mr.c
index 6312721..41e277e 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -280,8 +280,10 @@  mlx4_alloc_priv_pages(struct ib_device *device,
        int size = max_pages * sizeof(u64);
	int add_size;
	int ret;
-
+/*
        add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
+*/
+       add_size = 2048;

        mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
	if (!mr->pages_alloc)