Message ID | 20160618105650.GD5408@leon.nu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> On Jun 18, 2016, at 6:56 AM, Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Jun 17, 2016 at 03:55:56PM -0400, Chuck Lever wrote: >> >>> On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote: >>> >>> On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote: >>>> >>>> I think the patch description justifies the choice of >>>> solution, but does not describe the original issue at >>>> all. The original issue had nothing to do with cacheline >>>> alignment. >>> >>> I disagree, kmalloc with supplied flags will return contiguous memory >>> which is enough for dma_map_single. It is cache line alignment. >> >> The reason I find this hard to believe is that there is >> no end alignment guarantee at all in this code, but it >> works without issue when SLUB debugging is not enabled. >> >> xprtrdma allocates 256 elements in this array on x86. >> The code makes the array start on an 0x40 byte boundary. >> I'm pretty sure that means the end of that array will >> also be on at least an 0x40 byte boundary, and thus >> aligned to the DMA cacheline, whether or not SLUB >> debugging is enabled. >> >> Notice that in the current code, if the consumer requests >> an odd number of SGs, that array can't possibly end on >> an alignment boundary. But we've never had a complaint. >> >> SLUB debugging changes the alignment of lots of things, >> but mlx4_alloc_priv_pages is the only breakage that has >> been reported. > > I think it is related to custom logic which is in this function only. > I posted grep output earlier to emphasize it. > > For example adds 2K region after the actual data and it will ensure > alignment. > >> >> DMA-API.txt says: >> >>> [T]he mapped region must begin exactly on a cache line >>> boundary and end exactly on one (to prevent two separately >>> mapped regions from sharing a single cache line) >> >> The way I read this, cacheline alignment shouldn't be >> an issue at all, as long as DMA cachelines aren't >> shared between mappings. >> >> If I simply increase the memory allocation size a little >> and ensure the end of the mapping is aligned, that should >> be enough to prevent DMA cacheline sharing with another >> memory allocation on the same page. But I still see Local >> Protection Errors when SLUB debugging is enabled, on my >> system (with patches to allocate more pages per MR). >> >> I'm not convinced this has anything to do with DMA >> cacheline alignment. The reason your patch fixes this >> issue is because it keeps the entire array on one page. > > If you don't mind, we can do an experiment. > Let's add padding which will prevent alignment issue and > for sure will cross the page boundary. > > diff --git a/drivers/infiniband/hw/mlx4/mr.c > b/drivers/infiniband/hw/mlx4/mr.c > index 6312721..41e277e 100644 > --- a/drivers/infiniband/hw/mlx4/mr.c > +++ b/drivers/infiniband/hw/mlx4/mr.c > @@ -280,8 +280,10 @@ mlx4_alloc_priv_pages(struct ib_device *device, > int size = max_pages * sizeof(u64); > int add_size; > int ret; > - > +/* > add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); > +*/ > + add_size = 2048; > > mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); > if (!mr->pages_alloc) Hi Leon- I created a different patch, see attachment. It aligns the start _and_ end of the DMA mapped region, places large arrays so they encounter a page boundary, and leaves slack space around each array so there is no possibility of a shared DMA cacheline or other activity in that memory. I am able to reproduce the Local Protection Errors with this patch applied and SLUB debugging disabled. -- Chuck Lever
> Hi Leon- > > I created a different patch, see attachment. It aligns > the start _and_ end of the DMA mapped region, places > large arrays so they encounter a page boundary, and > leaves slack space around each array so there is no > possibility of a shared DMA cacheline or other activity > in that memory. > > I am able to reproduce the Local Protection Errors with > this patch applied and SLUB debugging disabled. Thanks Chuck for proving that the dma alignment is not the issue here. I suggest that we go with my dma coherent patch for now until Leon and the Mellanox team can debug this one with the HW/FW folks and find out what is going on. Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you want I can help with debugging this one. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <sagigrim@gmail.com> wrote: >> I am able to reproduce the Local Protection Errors with >> this patch applied and SLUB debugging disabled. > Thanks Chuck for proving that the dma alignment is not the issue here. > > I suggest that we go with my dma coherent patch for now until Leon and > the Mellanox team can debug this one with the HW/FW folks and find out > what is going on. > > Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you > want I can help with debugging this one. Hi Sagi, Leon and Co, From quick reading of the patch I got the impression that some scheme which used to work is now broken, did we get a bisection result pointing to the upstream commit which introduce the regression? I didn't see such note along the thread, basically, I think this is where we should be starting, thoughts? I also added the mlx4 core/IB maintainer. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 19, 2016 at 10:38 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > From quick reading of the patch I got the impression that some scheme > which used to work is now broken, did we get a bisection result > pointing to the upstream commit which introduce the regression? I > didn't see such note along the thread, basically, I think this is > where we should be starting, thoughts? I also added the mlx4 core/IB > maintainer. Oh, I missed the 1st post of the thread pointing to commit 1b2cd0fc673c ('IB/mlx4: Support the new memory [...]') -- looking on the patch, the only thing which is explicitly visible to upper layers is the setting of ib_dev.map_mr_sg API call. So there's NFS code which depends on this verb being exported, and if yes, does X else does Y? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 19, 2016, at 3:38 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > > On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <sagigrim@gmail.com> wrote: > >>> I am able to reproduce the Local Protection Errors with >>> this patch applied and SLUB debugging disabled. > >> Thanks Chuck for proving that the dma alignment is not the issue here. >> >> I suggest that we go with my dma coherent patch for now until Leon and >> the Mellanox team can debug this one with the HW/FW folks and find out >> what is going on. >> >> Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you >> want I can help with debugging this one. > > Hi Sagi, Leon and Co, > > From quick reading of the patch I got the impression that some scheme > which used to work is now broken, did we get a bisection result > pointing to the upstream commit which introduce the regression? Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API') The problem was introduced by the new FR API. I reported this issue back in late April: http://marc.info/?l=linux-rdma&m=146194706501705&w=2 and bisected the appearance of symptoms to: commit d86bd1bece6fc41d59253002db5441fe960a37f6 Author: Joonsoo Kim <iamjoonsoo.kim@lge.com> Date: Tue Mar 15 14:55:12 2016 -0700 mm/slub: support left redzone The left redzone changes the alignment characteristics of regions returned by kmalloc. Further diagnosis showed the problem was with mlx4_alloc_priv_pages(), and the WR flush occurred only when mr->pages happened to contain a page boundary. What we don't understand is why a page boundary in that array is a problem. > I > didn't see such note along the thread, basically, I think this is > where we should be starting, thoughts? I also added the mlx4 core/IB > maintainer. Yishai was notified about this issue on May 25: http://marc.info/?l=linux-rdma&m=146419192913960&w=2 -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote: > Thanks Chuck and Sagi for the help. <...> > > > I > > didn't see such note along the thread, basically, I think this is > > where we should be starting, thoughts? I also added the mlx4 core/IB > > maintainer. > > Yishai was notified about this issue on May 25: > > http://marc.info/?l=linux-rdma&m=146419192913960&w=2 Yishai and me follow this thread closely and we work on finding the root cause of this issue. Thanks > > > -- > Chuck Lever > > >
> Yishai and me follow this thread closely and we work on finding the > root cause of this issue. Thanks Leon and Yishai, let me know if you need any help with this. Do you agree we should move forward with the original patch until we get this resolved? Also, did anyone find out if this is happening in mlx5 as well? (Chuck?), if not then this would trim the root-cause to be a mlx4 specific issue. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2016 at 09:34:28AM +0300, Sagi Grimberg wrote: > > >Yishai and me follow this thread closely and we work on finding the > >root cause of this issue. > > Thanks Leon and Yishai, let me know if you need any help with this. > > Do you agree we should move forward with the original patch until > we get this resolved? We will do our best to meet our fixes submission target, which is planned to be this week. And for sure we will submit one of two: your's proposed patch or Yishai's fix if any. Right now, your's patch is running in our verification. Thanks
> We will do our best to meet our fixes submission target, which is > planned to be this week. > > And for sure we will submit one of two: your's proposed patch > or Yishai's fix if any. > > Right now, your's patch is running in our verification. Sure, thanks Leon. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6/20/2016 8:44 AM, Leon Romanovsky wrote: > On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote: >> > > Thanks Chuck and Sagi for the help. > > <...> > >> >>> I >>> didn't see such note along the thread, basically, I think this is >>> where we should be starting, thoughts? I also added the mlx4 core/IB >>> maintainer. >> >> Yishai was notified about this issue on May 25: >> >> http://marc.info/?l=linux-rdma&m=146419192913960&w=2 > > Yishai and me follow this thread closely and we work on finding the > root cause of this issue. > Just found the root cause of the problem, it was found to be a hardware limitation that is described as part of the PRM. The driver code had to be written accordingly, confirmed that internally with the relevant people. From PRM: "The PBL should be physically contiguous, must reside in a 64-byte-aligned address, and must not include the last 8 bytes of a page." The last sentence pointed that only one page can be used as the last 8 bytes should not be included. That's why there is a hard limit in the code for 511 entries. Re the candidate fix that you sent, from initial review it makes sense, we'll formally confirm it soon after finalizing the regression testing in our side. Thanks Chuck and Sagi for evaluating and working on a solution. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Just found the root cause of the problem, it was found to be a hardware > limitation that is described as part of the PRM. The driver code had to > be written accordingly, confirmed that internally with the relevant people. > > From PRM: > "The PBL should be physically contiguous, must reside in a > 64-byte-aligned address, and must not include the last 8 bytes of a page." > > The last sentence pointed that only one page can be used as the last 8 > bytes should not be included. That's why there is a hard limit in the > code for 511 entries. > > Re the candidate fix that you sent, from initial review it makes sense, > we'll formally confirm it soon after finalizing the regression testing > in our side. > > Thanks Chuck and Sagi for evaluating and working on a solution. Thanks Yishai, That clears up the root-cause. Does the same holds for mlx5? or we can leave it alone? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Sagi Grimberg" <sagigrim@gmail.com> > To: "Yishai Hadas" <yishaih@dev.mellanox.co.il>, "Chuck Lever" <chuck.lever@oracle.com> > Cc: leon@kernel.org, "Or Gerlitz" <gerlitz.or@gmail.com>, "Yishai Hadas" <yishaih@mellanox.com>, "linux-rdma" > <linux-rdma@vger.kernel.org>, "Linux NFS Mailing List" <linux-nfs@vger.kernel.org>, "Majd Dibbiny" > <majd@mellanox.com> > Sent: Tuesday, June 21, 2016 9:56:44 AM > Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages > > > > Just found the root cause of the problem, it was found to be a hardware > > limitation that is described as part of the PRM. The driver code had to > > be written accordingly, confirmed that internally with the relevant people. > > > > From PRM: > > "The PBL should be physically contiguous, must reside in a > > 64-byte-aligned address, and must not include the last 8 bytes of a page." > > > > The last sentence pointed that only one page can be used as the last 8 > > bytes should not be included. That's why there is a hard limit in the > > code for 511 entries. > > > > Re the candidate fix that you sent, from initial review it makes sense, > > we'll formally confirm it soon after finalizing the regression testing > > in our side. > > > > Thanks Chuck and Sagi for evaluating and working on a solution. > > Thanks Yishai, > > That clears up the root-cause. > > Does the same holds for mlx5? or we can leave it alone? > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Also wondering about mlx5 because the default is coherent and increasing the allowed queue depth got me into the swiotlb error situation. Backing the queue depth down per Bart's suggestion to 32 avoids the swiotlb errors. Likley 128 is too high anyway, but the weird part of my testing as already mentioned was that its only seen during reconnect activity. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 6312721..41e277e 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -280,8 +280,10 @@ mlx4_alloc_priv_pages(struct ib_device *device, int size = max_pages * sizeof(u64); int add_size; int ret; - +/* add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); +*/ + add_size = 2048; mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); if (!mr->pages_alloc)