diff mbox

[rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

Message ID 20180701163624.28620-1-leon@kernel.org (mailing list archive)
State Accepted
Delegated to: Jason Gunthorpe
Headers show

Commit Message

Leon Romanovsky July 1, 2018, 4:36 p.m. UTC
From: Leon Romanovsky <leonro@mellanox.com>

VMA lookup is supposed to be performed while mmap_sem is held.

Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/i40iw/i40iw_verbs.c | 2 ++
 1 file changed, 2 insertions(+)

--
2.14.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Gunthorpe July 4, 2018, 5:54 p.m. UTC | #1
On Sun, Jul 01, 2018 at 07:36:24PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> VMA lookup is supposed to be performed while mmap_sem is held.
> 
> Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/infiniband/hw/i40iw/i40iw_verbs.c | 2 ++
>  1 file changed, 2 insertions(+)

As discussed, this entire approach in the driver is wrong, but we may
as well have right locking until it gets fixed :(

Applied to for-next

Thanks,
Jason

> diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> index 8884ff71a634..7d85414742ff 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> @@ -1410,6 +1410,7 @@ static void i40iw_set_hugetlb_values(u64 addr, struct i40iw_mr *iwmr)
>  	struct vm_area_struct *vma;
>  	struct hstate *h;
> 
> +	down_read(&current->mm->mmap_sem);
>  	vma = find_vma(current->mm, addr);
>  	if (vma && is_vm_hugetlb_page(vma)) {
>  		h = hstate_vma(vma);
> @@ -1418,6 +1419,7 @@ static void i40iw_set_hugetlb_values(u64 addr, struct i40iw_mr *iwmr)
>  			iwmr->page_msk = huge_page_mask(h);
>  		}
>  	}
> +	up_read(&current->mm->mmap_sem);
>  }
> 
>  /**
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shiraz Saleem July 21, 2018, 1:31 p.m. UTC | #2
On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> VMA lookup is supposed to be performed while mmap_sem is held.
> 
> Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>
Thanks for fixing this bug.

Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 23, 2018, 8:39 p.m. UTC | #3
On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@mellanox.com>
> > 
> > VMA lookup is supposed to be performed while mmap_sem is held.
> > 
> > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> >
> Thanks for fixing this bug.
> 
> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 

This code in the driver all looks wrong to me, it should not rely on
the huge pages flag and it shouldn't do this VM walking.

Can you please fix it?

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shiraz Saleem July 23, 2018, 10:19 p.m. UTC | #4
On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > > 
> > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > 
> > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > >
> > Thanks for fixing this bug.
> > 
> > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> 
> This code in the driver all looks wrong to me, it should not rely on
> the huge pages flag and it shouldn't do this VM walking.
>
Hi Jason - Why is that we cant use umem->hugetlb to determine if
MR is composed of huge pages? Isnt that why the flag was defined for
in the first place. All the VMAs w.r.t the user-space buffer are
scanned to make this determination in ib_umem_get.
Isnt it better to use it than implement it in individual driver.

The umem->hugetlb flag is checked once in i40iw during MR creation.

Shiraz
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 23, 2018, 10:29 p.m. UTC | #5
On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > 
> > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > 
> > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > >
> > > Thanks for fixing this bug.
> > > 
> > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > 
> > This code in the driver all looks wrong to me, it should not rely on
> > the huge pages flag and it shouldn't do this VM walking.
> >
> Hi Jason - Why is that we cant use umem->hugetlb to determine if
> MR is composed of huge pages? Isnt that why the flag was defined for
> in the first place. All the VMAs w.r.t the user-space buffer are
> scanned to make this determination in ib_umem_get.
> Isnt it better to use it than implement it in individual driver.

It was a mistake to have this flag and all drivers except i40iw and
bnxt don't use it. I wish to remove it.

Drivers are supposed to upgrade the page order by looking for runs of
suitable physical addresses after DMA translation, not by inspecting
the VMA. Something like mlx5_ib_cont_pages() and related

The core could provde support code for this common algorithm..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shiraz Saleem July 26, 2018, 4:34 p.m. UTC | #6
On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > 
> > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > 
> > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > >
> > > > Thanks for fixing this bug.
> > > > 
> > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > 
> > > This code in the driver all looks wrong to me, it should not rely on
> > > the huge pages flag and it shouldn't do this VM walking.
> > >
> > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > MR is composed of huge pages? Isnt that why the flag was defined for
> > in the first place. All the VMAs w.r.t the user-space buffer are
> > scanned to make this determination in ib_umem_get.
> > Isnt it better to use it than implement it in individual driver.
> 
> It was a mistake to have this flag and all drivers except i40iw and
> bnxt don't use it. I wish to remove it.
I see. What was the drawback to the approach?

 
> Drivers are supposed to upgrade the page order by looking for runs of
> suitable physical addresses after DMA translation, not by inspecting
> the VMA. Something like mlx5_ib_cont_pages() and related
> 
> The core could provde support code for this common algorithm..
It is a great idea for core to replace the flag with the common algorithm
and avoid duplicating in every driver. I’m not in favor of changing the driver
just to get rid of the flag and then again to use the common algorithm in core.

Shiraz
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 26, 2018, 4:46 p.m. UTC | #7
On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > > 
> > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > 
> > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > > >
> > > > > Thanks for fixing this bug.
> > > > > 
> > > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > > 
> > > > This code in the driver all looks wrong to me, it should not rely on
> > > > the huge pages flag and it shouldn't do this VM walking.
> > > >
> > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > scanned to make this determination in ib_umem_get.
> > > Isnt it better to use it than implement it in individual driver.
> > 
> > It was a mistake to have this flag and all drivers except i40iw and
> > bnxt don't use it. I wish to remove it.
> I see. What was the drawback to the approach?

It is just wongly designed.. page merging is about combining physical
addresses within a HW supported size list, not about matching the CPU
huge page size(s) to HW.

We should never even care about 'huge pages' in the VM sense, and
certainly never scan a VM, and it is buggy/racy to do so, I
think. Leon already noticed missing locking, but I think there is
worse here since we have desync'd with the lock that get_user_pages
was run under.

> > Drivers are supposed to upgrade the page order by looking for runs of
> > suitable physical addresses after DMA translation, not by inspecting
> > the VMA. Something like mlx5_ib_cont_pages() and related
> > 
> > The core could provde support code for this common algorithm..
>
> It is a great idea for core to replace the flag with the common algorithm
> and avoid duplicating in every driver. I’m not in favor of changing the driver
> just to get rid of the flag and then again to use the common algorithm in core.

Well, someone will need to figure out what that core bit should look
like, and turns out this driver is buggy in this area, so.... :)

I think this would probably be something like passing in a bitmap (?)
of supported page sizes to ib_umem_get() and then ib_umem_get would
combine pages when it builds up the s/g list.

So each driver driver gets a minimum set of s/g entries within its
set of supported page sizes, and doesn't have to do any more work.

The algorithm is a bit tricky though, I think?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shiraz Saleem July 31, 2018, 2:19 p.m. UTC | #8
On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > 
> > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > 
> > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > > > >
> > > > > > Thanks for fixing this bug.
> > > > > > 
> > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > > > 
> > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > >
> > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > scanned to make this determination in ib_umem_get.
> > > > Isnt it better to use it than implement it in individual driver.
> > > 
> > > It was a mistake to have this flag and all drivers except i40iw and
> > > bnxt don't use it. I wish to remove it.
> > I see. What was the drawback to the approach?
> 
> It is just wongly designed.. page merging is about combining physical
> addresses within a HW supported size list, not about matching the CPU
> huge page size(s) to HW.

Well ok. That is a different way of looking at it.
Our assumption was tied to the fact that the OS makes
available only a few page sizes (depending on the CPU arch) and
huge pages fall in that bracket. So it made sense for it be
supported by core. But your approach seems to give more
flexibility and gets rid of the locking. 

> We should never even care about 'huge pages' in the VM sense, and
> certainly never scan a VM, and it is buggy/racy to do so, I
> think. Leon already noticed missing locking, but I think there is
> worse here since we have desync'd with the lock that get_user_pages
> was run under.

OK. At this point, we are looking to fix in i40iw with an algo that
looks at physical page address.

Not sure I follow the desync argument as the pages are pinned and given
to us by the core. But if this is really a concern, there are other places
where the VMA scan is done too.

https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389
https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341

Shiraz
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 31, 2018, 2:31 p.m. UTC | #9
On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > > 
> > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > 
> > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > >
> > > > > > > Thanks for fixing this bug.
> > > > > > > 
> > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > > > > 
> > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > >
> > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > scanned to make this determination in ib_umem_get.
> > > > > Isnt it better to use it than implement it in individual driver.
> > > > 
> > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > bnxt don't use it. I wish to remove it.
> > > I see. What was the drawback to the approach?
> > 
> > It is just wongly designed.. page merging is about combining physical
> > addresses within a HW supported size list, not about matching the CPU
> > huge page size(s) to HW.
> 
> Well ok. That is a different way of looking at it.
> Our assumption was tied to the fact that the OS makes
> available only a few page sizes (depending on the CPU arch) and
> huge pages fall in that bracket. So it made sense for it be
> supported by core. But your approach seems to give more
> flexibility and gets rid of the locking. 

There are various patches that produce contiguous physical allocations
in various ways beyond the huge page size restriction. Drivers must
look at the physical pages to take advantage of that.

> > We should never even care about 'huge pages' in the VM sense, and
> > certainly never scan a VM, and it is buggy/racy to do so, I
> > think. Leon already noticed missing locking, but I think there is
> > worse here since we have desync'd with the lock that get_user_pages
> > was run under.
> 
> OK. At this point, we are looking to fix in i40iw with an algo that
> looks at physical page address.

Ideally I'd like this in the core code.. So if you are going to write
and test something new, lets do that.

> Not sure I follow the desync argument as the pages are pinned and given
> to us by the core. But if this is really a concern, there are other places
> where the VMA scan is done too.
>
>https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389

This is a bit tricky, but becausae it is done before get_user_pages,
and only influences the access flags, it is not racy in a way that
really matters. At worst a race would cause ib_umem_get to return fail
when maybe it shouldn't have.

> https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341

ODP is supposed to not be racy with changes to the mm as it doesn't
use get_user_pages in the same way..

But not sure what this is doing, or what IB_ACCESS_HUGETLB is supposed
to be..

Also kind of looks wrong, wouldn't recommend copying it :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shiraz Saleem Aug. 14, 2018, 2 p.m. UTC | #10
On Tue, Jul 31, 2018 at 08:31:10AM -0600, Jason Gunthorpe wrote:
> On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> > On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > > > 
> > > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > > 
> > > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > > >
> > > > > > > > Thanks for fixing this bug.
> > > > > > > > 
> > > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > > > > > 
> > > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > > >
> > > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > > scanned to make this determination in ib_umem_get.
> > > > > > Isnt it better to use it than implement it in individual driver.
> > > > > 
> > > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > > bnxt don't use it. I wish to remove it.
> > > > I see. What was the drawback to the approach?
> > > 
> > > It is just wongly designed.. page merging is about combining physical
> > > addresses within a HW supported size list, not about matching the CPU
> > > huge page size(s) to HW.
> > 
> > Well ok. That is a different way of looking at it.
> > Our assumption was tied to the fact that the OS makes
> > available only a few page sizes (depending on the CPU arch) and
> > huge pages fall in that bracket. So it made sense for it be
> > supported by core. But your approach seems to give more
> > flexibility and gets rid of the locking. 
> 
> There are various patches that produce contiguous physical allocations
> in various ways beyond the huge page size restriction. Drivers must
> look at the physical pages to take advantage of that.
> 
> > > We should never even care about 'huge pages' in the VM sense, and
> > > certainly never scan a VM, and it is buggy/racy to do so, I
> > > think. Leon already noticed missing locking, but I think there is
> > > worse here since we have desync'd with the lock that get_user_pages
> > > was run under.
> > 
> > OK. At this point, we are looking to fix in i40iw with an algo that
> > looks at physical page address.
> 
> Ideally I'd like this in the core code.. So if you are going to write
> and test something new, lets do that.


Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
MR is contiguous or not.

With respect to the algo in ib_umem_get, I am considering 2 options --

1. Scan the entire page list returned after pinning check if satisfies the
HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
selected HW size is then used size the SG table and build the minimal SG list
depending on HW size. Similar to sg_alloc_table_from_pages().

2. Leave the current pinning and SG list creation same.
https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192
Scan the SG entries to get the page list and check to see if satisfies the HW pg size.
Rebuild a new SG list (or collapse the current list) to have minimal entries if HW
size (> PAGE_SIZE) is selected.

Thoughts?

Additionally a question I have is what HW size do we select for a contiguous MR.
For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
that is contiguous, what does the core select for HW size since all 3 can be done?
Maybe the largest supported HW size that is <= 1/2 of MR len?
So in this case, it would be 64K. 

Shiraz


> > Not sure I follow the desync argument as the pages are pinned and given
> > to us by the core. But if this is really a concern, there are other places
> > where the VMA scan is done too.
> >
> >https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389
> 
> This is a bit tricky, but becausae it is done before get_user_pages,
> and only influences the access flags, it is not racy in a way that
> really matters. At worst a race would cause ib_umem_get to return fail
> when maybe it shouldn't have.
> 
> > https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341
> 
> ODP is supposed to not be racy with changes to the mm as it doesn't
> use get_user_pages in the same way..
> 
> But not sure what this is doing, or what IB_ACCESS_HUGETLB is supposed
> to be..
> 
> Also kind of looks wrong, wouldn't recommend copying it :)
> 
> Jason
Jason Gunthorpe Aug. 14, 2018, 8:03 p.m. UTC | #11
On Tue, Aug 14, 2018 at 09:00:17AM -0500, Shiraz Saleem wrote:
> On Tue, Jul 31, 2018 at 08:31:10AM -0600, Jason Gunthorpe wrote:
> > On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> > > On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > > > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > > > > 
> > > > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > > > 
> > > > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > > > > > > > >
> > > > > > > > > Thanks for fixing this bug.
> > > > > > > > > 
> > > > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> 
> > > > > > > > 
> > > > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > > > >
> > > > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > > > scanned to make this determination in ib_umem_get.
> > > > > > > Isnt it better to use it than implement it in individual driver.
> > > > > > 
> > > > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > > > bnxt don't use it. I wish to remove it.
> > > > > I see. What was the drawback to the approach?
> > > > 
> > > > It is just wongly designed.. page merging is about combining physical
> > > > addresses within a HW supported size list, not about matching the CPU
> > > > huge page size(s) to HW.
> > > 
> > > Well ok. That is a different way of looking at it.
> > > Our assumption was tied to the fact that the OS makes
> > > available only a few page sizes (depending on the CPU arch) and
> > > huge pages fall in that bracket. So it made sense for it be
> > > supported by core. But your approach seems to give more
> > > flexibility and gets rid of the locking. 
> > 
> > There are various patches that produce contiguous physical allocations
> > in various ways beyond the huge page size restriction. Drivers must
> > look at the physical pages to take advantage of that.
> > 
> > > > We should never even care about 'huge pages' in the VM sense, and
> > > > certainly never scan a VM, and it is buggy/racy to do so, I
> > > > think. Leon already noticed missing locking, but I think there is
> > > > worse here since we have desync'd with the lock that get_user_pages
> > > > was run under.
> > > 
> > > OK. At this point, we are looking to fix in i40iw with an algo that
> > > looks at physical page address.
> > 
> > Ideally I'd like this in the core code.. So if you are going to write
> > and test something new, lets do that.
> 
> 
> Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
> HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
> MR is contiguous or not.
> 
> With respect to the algo in ib_umem_get, I am considering 2 options --
> 
> 1. Scan the entire page list returned after pinning check if satisfies the
> HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
> selected HW size is then used size the SG table and build the minimal SG list
> depending on HW size. Similar to sg_alloc_table_from_pages().

It is unclear if saving memory like this is really needed or not, but
yes, that is possible.

> 2. Leave the current pinning and SG list creation same.
> https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192

Optimizing the SG table should optimize the dma_mapping overhead,
particularly with IOMMU, so I'd prefer to start by combining sg
entries.
 
> Additionally a question I have is what HW size do we select for a contiguous MR.
> For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
> that is contiguous, what does the core select for HW size since all 3 can be done?
> Maybe the largest supported HW size that is <= 1/2 of MR len?
> So in this case, it would be 64K. 

It should break it into the largest driver supported page size - the
goal of the algorithm is to give the driver the smallest list of
pages.

The tricky bit is that when we create the sg list the DMA mapping will
preserve the sizes, but not necessarily the alignment. So we might get
a 2M page starting at some random unaligned 4k offset. HW can't deal
with this, so another pass is needed post DMA mapping working on the
physical pages that would break up any unalignment.

AFAIK this is a obscure case that does not happen in the common high
performance path we expect.

The HW driver would have to iterate ove the SG list with a helper
something like

for (ib_umem_start_phys_iter(umem, &state); state->len;
     ib_umem_next_phys_iter(umem, &state) {
     .. add to hw table..->phys = state->physaddr;
     .. add to hw table..->len = state->len;
}

Where the helpers guarantee that state->physaddr is properly aligned
for state->len (they only make things smaller, not bigger)

Something that looks kind of like this off the cuff untested thing, at
least for the sg pass:

struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
					struct page *page, unsigned long npages,
					unsigned long supported_pagesz)
{
	while (npages) {
		unsigned long pfn = page_to_pfn(page);
		/* How many trailing zero bits are in the page's kernel
		 * address? */
		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
		unsigned int len_pages;
		unsigned int page_bit;

		/* Figure out the highest set bit in the supported sizes that
		 * is below the alignment of the page. ie if we have 12 zeros
		 * then we want to return bit 12 is set (ie the value 13).
		 * This algorithm guarantees that all pages are at least
		 * aligned on their size in the kernel address.
		 */
		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
			return NULL;

		len_pages =
			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
		sg_end = sg_next(sg_end);
		page = pfn_to_page(pfn + len_pages);
	}
	return sg_end;
}

/*
 * page_list is an array of npage struct page pointers, where each page is
 * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
 * where each bit indicates a supported page size, ie (1<<12) indicates that
 * 4K pages are supported.
 */
struct scatterlist *ib_umem_add_page_list(struct scatterlist *sg_end,
					  struct page **page_list,
					  unsigned long npages,
					  u64 supported_pagesz)
{
	unsigned long i = 0;

	while (i != npages) {
		struct page *first_page = page_list[i];
		unsigned long first_pfn = page_to_pfn(first_page);
		unsigned long len;

		/* Compute the number of contiguous pages we have starting at
		 * i */
		for (len = 0; i != npages &&
			      first_pfn + len == page_to_pfn(page_list[i]);
		     i++, len++)
			;

		/* Place the N continugous pages into the scatter table */
		sg_end = ib_umem_add_page_sg(sg_end, first_page, len,
					     supported_pagesz);
		if (unlikely(!sg_end))
			return NULL;
	}
	return sg_end;
}
Shiraz Saleem Aug. 22, 2018, 9:23 p.m. UTC | #12
> > > Ideally I'd like this in the core code.. So if you are going to write
> > > and test something new, lets do that.

Hi Jason - Thanks for sharing your thoughts. I think I have a better handle of what needs to be
done, but, do have further questions. See inline.
 
> > 
> > Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
> > HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
> > MR is contiguous or not.
> > 
> > With respect to the algo in ib_umem_get, I am considering 2 options --
> > 
> > 1. Scan the entire page list returned after pinning check if satisfies the
> > HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
> > selected HW size is then used size the SG table and build the minimal SG list
> > depending on HW size. Similar to sg_alloc_table_from_pages().
> 
> It is unclear if saving memory like this is really needed or not, but
> yes, that is possible.
> 
> > 2. Leave the current pinning and SG list creation same.
> > https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192
> 
> Optimizing the SG table should optimize the dma_mapping overhead,
> particularly with IOMMU, so I'd prefer to start by combining sg
> entries.
>  
> > Additionally a question I have is what HW size do we select for a contiguous MR.
> > For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
> > that is contiguous, what does the core select for HW size since all 3 can be done?
> > Maybe the largest supported HW size that is <= 1/2 of MR len?
> > So in this case, it would be 64K. 
> 
> It should break it into the largest driver supported page size - the
> goal of the algorithm is to give the driver the smallest list of
> pages.
> 
> The tricky bit is that when we create the sg list the DMA mapping will
> preserve the sizes, but not necessarily the alignment. So we might get
> a 2M page starting at some random unaligned 4k offset. HW can't deal
> with this, so another pass is needed post DMA mapping working on the
> physical pages that would break up any unalignment.
> 
> AFAIK this is a obscure case that does not happen in the common high
> performance path we expect.
> 
> The HW driver would have to iterate ove the SG list with a helper
> something like
> 
> for (ib_umem_start_phys_iter(umem, &state); state->len;
>      ib_umem_next_phys_iter(umem, &state) {
>      .. add to hw table..->phys = state->physaddr;
>      .. add to hw table..->len = state->len;
> }
> 
> Where the helpers guarantee that state->physaddr is properly aligned
> for state->len (they only make things smaller, not bigger)

Not sure I am clear with this part "they only make things smaller, not bigger".
If the core chose 2M HW PG size and the start of the MR is at an offset
into the 2M page, then in driver we supply a 2M aligned address from the
SG entry into HW table, but HW can still do 2M. Is this the case you are referring to?

> Something that looks kind of like this off the cuff untested thing, at
> least for the sg pass:
> 
> struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> 					struct page *page, unsigned long npages,
> 					unsigned long supported_pagesz)
> {
> 	while (npages) {
> 		unsigned long pfn = page_to_pfn(page);
> 		/* How many trailing zero bits are in the page's kernel
> 		 * address? */
> 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> 		unsigned int len_pages;
> 		unsigned int page_bit;
> 
> 		/* Figure out the highest set bit in the supported sizes that
> 		 * is below the alignment of the page. ie if we have 12 zeros
> 		 * then we want to return bit 12 is set (ie the value 13).
> 		 * This algorithm guarantees that all pages are at least
> 		 * aligned on their size in the kernel address.
> 		 */

Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?

> 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> 			return NULL;
> 
> 		len_pages =
> 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));

For a 2M huge page where the start addr is offset into a 2M page, it looks like
we are creating 4K SG entries upto the naturally aligned boundary followed by
single SG 2M entries. Is that the intent?

> 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> 		sg_end = sg_next(sg_end);
> 		page = pfn_to_page(pfn + len_pages);

npages -= len_pages;?

> 	}
> 	return sg_end;
> }
> 
> /*
>  * page_list is an array of npage struct page pointers, where each page is
>  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
>  * where each bit indicates a supported page size, ie (1<<12) indicates that
>  * 4K pages are supported.
>  */

Does page_list array contains all the npages of the MR when this function is called?
Or is this function called after each pinning of min (npages, 512) pages?
https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
I am guessing its the former in which case, dont we have the same problem of saving
off memory as Option #1 I proposed?

> struct scatterlist *ib_umem_add_page_list(struct scatterlist *sg_end,
> 					  struct page **page_list,
> 					  unsigned long npages,
> 					  u64 supported_pagesz)
> {
> 	unsigned long i = 0;
> 
> 	while (i != npages) {
> 		struct page *first_page = page_list[i];
> 		unsigned long first_pfn = page_to_pfn(first_page);
> 		unsigned long len;
> 
> 		/* Compute the number of contiguous pages we have starting at
> 		 * i */
> 		for (len = 0; i != npages &&
> 			      first_pfn + len == page_to_pfn(page_list[i]);
> 		     i++, len++)
> 			;
> 
> 		/* Place the N continugous pages into the scatter table */
> 		sg_end = ib_umem_add_page_sg(sg_end, first_page, len,
> 					     supported_pagesz);
> 		if (unlikely(!sg_end))
> 			return NULL;
> 	}
> 	return sg_end;
> }
Jason Gunthorpe Aug. 22, 2018, 9:43 p.m. UTC | #13
On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote:
> > > > Ideally I'd like this in the core code.. So if you are going to write
> > It should break it into the largest driver supported page size - the
> > goal of the algorithm is to give the driver the smallest list of
> > pages.
> > 
> > The tricky bit is that when we create the sg list the DMA mapping will
> > preserve the sizes, but not necessarily the alignment. So we might get
> > a 2M page starting at some random unaligned 4k offset. HW can't deal
> > with this, so another pass is needed post DMA mapping working on the
> > physical pages that would break up any unalignment.
> > 
> > AFAIK this is a obscure case that does not happen in the common high
> > performance path we expect.
> > 
> > The HW driver would have to iterate ove the SG list with a helper
> > something like
> > 
> > for (ib_umem_start_phys_iter(umem, &state); state->len;
> >      ib_umem_next_phys_iter(umem, &state) {
> >      .. add to hw table..->phys = state->physaddr;
> >      .. add to hw table..->len = state->len;
> > }
> > 
> > Where the helpers guarantee that state->physaddr is properly aligned
> > for state->len (they only make things smaller, not bigger)
> 
> Not sure I am clear with this part "they only make things smaller,
> not bigger".  If the core chose 2M HW PG size and the start of the
> MR is at an offset into the 2M page, then in driver we supply a 2M
> aligned address from the SG entry into HW table, but HW can still do
> 2M. Is this the case you are referring to?

Sort of, if there is no DMA mapping then the alignment and sizing here
should match the alignment and sizing created below, so nothing should
change.

With the IOMMU we might get a situation where a 2M aligned region is
IOMMU mapped to a 2M unaligned region. In this case we have to break
the 2M pages back up into 4k pages (or whatever)

Or, another case, would be HW that uses a 64K native page size (eg
PPC), and a driver that only supports 4K.

In this case the 64K pages would be reduced to aligned 4k ranges for
the driver.

> > Something that looks kind of like this off the cuff untested thing, at
> > least for the sg pass:
> > 
> > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> > 					struct page *page, unsigned long npages,
> > 					unsigned long supported_pagesz)
> > {
> > 	while (npages) {
> > 		unsigned long pfn = page_to_pfn(page);
> > 		/* How many trailing zero bits are in the page's kernel
> > 		 * address? */
> > 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> > 		unsigned int len_pages;
> > 		unsigned int page_bit;
> > 
> > 		/* Figure out the highest set bit in the supported sizes that
> > 		 * is below the alignment of the page. ie if we have 12 zeros
> > 		 * then we want to return bit 12 is set (ie the value 13).
> > 		 * This algorithm guarantees that all pages are at least
> > 		 * aligned on their size in the kernel address.
> > 		 */
> 
> Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?

I didn't test it, comment is describing what to do :)

> > 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> > 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> > 			return NULL;
> > 
> > 		len_pages =
> > 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
> 
> For a 2M huge page where the start addr is offset into a 2M page, it looks like
> we are creating 4K SG entries upto the naturally aligned boundary followed by
> single SG 2M entries. Is that the intent?

Yes. Or more specifically, this is a greedy algorithim, it always
selects the largest aligned block it can, and consumes those
pages.

For instance if we have a 4M region composed of 2M huge pages
And the driver can support 2M,1M and 4k
and the start of the MR is offset 4k into the region

We should get 255 4K pages, 1 1M page, 1 2M page.

> > 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> > 		sg_end = sg_next(sg_end);
> > 		page = pfn_to_page(pfn + len_pages);
> 
> npages -= len_pages;?

Yep

> > 	return sg_end;
> > }
> > 
> > /*
> >  * page_list is an array of npage struct page pointers, where each page is
> >  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
> >  * where each bit indicates a supported page size, ie (1<<12) indicates that
> >  * 4K pages are supported.
> >  */
> 
> Does page_list array contains all the npages of the MR when this function is called?
> Or is this function called after each pinning of min (npages, 512) pages?
> https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
> I am guessing its the former in which case, dont we have the same problem of saving
> off memory as Option #1 I proposed?

It has to be the page list after pinning, yes.

And yes, this doesn't attempt to solve the problem with overallocating
the sg table. We can solve in a couple of different ways, but that can
be a followup, IMHO.

After thinking about this some more I might suggest a slight changes..

Have the sg table contain the largest contiguous regions possible, and
ignore alignment or driver capability. This minimizes the sg table
entry size. Ie in the example above it would simply be a SG entry of
length 4M-4k

This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge
pages (if supported) and the IOMMU huge page size may differ from the
driver supported size.

Next, when processing in the driver, break up the giant sg entries
with a helper, as described. The helper is a bit more complicated
because the realignment work is pushed into it, but not seriously so.

Second, I think we have some HW that can handle only one page size? ie
if they want to use 2M pages then all pages must be 2M?

Another helper would be needed to sweep the sg list to get the best
page size, like this:

 best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap);
 for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len;
      ib_umem_next_phys_iter(umem, &state) {
      .. add to hw table..->phys = state->physaddr;
      .. add to hw table..->len = state->len;
 }

HW that can support mixed pages would just use best_size ==
supported_sizes bitmaps

And we could support HW that can optimize with unaligned pages too, if
there is such a thing.

Jason
Shiraz Saleem Aug. 28, 2018, 7:27 p.m. UTC | #14
On Wed, Aug 22, 2018 at 03:43:56PM -0600, Jason Gunthorpe wrote:
> On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote:
> > > > > Ideally I'd like this in the core code.. So if you are going to write
> > > It should break it into the largest driver supported page size - the
> > > goal of the algorithm is to give the driver the smallest list of
> > > pages.
> > > 
> > > The tricky bit is that when we create the sg list the DMA mapping will
> > > preserve the sizes, but not necessarily the alignment. So we might get
> > > a 2M page starting at some random unaligned 4k offset. HW can't deal
> > > with this, so another pass is needed post DMA mapping working on the
> > > physical pages that would break up any unalignment.
> > > 
> > > AFAIK this is a obscure case that does not happen in the common high
> > > performance path we expect.
> > > 
> > > The HW driver would have to iterate ove the SG list with a helper
> > > something like
> > > 
> > > for (ib_umem_start_phys_iter(umem, &state); state->len;
> > >      ib_umem_next_phys_iter(umem, &state) {
> > >      .. add to hw table..->phys = state->physaddr;
> > >      .. add to hw table..->len = state->len;
> > > }
> > > 
> > > Where the helpers guarantee that state->physaddr is properly aligned
> > > for state->len (they only make things smaller, not bigger)
> > 
> > Not sure I am clear with this part "they only make things smaller,
> > not bigger".  If the core chose 2M HW PG size and the start of the
> > MR is at an offset into the 2M page, then in driver we supply a 2M
> > aligned address from the SG entry into HW table, but HW can still do
> > 2M. Is this the case you are referring to?
> 
> Sort of, if there is no DMA mapping then the alignment and sizing here
> should match the alignment and sizing created below, so nothing should
> change.
> 
> With the IOMMU we might get a situation where a 2M aligned region is
> IOMMU mapped to a 2M unaligned region. In this case we have to break
> the 2M pages back up into 4k pages (or whatever)
> 
> Or, another case, would be HW that uses a 64K native page size (eg
> PPC), and a driver that only supports 4K.
> 
> In this case the 64K pages would be reduced to aligned 4k ranges for
> the driver.
> 
> > > Something that looks kind of like this off the cuff untested thing, at
> > > least for the sg pass:
> > > 
> > > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> > > 					struct page *page, unsigned long npages,
> > > 					unsigned long supported_pagesz)
> > > {
> > > 	while (npages) {
> > > 		unsigned long pfn = page_to_pfn(page);
> > > 		/* How many trailing zero bits are in the page's kernel
> > > 		 * address? */
> > > 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> > > 		unsigned int len_pages;
> > > 		unsigned int page_bit;
> > > 
> > > 		/* Figure out the highest set bit in the supported sizes that
> > > 		 * is below the alignment of the page. ie if we have 12 zeros
> > > 		 * then we want to return bit 12 is set (ie the value 13).
> > > 		 * This algorithm guarantees that all pages are at least
> > > 		 * aligned on their size in the kernel address.
> > > 		 */
> > 
> > Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?
> 
> I didn't test it, comment is describing what to do :)
> 
> > > 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> > > 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> > > 			return NULL;
> > > 
> > > 		len_pages =
> > > 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
> > 
> > For a 2M huge page where the start addr is offset into a 2M page, it looks like
> > we are creating 4K SG entries upto the naturally aligned boundary followed by
> > single SG 2M entries. Is that the intent?
> 
> Yes. Or more specifically, this is a greedy algorithim, it always
> selects the largest aligned block it can, and consumes those
> pages.
> 
> For instance if we have a 4M region composed of 2M huge pages
> And the driver can support 2M,1M and 4k
> and the start of the MR is offset 4k into the region
> 
> We should get 255 4K pages, 1 1M page, 1 2M page.
> 
> > > 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> > > 		sg_end = sg_next(sg_end);
> > > 		page = pfn_to_page(pfn + len_pages);
> > 
> > npages -= len_pages;?
> 
> Yep
> 
> > > 	return sg_end;
> > > }
> > > 
> > > /*
> > >  * page_list is an array of npage struct page pointers, where each page is
> > >  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
> > >  * where each bit indicates a supported page size, ie (1<<12) indicates that
> > >  * 4K pages are supported.
> > >  */
> > 
> > Does page_list array contains all the npages of the MR when this function is called?
> > Or is this function called after each pinning of min (npages, 512) pages?
> > https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
> > I am guessing its the former in which case, dont we have the same problem of saving
> > off memory as Option #1 I proposed?
> 
> It has to be the page list after pinning, yes.
> 
> And yes, this doesn't attempt to solve the problem with overallocating
> the sg table. We can solve in a couple of different ways, but that can
> be a followup, IMHO.
> 
> After thinking about this some more I might suggest a slight changes..
> 
> Have the sg table contain the largest contiguous regions possible, and
> ignore alignment or driver capability. This minimizes the sg table
> entry size. Ie in the example above it would simply be a SG entry of
> length 4M-4k

OK.
 
> This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge
> pages (if supported) and the IOMMU huge page size may differ from the
> driver supported size.
> 
> Next, when processing in the driver, break up the giant sg entries
> with a helper, as described. The helper is a bit more complicated
> because the realignment work is pushed into it, but not seriously so.
> 
> Second, I think we have some HW that can handle only one page size? ie
> if they want to use 2M pages then all pages must be 2M?

Yes.
 
> Another helper would be needed to sweep the sg list to get the best
> page size, like this:
> 
>  best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap);
>  for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len;
>       ib_umem_next_phys_iter(umem, &state) {
>       .. add to hw table..->phys = state->physaddr;
>       .. add to hw table..->len = state->len;
>  }
> 
> HW that can support mixed pages would just use best_size ==
> supported_sizes bitmaps
> 
OK. Let me spend some time now on the implementation aspects.

Shiraz
diff mbox

Patch

diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
index 8884ff71a634..7d85414742ff 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -1410,6 +1410,7 @@  static void i40iw_set_hugetlb_values(u64 addr, struct i40iw_mr *iwmr)
 	struct vm_area_struct *vma;
 	struct hstate *h;

+	down_read(&current->mm->mmap_sem);
 	vma = find_vma(current->mm, addr);
 	if (vma && is_vm_hugetlb_page(vma)) {
 		h = hstate_vma(vma);
@@ -1418,6 +1419,7 @@  static void i40iw_set_hugetlb_values(u64 addr, struct i40iw_mr *iwmr)
 			iwmr->page_msk = huge_page_mask(h);
 		}
 	}
+	up_read(&current->mm->mmap_sem);
 }

 /**