mbox series

[rdma-next,v5,0/4] Dynamicaly allocate SG table from the pages

Message ID 20201004154340.1080481-1-leon@kernel.org (mailing list archive)
Headers show
Series Dynamicaly allocate SG table from the pages | expand

Message

Leon Romanovsky Oct. 4, 2020, 3:43 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v5:
 * Use sg_init_table to allocate table and avoid changes is __sg_alloc_table
 * Fix offset issue
v4: https://lore.kernel.org/lkml/20200927064647.3106737-1-leon@kernel.org
 * Fixed formatting in first patch.
 * Added fix (clear tmp_netnts) in first patch to fix i915 failure.
 * Added test patches
v3: https://lore.kernel.org/linux-rdma/20200922083958.2150803-1-leon@kernel.org/
 * Squashed Christopher's suggestion to avoid introduced new API, but extend existing one.
v2: https://lore.kernel.org/linux-rdma/20200916140726.839377-1-leon@kernel.org
 * Fixed indentations and comments
 * Deleted sg_alloc_next()
 * Squashed lib/scatterlist patches into one
v1: https://lore.kernel.org/lkml/20200910134259.1304543-1-leon@kernel.org
 * Changed _sg_chain to be __sg_chain
 * Added dependency on ARCH_NO_SG_CHAIN
 * Removed struct sg_append
v0:
 * https://lore.kernel.org/lkml/20200903121853.1145976-1-leon@kernel.org

--------------------------------------------------------------------------
From Maor:

This series extends __sg_alloc_table_from_pages to allow chaining of
new pages to already initialized SG table.

This allows for the drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The second patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

             Number of entries      Size
    Before        26214400          600.0MB
    After            51200            1.2MB

Thanks

Maor Gottlieb (2):
  lib/scatterlist: Add support in dynamic allocation of SG table from
    pages
  RDMA/umem: Move to allocate SG table from pages

Tvrtko Ursulin (2):
  tools/testing/scatterlist: Rejuvenate bit-rotten test
  tools/testing/scatterlist: Show errors in human readable form

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 ++-
 drivers/infiniband/core/umem.c              |  94 ++-------------
 include/linux/scatterlist.h                 |  38 +++---
 lib/scatterlist.c                           | 125 ++++++++++++++++----
 tools/testing/scatterlist/Makefile          |   3 +-
 tools/testing/scatterlist/linux/mm.h        |  35 ++++++
 tools/testing/scatterlist/main.c            |  53 ++++++---
 8 files changed, 225 insertions(+), 150 deletions(-)

--
2.26.2

Comments

Jason Gunthorpe Oct. 5, 2020, 11:56 p.m. UTC | #1
On Sun, Oct 04, 2020 at 06:43:36PM +0300, Leon Romanovsky wrote:
> This series extends __sg_alloc_table_from_pages to allow chaining of
> new pages to already initialized SG table.
> 
> This allows for the drivers to utilize the optimization of merging contiguous
> pages without a need to pre allocate all the pages and hold them in
> a very large temporary buffer prior to the call to SG table initialization.
> 
> The second patch changes the Infiniband driver to use the new API. It
> removes duplicate functionality from the code and benefits the
> optimization of allocating dynamic SG table from pages.
> 
> In huge pages system of 2MB page size, without this change, the SG table
> would contain x512 SG entries.
> E.g. for 100GB memory registration:
> 
>              Number of entries      Size
>     Before        26214400          600.0MB
>     After            51200            1.2MB
> 
> Thanks
> 
> Maor Gottlieb (2):
>   lib/scatterlist: Add support in dynamic allocation of SG table from
>     pages
>   RDMA/umem: Move to allocate SG table from pages
> 
> Tvrtko Ursulin (2):
>   tools/testing/scatterlist: Rejuvenate bit-rotten test
>   tools/testing/scatterlist: Show errors in human readable form

This looks OK, I'm going to send it into linux-next on the hmm tree
for awhile to see if anything gets broken. If there is more
remarks/tags/etc please continue

Thanks,
Jason
Daniel Vetter Oct. 6, 2020, 10:41 a.m. UTC | #2
On Mon, Oct 05, 2020 at 08:56:50PM -0300, Jason Gunthorpe wrote:
> On Sun, Oct 04, 2020 at 06:43:36PM +0300, Leon Romanovsky wrote:
> > This series extends __sg_alloc_table_from_pages to allow chaining of
> > new pages to already initialized SG table.
> > 
> > This allows for the drivers to utilize the optimization of merging contiguous
> > pages without a need to pre allocate all the pages and hold them in
> > a very large temporary buffer prior to the call to SG table initialization.
> > 
> > The second patch changes the Infiniband driver to use the new API. It
> > removes duplicate functionality from the code and benefits the
> > optimization of allocating dynamic SG table from pages.
> > 
> > In huge pages system of 2MB page size, without this change, the SG table
> > would contain x512 SG entries.
> > E.g. for 100GB memory registration:
> > 
> >              Number of entries      Size
> >     Before        26214400          600.0MB
> >     After            51200            1.2MB
> > 
> > Thanks
> > 
> > Maor Gottlieb (2):
> >   lib/scatterlist: Add support in dynamic allocation of SG table from
> >     pages
> >   RDMA/umem: Move to allocate SG table from pages
> > 
> > Tvrtko Ursulin (2):
> >   tools/testing/scatterlist: Rejuvenate bit-rotten test
> >   tools/testing/scatterlist: Show errors in human readable form
> 
> This looks OK, I'm going to send it into linux-next on the hmm tree
> for awhile to see if anything gets broken. If there is more
> remarks/tags/etc please continue

An idea that just crossed my mind: A pin_user_pages_sgt might be useful
for both rdma and drm, since this would avoid the possible huge interim
struct pages array for thp pages. Or anything else that could be coalesced
down into a single sg entry.

Not sure it's worth it, but would at least give a slightly neater
interface I think.
-Daniel
Daniel Vetter Oct. 7, 2020, 8:15 a.m. UTC | #3
On Wed, Oct 7, 2020 at 9:22 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Oct 06, 2020 at 12:41:22PM +0200, Daniel Vetter wrote:
> > On Mon, Oct 05, 2020 at 08:56:50PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Oct 04, 2020 at 06:43:36PM +0300, Leon Romanovsky wrote:
> > > > This series extends __sg_alloc_table_from_pages to allow chaining of
> > > > new pages to already initialized SG table.
> > > >
> > > > This allows for the drivers to utilize the optimization of merging contiguous
> > > > pages without a need to pre allocate all the pages and hold them in
> > > > a very large temporary buffer prior to the call to SG table initialization.
> > > >
> > > > The second patch changes the Infiniband driver to use the new API. It
> > > > removes duplicate functionality from the code and benefits the
> > > > optimization of allocating dynamic SG table from pages.
> > > >
> > > > In huge pages system of 2MB page size, without this change, the SG table
> > > > would contain x512 SG entries.
> > > > E.g. for 100GB memory registration:
> > > >
> > > >              Number of entries      Size
> > > >     Before        26214400          600.0MB
> > > >     After            51200            1.2MB
> > > >
> > > > Thanks
> > > >
> > > > Maor Gottlieb (2):
> > > >   lib/scatterlist: Add support in dynamic allocation of SG table from
> > > >     pages
> > > >   RDMA/umem: Move to allocate SG table from pages
> > > >
> > > > Tvrtko Ursulin (2):
> > > >   tools/testing/scatterlist: Rejuvenate bit-rotten test
> > > >   tools/testing/scatterlist: Show errors in human readable form
> > >
> > > This looks OK, I'm going to send it into linux-next on the hmm tree
> > > for awhile to see if anything gets broken. If there is more
> > > remarks/tags/etc please continue
> >
> > An idea that just crossed my mind: A pin_user_pages_sgt might be useful
> > for both rdma and drm, since this would avoid the possible huge interim
> > struct pages array for thp pages. Or anything else that could be coalesced
> > down into a single sg entry.
> >
> > Not sure it's worth it, but would at least give a slightly neater
> > interface I think.
>
> We've talked about it. Christoph wants to see this area move to a biovec
> interface instead of sgl, but it might still be worthwhile to have an
> interm step at least as an API consolidation.

Hm but then we'd need a new struct for the mapped side of things
(which would still be what you get from dma-buf). That would be quite
a bit of work to roll out everywhere, and sgt isn't such a huge misfit
for passing buffer object mappings and system memory backing storage
around, and hence what we (very slowly) converging drivers/gpu towards
over the past 10 years or so.

And moving the dma_map step out of dma-buf doesn't work, because some
of the use-cases we have is for very special iommus which are managed
by the gpu driver directly. Stuff that e.g. rotates/retiles/compresses
on the fly, and is accessible by other (gfx related like video code,
camera, ..) devices. Not something I expect to ever be relevant for
rdma since this exist mostly on some small soc, but it's a thing.
Without that dma-buf could hand out biovec for struct_page backed
stuff, or some pfn_vec for the p2p stuff.

Anyway was just an idea, I guess we'll have to live with some
impedance mismatch since rolling out the one an only iovec structure
which suits everyone is I think impossible :-)

> Avoiding the page list would be complicated as we'd somehow have to
> code share the page table iterator scheme.

We're (slowly) getting towards thp for vram mappings and everything so
I guess for drivers/gpu we might make that happen. But yeah it'd be
not so pretty I think.
-Daniel