diff mbox series

[net-next] xsk: introduce xsk_dma_ops

Message ID 20230417032750.7086-1-xuanzhuo@linux.alibaba.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] xsk: introduce xsk_dma_ops | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 22 this patch: 27
netdev/cc_maintainers success CCed 14 of 14 maintainers
netdev/build_clang success Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 22 this patch: 22
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Xuan Zhuo April 17, 2023, 3:27 a.m. UTC
The purpose of this patch is to allow driver pass the own dma_ops to
xsk.

This is to cope with the scene of virtio-net. If virtio does not have
VIRTIO_F_ACCESS_PLATFORM, then virtio cannot use DMA API. In this case,
XSK cannot use DMA API directly to achieve DMA address. Based on this
scene, we must let XSK support driver to use the driver's dma_ops.

On the other hand, the implementation of XSK as a highlevel code
should put the underlying operation of DMA to the driver layer.
The driver layer determines the implementation of the final DMA. XSK
should not make such assumptions. Everything will be simplified if DMA
is done at the driver level.

More is here:
    https://lore.kernel.org/virtualization/1681265026.6082013-1-xuanzhuo@linux.alibaba.com/

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/net/xdp_sock_drv.h  | 20 +++++++++++++++-
 include/net/xsk_buff_pool.h | 19 +++++++++++++++
 net/xdp/xsk_buff_pool.c     | 47 +++++++++++++++++++++++--------------
 3 files changed, 68 insertions(+), 18 deletions(-)

Comments

Christoph Hellwig April 17, 2023, 4:24 a.m. UTC | #1
On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote:
> The purpose of this patch is to allow driver pass the own dma_ops to
> xsk.

Drivers have no business passing around dma_ops, or even knowing about
them.
Xuan Zhuo April 17, 2023, 5:58 a.m. UTC | #2
On Sun, 16 Apr 2023 21:24:32 -0700, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote:
> > The purpose of this patch is to allow driver pass the own dma_ops to
> > xsk.
>
> Drivers have no business passing around dma_ops, or even knowing about
> them.


May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API.

I mean the callbacks for xsk to do dma.

Maybe, I should rename it in the next version.

Thanks.
kernel test robot April 17, 2023, 6:38 a.m. UTC | #3
Hi Xuan,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903
patch link:    https://lore.kernel.org/r/20230417032750.7086-1-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next] xsk: introduce xsk_dma_ops
config: i386-randconfig-a011-20230417 (https://download.01.org/0day-ci/archive/20230417/202304171427.Uaryn9jl-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/28e766603a33761d7bd1fdd3e107595408319f7d
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903
        git checkout 28e766603a33761d7bd1fdd3e107595408319f7d
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash net/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304171427.Uaryn9jl-lkp@intel.com/

All errors (new ones prefixed by >>):

>> net/xdp/xsk_buff_pool.c:430:26: error: incompatible function pointer types assigning to 'dma_addr_t (*)(struct device *, struct page *, unsigned long, size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int (*)(struct device *, struct page *, unsigned long, unsigned int, enum dma_data_direction, unsigned long)') from 'dma_addr_t (struct device *, struct page *, size_t, size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int (struct device *, struct page *, unsigned int, unsigned int, enum dma_data_direction, unsigned long)') [-Werror,-Wincompatible-function-pointer-types]
                   pool->dma_ops.map_page = dma_map_page_attrs;
                                          ^ ~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +430 net/xdp/xsk_buff_pool.c

   409	
   410	int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
   411		       struct xsk_dma_ops *dma_ops,
   412		       unsigned long attrs, struct page **pages, u32 nr_pages)
   413	{
   414		struct xsk_dma_map *dma_map;
   415		dma_addr_t dma;
   416		int err;
   417		u32 i;
   418	
   419		dma_map = xp_find_dma_map(pool);
   420		if (dma_map) {
   421			err = xp_init_dma_info(pool, dma_map);
   422			if (err)
   423				return err;
   424	
   425			refcount_inc(&dma_map->users);
   426			return 0;
   427		}
   428	
   429		if (!dma_ops) {
 > 430			pool->dma_ops.map_page = dma_map_page_attrs;
   431			pool->dma_ops.mapping_error = dma_mapping_error;
   432			pool->dma_ops.need_sync = dma_need_sync;
   433			pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device;
   434			pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu;
   435			dma_ops = &pool->dma_ops;
   436		} else {
   437			pool->dma_ops = *dma_ops;
   438		}
   439	
   440		dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
   441		if (!dma_map)
   442			return -ENOMEM;
   443	
   444		for (i = 0; i < dma_map->dma_pages_cnt; i++) {
   445			dma = dma_ops->map_page(dev, pages[i], 0, PAGE_SIZE,
   446						DMA_BIDIRECTIONAL, attrs);
   447			if (dma_ops->mapping_error(dev, dma)) {
   448				__xp_dma_unmap(dma_map, dma_ops, attrs);
   449				return -ENOMEM;
   450			}
   451			if (dma_ops->need_sync(dev, dma))
   452				dma_map->dma_need_sync = true;
   453			dma_map->dma_pages[i] = dma;
   454		}
   455	
   456		if (pool->unaligned)
   457			xp_check_dma_contiguity(dma_map);
   458	
   459		err = xp_init_dma_info(pool, dma_map);
   460		if (err) {
   461			__xp_dma_unmap(dma_map, dma_ops, attrs);
   462			return err;
   463		}
   464	
   465		return 0;
   466	}
   467	EXPORT_SYMBOL(xp_dma_map);
   468
Michael S. Tsirkin April 17, 2023, 6:43 a.m. UTC | #4
On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote:
> @@ -532,9 +545,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
>  	xskb->xdp.data_meta = xskb->xdp.data;
>  
>  	if (pool->dma_need_sync) {
> -		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
> -						 pool->frame_len,
> -						 DMA_BIDIRECTIONAL);
> +		pool->dma_ops.sync_single_range_for_device(pool->dev, xskb->dma, 0,
> +							   pool->frame_len,
> +							   DMA_BIDIRECTIONAL);
>  	}
>  	return &xskb->xdp;
>  }
> @@ -670,15 +683,15 @@ EXPORT_SYMBOL(xp_raw_get_dma);
>  
>  void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb)
>  {
> -	dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
> -				      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
> +	xskb->pool->dma_ops.sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
> +						      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
>  }
>  EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow);
>  
>  void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
>  				 size_t size)
>  {
> -	dma_sync_single_range_for_device(pool->dev, dma, 0,
> -					 size, DMA_BIDIRECTIONAL);
> +	pool->dma_ops.sync_single_range_for_device(pool->dev, dma, 0,
> +						   size, DMA_BIDIRECTIONAL);
>  }
>  EXPORT_SYMBOL(xp_dma_sync_for_device_slow);

So you add an indirect function call on data path? Won't this be costly?

> -- 
> 2.32.0.3.g01195cf9f
Xuan Zhuo April 17, 2023, 6:48 a.m. UTC | #5
On Mon, 17 Apr 2023 02:43:32 -0400, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote:
> > @@ -532,9 +545,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
> >  	xskb->xdp.data_meta = xskb->xdp.data;
> >
> >  	if (pool->dma_need_sync) {
> > -		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
> > -						 pool->frame_len,
> > -						 DMA_BIDIRECTIONAL);
> > +		pool->dma_ops.sync_single_range_for_device(pool->dev, xskb->dma, 0,
> > +							   pool->frame_len,
> > +							   DMA_BIDIRECTIONAL);
> >  	}
> >  	return &xskb->xdp;
> >  }
> > @@ -670,15 +683,15 @@ EXPORT_SYMBOL(xp_raw_get_dma);
> >
> >  void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb)
> >  {
> > -	dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
> > -				      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
> > +	xskb->pool->dma_ops.sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
> > +						      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
> >  }
> >  EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow);
> >
> >  void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
> >  				 size_t size)
> >  {
> > -	dma_sync_single_range_for_device(pool->dev, dma, 0,
> > -					 size, DMA_BIDIRECTIONAL);
> > +	pool->dma_ops.sync_single_range_for_device(pool->dev, dma, 0,
> > +						   size, DMA_BIDIRECTIONAL);
> >  }
> >  EXPORT_SYMBOL(xp_dma_sync_for_device_slow);
>
> So you add an indirect function call on data path? Won't this be costly?


Yes, this may introduce some cost. The good news is that in some case,
sync is not necessary. dma_need_sync should be false.

Thanks.


>
> > --
> > 2.32.0.3.g01195cf9f
>
kernel test robot April 17, 2023, 6:48 a.m. UTC | #6
Hi Xuan,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903
patch link:    https://lore.kernel.org/r/20230417032750.7086-1-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next] xsk: introduce xsk_dma_ops
config: mips-randconfig-r021-20230416 (https://download.01.org/0day-ci/archive/20230417/202304171441.eZRwCNsh-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project 9638da200e00bd069e6dd63604e14cbafede9324)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mipsel-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/28e766603a33761d7bd1fdd3e107595408319f7d
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903
        git checkout 28e766603a33761d7bd1fdd3e107595408319f7d
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=mips olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash net/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304171441.eZRwCNsh-lkp@intel.com/

All errors (new ones prefixed by >>):

>> net/xdp/xsk_buff_pool.c:430:26: error: incompatible function pointer types assigning to 'dma_addr_t (*)(struct device *, struct page *, unsigned long, size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int (*)(struct device *, struct page *, unsigned long, unsigned int, enum dma_data_direction, unsigned long)') from 'dma_addr_t (struct device *, struct page *, size_t, size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int (struct device *, struct page *, unsigned int, unsigned int, enum dma_data_direction, unsigned long)') [-Wincompatible-function-pointer-types]
                   pool->dma_ops.map_page = dma_map_page_attrs;
                                          ^ ~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +430 net/xdp/xsk_buff_pool.c

   409	
   410	int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
   411		       struct xsk_dma_ops *dma_ops,
   412		       unsigned long attrs, struct page **pages, u32 nr_pages)
   413	{
   414		struct xsk_dma_map *dma_map;
   415		dma_addr_t dma;
   416		int err;
   417		u32 i;
   418	
   419		dma_map = xp_find_dma_map(pool);
   420		if (dma_map) {
   421			err = xp_init_dma_info(pool, dma_map);
   422			if (err)
   423				return err;
   424	
   425			refcount_inc(&dma_map->users);
   426			return 0;
   427		}
   428	
   429		if (!dma_ops) {
 > 430			pool->dma_ops.map_page = dma_map_page_attrs;
   431			pool->dma_ops.mapping_error = dma_mapping_error;
   432			pool->dma_ops.need_sync = dma_need_sync;
   433			pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device;
   434			pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu;
   435			dma_ops = &pool->dma_ops;
   436		} else {
   437			pool->dma_ops = *dma_ops;
   438		}
   439	
   440		dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
   441		if (!dma_map)
   442			return -ENOMEM;
   443	
   444		for (i = 0; i < dma_map->dma_pages_cnt; i++) {
   445			dma = dma_ops->map_page(dev, pages[i], 0, PAGE_SIZE,
   446						DMA_BIDIRECTIONAL, attrs);
   447			if (dma_ops->mapping_error(dev, dma)) {
   448				__xp_dma_unmap(dma_map, dma_ops, attrs);
   449				return -ENOMEM;
   450			}
   451			if (dma_ops->need_sync(dev, dma))
   452				dma_map->dma_need_sync = true;
   453			dma_map->dma_pages[i] = dma;
   454		}
   455	
   456		if (pool->unaligned)
   457			xp_check_dma_contiguity(dma_map);
   458	
   459		err = xp_init_dma_info(pool, dma_map);
   460		if (err) {
   461			__xp_dma_unmap(dma_map, dma_ops, attrs);
   462			return err;
   463		}
   464	
   465		return 0;
   466	}
   467	EXPORT_SYMBOL(xp_dma_map);
   468
Jakub Kicinski April 17, 2023, 6:56 p.m. UTC | #7
On Mon, 17 Apr 2023 13:58:01 +0800 Xuan Zhuo wrote:
> On Sun, 16 Apr 2023 21:24:32 -0700, Christoph Hellwig <hch@infradead.org> wrote:
> > On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote:  
> > > The purpose of this patch is to allow driver pass the own dma_ops to
> > > xsk.  
> >
> > Drivers have no business passing around dma_ops, or even knowing about
> > them.  
> 
> May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API.
> 
> I mean the callbacks for xsk to do dma.
> 
> Maybe, I should rename it in the next version.

Would you mind explaining this a bit more to folks like me who are not
familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
details from the stack, why is it not sufficient here.
Jakub Kicinski April 17, 2023, 6:57 p.m. UTC | #8
On Mon, 17 Apr 2023 11:56:10 -0700 Jakub Kicinski wrote:
> > May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API.
> > 
> > I mean the callbacks for xsk to do dma.
> > 
> > Maybe, I should rename it in the next version.  
> 
> Would you mind explaining this a bit more to folks like me who are not
> familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
> details from the stack, why is it not sufficient here.

Umm.. also it'd help to post the user of the API in the same series.
I only see the XSK changes, maybe if the virtio changes were in 
the same series I could answer my own question.
Jason Wang April 18, 2023, 1:07 a.m. UTC | #9
On Tue, Apr 18, 2023 at 2:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 17 Apr 2023 11:56:10 -0700 Jakub Kicinski wrote:
> > > May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API.
> > >
> > > I mean the callbacks for xsk to do dma.
> > >
> > > Maybe, I should rename it in the next version.
> >
> > Would you mind explaining this a bit more to folks like me who are not
> > familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
> > details from the stack, why is it not sufficient here.

The reason is that legacy virtio device don't use DMA(vring_use_dma_api()).

The AF_XDP assumes DMA for netdev doesn't work in this case. We need a
way to make it work.

Thanks

>
> Umm.. also it'd help to post the user of the API in the same series.
> I only see the XSK changes, maybe if the virtio changes were in
> the same series I could answer my own question.
>
Jakub Kicinski April 18, 2023, 1:19 a.m. UTC | #10
On Tue, 18 Apr 2023 09:07:30 +0800 Jason Wang wrote:
> > > Would you mind explaining this a bit more to folks like me who are not
> > > familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
> > > details from the stack, why is it not sufficient here.  
> 
> The reason is that legacy virtio device don't use DMA(vring_use_dma_api()).
> 
> The AF_XDP assumes DMA for netdev doesn't work in this case. We need a
> way to make it work.

Can we not push this down to be bus level? virtio has its own bus it
can plug in whatever magic it wants into dma ops.

Doesn't have to be super fast for af_xdp's sake - for af_xdp dma mapping
is on the control path. You can keep using the if (vring_use_dma_api())
elsewhere for now if there is a perf concern.

Otherwise it really seems like we're bubbling up a virtio hack into
generic code :(
Xuan Zhuo April 18, 2023, 2:15 a.m. UTC | #11
On Mon, 17 Apr 2023 11:57:53 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Mon, 17 Apr 2023 11:56:10 -0700 Jakub Kicinski wrote:
> > > May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API.
> > >
> > > I mean the callbacks for xsk to do dma.
> > >
> > > Maybe, I should rename it in the next version.
> >
> > Would you mind explaining this a bit more to folks like me who are not
> > familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
> > details from the stack, why is it not sufficient here.
>
> Umm.. also it'd help to post the user of the API in the same series.
> I only see the XSK changes, maybe if the virtio changes were in
> the same series I could answer my own question.


This [1] is the similar code. This is the early version. But the idea is
similar to this patch.


[1] https://lore.kernel.org/virtualization/20230202110058.130695-1-xuanzhuo@linux.alibaba.com/


Thanks.
Xuan Zhuo April 18, 2023, 2:19 a.m. UTC | #12
On Mon, 17 Apr 2023 18:19:50 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 18 Apr 2023 09:07:30 +0800 Jason Wang wrote:
> > > > Would you mind explaining this a bit more to folks like me who are not
> > > > familiar with VirtIO?  DMA API is supposed to hide the DMA mapping
> > > > details from the stack, why is it not sufficient here.
> >
> > The reason is that legacy virtio device don't use DMA(vring_use_dma_api()).
> >
> > The AF_XDP assumes DMA for netdev doesn't work in this case. We need a
> > way to make it work.
>
> Can we not push this down to be bus level? virtio has its own bus it
> can plug in whatever magic it wants into dma ops.

It is actually not possible.

[1] https://lore.kernel.org/virtualization/ZDUCDeYLqAwQVJe7@infradead.org/

>
> Doesn't have to be super fast for af_xdp's sake - for af_xdp dma mapping
> is on the control path. You can keep using the if (vring_use_dma_api())
> elsewhere for now if there is a perf concern.

Sorry, I don't particularly understand this passage.

Now, the question is if vring_use_dma_api() is false, then we cannot use DMA
API in AF_XDP.

The good news is that except for some of sync's operations, they are in the
control path. I think it is very small effect on performance. Because in most
case the sync is unnecessary.


>
> Otherwise it really seems like we're bubbling up a virtio hack into
> generic code :(

Can we understand the purpose of this matter to back the DMA operation to the
driver? Although I don't know if there are other drivers with similar
requirements.

Thanks.
Jakub Kicinski April 18, 2023, 2:54 a.m. UTC | #13
On Tue, 18 Apr 2023 10:19:39 +0800 Xuan Zhuo wrote:
> > Can we not push this down to be bus level? virtio has its own bus it
> > can plug in whatever magic it wants into dma ops.  
> 
> It is actually not possible.
> 
> [1] https://lore.kernel.org/virtualization/ZDUCDeYLqAwQVJe7@infradead.org/

Maybe Christoph, or Greg can comment.

AF_XDP, io_uring, and increasing number of pinned memory / zero copy
implementations need to do DMA mapping outside the drivers.
I don't think it's reasonable to be bubbling up custom per-subsystem
DMA ops into all of them for the sake of virtio.

> > Otherwise it really seems like we're bubbling up a virtio hack into
> > generic code :(  
> 
> Can we understand the purpose of this matter to back the DMA operation to the
> driver?

We understand what your code does.
Christoph Hellwig April 18, 2023, 5:01 a.m. UTC | #14
On Mon, Apr 17, 2023 at 07:54:00PM -0700, Jakub Kicinski wrote:
> AF_XDP, io_uring, and increasing number of pinned memory / zero copy
> implementations need to do DMA mapping outside the drivers.

You can't just do dma mapping outside the driver, because there are
drivers that do not require DMA mapping at all.  virtio is an example,
but all the classic s390 drivers and some other odd virtualization
ones are others.

> I don't think it's reasonable to be bubbling up custom per-subsystem
> DMA ops into all of them for the sake of virtio.

dma addresses and thus dma mappings are completely driver specific.
Upper layers have no business looking at them.
Jakub Kicinski April 18, 2023, 6:19 a.m. UTC | #15
On Mon, 17 Apr 2023 22:01:36 -0700 Christoph Hellwig wrote:
> On Mon, Apr 17, 2023 at 07:54:00PM -0700, Jakub Kicinski wrote:
> > AF_XDP, io_uring, and increasing number of pinned memory / zero copy
> > implementations need to do DMA mapping outside the drivers.  
> 
> You can't just do dma mapping outside the driver, because there are
> drivers that do not require DMA mapping at all.  virtio is an example,
> but all the classic s390 drivers and some other odd virtualization
> ones are others.

What bus are the classic s390 on (in terms of the device model)?

> > I don't think it's reasonable to be bubbling up custom per-subsystem
> > DMA ops into all of them for the sake of virtio.  
> 
> dma addresses and thus dma mappings are completely driver specific.
> Upper layers have no business looking at them.

Damn, that's unfortunate. Thinking aloud -- that means that if we want 
to continue to pull memory management out of networking drivers to
improve it for all, cross-optimize with the rest of the stack and
allow various upcoming forms of zero copy -- then we need to add an
equivalent of dma_ops and DMA API locally in networking?
Christoph Hellwig April 19, 2023, 5:16 a.m. UTC | #16
On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote:
> > You can't just do dma mapping outside the driver, because there are
> > drivers that do not require DMA mapping at all.  virtio is an example,
> > but all the classic s390 drivers and some other odd virtualization
> > ones are others.
> 
> What bus are the classic s390 on (in terms of the device model)?

I think most of them are based on struct ccw_device, but I'll let the
s390 maintainers fill in.

Another interesting case that isn't really relevant for your networking
guys, but that caused as problems is RDMA.  For hardware RDMA devices
it wants the ULPs to DMA map, but it turns out we have various software
drivers that map to network drivers that do their own DMA mapping
at a much lower layer and after potentially splitting packets or
even mangling them.

> 
> > > I don't think it's reasonable to be bubbling up custom per-subsystem
> > > DMA ops into all of them for the sake of virtio.  
> > 
> > dma addresses and thus dma mappings are completely driver specific.
> > Upper layers have no business looking at them.
> 
> Damn, that's unfortunate. Thinking aloud -- that means that if we want 
> to continue to pull memory management out of networking drivers to
> improve it for all, cross-optimize with the rest of the stack and
> allow various upcoming forms of zero copy -- then we need to add an
> equivalent of dma_ops and DMA API locally in networking?

Can you explain what the actual use case is?

From the original patchset I suspect it is dma mapping something very
long term and then maybe doing syncs on it as needed?
Alexander Lobakin April 19, 2023, 1:14 p.m. UTC | #17
From: Christoph Hellwig <hch@infradead.org>
Date: Tue, 18 Apr 2023 22:16:53 -0700

> On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote:
>>> You can't just do dma mapping outside the driver, because there are
>>> drivers that do not require DMA mapping at all.  virtio is an example,
>>> but all the classic s390 drivers and some other odd virtualization
>>> ones are others.
>>
>> What bus are the classic s390 on (in terms of the device model)?
> 
> I think most of them are based on struct ccw_device, but I'll let the
> s390 maintainers fill in.
> 
> Another interesting case that isn't really relevant for your networking
> guys, but that caused as problems is RDMA.  For hardware RDMA devices
> it wants the ULPs to DMA map, but it turns out we have various software
> drivers that map to network drivers that do their own DMA mapping
> at a much lower layer and after potentially splitting packets or
> even mangling them.
> 
>>
>>>> I don't think it's reasonable to be bubbling up custom per-subsystem
>>>> DMA ops into all of them for the sake of virtio.  
>>>
>>> dma addresses and thus dma mappings are completely driver specific.
>>> Upper layers have no business looking at them.

Here it's not an "upper layer". XSk core doesn't look at them or pass
them between several drivers. It maps DMA solely via the struct device
passed from the driver and then just gets-sets addresses for this driver
only. Just like Page Pool does for regular Rx buffers. This got moved to
the XSk core to not repeat the same code pattern in each driver.

>>
>> Damn, that's unfortunate. Thinking aloud -- that means that if we want 
>> to continue to pull memory management out of networking drivers to
>> improve it for all, cross-optimize with the rest of the stack and
>> allow various upcoming forms of zero copy -- then we need to add an
>> equivalent of dma_ops and DMA API locally in networking?

Managing DMA addresses is totally fine as long as you don't try to pass
mapped addresses between different drivers :D Page Pool already does
that and I don't see a problem in that in general.

> 
> Can you explain what the actual use case is?
> 
>>From the original patchset I suspect it is dma mapping something very
> long term and then maybe doing syncs on it as needed?

As I mentioned, XSk provides some handy wrappers to map DMA for drivers.
Previously, XSk was supported by real hardware drivers only, but here
the developer tries to add support to virtio-net. I suspect he needs to
use DMA mapping functions different from which the regular driver use.
So this is far from dma_map_ops, the author picked wrong name :D
And correct, for XSk we map one big piece of memory only once and then
reuse it for buffers, no inflight map/unmap on hotpath (only syncs when
needed). So this mapping is longterm and is stored in XSk core structure
assigned to the driver which this mapping was done for.
I think Jakub thinks of something similar, but for the "regular" Rx/Tx,
not only XDP sockets :)

Thanks,
Olek
Alexander Lobakin April 19, 2023, 1:22 p.m. UTC | #18
From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Mon, 17 Apr 2023 11:27:50 +0800

> The purpose of this patch is to allow driver pass the own dma_ops to
> xsk.
> 
> This is to cope with the scene of virtio-net. If virtio does not have
> VIRTIO_F_ACCESS_PLATFORM, then virtio cannot use DMA API. In this case,
> XSK cannot use DMA API directly to achieve DMA address. Based on this
> scene, we must let XSK support driver to use the driver's dma_ops.
> 
> On the other hand, the implementation of XSK as a highlevel code
> should put the underlying operation of DMA to the driver layer.
> The driver layer determines the implementation of the final DMA. XSK
> should not make such assumptions. Everything will be simplified if DMA
> is done at the driver level.
> 
> More is here:
>     https://lore.kernel.org/virtualization/1681265026.6082013-1-xuanzhuo@linux.alibaba.com/
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

[...]

>  struct xsk_buff_pool {
>  	/* Members only used in the control path first. */
>  	struct device *dev;
> @@ -85,6 +102,7 @@ struct xsk_buff_pool {
>  	 * sockets share a single cq when the same netdev and queue id is shared.
>  	 */
>  	spinlock_t cq_lock;
> +	struct xsk_dma_ops dma_ops;

Why full struct, not a const pointer? You'll have indirect calls either
way, copying the full struct won't reclaim you much performance.

>  	struct xdp_buff_xsk *free_heads[];
>  };
>  

[...]

> @@ -424,18 +426,29 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
>  		return 0;
>  	}
>  
> +	if (!dma_ops) {
> +		pool->dma_ops.map_page = dma_map_page_attrs;
> +		pool->dma_ops.mapping_error = dma_mapping_error;
> +		pool->dma_ops.need_sync = dma_need_sync;
> +		pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device;
> +		pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu;
> +		dma_ops = &pool->dma_ops;
> +	} else {
> +		pool->dma_ops = *dma_ops;
> +	}

If DMA syncs are not needed on your x86_64 DMA-coherent system, it
doesn't mean we all don't need it. Instead of filling pointers with
"default" callbacks, you could instead avoid indirect calls at all when
no custom DMA ops are specified. Pls see how for example Christoph did
that for direct DMA. It would cost only one if-else for case without
custom DMA ops here instead of an indirect call each time.

(I *could* suggest using INDIRECT_CALL_WRAPPER(), but I won't, since
 it's more expensive than direct checking and I feel like it's more
 appropriate to check directly here)

> +
>  	dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
>  	if (!dma_map)
>  		return -ENOMEM;
[...]

Thanks,
Olek
Xuan Zhuo April 19, 2023, 1:40 p.m. UTC | #19
On Wed, 19 Apr 2023 15:14:48 +0200, Alexander Lobakin <aleksander.lobakin@intel.com> wrote:
> From: Christoph Hellwig <hch@infradead.org>
> Date: Tue, 18 Apr 2023 22:16:53 -0700
>
> > On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote:
> >>> You can't just do dma mapping outside the driver, because there are
> >>> drivers that do not require DMA mapping at all.  virtio is an example,
> >>> but all the classic s390 drivers and some other odd virtualization
> >>> ones are others.
> >>
> >> What bus are the classic s390 on (in terms of the device model)?
> >
> > I think most of them are based on struct ccw_device, but I'll let the
> > s390 maintainers fill in.
> >
> > Another interesting case that isn't really relevant for your networking
> > guys, but that caused as problems is RDMA.  For hardware RDMA devices
> > it wants the ULPs to DMA map, but it turns out we have various software
> > drivers that map to network drivers that do their own DMA mapping
> > at a much lower layer and after potentially splitting packets or
> > even mangling them.
> >
> >>
> >>>> I don't think it's reasonable to be bubbling up custom per-subsystem
> >>>> DMA ops into all of them for the sake of virtio.
> >>>
> >>> dma addresses and thus dma mappings are completely driver specific.
> >>> Upper layers have no business looking at them.
>
> Here it's not an "upper layer". XSk core doesn't look at them or pass
> them between several drivers. It maps DMA solely via the struct device
> passed from the driver and then just gets-sets addresses for this driver
> only. Just like Page Pool does for regular Rx buffers. This got moved to
> the XSk core to not repeat the same code pattern in each driver.
>
> >>
> >> Damn, that's unfortunate. Thinking aloud -- that means that if we want
> >> to continue to pull memory management out of networking drivers to
> >> improve it for all, cross-optimize with the rest of the stack and
> >> allow various upcoming forms of zero copy -- then we need to add an
> >> equivalent of dma_ops and DMA API locally in networking?
>
> Managing DMA addresses is totally fine as long as you don't try to pass
> mapped addresses between different drivers :D Page Pool already does
> that and I don't see a problem in that in general.
>
> >
> > Can you explain what the actual use case is?
> >
> >>From the original patchset I suspect it is dma mapping something very
> > long term and then maybe doing syncs on it as needed?
>
> As I mentioned, XSk provides some handy wrappers to map DMA for drivers.
> Previously, XSk was supported by real hardware drivers only, but here
> the developer tries to add support to virtio-net. I suspect he needs to
> use DMA mapping functions different from which the regular driver use.
> So this is far from dma_map_ops, the author picked wrong name :D


Yes, dma_ops caused some misunderstandings, which is indeed a wrong name.

Thanks.


> And correct, for XSk we map one big piece of memory only once and then
> reuse it for buffers, no inflight map/unmap on hotpath (only syncs when
> needed). So this mapping is longterm and is stored in XSk core structure
> assigned to the driver which this mapping was done for.
> I think Jakub thinks of something similar, but for the "regular" Rx/Tx,
> not only XDP sockets :)
>
> Thanks,
> Olek
Xuan Zhuo April 19, 2023, 1:42 p.m. UTC | #20
On Wed, 19 Apr 2023 15:22:39 +0200, Alexander Lobakin <aleksander.lobakin@intel.com> wrote:
> From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Date: Mon, 17 Apr 2023 11:27:50 +0800
>
> > The purpose of this patch is to allow driver pass the own dma_ops to
> > xsk.
> >
> > This is to cope with the scene of virtio-net. If virtio does not have
> > VIRTIO_F_ACCESS_PLATFORM, then virtio cannot use DMA API. In this case,
> > XSK cannot use DMA API directly to achieve DMA address. Based on this
> > scene, we must let XSK support driver to use the driver's dma_ops.
> >
> > On the other hand, the implementation of XSK as a highlevel code
> > should put the underlying operation of DMA to the driver layer.
> > The driver layer determines the implementation of the final DMA. XSK
> > should not make such assumptions. Everything will be simplified if DMA
> > is done at the driver level.
> >
> > More is here:
> >     https://lore.kernel.org/virtualization/1681265026.6082013-1-xuanzhuo@linux.alibaba.com/
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>
> [...]
>
> >  struct xsk_buff_pool {
> >  	/* Members only used in the control path first. */
> >  	struct device *dev;
> > @@ -85,6 +102,7 @@ struct xsk_buff_pool {
> >  	 * sockets share a single cq when the same netdev and queue id is shared.
> >  	 */
> >  	spinlock_t cq_lock;
> > +	struct xsk_dma_ops dma_ops;
>
> Why full struct, not a const pointer? You'll have indirect calls either
> way, copying the full struct won't reclaim you much performance.
>
> >  	struct xdp_buff_xsk *free_heads[];
> >  };
> >
>
> [...]
>
> > @@ -424,18 +426,29 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
> >  		return 0;
> >  	}
> >
> > +	if (!dma_ops) {
> > +		pool->dma_ops.map_page = dma_map_page_attrs;
> > +		pool->dma_ops.mapping_error = dma_mapping_error;
> > +		pool->dma_ops.need_sync = dma_need_sync;
> > +		pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device;
> > +		pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu;
> > +		dma_ops = &pool->dma_ops;
> > +	} else {
> > +		pool->dma_ops = *dma_ops;
> > +	}
>
> If DMA syncs are not needed on your x86_64 DMA-coherent system, it
> doesn't mean we all don't need it. Instead of filling pointers with
> "default" callbacks, you could instead avoid indirect calls at all when
> no custom DMA ops are specified. Pls see how for example Christoph did
> that for direct DMA. It would cost only one if-else for case without
> custom DMA ops here instead of an indirect call each time.
>
> (I *could* suggest using INDIRECT_CALL_WRAPPER(), but I won't, since
>  it's more expensive than direct checking and I feel like it's more
>  appropriate to check directly here)

OK, I will fix it in next version.

Thanks.




>
> > +
> >  	dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
> >  	if (!dma_map)
> >  		return -ENOMEM;
> [...]
>
> Thanks,
> Olek
Jakub Kicinski April 19, 2023, 4:45 p.m. UTC | #21
On Tue, 18 Apr 2023 22:16:53 -0700 Christoph Hellwig wrote:
> On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote:
> > Damn, that's unfortunate. Thinking aloud -- that means that if we want 
> > to continue to pull memory management out of networking drivers to
> > improve it for all, cross-optimize with the rest of the stack and
> > allow various upcoming forms of zero copy -- then we need to add an
> > equivalent of dma_ops and DMA API locally in networking?  
> 
> Can you explain what the actual use case is?
> 
> From the original patchset I suspect it is dma mapping something very
> long term and then maybe doing syncs on it as needed?

In this case yes, pinned user memory, it gets sliced up into MTU sized
chunks, fed into an Rx queue of a device, and user can see packets
without any copies.

Quite similar use case #2 is upcoming io_uring / "direct placement"
patches (former from Meta, latter for Google) which will try to receive
just the TCP data into pinned user memory.

And, as I think Olek mentioned, #3 is page_pool - which allocates 4k
pages, manages the DMA mappings, gives them to the device and tries 
to recycle back to the device once TCP is done with them (avoiding the
unmapping and even atomic ops on the refcount, as in the good case page
refcount is always 1). See page_pool_return_skb_page() for the
recycling flow.

In all those cases it's more flexible (and faster) to hide the DMA
mapping from the driver. All the cases are also opt-in so we don't need
to worry about complete oddball devices. And to answer your question in
all cases we hope mapping/unmapping will be relatively rare while
syncing will be frequent.

AFAIU the patch we're discussing implements custom dma_ops for case #1,
but the same thing will be needed for #2, and #3. Question to me is
whether we need netdev-wide net_dma_ops or device model can provide us
with a DMA API that'd work for SoC/PCIe/virt devices.
Christoph Hellwig April 20, 2023, 6:12 a.m. UTC | #22
On Wed, Apr 19, 2023 at 03:22:39PM +0200, Alexander Lobakin wrote:
> If DMA syncs are not needed on your x86_64 DMA-coherent system, it
> doesn't mean we all don't need it.

If the DMA isn't actually a DMA (as in the virtio case, or other
cases that instead have to do their own dma mapping at much lower
layers) syncs generally don't make sense.

> Instead of filling pointers with
> "default" callbacks, you could instead avoid indirect calls at all when
> no custom DMA ops are specified. Pls see how for example Christoph did
> that for direct DMA. It would cost only one if-else for case without
> custom DMA ops here instead of an indirect call each time.

So yes, I think the abstraction here should not be another layer of
DMA ops, but the option to DMA map or not at all.
Christoph Hellwig April 20, 2023, 6:16 a.m. UTC | #23
On Wed, Apr 19, 2023 at 03:14:48PM +0200, Alexander Lobakin wrote:
> >>> dma addresses and thus dma mappings are completely driver specific.
> >>> Upper layers have no business looking at them.
> 
> Here it's not an "upper layer". XSk core doesn't look at them or pass
> them between several drivers.

Same for upper layers :)  The just do abstract operations that can sit
on top of a variety of drivers.

> It maps DMA solely via the struct device
> passed from the driver and then just gets-sets addresses for this driver
> only. Just like Page Pool does for regular Rx buffers. This got moved to
> the XSk core to not repeat the same code pattern in each driver.

Which assumes that:

 a) a DMA mapping needs to be done at all
 b) it can be done using a struct device exposed to it
 c) that DMA mapping is actually at the same granularity that it
    operates on

all of which might not be true.

> >>From the original patchset I suspect it is dma mapping something very
> > long term and then maybe doing syncs on it as needed?
> 
> As I mentioned, XSk provides some handy wrappers to map DMA for drivers.
> Previously, XSk was supported by real hardware drivers only, but here
> the developer tries to add support to virtio-net. I suspect he needs to
> use DMA mapping functions different from which the regular driver use.

Yes,  For actual hardware virtio and some more complex virtualized
setups it works just like real hardware.  For legacy virtio there is 
no DMA maping involved at all.  Because of that all DMA mapping needs
to be done inside of virtio.

> So this is far from dma_map_ops, the author picked wrong name :D
> And correct, for XSk we map one big piece of memory only once and then
> reuse it for buffers, no inflight map/unmap on hotpath (only syncs when
> needed). So this mapping is longterm and is stored in XSk core structure
> assigned to the driver which this mapping was done for.
> I think Jakub thinks of something similar, but for the "regular" Rx/Tx,
> not only XDP sockets :)

FYI, dma_map_* is not intended for long term mappings, can lead
to starvation issues.  You need to use dma_alloc_* instead.  And
"you" in that case is as I said the driver, not an upper layer.
If it's just helper called by drivers and never from core code,
that's of course fine.
Christoph Hellwig April 20, 2023, 6:19 a.m. UTC | #24
On Wed, Apr 19, 2023 at 09:45:06AM -0700, Jakub Kicinski wrote:
> > Can you explain what the actual use case is?
> > 
> > From the original patchset I suspect it is dma mapping something very
> > long term and then maybe doing syncs on it as needed?
> 
> In this case yes, pinned user memory, it gets sliced up into MTU sized
> chunks, fed into an Rx queue of a device, and user can see packets
> without any copies.

How long is the life time of these mappings?  Because dma_map_*
assumes a temporary mapping and not one that is pinned bascically
forever.

> Quite similar use case #2 is upcoming io_uring / "direct placement"
> patches (former from Meta, latter for Google) which will try to receive
> just the TCP data into pinned user memory.

I don't think we can just long term pin user memory here.  E.g. for
confidential computing cases we can't even ever do DMA straight to
userspace.  I had that conversation with Meta's block folks who
want to do something similar with io_uring and the only option is an
an allocator for memory that is known DMAable, e.g. through dma-bufs.

You guys really all need to get together and come up with a scheme
that actually works instead of piling these hacks over hacks.
Xuan Zhuo April 20, 2023, 9:11 a.m. UTC | #25
On Wed, 19 Apr 2023 23:19:22 -0700, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 19, 2023 at 09:45:06AM -0700, Jakub Kicinski wrote:
> > > Can you explain what the actual use case is?
> > >
> > > From the original patchset I suspect it is dma mapping something very
> > > long term and then maybe doing syncs on it as needed?
> >
> > In this case yes, pinned user memory, it gets sliced up into MTU sized
> > chunks, fed into an Rx queue of a device, and user can see packets
> > without any copies.
>
> How long is the life time of these mappings?  Because dma_map_*
> assumes a temporary mapping and not one that is pinned bascically
> forever.
>
> > Quite similar use case #2 is upcoming io_uring / "direct placement"
> > patches (former from Meta, latter for Google) which will try to receive
> > just the TCP data into pinned user memory.
>
> I don't think we can just long term pin user memory here.  E.g. for
> confidential computing cases we can't even ever do DMA straight to
> userspace.  I had that conversation with Meta's block folks who
> want to do something similar with io_uring and the only option is an
> an allocator for memory that is known DMAable, e.g. through dma-bufs.
>
> You guys really all need to get together and come up with a scheme
> that actually works instead of piling these hacks over hacks.

I think that cases Jakub mentioned are new requirements. From the perspective of
implementation, compared to this patch I submitted, if the DMA API can be
expanded, compatible with some special hardware, I think it is a good solution.

I know that the current design of DMA API only supports some physical hardware,
but can it be modified or expanded?

Still the previous idea, can we add a new ops (not dma_ops) in device? After the
driver configuration, so that the DMA API can be compatible with some special
hardware?


Thanks.
Alexander Lobakin April 20, 2023, 1:59 p.m. UTC | #26
From: Christoph Hellwig <hch@infradead.org>
Date: Wed, 19 Apr 2023 23:16:23 -0700

> On Wed, Apr 19, 2023 at 03:14:48PM +0200, Alexander Lobakin wrote:
>>>>> dma addresses and thus dma mappings are completely driver specific.
>>>>> Upper layers have no business looking at them.
>>
>> Here it's not an "upper layer". XSk core doesn't look at them or pass
>> them between several drivers.
> 
> Same for upper layers :)  The just do abstract operations that can sit
> on top of a variety of drivers.
> 
>> It maps DMA solely via the struct device
>> passed from the driver and then just gets-sets addresses for this driver
>> only. Just like Page Pool does for regular Rx buffers. This got moved to
>> the XSk core to not repeat the same code pattern in each driver.
> 
> Which assumes that:
> 
>  a) a DMA mapping needs to be done at all
>  b) it can be done using a struct device exposed to it
>  c) that DMA mapping is actually at the same granularity that it
>     operates on
> 
> all of which might not be true.
> 
>>> >From the original patchset I suspect it is dma mapping something very
>>> long term and then maybe doing syncs on it as needed?
>>
>> As I mentioned, XSk provides some handy wrappers to map DMA for drivers.
>> Previously, XSk was supported by real hardware drivers only, but here
>> the developer tries to add support to virtio-net. I suspect he needs to
>> use DMA mapping functions different from which the regular driver use.
> 
> Yes,  For actual hardware virtio and some more complex virtualized
> setups it works just like real hardware.  For legacy virtio there is 
> no DMA maping involved at all.  Because of that all DMA mapping needs
> to be done inside of virtio.
> 
>> So this is far from dma_map_ops, the author picked wrong name :D
>> And correct, for XSk we map one big piece of memory only once and then
>> reuse it for buffers, no inflight map/unmap on hotpath (only syncs when
>> needed). So this mapping is longterm and is stored in XSk core structure
>> assigned to the driver which this mapping was done for.
>> I think Jakub thinks of something similar, but for the "regular" Rx/Tx,
>> not only XDP sockets :)
> 
> FYI, dma_map_* is not intended for long term mappings, can lead
> to starvation issues.  You need to use dma_alloc_* instead.  And
> "you" in that case is as I said the driver, not an upper layer.
> If it's just helper called by drivers and never from core code,
> that's of course fine.

Hmm, currently almost all Ethernet drivers map Rx pages once and then
just recycle them, keeping the original DMA mapping. Which means pages
can have the same first mapping for very long time, often even for the
lifetime of the struct device. Same for XDP sockets, the lifetime of DMA
mappings equals the lifetime of sockets.
Does it mean we'd better review that approach and try switching to
dma_alloc_*() family (non-coherent/caching in our case)?
Also, I remember I tried to do that for one my driver, but the thing
that all those functions zero the whole page(s) before returning them to
the driver ruins the performance -- we don't need to zero buffers for
receiving packets and spend a ton of cycles on it (esp. in cases when 4k
gets zeroed each time, but your main body of traffic is 64-byte frames).

Thanks,
Olek
Jakub Kicinski April 20, 2023, 2:13 p.m. UTC | #27
On Wed, 19 Apr 2023 23:19:22 -0700 Christoph Hellwig wrote:
> > In this case yes, pinned user memory, it gets sliced up into MTU sized
> > chunks, fed into an Rx queue of a device, and user can see packets
> > without any copies.  
> 
> How long is the life time of these mappings?  Because dma_map_*
> assumes a temporary mapping and not one that is pinned bascically
> forever.

Yeah, this one is "for ever".

> > Quite similar use case #2 is upcoming io_uring / "direct placement"
> > patches (former from Meta, latter for Google) which will try to receive
> > just the TCP data into pinned user memory.  
> 
> I don't think we can just long term pin user memory here.  E.g. for
> confidential computing cases we can't even ever do DMA straight to
> userspace.  I had that conversation with Meta's block folks who
> want to do something similar with io_uring and the only option is an
> an allocator for memory that is known DMAable, e.g. through dma-bufs.
> 
> You guys really all need to get together and come up with a scheme
> that actually works instead of piling these hacks over hacks.

Okay, that simplifies various aspects. We'll just used dma-bufs from
the start in the new APIs.
Christoph Hellwig April 20, 2023, 4:15 p.m. UTC | #28
On Thu, Apr 20, 2023 at 03:59:39PM +0200, Alexander Lobakin wrote:
> Hmm, currently almost all Ethernet drivers map Rx pages once and then
> just recycle them, keeping the original DMA mapping. Which means pages
> can have the same first mapping for very long time, often even for the
> lifetime of the struct device. Same for XDP sockets, the lifetime of DMA
> mappings equals the lifetime of sockets.
> Does it mean we'd better review that approach and try switching to
> dma_alloc_*() family (non-coherent/caching in our case)?

Yes, exactly.  dma_alloc_noncoherent can be used exactly as alloc_pages
+ dma_map_* by the driver (including the dma_sync_* calls on reuse), but
has a huge number of advantages.

> Also, I remember I tried to do that for one my driver, but the thing
> that all those functions zero the whole page(s) before returning them to
> the driver ruins the performance -- we don't need to zero buffers for
> receiving packets and spend a ton of cycles on it (esp. in cases when 4k
> gets zeroed each time, but your main body of traffic is 64-byte frames).

Hmm, the single zeroing when doing the initial allocation shows up
in these profiles?
Christoph Hellwig April 20, 2023, 4:18 p.m. UTC | #29
On Thu, Apr 20, 2023 at 05:11:48PM +0800, Xuan Zhuo wrote:
> I know that the current design of DMA API only supports some physical hardware,
> but can it be modified or expanded?

I think the important point is that for some cases there is no need
to dma map at all, and upper layers should be fine by that by just
doing the dma mapping in helpers called by the driver.

The virtio drivers then check if platform_access is set, then call the
generic dma mapping helper, or if not just allocate memory using
alloc_pages and also skip all the sync calls.
Alexander Lobakin April 20, 2023, 4:42 p.m. UTC | #30
From: Christoph Hellwig <hch@infradead.org>
Date: Thu, 20 Apr 2023 09:15:23 -0700

> On Thu, Apr 20, 2023 at 03:59:39PM +0200, Alexander Lobakin wrote:
>> Hmm, currently almost all Ethernet drivers map Rx pages once and then
>> just recycle them, keeping the original DMA mapping. Which means pages
>> can have the same first mapping for very long time, often even for the
>> lifetime of the struct device. Same for XDP sockets, the lifetime of DMA
>> mappings equals the lifetime of sockets.
>> Does it mean we'd better review that approach and try switching to
>> dma_alloc_*() family (non-coherent/caching in our case)?
> 
> Yes, exactly.  dma_alloc_noncoherent can be used exactly as alloc_pages
> + dma_map_* by the driver (including the dma_sync_* calls on reuse), but
> has a huge number of advantages.
> 
>> Also, I remember I tried to do that for one my driver, but the thing
>> that all those functions zero the whole page(s) before returning them to
>> the driver ruins the performance -- we don't need to zero buffers for
>> receiving packets and spend a ton of cycles on it (esp. in cases when 4k
>> gets zeroed each time, but your main body of traffic is 64-byte frames).
> 
> Hmm, the single zeroing when doing the initial allocation shows up
> in these profiles?

When there's no recycling of pages, then yes. And since recycling is
done asynchronously, sometimes new allocations happen either way.
Anyways, that was roughly a couple years ago right when you introduced
dma_alloc_noncoherent(). Things might've been changed since then.
I could try again while next is closed (i.e. starting this Sunday), the
only thing I'd like to mention: Page Pool allocates pages via
alloc_pages_bulk_array_node(). Bulking helps a lot (and PP uses bulks of
16 IIRC), explicit node setting helps when Rx queues are distributed
between several nodes. We can then have one struct device for several nodes.
As I can see, there's now no function to allocate in bulks and no
explicit node setting option (e.g. mlx5 works around this using
set_dev_node() + allocate + set_dev_node(orig_node)). Could such options
be added in near future? That would help a lot switching to the
functions intended for use when DMA mappings can stay for a long time.
From what I see from the code, that shouldn't be a problem (except for
non-direct DMA cases, where we'd need to introduce new callbacks or
extend the existing ones).

Thanks,
Olek
Xuan Zhuo April 21, 2023, 7:31 a.m. UTC | #31
On Thu, 20 Apr 2023 07:13:49 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 19 Apr 2023 23:19:22 -0700 Christoph Hellwig wrote:
> > > In this case yes, pinned user memory, it gets sliced up into MTU sized
> > > chunks, fed into an Rx queue of a device, and user can see packets
> > > without any copies.
> >
> > How long is the life time of these mappings?  Because dma_map_*
> > assumes a temporary mapping and not one that is pinned bascically
> > forever.
>
> Yeah, this one is "for ever".
>
> > > Quite similar use case #2 is upcoming io_uring / "direct placement"
> > > patches (former from Meta, latter for Google) which will try to receive
> > > just the TCP data into pinned user memory.
> >
> > I don't think we can just long term pin user memory here.  E.g. for
> > confidential computing cases we can't even ever do DMA straight to
> > userspace.  I had that conversation with Meta's block folks who
> > want to do something similar with io_uring and the only option is an
> > an allocator for memory that is known DMAable, e.g. through dma-bufs.
> >
> > You guys really all need to get together and come up with a scheme
> > that actually works instead of piling these hacks over hacks.
>
> Okay, that simplifies various aspects. We'll just used dma-bufs from
> the start in the new APIs.


I am not particularly familiar with dma-bufs. I want to know if this mechanism
can solve the problem of virtio-net.

I saw this framework, allowing the driver do something inside the ops of
dma-bufs.

If so, is it possible to propose a new patch based on dma-bufs?

Thanks.
Jakub Kicinski April 21, 2023, 1:50 p.m. UTC | #32
On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote:
> I am not particularly familiar with dma-bufs. I want to know if this mechanism
> can solve the problem of virtio-net.
> 
> I saw this framework, allowing the driver do something inside the ops of
> dma-bufs.
> 
> If so, is it possible to propose a new patch based on dma-bufs?

I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework
uAPI of XSK to allow user to pass in a dma-buf region rather than just
a user VA. So it may be a larger effort but architecturally it may be
the right solution.
Xuan Zhuo April 23, 2023, 1:54 a.m. UTC | #33
On Fri, 21 Apr 2023 06:50:59 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote:
> > I am not particularly familiar with dma-bufs. I want to know if this mechanism
> > can solve the problem of virtio-net.
> >
> > I saw this framework, allowing the driver do something inside the ops of
> > dma-bufs.
> >
> > If so, is it possible to propose a new patch based on dma-bufs?
>
> I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework
> uAPI of XSK to allow user to pass in a dma-buf region rather than just
> a user VA.

This seems to be a big job. Can we first receive this patch.

Thanks.

> So it may be a larger effort but architecturally it may be
> the right solution.
Alexander Lobakin April 24, 2023, 3:28 p.m. UTC | #34
From: Jakub Kicinski <kuba@kernel.org>
Date: Fri, 21 Apr 2023 06:50:59 -0700

> On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote:
>> I am not particularly familiar with dma-bufs. I want to know if this mechanism
>> can solve the problem of virtio-net.
>>
>> I saw this framework, allowing the driver do something inside the ops of
>> dma-bufs.
>>
>> If so, is it possible to propose a new patch based on dma-bufs?
> 
> I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework

Oh no, not me. I suck at dma-bufs, tried to understand them several
times with no progress :D My knowledge is limited to "ok, if it's
DMA + userspace, then it's likely dma-buf" :smile_with_tear:

> uAPI of XSK to allow user to pass in a dma-buf region rather than just
> a user VA. So it may be a larger effort but architecturally it may be
> the right solution.
> 

I'm curious whether this could be done without tons of work. Switching
Page Pool to dma_alloc_noncoherent() is simpler :D But, as I wrote
above, we need to extend DMA API first to provide bulk allocations and
NUMA-aware allocations.
Can't we provide a shim for back-compat, i.e. if a program passes just a
user VA, create a dma-buf in the kernel already?

Thanks,
Olek
Jakub Kicinski April 24, 2023, 3:28 p.m. UTC | #35
On Sun, 23 Apr 2023 09:54:28 +0800 Xuan Zhuo wrote:
> On Fri, 21 Apr 2023 06:50:59 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote:  
> > > I am not particularly familiar with dma-bufs. I want to know if this mechanism
> > > can solve the problem of virtio-net.
> > >
> > > I saw this framework, allowing the driver do something inside the ops of
> > > dma-bufs.
> > >
> > > If so, is it possible to propose a new patch based on dma-bufs?  
> >
> > I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework
> > uAPI of XSK to allow user to pass in a dma-buf region rather than just
> > a user VA.  
> 
> This seems to be a big job. Can we first receive this patch.

To me it looks like a very obvious workaround for the fact that virtio
does not make normal use of the DMA API, and (for reasons which perhaps
due to my meager intellect I do not grasp) you are not allowed to fix
that. Regardless, we have a path forward so I vote "no" to the patch
under discussion.
Xuan Zhuo April 25, 2023, 2:11 a.m. UTC | #36
On Mon, 24 Apr 2023 17:28:01 +0200, Alexander Lobakin <aleksander.lobakin@intel.com> wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Fri, 21 Apr 2023 06:50:59 -0700
>
> > On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote:
> >> I am not particularly familiar with dma-bufs. I want to know if this mechanism
> >> can solve the problem of virtio-net.
> >>
> >> I saw this framework, allowing the driver do something inside the ops of
> >> dma-bufs.
> >>
> >> If so, is it possible to propose a new patch based on dma-bufs?
> >
> > I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework
>
> Oh no, not me. I suck at dma-bufs, tried to understand them several
> times with no progress :D My knowledge is limited to "ok, if it's
> DMA + userspace, then it's likely dma-buf" :smile_with_tear:
>
> > uAPI of XSK to allow user to pass in a dma-buf region rather than just
> > a user VA. So it may be a larger effort but architecturally it may be
> > the right solution.
> >
>
> I'm curious whether this could be done without tons of work. Switching
> Page Pool to dma_alloc_noncoherent() is simpler :D But, as I wrote
> above, we need to extend DMA API first to provide bulk allocations and
> NUMA-aware allocations.
> Can't we provide a shim for back-compat, i.e. if a program passes just a
> user VA, create a dma-buf in the kernel already?


Yes

I think so too. If this is the case, will the workload be much smaller? Let me
try it.

Thanks.


>
> Thanks,
> Olek
Michael S. Tsirkin April 25, 2023, 8:12 a.m. UTC | #37
On Thu, Apr 20, 2023 at 09:18:21AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 20, 2023 at 05:11:48PM +0800, Xuan Zhuo wrote:
> > I know that the current design of DMA API only supports some physical hardware,
> > but can it be modified or expanded?
> 
> I think the important point is that for some cases there is no need
> to dma map at all, and upper layers should be fine by that by just
> doing the dma mapping in helpers called by the driver.
> 
> The virtio drivers then check if platform_access is set, then call the
> generic dma mapping helper, or if not just allocate memory using
> alloc_pages and also skip all the sync calls.

In theory, absolutely. In practice modern virtio devices are ok,
the reason we are stuck supporting old legacy ones is because legacy
devices are needed to run old guests. And then people turn
around and run a new guest on the same device,
for example because they switch back and forth e.g.
for data recovery? Or because whoever is selling the
host wants to opt for maximum compatibility.

Teaching all of linux to sometimes use dma and sometimes not
is a lot of work, and for limited benefit of these legacy systems.
We do it in a limited number of cases but generally
making DMA itself DTRT sounds more attractive.

So special DMA ops for these makes some sense: yes the
firmware described DMA is wrong on these boxes but
buggy firmware is not so unusual, is it?
Given virtio devices actually are on a virtual bus (the virtio bus)
sticking the fake DMA ops on this bus seems to make sense
as a way to express this quirk.

No?
Christoph Hellwig May 1, 2023, 4:16 a.m. UTC | #38
On Tue, Apr 25, 2023 at 04:12:05AM -0400, Michael S. Tsirkin wrote:
> In theory, absolutely. In practice modern virtio devices are ok,
> the reason we are stuck supporting old legacy ones is because legacy
> devices are needed to run old guests. And then people turn
> around and run a new guest on the same device,
> for example because they switch back and forth e.g.
> for data recovery? Or because whoever is selling the
> host wants to opt for maximum compatibility.
> 
> Teaching all of linux to sometimes use dma and sometimes not
> is a lot of work, and for limited benefit of these legacy systems.

It's not like virtio is the only case where blindly assuming
your can use DMA operations in a higher layer is the problem..
Christoph Hellwig May 1, 2023, 4:28 a.m. UTC | #39
On Thu, Apr 20, 2023 at 06:42:17PM +0200, Alexander Lobakin wrote:
> When there's no recycling of pages, then yes. And since recycling is
> done asynchronously, sometimes new allocations happen either way.
> Anyways, that was roughly a couple years ago right when you introduced
> dma_alloc_noncoherent(). Things might've been changed since then.
> I could try again while next is closed (i.e. starting this Sunday), the
> only thing I'd like to mention: Page Pool allocates pages via
> alloc_pages_bulk_array_node(). Bulking helps a lot (and PP uses bulks of
> 16 IIRC), explicit node setting helps when Rx queues are distributed
> between several nodes. We can then have one struct device for several nodes.
> As I can see, there's now no function to allocate in bulks and no
> explicit node setting option (e.g. mlx5 works around this using
> set_dev_node() + allocate + set_dev_node(orig_node)). Could such options
> be added in near future? That would help a lot switching to the
> functions intended for use when DMA mappings can stay for a long time.
> >From what I see from the code, that shouldn't be a problem (except for
> non-direct DMA cases, where we'd need to introduce new callbacks or
> extend the existing ones).

So the node hint is something we can triviall pass through, and
something the mlx5 maintainers should have done from the beginning
instead of this nasty hack.  Patches gladly accepted.

A alloc_pages_bulk_array_node-like allocator also seems doable, we
just need to make sure it has a decent fallback as I don't think
we can wire it up to all the crazy legacy iommu drivers.
diff mbox series

Patch

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 9c0d860609ba..181583ff6a26 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -67,7 +67,17 @@  static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool,
 {
 	struct xdp_umem *umem = pool->umem;
 
-	return xp_dma_map(pool, dev, attrs, umem->pgs, umem->npgs);
+	return xp_dma_map(pool, dev, NULL, attrs, umem->pgs, umem->npgs);
+}
+
+static inline int xsk_pool_dma_map_with_ops(struct xsk_buff_pool *pool,
+					    struct device *dev,
+					    struct xsk_dma_ops *dma_ops,
+					    unsigned long attrs)
+{
+	struct xdp_umem *umem = pool->umem;
+
+	return xp_dma_map(pool, dev, dma_ops, attrs, umem->pgs, umem->npgs);
 }
 
 static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp)
@@ -226,6 +236,14 @@  static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool,
 	return 0;
 }
 
+static inline int xsk_pool_dma_map_with_ops(struct xsk_buff_pool *pool,
+					    struct device *dev,
+					    struct xsk_dma_ops *dma_ops,
+					    unsigned long attrs)
+{
+	return 0;
+}
+
 static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp)
 {
 	return 0;
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 3e952e569418..1299b9d12484 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -43,6 +43,23 @@  struct xsk_dma_map {
 	bool dma_need_sync;
 };
 
+struct xsk_dma_ops {
+	dma_addr_t (*map_page)(struct device *dev, struct page *page,
+			       unsigned long offset, size_t size,
+			       enum dma_data_direction dir, unsigned long attrs);
+	void (*unmap_page)(struct device *dev, dma_addr_t dma_handle,
+			   size_t size, enum dma_data_direction dir,
+			   unsigned long attrs);
+	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
+	bool (*need_sync)(struct device *dev, dma_addr_t dma);
+	void (*sync_single_range_for_cpu)(struct device *dev, dma_addr_t addr,
+					  unsigned long offset, size_t size,
+					  enum dma_data_direction dir);
+	void (*sync_single_range_for_device)(struct device *dev, dma_addr_t addr,
+					     unsigned long offset, size_t size,
+					     enum dma_data_direction dir);
+};
+
 struct xsk_buff_pool {
 	/* Members only used in the control path first. */
 	struct device *dev;
@@ -85,6 +102,7 @@  struct xsk_buff_pool {
 	 * sockets share a single cq when the same netdev and queue id is shared.
 	 */
 	spinlock_t cq_lock;
+	struct xsk_dma_ops dma_ops;
 	struct xdp_buff_xsk *free_heads[];
 };
 
@@ -131,6 +149,7 @@  static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p
 /* AF_XDP ZC drivers, via xdp_sock_buff.h */
 void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq);
 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
+	       struct xsk_dma_ops *dma_ops,
 	       unsigned long attrs, struct page **pages, u32 nr_pages);
 void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs);
 struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool);
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b2df1e0f8153..646090cae8ec 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -328,7 +328,8 @@  static void xp_destroy_dma_map(struct xsk_dma_map *dma_map)
 	kfree(dma_map);
 }
 
-static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs)
+static void __xp_dma_unmap(struct xsk_dma_map *dma_map,
+			   struct xsk_dma_ops *dma_ops, unsigned long attrs)
 {
 	dma_addr_t *dma;
 	u32 i;
@@ -337,8 +338,8 @@  static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs)
 		dma = &dma_map->dma_pages[i];
 		if (*dma) {
 			*dma &= ~XSK_NEXT_PG_CONTIG_MASK;
-			dma_unmap_page_attrs(dma_map->dev, *dma, PAGE_SIZE,
-					     DMA_BIDIRECTIONAL, attrs);
+			dma_ops->unmap_page(dma_map->dev, *dma, PAGE_SIZE,
+					    DMA_BIDIRECTIONAL, attrs);
 			*dma = 0;
 		}
 	}
@@ -362,7 +363,7 @@  void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
 	if (!refcount_dec_and_test(&dma_map->users))
 		return;
 
-	__xp_dma_unmap(dma_map, attrs);
+	__xp_dma_unmap(dma_map, &pool->dma_ops, attrs);
 	kvfree(pool->dma_pages);
 	pool->dma_pages_cnt = 0;
 	pool->dev = NULL;
@@ -407,6 +408,7 @@  static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_
 }
 
 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
+	       struct xsk_dma_ops *dma_ops,
 	       unsigned long attrs, struct page **pages, u32 nr_pages)
 {
 	struct xsk_dma_map *dma_map;
@@ -424,18 +426,29 @@  int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 		return 0;
 	}
 
+	if (!dma_ops) {
+		pool->dma_ops.map_page = dma_map_page_attrs;
+		pool->dma_ops.mapping_error = dma_mapping_error;
+		pool->dma_ops.need_sync = dma_need_sync;
+		pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device;
+		pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu;
+		dma_ops = &pool->dma_ops;
+	} else {
+		pool->dma_ops = *dma_ops;
+	}
+
 	dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
 	if (!dma_map)
 		return -ENOMEM;
 
 	for (i = 0; i < dma_map->dma_pages_cnt; i++) {
-		dma = dma_map_page_attrs(dev, pages[i], 0, PAGE_SIZE,
-					 DMA_BIDIRECTIONAL, attrs);
-		if (dma_mapping_error(dev, dma)) {
-			__xp_dma_unmap(dma_map, attrs);
+		dma = dma_ops->map_page(dev, pages[i], 0, PAGE_SIZE,
+					DMA_BIDIRECTIONAL, attrs);
+		if (dma_ops->mapping_error(dev, dma)) {
+			__xp_dma_unmap(dma_map, dma_ops, attrs);
 			return -ENOMEM;
 		}
-		if (dma_need_sync(dev, dma))
+		if (dma_ops->need_sync(dev, dma))
 			dma_map->dma_need_sync = true;
 		dma_map->dma_pages[i] = dma;
 	}
@@ -445,7 +458,7 @@  int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 
 	err = xp_init_dma_info(pool, dma_map);
 	if (err) {
-		__xp_dma_unmap(dma_map, attrs);
+		__xp_dma_unmap(dma_map, dma_ops, attrs);
 		return err;
 	}
 
@@ -532,9 +545,9 @@  struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
 	xskb->xdp.data_meta = xskb->xdp.data;
 
 	if (pool->dma_need_sync) {
-		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
-						 pool->frame_len,
-						 DMA_BIDIRECTIONAL);
+		pool->dma_ops.sync_single_range_for_device(pool->dev, xskb->dma, 0,
+							   pool->frame_len,
+							   DMA_BIDIRECTIONAL);
 	}
 	return &xskb->xdp;
 }
@@ -670,15 +683,15 @@  EXPORT_SYMBOL(xp_raw_get_dma);
 
 void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb)
 {
-	dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
-				      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
+	xskb->pool->dma_ops.sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
+						      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
 }
 EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow);
 
 void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
 				 size_t size)
 {
-	dma_sync_single_range_for_device(pool->dev, dma, 0,
-					 size, DMA_BIDIRECTIONAL);
+	pool->dma_ops.sync_single_range_for_device(pool->dev, dma, 0,
+						   size, DMA_BIDIRECTIONAL);
 }
 EXPORT_SYMBOL(xp_dma_sync_for_device_slow);