[PATCHv3,2/7] file: add ops to dma map bvec

Message ID	20220805162444.3985535-3-kbusch@fb.com (mailing list archive)
State	New
Headers	show Return-Path: <io-uring-owner@kernel.org> From: Keith Busch <kbusch@fb.com> To: <linux-nvme@lists.infradead.org>, <linux-block@vger.kernel.org>, <io-uring@vger.kernel.org>, <linux-fsdevel@vger.kernel.org> CC: <axboe@kernel.dk>, <hch@lst.de>, Alexander Viro <viro@zeniv.linux.org.uk>, Kernel Team <Kernel-team@fb.com>, Keith Busch <kbusch@kernel.org> Subject: [PATCHv3 2/7] file: add ops to dma map bvec Date: Fri, 5 Aug 2022 09:24:39 -0700 Message-ID: <20220805162444.3985535-3-kbusch@fb.com> In-Reply-To: <20220805162444.3985535-1-kbusch@fb.com> References: <20220805162444.3985535-1-kbusch@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Precedence: bulk
Series	dma mapping optimisations \| expand [PATCHv3,0/7] dma mapping optimisations [PATCHv3,1/7] blk-mq: add ops to dma map bvec [PATCHv3,2/7] file: add ops to dma map bvec [PATCHv3,3/7] iov_iter: introduce type for preregistered dma tags [PATCHv3,4/7] block: add dma tag bio type [PATCHv3,5/7] io_uring: introduce file slot release helper [PATCHv3,6/7] io_uring: add support for dma pre-mapping [PATCHv3,7/7] nvme-pci: implement dma_map support

Keith Busch Aug. 5, 2022, 4:24 p.m. UTC

From: Keith Busch <kbusch@kernel.org>

The same buffer may be used for many subsequent IO's. Instead of setting
up the mapping per-IO, provide an interface that can allow a buffer to
be premapped just once and referenced again later, and implement for the
block device file.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c       | 20 ++++++++++++++++++++
 fs/file.c          | 15 +++++++++++++++
 include/linux/fs.h | 20 ++++++++++++++++++++
 3 files changed, 55 insertions(+)

Dave Chinner Aug. 8, 2022, 12:21 a.m. UTC | #1

On Fri, Aug 05, 2022 at 09:24:39AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The same buffer may be used for many subsequent IO's. Instead of setting
> up the mapping per-IO, provide an interface that can allow a buffer to
> be premapped just once and referenced again later, and implement for the
> block device file.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/fops.c       | 20 ++++++++++++++++++++
>  fs/file.c          | 15 +++++++++++++++
>  include/linux/fs.h | 20 ++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 29066ac5a2fa..db2d1e848f4b 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -670,6 +670,22 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	return error;
>  }
>  
> +#ifdef CONFIG_HAS_DMA
> +void *blkdev_dma_map(struct file *filp, struct bio_vec *bvec, int nr_vecs)
> +{
> +	struct block_device *bdev = filp->private_data;
> +
> +	return block_dma_map(bdev, bvec, nr_vecs);
> +}
> +
> +void blkdev_dma_unmap(struct file *filp, void *dma_tag)
> +{
> +	struct block_device *bdev = filp->private_data;
> +
> +	return block_dma_unmap(bdev, dma_tag);
> +}
> +#endif
> +
>  const struct file_operations def_blk_fops = {
>  	.open		= blkdev_open,
>  	.release	= blkdev_close,
> @@ -686,6 +702,10 @@ const struct file_operations def_blk_fops = {
>  	.splice_read	= generic_file_splice_read,
>  	.splice_write	= iter_file_splice_write,
>  	.fallocate	= blkdev_fallocate,
> +#ifdef CONFIG_HAS_DMA
> +	.dma_map	= blkdev_dma_map,
> +	.dma_unmap	= blkdev_dma_unmap,
> +#endif
>  };
>  
>  static __init int blkdev_init(void)
> diff --git a/fs/file.c b/fs/file.c
> index 3bcc1ecc314a..767bf9d3205e 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1307,3 +1307,18 @@ int iterate_fd(struct files_struct *files, unsigned n,
>  	return res;
>  }
>  EXPORT_SYMBOL(iterate_fd);
> +
> +#ifdef CONFIG_HAS_DMA
> +void *file_dma_map(struct file *file, struct bio_vec *bvec, int nr_vecs)
> +{
> +	if (file->f_op->dma_map)
> +		return file->f_op->dma_map(file, bvec, nr_vecs);
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +void file_dma_unmap(struct file *file, void *dma_tag)
> +{
> +	if (file->f_op->dma_unmap)
> +		return file->f_op->dma_unmap(file, dma_tag);
> +}
> +#endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9f131e559d05..8652bad763f3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2092,6 +2092,10 @@ struct dir_context {
>  struct iov_iter;
>  struct io_uring_cmd;
>  
> +#ifdef CONFIG_HAS_DMA
> +struct bio_vec;
> +#endif
> +
>  struct file_operations {
>  	struct module *owner;
>  	loff_t (*llseek) (struct file *, loff_t, int);
> @@ -2134,6 +2138,10 @@ struct file_operations {
>  				   loff_t len, unsigned int remap_flags);
>  	int (*fadvise)(struct file *, loff_t, loff_t, int);
>  	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> +#ifdef CONFIG_HAS_DMA
> +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> +	void (*dma_unmap)(struct file *, void *);
> +#endif
>  } __randomize_layout;

This just smells wrong. Using a block layer specific construct as a
primary file operation parameter shouts "layering violation" to me.

Indeed, I can't see how this can be used by anything other than a
block device file on a single, stand-alone block device. It's
mapping a region of memory to something that has no file offset or
length associated with it, and the implementation of the callout is
specially pulling the bdev from the private file data.

What we really need is a callout that returns the bdevs that the
struct file is mapped to (one, or many), so the caller can then map
the memory addresses to the block devices itself. The caller then
needs to do an {file, offset, len} -> {bdev, sector, count}
translation so the io_uring code can then use the correct bdev and
dma mappings for the file offset that the user is doing IO to/from.

For a stand-alone block device, the "get bdevs" callout is pretty
simple. single device filesystems are trivial, too. XFS is trivial -
it will return 1 or 2 block devices. stacked bdevs need to iterate
recursively, as would filesystems like btrfs. Still, pretty easy,
and for the case you care about here has almost zero overhead.

Now you have a list of all the bdevs you are going to need to add
dma mappings for, and you can call the bdev directly to set them up.
THere is no need what-so-ever to do this through through the file
operations layer - it's completely contained at the block device
layer and below.

Then, for each file IO range, we need a mapping callout in the file
operations structure. That will take a  {file, offset, len} tuple
and return a {bdev, sector, count} tuple that maps part or all of
the file data.

Again, for a standalone block device, this is simply a translation
of filep->private to bdev, and offset,len from byte counts to sector
counts. Trival, almost no overhead at all.

For filesystems and stacked block devices, though, this gives you
back all the information you need to select the right set of dma
buffers and the {sector, count} information you need to issue the IO
correctly. Setting this up is now all block device layer
manipulation.[*]

This is where I think this patchset needs to go, not bulldoze
through abstractions that get in the way because all you are
implementing is a special fast path for a single niche use case. We
know how to make it work with filesystems and stacked devices, so
can we please start with an API that allows us to implement the
functionality without having to completely rewrite all the code that
you are proposing to add right now?

Cheers,

Dave.

[*] For the purposes of brevity, I'm ignoring the elephant in the
middle of the room: how do you ensure that the filesystem doesn't
run a truncate or hole punch while you have an outstanding DMA
mapping and io_uring is doing IO direct to file offset via that
mapping? i.e. how do you prevent such a non-filesystem controlled IO
path from accessing to stale data (i.e.  use-after-free) of on-disk
storage because there is nothing serialising it against other
filesystem operations?

This is very similar to the direct storage access issues that
DAX+RDMA and pNFS file layouts have. pNFS solves it with file layout
leases, and DAX has all the hooks into the filesystems needed to use
file layout leases in place, too. I'd suggest that IO via io_uring
persistent DMA mappings outside the scope of the filesysetm
controlled IO path also need layout lease guarantees to avoid user
IO from racing with truncate, etc....

Matthew Wilcox Aug. 8, 2022, 1:13 a.m. UTC | #2

On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > +#ifdef CONFIG_HAS_DMA
> > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > +	void (*dma_unmap)(struct file *, void *);
> > +#endif
> 
> This just smells wrong. Using a block layer specific construct as a
> primary file operation parameter shouts "layering violation" to me.

A bio_vec is also used for networking; it's in disguise as an skb_frag,
but it's there.

> What we really need is a callout that returns the bdevs that the
> struct file is mapped to (one, or many), so the caller can then map
> the memory addresses to the block devices itself. The caller then
> needs to do an {file, offset, len} -> {bdev, sector, count}
> translation so the io_uring code can then use the correct bdev and
> dma mappings for the file offset that the user is doing IO to/from.

I don't even know if what you're proposing is possible.  Consider a
network filesystem which might transparently be moved from one network
interface to another.  I don't even know if the filesystem would know
which network device is going to be used for the IO at the time of
IO submission.

I think a totally different model is needed where we can find out if
the bvec contains pages which are already mapped to the device, and map
them if they aren't.  That also handles a DM case where extra devices
are hot-added to a RAID, for example.

Dave Chinner Aug. 8, 2022, 2:15 a.m. UTC | #3

On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > +#ifdef CONFIG_HAS_DMA
> > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > +	void (*dma_unmap)(struct file *, void *);
> > > +#endif
> > 
> > This just smells wrong. Using a block layer specific construct as a
> > primary file operation parameter shouts "layering violation" to me.
> 
> A bio_vec is also used for networking; it's in disguise as an skb_frag,
> but it's there.

Which is just as awful. Just because it's done somewhere else
doesn't make it right.

> > What we really need is a callout that returns the bdevs that the
> > struct file is mapped to (one, or many), so the caller can then map
> > the memory addresses to the block devices itself. The caller then
> > needs to do an {file, offset, len} -> {bdev, sector, count}
> > translation so the io_uring code can then use the correct bdev and
> > dma mappings for the file offset that the user is doing IO to/from.
> 
> I don't even know if what you're proposing is possible.  Consider a
> network filesystem which might transparently be moved from one network
> interface to another.  I don't even know if the filesystem would know
> which network device is going to be used for the IO at the time of
> IO submission.

Sure, but nobody is suggesting we support direct DMA buffer mapping
and reuse for network devices right now, whereas we have working
code for block devices in front of us.

What I want to see is broad-based generic block device based
filesysetm support, not niche functionality that can only work on a
single type of block device. Network filesystems and devices are a
*long* way from being able to do anything like this, so I don't see
a need to cater for them at this point in time.

When someone has a network device abstraction and network filesystem
that can do direct data placement based on that device abstraction,
then we can talk about the high level interface we should use to
drive it....

> I think a totally different model is needed where we can find out if
> the bvec contains pages which are already mapped to the device, and map
> them if they aren't.  That also handles a DM case where extra devices
> are hot-added to a RAID, for example.

I cannot form a picture of what you are suggesting from such a brief
description. Care to explain in more detail?

Cheers,

Dave.

Matthew Wilcox Aug. 8, 2022, 2:49 a.m. UTC | #4

On Mon, Aug 08, 2022 at 12:15:01PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> > On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > > +#ifdef CONFIG_HAS_DMA
> > > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > > +	void (*dma_unmap)(struct file *, void *);
> > > > +#endif
> > > 
> > > This just smells wrong. Using a block layer specific construct as a
> > > primary file operation parameter shouts "layering violation" to me.
> > 
> > A bio_vec is also used for networking; it's in disguise as an skb_frag,
> > but it's there.
> 
> Which is just as awful. Just because it's done somewhere else
> doesn't make it right.
> 
> > > What we really need is a callout that returns the bdevs that the
> > > struct file is mapped to (one, or many), so the caller can then map
> > > the memory addresses to the block devices itself. The caller then
> > > needs to do an {file, offset, len} -> {bdev, sector, count}
> > > translation so the io_uring code can then use the correct bdev and
> > > dma mappings for the file offset that the user is doing IO to/from.
> > 
> > I don't even know if what you're proposing is possible.  Consider a
> > network filesystem which might transparently be moved from one network
> > interface to another.  I don't even know if the filesystem would know
> > which network device is going to be used for the IO at the time of
> > IO submission.
> 
> Sure, but nobody is suggesting we support direct DMA buffer mapping
> and reuse for network devices right now, whereas we have working
> code for block devices in front of us.

But we have working code already (merged) in the networking layer for
reusing pages that are mapped to particular devices.

> What I want to see is broad-based generic block device based
> filesysetm support, not niche functionality that can only work on a
> single type of block device. Network filesystems and devices are a
> *long* way from being able to do anything like this, so I don't see
> a need to cater for them at this point in time.
> 
> When someone has a network device abstraction and network filesystem
> that can do direct data placement based on that device abstraction,
> then we can talk about the high level interface we should use to
> drive it....
> 
> > I think a totally different model is needed where we can find out if
> > the bvec contains pages which are already mapped to the device, and map
> > them if they aren't.  That also handles a DM case where extra devices
> > are hot-added to a RAID, for example.
> 
> I cannot form a picture of what you are suggesting from such a brief
> description. Care to explain in more detail?

Let's suppose you have a RAID 5 of NVMe devices.  One fails and now
the RAID-5 is operating in degraded mode.  So you hot-unplug the failed
device, plug in a new NVMe drive and add it to the RAID.  The pages now
need to be DMA mapped to that new PCI device.

What I'm saying is that the set of devices that the pages need to be
mapped to is not static and cannot be known at "setup time", even given
the additional information that you were proposing earlier in this thread.
It has to be dynamically adjusted.

Dave Chinner Aug. 8, 2022, 7:31 a.m. UTC | #5

On Mon, Aug 08, 2022 at 03:49:09AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 12:15:01PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> > > On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > > > +#ifdef CONFIG_HAS_DMA
> > > > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > > > +	void (*dma_unmap)(struct file *, void *);
> > > > > +#endif
> > > > 
> > > > This just smells wrong. Using a block layer specific construct as a
> > > > primary file operation parameter shouts "layering violation" to me.
> > > 
> > > A bio_vec is also used for networking; it's in disguise as an skb_frag,
> > > but it's there.
> > 
> > Which is just as awful. Just because it's done somewhere else
> > doesn't make it right.
> > 
> > > > What we really need is a callout that returns the bdevs that the
> > > > struct file is mapped to (one, or many), so the caller can then map
> > > > the memory addresses to the block devices itself. The caller then
> > > > needs to do an {file, offset, len} -> {bdev, sector, count}
> > > > translation so the io_uring code can then use the correct bdev and
> > > > dma mappings for the file offset that the user is doing IO to/from.
> > > 
> > > I don't even know if what you're proposing is possible.  Consider a
> > > network filesystem which might transparently be moved from one network
> > > interface to another.  I don't even know if the filesystem would know
> > > which network device is going to be used for the IO at the time of
> > > IO submission.
> > 
> > Sure, but nobody is suggesting we support direct DMA buffer mapping
> > and reuse for network devices right now, whereas we have working
> > code for block devices in front of us.
> 
> But we have working code already (merged) in the networking layer for
> reusing pages that are mapped to particular devices.

Great! How is it hooked up to the network filesystems? I'm kinda
betting that it isn't at all - it's the kernel bypass paths that use
these device based mappings, right? And the user applications are
bound directly to the devices, unlike network filesytsems?

> > What I want to see is broad-based generic block device based
> > filesysetm support, not niche functionality that can only work on a
> > single type of block device. Network filesystems and devices are a
> > *long* way from being able to do anything like this, so I don't see
> > a need to cater for them at this point in time.
> > 
> > When someone has a network device abstraction and network filesystem
> > that can do direct data placement based on that device abstraction,
> > then we can talk about the high level interface we should use to
> > drive it....
> > 
> > > I think a totally different model is needed where we can find out if
> > > the bvec contains pages which are already mapped to the device, and map
> > > them if they aren't.  That also handles a DM case where extra devices
> > > are hot-added to a RAID, for example.
> > 
> > I cannot form a picture of what you are suggesting from such a brief
> > description. Care to explain in more detail?
> 
> Let's suppose you have a RAID 5 of NVMe devices.  One fails and now
> the RAID-5 is operating in degraded mode.

Yes, but this is purely an example of a stacked device type that
requires fined grained mapping of data offset to block device and
offset. When the device fails, it just doesn't return a data mapping
that points to the failed device.

> So you hot-unplug the failed
> device, plug in a new NVMe drive and add it to the RAID.  The pages now
> need to be DMA mapped to that new PCI device.

yup, and now the dma tags for the mappings to that sub-device return
errors, which then tell the application that it needs to remap the
dma buffers it is using.

That's just bog standard error handling - if a bdev goes away,
access to the dma tags have to return IO errors, and it is up to the
application level (i.e. the io_uring code) to handle that sanely.

> What I'm saying is that the set of devices that the pages need to be
> mapped to is not static and cannot be known at "setup time", even given
> the additional information that you were proposing earlier in this thread.
> It has to be dynamically adjusted.

Sure, I'm assuming that IOs based on dma tags will fail if there's a
bdev issue.  The DMA mappings have to be set up somewhere, and it
has to be done before the IO is started. That means there has to be
"discovery" done at "setup time", and if there's an issue between
setup and IO submission, then an error will result and the IO setup
code is going to have to handle that. I can't see how this would
work any other way....

Cheers,

Dave.

Pavel Begunkov Aug. 8, 2022, 10:14 a.m. UTC | #6

On 8/8/22 03:15, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
>> On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
>>>> +#ifdef CONFIG_HAS_DMA
>>>> +	void *(*dma_map)(struct file *, struct bio_vec *, int);
>>>> +	void (*dma_unmap)(struct file *, void *);
>>>> +#endif
>>>
>>> This just smells wrong. Using a block layer specific construct as a
>>> primary file operation parameter shouts "layering violation" to me.
>>
>> A bio_vec is also used for networking; it's in disguise as an skb_frag,
>> but it's there.
> 
> Which is just as awful. Just because it's done somewhere else
> doesn't make it right.
> 
>>> What we really need is a callout that returns the bdevs that the
>>> struct file is mapped to (one, or many), so the caller can then map
>>> the memory addresses to the block devices itself. The caller then
>>> needs to do an {file, offset, len} -> {bdev, sector, count}
>>> translation so the io_uring code can then use the correct bdev and
>>> dma mappings for the file offset that the user is doing IO to/from.
>>
>> I don't even know if what you're proposing is possible.  Consider a
>> network filesystem which might transparently be moved from one network
>> interface to another.  I don't even know if the filesystem would know
>> which network device is going to be used for the IO at the time of
>> IO submission.
> 
> Sure, but nobody is suggesting we support direct DMA buffer mapping
> and reuse for network devices right now, whereas we have working
> code for block devices in front of us.

Networking is not so far away, with zerocopy tx landed the next target
is peer-to-peer, i.e. transfers from a device memory. It's nothing
new and was already tried out quite some time ago, but to be fair,
it's not ready yet as this patchset. In any case, they have to use
common infra, which means we can't rely on struct block_device.

The first idea was to have a callback returning a struct device
pointer and failing when the file can have multiple devices or change
them on the fly. Networking already has a hook to assign a device to
a socket, we just need to make it's immutable after the assignment.
 From the userspace perspective, if host memory mapping failed it can
be re-registered as a normal io_uring registered buffer with no change
in the API on the submission side.

I like the idea to reserve ranges in the API for future use, but
as I understand it, io_uring would need to do device lookups based on
the I/O offset, which doesn't sound fast and I'm not convinced we want
to go this way now. Could work if the specified range covers only one
device but needs knowledge of how it's chunked and doesn't go well
when devices alternate every 4KB or so.

Another question is whether we want to have some kind of notion of
device groups so the userspace doesn't have to register a buffer
multiple times when the mapping can be shared b/w files.

> What I want to see is broad-based generic block device based
> filesysetm support, not niche functionality that can only work on a
> single type of block device. Network filesystems and devices are a
> *long* way from being able to do anything like this, so I don't see
> a need to cater for them at this point in time.
> 
> When someone has a network device abstraction and network filesystem
> that can do direct data placement based on that device abstraction,
> then we can talk about the high level interface we should use to
> drive it....
> 
>> I think a totally different model is needed where we can find out if
>> the bvec contains pages which are already mapped to the device, and map
>> them if they aren't.  That also handles a DM case where extra devices
>> are hot-added to a RAID, for example.
> 
> I cannot form a picture of what you are suggesting from such a brief
> description. Care to explain in more detail?

Keith Busch Aug. 8, 2022, 3:28 p.m. UTC | #7

On Mon, Aug 08, 2022 at 05:31:34PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 03:49:09AM +0100, Matthew Wilcox wrote:
> 
> > So you hot-unplug the failed
> > device, plug in a new NVMe drive and add it to the RAID.  The pages now
> > need to be DMA mapped to that new PCI device.
> 
> yup, and now the dma tags for the mappings to that sub-device return
> errors, which then tell the application that it needs to remap the
> dma buffers it is using.
> 
> That's just bog standard error handling - if a bdev goes away,
> access to the dma tags have to return IO errors, and it is up to the
> application level (i.e. the io_uring code) to handle that sanely.

I didn't think anyone should see IO errors in such scenarios. This feature is
more of an optional optimization, and everything should work as it does today
if a tag becomes invalid.

For md raid or multi-device filesystem, I imagined this would return dma tag
that demuxes to dma tags of the member devices. If any particular member device
doesn't have a dma tag for whatever reason, the filesystem or md would
transparently fall back to the registered bvec that it currently uses when it
needs to do IO to that device.

If you do a RAID hot-swap, MD could request a new dma tag for the new device
without io_uring knowing about the event. MD can continue servicing new IO
referencing its dma tag, and use the new device's tag only once the setup is
complete.

I'm not familiar enough with the networking side, but I thought the file level
abstraction would allow similar handling without io_uring's knowledge.

[PATCHv3,2/7] file: add ops to dma map bvec

Commit Message

Comments

Patch