mbox series

[v1,0/6] no-copy bvec

Message ID cover.1607976425.git.asml.silence@gmail.com (mailing list archive)
Headers show
Series no-copy bvec | expand

Message

Pavel Begunkov Dec. 15, 2020, 12:20 a.m. UTC
Instead of creating a full copy of iter->bvec into bio in direct I/O,
the patchset makes use of the one provided. It changes semantics and
obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
converts the only place that doesn't.

bio_iov_iter_get_pages() is still does iov_iter_advance(), which is
not great, but neccessary for revert to work. It's desirable to have
a fast version of iov_iter_advance(i, i->count), so we may want to
hack something up for that. E.g. allow to not keep it consistent
in some cases when i->count==0. Also we can add a separate bio pool
without inlined bvec. Very easy to do and shrinks bios from 3 to 2
cachelines.

Also as suggested it removes BIO_WORKINGSET from direct paths: blkdev,
iomap, fs/direct-io. Even though the last one is not very important as
more filesystems are converted to iomap, but still looks hacky. Maybe,
as Johannes mentioned in another thread, moving it to the writeback
code (or other option) would be better in the end. Afterwards?

since RFC:
- add target_core_file patch by Christoph
- make no-copy default behaviour, remove iter flag
- iter_advance() instead of hacks to revert to work
- add bvec iter_advance() optimisation patch
- remove PSI annotations from direct IO (iomap, block and fs/direct)
- note in d/f/porting

Christoph Hellwig (1):
  target/file: allocate the bvec array as part of struct
    target_core_file_cmd

Pavel Begunkov (5):
  iov_iter: optimise bvec iov_iter_advance()
  bio: deduplicate adding a page into bio
  block/psi: remove PSI annotations from direct IO
  bio: add a helper calculating nr segments to alloc
  block/iomap: don't copy bvec for direct IO

 Documentation/filesystems/porting.rst |   9 +++
 block/bio.c                           | 103 ++++++++++++--------------
 drivers/target/target_core_file.c     |  20 ++---
 fs/block_dev.c                        |   7 +-
 fs/direct-io.c                        |   2 +
 fs/iomap/direct-io.c                  |   9 +--
 include/linux/bio.h                   |   9 +++
 lib/iov_iter.c                        |  19 +++++
 8 files changed, 102 insertions(+), 76 deletions(-)

Comments

Ming Lei Dec. 15, 2020, 1:41 a.m. UTC | #1
On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
> Instead of creating a full copy of iter->bvec into bio in direct I/O,
> the patchset makes use of the one provided. It changes semantics and
> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
> converts the only place that doesn't.

Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
length bvec, which may not be supported by block layer or driver, so
this patchset has to address this case first.

Please see 7e24969022cb ("block: allow for_each_bvec to support zero len bvec").


thanks,
Ming
Pavel Begunkov Dec. 15, 2020, 11:14 a.m. UTC | #2
On 15/12/2020 01:41, Ming Lei wrote:
> On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
>> Instead of creating a full copy of iter->bvec into bio in direct I/O,
>> the patchset makes use of the one provided. It changes semantics and
>> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
>> converts the only place that doesn't.
> 
> Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
> length bvec, which may not be supported by block layer or driver, so
> this patchset has to address this case first.

The easiest for me would be to fallback to copy if there are zero bvecs,
e.g. finding such during iov_iter_alignment(), but do we know from where
zero bvecs can came? As it's internals we may want to forbid them if
there is not too much hassle.

> 
> Please see 7e24969022cb ("block: allow for_each_bvec to support zero len bvec").
Ming Lei Dec. 15, 2020, 12:03 p.m. UTC | #3
On Tue, Dec 15, 2020 at 11:14:20AM +0000, Pavel Begunkov wrote:
> On 15/12/2020 01:41, Ming Lei wrote:
> > On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
> >> Instead of creating a full copy of iter->bvec into bio in direct I/O,
> >> the patchset makes use of the one provided. It changes semantics and
> >> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
> >> converts the only place that doesn't.
> > 
> > Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
> > length bvec, which may not be supported by block layer or driver, so
> > this patchset has to address this case first.
> 
> The easiest for me would be to fallback to copy if there are zero bvecs,
> e.g. finding such during iov_iter_alignment(), but do we know from where
> zero bvecs can came? As it's internals we may want to forbid them if
> there is not too much hassle.

You may find clue from the following link:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html


Thanks,
Ming
Pavel Begunkov Dec. 15, 2020, 2:05 p.m. UTC | #4
On 15/12/2020 12:03, Ming Lei wrote:
> On Tue, Dec 15, 2020 at 11:14:20AM +0000, Pavel Begunkov wrote:
>> On 15/12/2020 01:41, Ming Lei wrote:
>>> On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
>>>> Instead of creating a full copy of iter->bvec into bio in direct I/O,
>>>> the patchset makes use of the one provided. It changes semantics and
>>>> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
>>>> converts the only place that doesn't.
>>>
>>> Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
>>> length bvec, which may not be supported by block layer or driver, so
>>> this patchset has to address this case first.
>>
>> The easiest for me would be to fallback to copy if there are zero bvecs,
>> e.g. finding such during iov_iter_alignment(), but do we know from where
>> zero bvecs can came? As it's internals we may want to forbid them if
>> there is not too much hassle.
> 
> You may find clue from the following link:
> 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html

Thanks for the link!

Al, you mentioned "Zero-length segments are not disallowed", do you have
a strong opinion on that? Apart from already diverged behaviour from the
block layer and getting in the way of this series, without it we'd also be
able to remove some extra ifs, e.g. in iterate_bvec()
Christoph Hellwig Dec. 22, 2020, 2:11 p.m. UTC | #5
On Tue, Dec 15, 2020 at 02:05:35PM +0000, Pavel Begunkov wrote:
> > You may find clue from the following link:
> > 
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html
> 
> Thanks for the link!
> 
> Al, you mentioned "Zero-length segments are not disallowed", do you have
> a strong opinion on that? Apart from already diverged behaviour from the
> block layer and getting in the way of this series, without it we'd also be
> able to remove some extra ifs, e.g. in iterate_bvec()

I'd prefer not to support zero-length ITER_BVEC and catching them
early, as the block layer can't deal with them either.  From a quick
look at iter_file_splice_write it should be pretty trivial to fix there,
although we'll need to audit other callers as well (even if I don't
expect them to submit this degenerate case).
Pavel Begunkov Dec. 23, 2020, 12:52 p.m. UTC | #6
On 22/12/2020 14:11, Christoph Hellwig wrote:
> On Tue, Dec 15, 2020 at 02:05:35PM +0000, Pavel Begunkov wrote:
>>> You may find clue from the following link:
>>>
>>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html
>>
>> Thanks for the link!
>>
>> Al, you mentioned "Zero-length segments are not disallowed", do you have
>> a strong opinion on that? Apart from already diverged behaviour from the
>> block layer and getting in the way of this series, without it we'd also be
>> able to remove some extra ifs, e.g. in iterate_bvec()
> 
> I'd prefer not to support zero-length ITER_BVEC and catching them
> early, as the block layer can't deal with them either.  From a quick
> look at iter_file_splice_write it should be pretty trivial to fix there,
> although we'll need to audit other callers as well (even if I don't
> expect them to submit this degenerate case).

Can scatterlist have 0-len entries? Those are directly translated into
bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c.
I've audited most of others by this moment, they're fine.

Thanks for other nits, they will go into next version.
Christoph Hellwig Dec. 23, 2020, 3:51 p.m. UTC | #7
On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
> Can scatterlist have 0-len entries? Those are directly translated into
> bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c.
> I've audited most of others by this moment, they're fine.

For block layer SGLs we should never see them, and for nvme neither.
I think the same is true for the SCSI target code, but please double
check.
James Bottomley Dec. 23, 2020, 4:04 p.m. UTC | #8
On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
> > Can scatterlist have 0-len entries? Those are directly translated
> > into bvecs, e.g. in nvme/target/io-cmd-file.c and
> > target/target_core_file.c. I've audited most of others by this
> > moment, they're fine.
> 
> For block layer SGLs we should never see them, and for nvme neither.
> I think the same is true for the SCSI target code, but please double
> check.

Right, no-one ever wants to see a 0-len scatter list entry.  The reason
is that every driver uses the sgl to program the device DMA engine in
the way NVME does.  a 0 length sgl would be a dangerous corner case:
some DMA engines would ignore it and others would go haywire, so if we
ever let a 0 length list down into the driver, they'd have to
understand the corner case behaviour of their DMA engine and filter it
accordingly, which is why we disallow them in the upper levels, since
they're effective nops anyway.

James
Douglas Gilbert Dec. 23, 2020, 8:23 p.m. UTC | #9
On 2020-12-23 11:04 a.m., James Bottomley wrote:
> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
>>> Can scatterlist have 0-len entries? Those are directly translated
>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and
>>> target/target_core_file.c. I've audited most of others by this
>>> moment, they're fine.
>>
>> For block layer SGLs we should never see them, and for nvme neither.
>> I think the same is true for the SCSI target code, but please double
>> check.
> 
> Right, no-one ever wants to see a 0-len scatter list entry.  The reason
> is that every driver uses the sgl to program the device DMA engine in
> the way NVME does.  a 0 length sgl would be a dangerous corner case:
> some DMA engines would ignore it and others would go haywire, so if we
> ever let a 0 length list down into the driver, they'd have to
> understand the corner case behaviour of their DMA engine and filter it
> accordingly, which is why we disallow them in the upper levels, since
> they're effective nops anyway.

When using scatter gather lists at the far end (i.e. on the storage device)
the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
allow the "number of logical blocks" in their sgl_s to be zero and state
that it is _not_ to be considered an error.

Doug Gilbert
Pavel Begunkov Dec. 23, 2020, 8:32 p.m. UTC | #10
On 23/12/2020 20:23, Douglas Gilbert wrote:
> On 2020-12-23 11:04 a.m., James Bottomley wrote:
>> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
>>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
>>>> Can scatterlist have 0-len entries? Those are directly translated
>>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and
>>>> target/target_core_file.c. I've audited most of others by this
>>>> moment, they're fine.
>>>
>>> For block layer SGLs we should never see them, and for nvme neither.
>>> I think the same is true for the SCSI target code, but please double
>>> check.
>>
>> Right, no-one ever wants to see a 0-len scatter list entry.  The reason
>> is that every driver uses the sgl to program the device DMA engine in
>> the way NVME does.  a 0 length sgl would be a dangerous corner case:
>> some DMA engines would ignore it and others would go haywire, so if we
>> ever let a 0 length list down into the driver, they'd have to
>> understand the corner case behaviour of their DMA engine and filter it
>> accordingly, which is why we disallow them in the upper levels, since
>> they're effective nops anyway.
> 
> When using scatter gather lists at the far end (i.e. on the storage device)
> the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
> allow the "number of logical blocks" in their sgl_s to be zero and state
> that it is _not_ to be considered an error.

It's fine for my case unless it leaks them out of device driver to the
net/block layer/etc. Is it?
Christoph Hellwig Dec. 24, 2020, 6:41 a.m. UTC | #11
On Wed, Dec 23, 2020 at 08:32:45PM +0000, Pavel Begunkov wrote:
> On 23/12/2020 20:23, Douglas Gilbert wrote:
> > On 2020-12-23 11:04 a.m., James Bottomley wrote:
> >> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
> >>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
> >>>> Can scatterlist have 0-len entries? Those are directly translated
> >>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and
> >>>> target/target_core_file.c. I've audited most of others by this
> >>>> moment, they're fine.
> >>>
> >>> For block layer SGLs we should never see them, and for nvme neither.
> >>> I think the same is true for the SCSI target code, but please double
> >>> check.
> >>
> >> Right, no-one ever wants to see a 0-len scatter list entry.?? The reason
> >> is that every driver uses the sgl to program the device DMA engine in
> >> the way NVME does.?? a 0 length sgl would be a dangerous corner case:
> >> some DMA engines would ignore it and others would go haywire, so if we
> >> ever let a 0 length list down into the driver, they'd have to
> >> understand the corner case behaviour of their DMA engine and filter it
> >> accordingly, which is why we disallow them in the upper levels, since
> >> they're effective nops anyway.
> > 
> > When using scatter gather lists at the far end (i.e. on the storage device)
> > the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
> > allow the "number of logical blocks" in their sgl_s to be zero and state
> > that it is _not_ to be considered an error.
> 
> It's fine for my case unless it leaks them out of device driver to the
> net/block layer/etc. Is it?

None of the SCSI Command mentions above are supported by Linux,
nevermind mapped to struct scatterlist.
Douglas Gilbert Dec. 24, 2020, 4:45 p.m. UTC | #12
On 2020-12-24 1:41 a.m., Christoph Hellwig wrote:
> On Wed, Dec 23, 2020 at 08:32:45PM +0000, Pavel Begunkov wrote:
>> On 23/12/2020 20:23, Douglas Gilbert wrote:
>>> On 2020-12-23 11:04 a.m., James Bottomley wrote:
>>>> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
>>>>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
>>>>>> Can scatterlist have 0-len entries? Those are directly translated
>>>>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and
>>>>>> target/target_core_file.c. I've audited most of others by this
>>>>>> moment, they're fine.
>>>>>
>>>>> For block layer SGLs we should never see them, and for nvme neither.
>>>>> I think the same is true for the SCSI target code, but please double
>>>>> check.
>>>>
>>>> Right, no-one ever wants to see a 0-len scatter list entry.?? The reason
>>>> is that every driver uses the sgl to program the device DMA engine in
>>>> the way NVME does.?? a 0 length sgl would be a dangerous corner case:
>>>> some DMA engines would ignore it and others would go haywire, so if we
>>>> ever let a 0 length list down into the driver, they'd have to
>>>> understand the corner case behaviour of their DMA engine and filter it
>>>> accordingly, which is why we disallow them in the upper levels, since
>>>> they're effective nops anyway.
>>>
>>> When using scatter gather lists at the far end (i.e. on the storage device)
>>> the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
>>> allow the "number of logical blocks" in their sgl_s to be zero and state
>>> that it is _not_ to be considered an error.
>>
>> It's fine for my case unless it leaks them out of device driver to the
>> net/block layer/etc. Is it?
> 
> None of the SCSI Command mentions above are supported by Linux,
> nevermind mapped to struct scatterlist.
> 

The POPULATE TOKEN / WRITE USING TOKEN pair can be viewed as a subset
of EXTENDED COPY (SPC-4) which also supports "range descriptors". It is
not clear if target_core_xcopy.c supports these range descriptors but
if it did, it would be trying to map them to struct scatterlist objects.

That said, it would be easy to skip the "number of logical blocks" == 0
case when translating range descriptors to sgl_s.

In my ddpt utility (a dd clone) I have generalized skip= and seek= to
optionally take sgl_s. If the last element in one of those sgl_s is
LBAn,0 then it is interpreted as "until the end of that device" which
is further restricted if the other sgl has a "hard" length or count=
is given. The point being a length of 0 can have meaning, a benefit
lost with NVMe's 0-based counts.

Doug Gilbert
James Bottomley Dec. 24, 2020, 5:30 p.m. UTC | #13
On Wed, 2020-12-23 at 15:23 -0500, Douglas Gilbert wrote:
> On 2020-12-23 11:04 a.m., James Bottomley wrote:
> > On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
> > > On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
> > > > Can scatterlist have 0-len entries? Those are directly
> > > > translated into bvecs, e.g. in nvme/target/io-cmd-file.c and
> > > > target/target_core_file.c. I've audited most of others by this
> > > > moment, they're fine.
> > > 
> > > For block layer SGLs we should never see them, and for nvme
> > > neither. I think the same is true for the SCSI target code, but
> > > please double check.
> > 
> > Right, no-one ever wants to see a 0-len scatter list entry.  The
> > reason is that every driver uses the sgl to program the device DMA
> > engine in the way NVME does.  a 0 length sgl would be a dangerous
> > corner case: some DMA engines would ignore it and others would go
> > haywire, so if we ever let a 0 length list down into the driver,
> > they'd have to understand the corner case behaviour of their DMA
> > engine and filter it accordingly, which is why we disallow them in
> > the upper levels, since they're effective nops anyway.
> 
> When using scatter gather lists at the far end (i.e. on the storage
> device) the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-
> 4) explicitly allow the "number of logical blocks" in their sgl_s to
> be zero and state that it is _not_ to be considered an error.

But that's pretty irrelevant.  The scatterlists that block has been
constructing to drive DMA engines pre-date SCSI's addition of SGLs by
decades (all SCSI commands before the object commands use a linear
buffer which is implemented in the HBA engine as a scatterlist but not
described by the SCSI standard as one).

So the answer to the question should the block layer emit zero length
sgl elements is "no" because they can confuse some DMA engines.

If there's a more theoretical question of whether the target driver in
adding commands it doesn't yet support should inject zero length SGL
elements into block because SCSI allows it, the answer is still "no"
because we don't want block to have SGLs that may confuse other DMA
engines.  There's lots of daft corner cases in the SCSI standard we
don't implement and a nop for SGL elements seems to be one of the more
hare brained because it adds no useful feature and merely causes
compatibility issues.

James