mbox series

[RFC,0/9] block: add llseek(SEEK_HOLE/SEEK_DATA) support

Message ID 20240328203910.2370087-1-stefanha@redhat.com (mailing list archive)
Headers show
Series block: add llseek(SEEK_HOLE/SEEK_DATA) support | expand

Message

Stefan Hajnoczi March 28, 2024, 8:39 p.m. UTC
cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
This can speed up the process by reducing the amount of data read and it
preserves sparseness when writing to the output file.

This patch series is an initial attempt at implementing
llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
approach and suggestions for resolving the open issues.

In the block device world there are similar concepts to holes:
- SCSI has Logical Block Provisioning where the "mapped" state would be
  considered data and other states would be considered holes.
- NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
- Linux loop block devices and dm-linear targets can pass through queries to
  the backing file.
- dm-thin targets can query metadata to find holes.
- ...and you may be able to think of more examples.

Therefore it is possible to offer this functionality in block drivers.

In my use case a QEMU process in userspace copies the contents of a dm-thin
target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets
without this patch series.

Others have also wished for block device support for SEEK_HOLE. Here is an open
issue from the BorgBackup project:
https://github.com/borgbackup/borg/issues/5609

With these patches userspace can identify holes in loop, dm-linear, and dm-thin
devices. This is done by adding a seek_hole_data() callback to struct
block_device_operations. When the callback is NULL the entire device is
considered data. Device-mapper is extended along the same lines so that targets
can provide seek_hole_data() callbacks.

I'm unfamiliar with much of this code and have probably missed locking
requirements. Since llseek() executes synchronously like ioctl() and is not an
asynchronous I/O request it's possible that my changes to the loop block driver
and dm-thin are broken (e.g. what if the loop device fd is changed during
llseek()?).

To run the tests:

  # make TARGETS=block_seek_hole -C tools/testing/selftests run_tests

The code is also available here:
https://gitlab.com/stefanha/linux/-/tree/block-seek-hole

Please take a look and let me know your thoughts. Thanks!

Stefan Hajnoczi (9):
  block: add llseek(SEEK_HOLE/SEEK_DATA) support
  loop: add llseek(SEEK_HOLE/SEEK_DATA) support
  selftests: block_seek_hole: add loop block driver tests
  dm: add llseek(SEEK_HOLE/SEEK_DATA) support
  selftests: block_seek_hole: add dm-zero test
  dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support
  selftests: block_seek_hole: add dm-linear test
  dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support
  selftests: block_seek_hole: add dm-thin test

 tools/testing/selftests/Makefile              |   1 +
 .../selftests/block_seek_hole/Makefile        |  17 +++
 include/linux/blkdev.h                        |   7 ++
 include/linux/device-mapper.h                 |   5 +
 block/fops.c                                  |  43 ++++++-
 drivers/block/loop.c                          |  36 +++++-
 drivers/md/dm-linear.c                        |  25 ++++
 drivers/md/dm-thin.c                          |  77 ++++++++++++
 drivers/md/dm.c                               |  68 ++++++++++
 .../testing/selftests/block_seek_hole/config  |   3 +
 .../selftests/block_seek_hole/dm_thin.sh      |  80 ++++++++++++
 .../selftests/block_seek_hole/dm_zero.sh      |  31 +++++
 .../selftests/block_seek_hole/map_holes.py    |  37 ++++++
 .../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++
 14 files changed, 540 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
 create mode 100644 tools/testing/selftests/block_seek_hole/config
 create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
 create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
 create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
 create mode 100755 tools/testing/selftests/block_seek_hole/test.py

Comments

Eric Blake March 28, 2024, 10:16 p.m. UTC | #1
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.

As a minor point of clarity (perhaps as much for my own records for
documenting research I've done over the years, and not necessarily
something you need to change in the commit message):

Userspace apps generally use lseek(2) from glibc or similar (perhaps
via its alias lseek64(), depending on whether userspace is using large
file offsets), rather than directly calling the _llseek() syscall.
But it all boils down to the same notion of seeking information about
various special offsets.

Also, in past history, coreutils cp(1) and dd(1) did experiment with
using FS_IOC_FIEMAP ioctls when SEEK_HOLE was not available, but it
proved to cause more problems than it solved, so that is not currently
in favor.  Yes, we could teach more and more block devices to expose
specific ioctls for querying sparseness boundaries, and then teach
userspace apps a list of ioctls to try; but as cp(1) already learned,
having one common interface is much easier than an ever-growing ioctl
ladder to be copied across every client that would benefit from
knowing where the unallocated portions are.

> This can speed up the process by reducing the amount of data read and it
> preserves sparseness when writing to the output file.
> 
> This patch series is an initial attempt at implementing
> llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
> approach and suggestions for resolving the open issues.

One of your open issues was whether adjusting the offset of the block
device itself should also adjust the file offset of the underlying
file (at least in the case of loopback and dm-linear).  What would the
community think of adding two additional constants to the entire
family of *seek() functions, that have the effect of returning the
same offset as their SEEK_HOLE/SEEK_DATA counterparts but without
moving the file offset?

Explaining the idea by example, although I'm not stuck on these names:
suppose you have an fd visiting a file description of 2MiB in size,
with the first 1MiB being a hole and the second being data.

#define MiB (1024*1024)
lseek64(fd, MiB, SEEK_SET); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_HOLE); // returns 0, file offset changed to 0
lseek64(fd, 0, SEEK_DATA); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_HOLE); // returns 0, but file offset left at MiB
lseek64(fd, 0, SEEK_SET); // returns 0, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_DATA); // returns MiB, but file offset left at MiB

With semantics like that, it might be easier to implement just
SEEK_PEEK* in devices (don't worry about changing offsets, just about
reporting where the requested offset is), and then have a common layer
do the translation from llseek(...,offs,SEEK_HOLE) into a 2-step
llseek(...,llseek(...,offs,SEEK_PEEK_HOLE),SEEK_SET) if that makes life
easier under the hood.

> 
> In the block device world there are similar concepts to holes:
> - SCSI has Logical Block Provisioning where the "mapped" state would be
>   considered data and other states would be considered holes.

BIG caveat here: the SCSI spec does not necessarily guarantee that
unmapped regions read as all zeroes; compare the difference between
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE.  While lseek(SEEK_HOLE)
on a regular file guarantees that future read() in that hole will see
NUL bytes, I'm not sure whether we want to make that guarantee for
block devices.  This may be yet another case where we might want to
add new SEEK_* constants to the *seek() family of functions that lets
the caller indicate whether they want offsets that are guaranteed to
read as zero, vs. merely offsets that are not allocated but may or may
not read as zero.  Skipping unallocated portions, even when you don't
know if the contents reliably read as zero, is still a useful goal in
some userspace programs.

> - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.

However, utilizing it in nbd.ko would require teaching the kernel to
handle structured or extended headers (right now, that is an extension
only supported in user-space implementations of the NBD protocol).  I
can see why you did not tackle that in this RFC series, even though
you mention it in the cover letter.

> - Linux loop block devices and dm-linear targets can pass through queries to
>   the backing file.
> - dm-thin targets can query metadata to find holes.
> - ...and you may be able to think of more examples.
> 
> Therefore it is possible to offer this functionality in block drivers.
> 
> In my use case a QEMU process in userspace copies the contents of a dm-thin
> target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets
> without this patch series.
> 
> Others have also wished for block device support for SEEK_HOLE. Here is an open
> issue from the BorgBackup project:
> https://github.com/borgbackup/borg/issues/5609
> 
> With these patches userspace can identify holes in loop, dm-linear, and dm-thin
> devices. This is done by adding a seek_hole_data() callback to struct
> block_device_operations. When the callback is NULL the entire device is
> considered data. Device-mapper is extended along the same lines so that targets
> can provide seek_hole_data() callbacks.
> 
> I'm unfamiliar with much of this code and have probably missed locking
> requirements. Since llseek() executes synchronously like ioctl() and is not an
> asynchronous I/O request it's possible that my changes to the loop block driver
> and dm-thin are broken (e.g. what if the loop device fd is changed during
> llseek()?).
> 
> To run the tests:
> 
>   # make TARGETS=block_seek_hole -C tools/testing/selftests run_tests
> 
> The code is also available here:
> https://gitlab.com/stefanha/linux/-/tree/block-seek-hole
> 
> Please take a look and let me know your thoughts. Thanks!
> 
> Stefan Hajnoczi (9):
>   block: add llseek(SEEK_HOLE/SEEK_DATA) support
>   loop: add llseek(SEEK_HOLE/SEEK_DATA) support
>   selftests: block_seek_hole: add loop block driver tests
>   dm: add llseek(SEEK_HOLE/SEEK_DATA) support
>   selftests: block_seek_hole: add dm-zero test
>   dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support
>   selftests: block_seek_hole: add dm-linear test
>   dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support
>   selftests: block_seek_hole: add dm-thin test
> 
>  tools/testing/selftests/Makefile              |   1 +
>  .../selftests/block_seek_hole/Makefile        |  17 +++
>  include/linux/blkdev.h                        |   7 ++
>  include/linux/device-mapper.h                 |   5 +
>  block/fops.c                                  |  43 ++++++-
>  drivers/block/loop.c                          |  36 +++++-
>  drivers/md/dm-linear.c                        |  25 ++++
>  drivers/md/dm-thin.c                          |  77 ++++++++++++
>  drivers/md/dm.c                               |  68 ++++++++++
>  .../testing/selftests/block_seek_hole/config  |   3 +
>  .../selftests/block_seek_hole/dm_thin.sh      |  80 ++++++++++++
>  .../selftests/block_seek_hole/dm_zero.sh      |  31 +++++
>  .../selftests/block_seek_hole/map_holes.py    |  37 ++++++
>  .../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++
>  14 files changed, 540 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
>  create mode 100644 tools/testing/selftests/block_seek_hole/config
>  create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
>  create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
>  create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
>  create mode 100755 tools/testing/selftests/block_seek_hole/test.py
> 
> -- 
> 2.44.0
>
Eric Blake March 28, 2024, 10:29 p.m. UTC | #2
Replying to myself,

On Thu, Mar 28, 2024 at 05:17:18PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
> 
> > 
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> >   considered data and other states would be considered holes.
> 
> BIG caveat here: the SCSI spec does not necessarily guarantee that
> unmapped regions read as all zeroes; compare the difference between
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE.  While lseek(SEEK_HOLE)
> on a regular file guarantees that future read() in that hole will see
> NUL bytes, I'm not sure whether we want to make that guarantee for
> block devices.  This may be yet another case where we might want to
> add new SEEK_* constants to the *seek() family of functions that lets
> the caller indicate whether they want offsets that are guaranteed to
> read as zero, vs. merely offsets that are not allocated but may or may
> not read as zero.  Skipping unallocated portions, even when you don't
> know if the contents reliably read as zero, is still a useful goal in
> some userspace programs.
> 
> > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.

The upstream NBD spec[1] took the time to represent two bits of
information per extent, _because_ of the knowledge that not all SCSI
devices with TRIM support actually guarantee a read of zeroes after
trimming.  That is, NBD chose to convey both:

NBD_STATE_HOLE: 1<<0 if region is unallocated, 0 if region has not been trimmed
NBD_STATE_ZERO: 1<<1 if region reads as zeroes, 0 if region contents might be nonzero

it is always safe to describe an extent as value 0 (both bits clear),
whether or not lseek(SEEK_DATA) returns the same offset; meanwhile,
traditional lseek(SEEK_HOLE) on filesystems generally translates to a
status of 3 (both bits set), but as it is (sometimes) possible to
determine that allocated data still reads as zero, or that unallocated
data may not necessarily read as zero, it is also possible to
implement NBD servers that don't report both bits in parallel.

If we are going to enhance llseek(2) to expose more information about
underlying block devices, possibly by adding more SEEK_ constants for
use in the entire family of *seek() API, it may be worth thinking
about whether it is worth userspace being able to query this
additional distinction between unallocated vs reads-as-zero.

[1] https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md#baseallocation-metadata-context
Stefan Hajnoczi March 28, 2024, 11:09 p.m. UTC | #3
On Thu, Mar 28, 2024 at 05:16:54PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > This can speed up the process by reducing the amount of data read and it
> > preserves sparseness when writing to the output file.
> > 
> > This patch series is an initial attempt at implementing
> > llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
> > approach and suggestions for resolving the open issues.
> 
> One of your open issues was whether adjusting the offset of the block
> device itself should also adjust the file offset of the underlying
> file (at least in the case of loopback and dm-linear).  What would the

Only the loop block driver has this issue. The dm-linear driver uses
blkdev_seek_hole_data(), which does not update the file offset because
it operates on a struct block_device instead of a struct file.

> 
> > 
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> >   considered data and other states would be considered holes.
> 
> BIG caveat here: the SCSI spec does not necessarily guarantee that
> unmapped regions read as all zeroes; compare the difference between
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE.  While lseek(SEEK_HOLE)
> on a regular file guarantees that future read() in that hole will see
> NUL bytes, I'm not sure whether we want to make that guarantee for
> block devices.  This may be yet another case where we might want to
> add new SEEK_* constants to the *seek() family of functions that lets
> the caller indicate whether they want offsets that are guaranteed to
> read as zero, vs. merely offsets that are not allocated but may or may
> not read as zero.  Skipping unallocated portions, even when you don't
> know if the contents reliably read as zero, is still a useful goal in
> some userspace programs.

SCSI initiators can check the Logical Block Provisioning Read Zeroes
(LBPRZ) field to determine whether or not zeroes are guaranteed. The sd
driver would only rely on the device when LPBRZ indicates that zeroes
will be read. Otherwise the driver would report that the device is
filled with data.

> 
> > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
> 
> However, utilizing it in nbd.ko would require teaching the kernel to
> handle structured or extended headers (right now, that is an extension
> only supported in user-space implementations of the NBD protocol).  I
> can see why you did not tackle that in this RFC series, even though
> you mention it in the cover letter.

Yes, I'm mostly interested in dm-thin. The loop block driver and
dm-linear are useful for testing so I modified them. I didn't try SCSI
or NBD.

Thanks,
Stefan
Christoph Hellwig April 2, 2024, 12:26 p.m. UTC | #4
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> In the block device world there are similar concepts to holes:
> - SCSI has Logical Block Provisioning where the "mapped" state would be
>   considered data and other states would be considered holes.

But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
not have to return zeroes.  They could also return some other
initialization pattern pattern.  So they are (unfortunately) not a 1:1
mapping to holes in sparse files.
Stefan Hajnoczi April 2, 2024, 1:04 p.m. UTC | #5
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> >   considered data and other states would be considered holes.
> 
> But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
> not have to return zeroes.  They could also return some other
> initialization pattern pattern.  So they are (unfortunately) not a 1:1
> mapping to holes in sparse files.

Hi Christoph,
There is a 1:1 mapping when when the Logical Block Provisioning Read
Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning
VPD page.

Otherwise SEEK_HOLE/SEEK_DATA has to treat the device as filled with
data because it doesn't know where the holes are.

Stefan
Eric Blake April 2, 2024, 1:31 p.m. UTC | #6
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> >   considered data and other states would be considered holes.
> 
> But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
> not have to return zeroes.  They could also return some other
> initialization pattern pattern.  So they are (unfortunately) not a 1:1
> mapping to holes in sparse files.

Yes, and Stefan already answered that:

https://lore.kernel.org/dm-devel/e2lcp3n5gpf7zmlpyn4nj7wsr36sffn23z5bmzlsghu6oapi5u@sdkcbpimi5is/t/#m58146a45951ec086966497e179a2b2715692712d

>> SCSI initiators can check the Logical Block Provisioning Read Zeroes
>> (LBPRZ) field to determine whether or not zeroes are guaranteed. The sd
>> driver would only rely on the device when LPBRZ indicates that zeroes
>> will be read. Otherwise the driver would report that the device is
>> filled with data.

As well as my question on whether the community would be open to
introducing new SEEK_* constants to allow orthogonality between
searching for zeroes (known to read as zero, whether or not it was
allocated) vs. sparseness (known to be unallocated, whether or not it
reads as zero), where the existing SEEK_HOLE seeks for both properties
at once.
Christoph Hellwig April 5, 2024, 7:02 a.m. UTC | #7
On Tue, Apr 02, 2024 at 09:04:46AM -0400, Stefan Hajnoczi wrote:
> Hi Christoph,
> There is a 1:1 mapping when when the Logical Block Provisioning Read
> Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning
> VPD page.

Yes.  NVMe also has a similar field, but ATA does not.
Christoph Hellwig April 5, 2024, 7:02 a.m. UTC | #8
On Tue, Apr 02, 2024 at 08:31:09AM -0500, Eric Blake wrote:
> As well as my question on whether the community would be open to
> introducing new SEEK_* constants to allow orthogonality between
> searching for zeroes (known to read as zero, whether or not it was
> allocated) vs. sparseness (known to be unallocated, whether or not it
> reads as zero), where the existing SEEK_HOLE seeks for both properties
> at once.

That seems like quite an effort.  Is is worth it?