mbox series

[0/2] RFC: Issue with discards on raw block device without O_DIRECT

Message ID 20201111153913.41840-1-mlevitsk@redhat.com (mailing list archive)
Headers show
Series RFC: Issue with discards on raw block device without O_DIRECT | expand

Message

Maxim Levitsky Nov. 11, 2020, 3:39 p.m. UTC
clone of "starship_production"

Maxim Levitsky (2):
  file-posix: allow -EBUSY errors during write zeros on raw block
    devices
  qemu-img: align next status sector on destination alignment.

 block/file-posix.c |  1 +
 qemu-img.c         | 13 ++++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

Comments

Maxim Levitsky Nov. 11, 2020, 3:44 p.m. UTC | #1
On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> clone of "starship_production"

The git-publish destroyed the cover letter:

For the reference this is for bz #1872633

The issue is that current kernel code that implements 'fallocate'
on kernel block devices roughly works like that:

1. Flush the page cache on the range that is about to be discarded.
2. Issue the discard and wait for it to finish.
   (as far as I can see the discard doesn't go through the
   page cache).

3. Check if the page cache is dirty for this range,
   if it is dirty (meaning that someone wrote to it meanwhile)
   return -EBUSY.

This means that if qemu (or qemu-img) issues a write, and then
discard to the area that shares a page, -EBUSY can be returned by
the kernel.

On the other hand, for example, the ext4 implementation of discard
doesn't seem to be affected. It does take a lock on the inode to avoid
concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Doing fsync and retrying is seems to resolve this issue, but it might be a too big hammer.
Just retrying doesn't work, indicating that maybe the code that flushes the page
cache in (1) doesn't do this correctly ?

It also can be racy unless special means are done to block IO from happening
from qemu during this fsync.

This patch series contains two patches:

First patch just lets the file-posix ignore the -EBUSY errors, which is
technically enough to fail back to plain write in this case, but seems wrong.

And the second patch adds an optimization to qemu-img to avoid such a
fragmented write/discard in the first place.

Both patches make the reproducer work for this particular bugzilla,
but I don't think they are enough.

What do you think?

Best regards,
	Maxim Levitsky
Jan Kara Nov. 12, 2020, 11:19 a.m. UTC | #2
[added some relevant people and lists to CC]

On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
> 
> The git-publish destroyed the cover letter:
> 
> For the reference this is for bz #1872633
> 
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
> 
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
>    (as far as I can see the discard doesn't go through the
>    page cache).
> 
> 3. Check if the page cache is dirty for this range,
>    if it is dirty (meaning that someone wrote to it meanwhile)
>    return -EBUSY.
> 
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.

Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.

> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).

> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer.  Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
> 
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
> 
> This patch series contains two patches:
> 
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
> 
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
> 
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
> 
> What do you think?

So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...

								Honza
Jan Kara Nov. 12, 2020, noon UTC | #3
On Thu 12-11-20 12:19:51, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...

So I was looking what it would take to fix this inside the kernel. The
problem is that invalidate_inode_pages2_range() is working on page
granularity and it is non-trivial to extend it to work on byte granularity
since we don't support something like "try to reclaim part of a page". But
I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
instead of truncate_inode_pages_range() again? I mean the EBUSY detection
cannot be reliable anyway and userspace has no way of knowing whether a
write happened before discard or after it so just discarding data is fine
from this point of view. Darrick?

								Honza
Maxim Levitsky Nov. 12, 2020, 3:38 p.m. UTC | #4
On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.

ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
       fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
       write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
       fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy


Best regards,
     Maxim Levitsky

> 
> 								Honza
Darrick J. Wong Nov. 12, 2020, 10:08 p.m. UTC | #5
On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Nov. 13, 2020, 10:07 a.m. UTC | #6
On Thu 12-11-20 17:38:36, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> I double checked this, those are the writes/discards according to my debug
> prints (I print start and then start+len-1 for each request)
> I have attached the patch for this for reference.
> 
> ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
>        fallocate 00007fe00000 00007fffefff

Yeah, the end at 7ffff000 is indeed not 4k aligned...

> WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
>        write 00007ffff000 00007ffffdff

.. and this write is following discarded area in the same page
(7ffff000..7ffffdff).

								Honza
Maxim Levitsky Dec. 7, 2020, 5:23 p.m. UTC | #7
On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote:
> On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> > On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > > [added some relevant people and lists to CC]
> > > 
> > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > > clone of "starship_production"
> > > > 
> > > > The git-publish destroyed the cover letter:
> > > > 
> > > > For the reference this is for bz #1872633
> > > > 
> > > > The issue is that current kernel code that implements 'fallocate'
> > > > on kernel block devices roughly works like that:
> > > > 
> > > > 1. Flush the page cache on the range that is about to be discarded.
> > > > 2. Issue the discard and wait for it to finish.
> > > >    (as far as I can see the discard doesn't go through the
> > > >    page cache).
> > > > 
> > > > 3. Check if the page cache is dirty for this range,
> > > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > > >    return -EBUSY.
> > > > 
> > > > This means that if qemu (or qemu-img) issues a write, and then
> > > > discard to the area that shares a page, -EBUSY can be returned by
> > > > the kernel.
> > > 
> > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > > kernel so we need to fix this.
> > > 
> > > > On the other hand, for example, the ext4 implementation of discard
> > > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > > 
> > > Well, filesystem hole punching is somewhat different beast than block device
> > > discard (at least implementation wise).
> > > 
> > > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > > 
> > > > It also can be racy unless special means are done to block IO from happening
> > > > from qemu during this fsync.
> > > > 
> > > > This patch series contains two patches:
> > > > 
> > > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > > technically enough to fail back to plain write in this case, but seems wrong.
> > > > 
> > > > And the second patch adds an optimization to qemu-img to avoid such a
> > > > fragmented write/discard in the first place.
> > > > 
> > > > Both patches make the reproducer work for this particular bugzilla,
> > > > but I don't think they are enough.
> > > > 
> > > > What do you think?
> > > 
> > > So if the EBUSY error happens because something happened to the page cache
> > > outside of discarded range (like you describe above), that is a kernel bug
> > > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > > discarded range while discard was running and userspace app has to deal
> > > with that depending on what it aims to do...
> > 
> > So I was looking what it would take to fix this inside the kernel. The
> > problem is that invalidate_inode_pages2_range() is working on page
> > granularity and it is non-trivial to extend it to work on byte granularity
> > since we don't support something like "try to reclaim part of a page". But
> > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> > instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> > cannot be reliable anyway and userspace has no way of knowing whether a
> > write happened before discard or after it so just discarding data is fine
> > from this point of view. Darrick?
> 
> Hmmm, I think I overlooked the fact that we can do buffered writes into
> a block device's pagecache without taking any of the usual locks that
> have to be held for filesystem files.  This is essentially a race
> between a not-page-aligned fallocate and a buffered write to a different
> sector that is mapped by a page that caches part of the fallocate range.
> 
> So yes, Jan is right that we need to use truncate_bdev_range instead of
> invalidate_inode_pages2_range because the former will zero the sub-page
> ranges on either end of the fallocate request instead of returning
> -EBUSY because someone dirtied a part of a page that wasn't involved in
> the fallocate operation.
> 
> I /probably/ just copy-pasta'd that invalidation call from directio
> without thinking hard enough about it, sorry about that. :(

Any update on this?
 
Today I took a look at both truncate_bdev_range,
and at invalidate_inode_pages2_range.
 
 
I see that the truncate_bdev_range can't really fail other 
than check for a mounted
filesystem.
 
It calls truncate_inode_pages_range which writes back all
dirty pages and waits for writeback to finish.
So it won't detect dirty pages as invalidate_inode_pages2_range does
 
 
Also AFAIK the
kernel page cache indeed 
only tracks the dirtiness of a whole page (e.g a 512 byte write, 
will cause the whole page to be written back, unless the 
write was done using O_DIRECT)
 
So if a page, part of
which is discarded, becomes dirty we can't really
know if someone wrote to the discarded region, or only near it.
 
I vote to keep the call to invalidate_inode_pages2_range but
exclude the non page
aligned areas from it.
 
This way we will still do a best effort detection of this case,
while not causing any false positives.
 
If you agree, I'll send a patch for this.


Best regards,
	Maxim Levitsky

> 
> --D
> 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR