diff mbox series

mm: always respect QUEUE_FLAG_STABLE_WRITES on the block device

Message ID 20230504105624.9789-1-idryomov@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: always respect QUEUE_FLAG_STABLE_WRITES on the block device | expand

Commit Message

Ilya Dryomov May 4, 2023, 10:56 a.m. UTC
Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
and a sb flag") introduced a regression for the raw block device use
case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
the effect of respecting it only when there is a filesystem mounted on
top of the block device.  If a filesystem is not mounted, block devices
that do integrity checking return sporadic checksum errors.

Additionally, this commit made the corresponding sysfs knob writeable
for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
is captured when the filesystem is mounted and isn't consulted after
that anywhere outside of swap code, changing it doesn't take immediate
effect even though dumping the knob shows the new value.  With no way
to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.

Resurrect the original stable writes behavior by changing
folio_wait_stable() to account for the case of a raw block device and
also:

- for the case of a filesystem, test QUEUE_FLAG_STABLE_WRITES flag
  each time instead of capturing it in the superblock so that changes
  are reflected immediately (thus aligning with the case of a raw block
  device)
- retain SB_I_STABLE_WRITES flag for filesystems that need stable
  writes independent of the underlying block device (currently just
  NFS)

Cc: stable@vger.kernel.org
Fixes: 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 fs/super.c          |  2 --
 mm/page-writeback.c | 12 +++++++++++-
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig May 4, 2023, 1:55 p.m. UTC | #1
On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote:
> Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
> and a sb flag") introduced a regression for the raw block device use
> case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
> the effect of respecting it only when there is a filesystem mounted on
> top of the block device.  If a filesystem is not mounted, block devices
> that do integrity checking return sporadic checksum errors.

With "If a file system is not mounted" you want to say "when accessing
a block device directly" here, right?  The two are not exclusive..

> Additionally, this commit made the corresponding sysfs knob writeable
> for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
> is captured when the filesystem is mounted and isn't consulted after
> that anywhere outside of swap code, changing it doesn't take immediate
> effect even though dumping the knob shows the new value.  With no way
> to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.

But very much intentional.  s_bdev often is not the only device
in a file system, and we should never reference if from core
helpers.

So I think we should go with something like this:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index db794399900734..aa36cc2a4530c1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
  */
 void folio_wait_stable(struct folio *folio)
 {
-	if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
+	struct inode *inode = folio_inode(folio);
+	struct super_block *sb = inode->i_sb;
+
+	if ((sb->s_iflags & SB_I_STABLE_WRITES) ||
+	    (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode))))
 		folio_wait_writeback(folio);
 }
 EXPORT_SYMBOL_GPL(folio_wait_stable);
Matthew Wilcox May 4, 2023, 2:16 p.m. UTC | #2
On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote:
> On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote:
> > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
> > and a sb flag") introduced a regression for the raw block device use
> > case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
> > the effect of respecting it only when there is a filesystem mounted on
> > top of the block device.  If a filesystem is not mounted, block devices
> > that do integrity checking return sporadic checksum errors.
> 
> With "If a file system is not mounted" you want to say "when accessing
> a block device directly" here, right?  The two are not exclusive..
> 
> > Additionally, this commit made the corresponding sysfs knob writeable
> > for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
> > is captured when the filesystem is mounted and isn't consulted after
> > that anywhere outside of swap code, changing it doesn't take immediate
> > effect even though dumping the knob shows the new value.  With no way
> > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.
> 
> But very much intentional.  s_bdev often is not the only device
> in a file system, and we should never reference if from core
> helpers.
> 
> So I think we should go with something like this:
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index db794399900734..aa36cc2a4530c1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
>   */
>  void folio_wait_stable(struct folio *folio)
>  {
> -	if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
> +	struct inode *inode = folio_inode(folio);
> +	struct super_block *sb = inode->i_sb;
> +
> +	if ((sb->s_iflags & SB_I_STABLE_WRITES) ||
> +	    (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode))))
>  		folio_wait_writeback(folio);
>  }
>  EXPORT_SYMBOL_GPL(folio_wait_stable);

I hate both of these patches ;-)  What we should do is add
AS_STABLE_WRITES, have the appropriate places call
mapping_set_stable_writes() and then folio_wait_stable() becomes

	if (mapping_test_stable_writes(folio->mapping))
		folio_wait_writeback(folio);

and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus
whatever else is going on there)
Ilya Dryomov May 4, 2023, 3:03 p.m. UTC | #3
On Thu, May 4, 2023 at 3:55 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote:
> > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
> > and a sb flag") introduced a regression for the raw block device use
> > case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
> > the effect of respecting it only when there is a filesystem mounted on
> > top of the block device.  If a filesystem is not mounted, block devices
> > that do integrity checking return sporadic checksum errors.
>
> With "If a file system is not mounted" you want to say "when accessing
> a block device directly" here, right?  The two are not exclusive..

Hi Christoph,

Right, I meant to say "when accessing a block device directly".

>
> > Additionally, this commit made the corresponding sysfs knob writeable
> > for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
> > is captured when the filesystem is mounted and isn't consulted after
> > that anywhere outside of swap code, changing it doesn't take immediate
> > effect even though dumping the knob shows the new value.  With no way
> > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.
>
> But very much intentional.  s_bdev often is not the only device
> in a file system, and we should never reference if from core
> helpers.
>
> So I think we should go with something like this:
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index db794399900734..aa36cc2a4530c1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
>   */
>  void folio_wait_stable(struct folio *folio)
>  {
> -       if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
> +       struct inode *inode = folio_inode(folio);
> +       struct super_block *sb = inode->i_sb;
> +
> +       if ((sb->s_iflags & SB_I_STABLE_WRITES) ||
> +           (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode))))
>                 folio_wait_writeback(folio);

Heh, this is almost exactly what I came up with initially (|| arms were
swapped in that version), but decided to improve upon after noticing the
writeable thing.

Thanks,

                Ilya
Ilya Dryomov May 4, 2023, 3:07 p.m. UTC | #4
On Thu, May 4, 2023 at 4:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote:
> > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote:
> > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
> > > and a sb flag") introduced a regression for the raw block device use
> > > case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
> > > the effect of respecting it only when there is a filesystem mounted on
> > > top of the block device.  If a filesystem is not mounted, block devices
> > > that do integrity checking return sporadic checksum errors.
> >
> > With "If a file system is not mounted" you want to say "when accessing
> > a block device directly" here, right?  The two are not exclusive..
> >
> > > Additionally, this commit made the corresponding sysfs knob writeable
> > > for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
> > > is captured when the filesystem is mounted and isn't consulted after
> > > that anywhere outside of swap code, changing it doesn't take immediate
> > > effect even though dumping the knob shows the new value.  With no way
> > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.
> >
> > But very much intentional.  s_bdev often is not the only device
> > in a file system, and we should never reference if from core
> > helpers.
> >
> > So I think we should go with something like this:
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index db794399900734..aa36cc2a4530c1 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
> >   */
> >  void folio_wait_stable(struct folio *folio)
> >  {
> > -     if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
> > +     struct inode *inode = folio_inode(folio);
> > +     struct super_block *sb = inode->i_sb;
> > +
> > +     if ((sb->s_iflags & SB_I_STABLE_WRITES) ||
> > +         (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode))))
> >               folio_wait_writeback(folio);
> >  }
> >  EXPORT_SYMBOL_GPL(folio_wait_stable);
>
> I hate both of these patches ;-)  What we should do is add
> AS_STABLE_WRITES, have the appropriate places call
> mapping_set_stable_writes() and then folio_wait_stable() becomes
>
>         if (mapping_test_stable_writes(folio->mapping))
>                 folio_wait_writeback(folio);
>
> and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus
> whatever else is going on there)

Hi Matthew,

We would still need something resembling Christoph's suggestion for
5.10 and 5.15 (at least).  Since this fixes a regression, would you
support merging the "ugly" version to facilitate backports or would
you rather see the AS/mapping-based refactor first?

Thanks,

                Ilya
Matthew Wilcox May 4, 2023, 3:37 p.m. UTC | #5
On Thu, May 04, 2023 at 05:07:10PM +0200, Ilya Dryomov wrote:
> On Thu, May 4, 2023 at 4:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I hate both of these patches ;-)  What we should do is add
> > AS_STABLE_WRITES, have the appropriate places call
> > mapping_set_stable_writes() and then folio_wait_stable() becomes
> >
> >         if (mapping_test_stable_writes(folio->mapping))
> >                 folio_wait_writeback(folio);
> >
> > and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus
> > whatever else is going on there)
> 
> Hi Matthew,
> 
> We would still need something resembling Christoph's suggestion for
> 5.10 and 5.15 (at least).  Since this fixes a regression, would you
> support merging the "ugly" version to facilitate backports or would
> you rather see the AS/mapping-based refactor first?

That's a terrible way of developing for Linux.  First, do it the right
way for mainline.  Then, see how easy the patch is to backport to relevant
kernel versions; if it's too ugly to cherry-pick, do something equivalent.
But never start out with the premise "This must be backported, so do it
as simply as possible first".
Jan Kara May 4, 2023, 3:55 p.m. UTC | #6
On Thu 04-05-23 15:16:39, Matthew Wilcox wrote:
> On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote:
> > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote:
> > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue
> > > and a sb flag") introduced a regression for the raw block device use
> > > case.  Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has
> > > the effect of respecting it only when there is a filesystem mounted on
> > > top of the block device.  If a filesystem is not mounted, block devices
> > > that do integrity checking return sporadic checksum errors.
> > 
> > With "If a file system is not mounted" you want to say "when accessing
> > a block device directly" here, right?  The two are not exclusive..
> > 
> > > Additionally, this commit made the corresponding sysfs knob writeable
> > > for debugging purposes.  However, because QUEUE_FLAG_STABLE_WRITES flag
> > > is captured when the filesystem is mounted and isn't consulted after
> > > that anywhere outside of swap code, changing it doesn't take immediate
> > > effect even though dumping the knob shows the new value.  With no way
> > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.
> > 
> > But very much intentional.  s_bdev often is not the only device
> > in a file system, and we should never reference if from core
> > helpers.
> > 
> > So I think we should go with something like this:
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index db794399900734..aa36cc2a4530c1 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
> >   */
> >  void folio_wait_stable(struct folio *folio)
> >  {
> > -	if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
> > +	struct inode *inode = folio_inode(folio);
> > +	struct super_block *sb = inode->i_sb;
> > +
> > +	if ((sb->s_iflags & SB_I_STABLE_WRITES) ||
> > +	    (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode))))
> >  		folio_wait_writeback(folio);
> >  }
> >  EXPORT_SYMBOL_GPL(folio_wait_stable);
> 
> I hate both of these patches ;-)  What we should do is add
> AS_STABLE_WRITES, have the appropriate places call
> mapping_set_stable_writes() and then folio_wait_stable() becomes
> 
> 	if (mapping_test_stable_writes(folio->mapping))
> 		folio_wait_writeback(folio);
> 
> and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus
> whatever else is going on there)

For bdev address_space that's easy but what Ilya also mentioned is a
problem when 'stable_write' flag gets toggled on the device and in that
case having to propagate the flag update to all the address_space
structures is a nightmare...

								Honza
Matthew Wilcox May 4, 2023, 4:16 p.m. UTC | #7
On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote:
> For bdev address_space that's easy but what Ilya also mentioned is a
> problem when 'stable_write' flag gets toggled on the device and in that
> case having to propagate the flag update to all the address_space
> structures is a nightmare...

We have a number of flags which don't take effect when modified on a
block device with a mounted filesystem on it.  For example, modifying
the readahead settings do not change existing files, only new ones.
Since this flag is only modifiable for debugging purposes, I think I'm
OK with it not affecting already-mounted filesystems.  It feels like a
decision that reasonable people could disagree on, though.
Dave Chinner May 4, 2023, 11:07 p.m. UTC | #8
On Thu, May 04, 2023 at 05:16:48PM +0100, Matthew Wilcox wrote:
> On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote:
> > For bdev address_space that's easy but what Ilya also mentioned is a
> > problem when 'stable_write' flag gets toggled on the device and in that
> > case having to propagate the flag update to all the address_space
> > structures is a nightmare...
> 
> We have a number of flags which don't take effect when modified on a
> block device with a mounted filesystem on it.  For example, modifying
> the readahead settings do not change existing files, only new ones.
> Since this flag is only modifiable for debugging purposes, I think I'm
> OK with it not affecting already-mounted filesystems.  It feels like a
> decision that reasonable people could disagree on, though.

I think an address space flag makes sense, because then we don't
even have to care about the special bdev sb/inode thing -
folio->mapping will already point at the bdev mapping and so do the
right thing.

That is, if the bdev changes stable_write state, it can toggle the
AS_STABLE_WRITE flag on it's inode->i_mapping straight away and all
the folios and files pointing to the bdev mapping will change
behaviour immediately.  Everything else retains the same behaviour
we have now - the stable_write state is persistent on the superblock
until the filesystem mount is cycled.

Cheers,

Dave.
Jan Kara May 5, 2023, 10:49 a.m. UTC | #9
On Fri 05-05-23 09:07:36, Dave Chinner wrote:
> On Thu, May 04, 2023 at 05:16:48PM +0100, Matthew Wilcox wrote:
> > On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote:
> > > For bdev address_space that's easy but what Ilya also mentioned is a
> > > problem when 'stable_write' flag gets toggled on the device and in that
> > > case having to propagate the flag update to all the address_space
> > > structures is a nightmare...
> > 
> > We have a number of flags which don't take effect when modified on a
> > block device with a mounted filesystem on it.  For example, modifying
> > the readahead settings do not change existing files, only new ones.
> > Since this flag is only modifiable for debugging purposes, I think I'm
> > OK with it not affecting already-mounted filesystems.  It feels like a
> > decision that reasonable people could disagree on, though.
> 
> I think an address space flag makes sense, because then we don't
> even have to care about the special bdev sb/inode thing -
> folio->mapping will already point at the bdev mapping and so do the
> right thing.
> 
> That is, if the bdev changes stable_write state, it can toggle the
> AS_STABLE_WRITE flag on it's inode->i_mapping straight away and all
> the folios and files pointing to the bdev mapping will change
> behaviour immediately.  Everything else retains the same behaviour
> we have now - the stable_write state is persistent on the superblock
> until the filesystem mount is cycled.

Yeah, I'm fine with this behavior. I just wasn't sure whether Ilya didn't
need the sysfs change to be visible in the filesystem so that was why I
pointed that out. But apparently he doesn't need it.

								Honza
diff mbox series

Patch

diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7dfe..6705b3506ae8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1213,8 +1213,6 @@  static int set_bdev_super(struct super_block *s, void *data)
 	s->s_dev = s->s_bdev->bd_dev;
 	s->s_bdi = bdi_get(s->s_bdev->bd_disk->bdi);
 
-	if (bdev_stable_writes(s->s_bdev))
-		s->s_iflags |= SB_I_STABLE_WRITES;
 	return 0;
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 516b1aa247e8..469bc57add8c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -3169,7 +3169,17 @@  EXPORT_SYMBOL_GPL(folio_wait_writeback_killable);
  */
 void folio_wait_stable(struct folio *folio)
 {
-	if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES)
+	struct inode *inode = folio_inode(folio);
+	struct super_block *sb = inode->i_sb;
+	bool stable_writes;
+
+	if (sb_is_blkdev_sb(sb))
+		stable_writes = bdev_stable_writes(I_BDEV(inode));
+	else
+		stable_writes = bdev_stable_writes(sb->s_bdev) ||
+				(sb->s_iflags & SB_I_STABLE_WRITES);
+
+	if (stable_writes)
 		folio_wait_writeback(folio);
 }
 EXPORT_SYMBOL_GPL(folio_wait_stable);