diff mbox series

[5/6] iomap: drop unnecessary state_lock when setting ifs uptodate bits

Message ID 20240731091305.2896873-6-yi.zhang@huaweicloud.com (mailing list archive)
State Accepted, archived
Headers show
Series iomap: some minor non-critical fixes and improvements when block size < folio size | expand

Commit Message

Zhang Yi July 31, 2024, 9:13 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
race issue when submitting multiple read bios for a page spans more than
one file system block by adding a spinlock(which names state_lock now)
to make the page uptodate synchronous. However, the race condition only
happened between the read I/O submitting and completeing threads, it's
sufficient to use page lock to protect other paths, e.g. buffered write
path. After large folio is supported, the spinlock could affect more
about the buffered write performance, so drop it could reduce some
unnecessary locking overhead.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Comments

Matthew Wilcox July 31, 2024, 4:52 p.m. UTC | #1
On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> race issue when submitting multiple read bios for a page spans more than
> one file system block by adding a spinlock(which names state_lock now)
> to make the page uptodate synchronous. However, the race condition only
> happened between the read I/O submitting and completeing threads, it's
> sufficient to use page lock to protect other paths, e.g. buffered write
> path. After large folio is supported, the spinlock could affect more
> about the buffered write performance, so drop it could reduce some
> unnecessary locking overhead.

This patch doesn't work.  If we get two read completions at the same
time for blocks belonging to the same folio, they will both write to
the uptodate array at the same time.
Zhang Yi Aug. 1, 2024, 1:52 a.m. UTC | #2
On 2024/8/1 0:52, Matthew Wilcox wrote:
> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>> race issue when submitting multiple read bios for a page spans more than
>> one file system block by adding a spinlock(which names state_lock now)
>> to make the page uptodate synchronous. However, the race condition only
>> happened between the read I/O submitting and completeing threads, it's
>> sufficient to use page lock to protect other paths, e.g. buffered write
>> path. After large folio is supported, the spinlock could affect more
>> about the buffered write performance, so drop it could reduce some
>> unnecessary locking overhead.
> 
> This patch doesn't work.  If we get two read completions at the same
> time for blocks belonging to the same folio, they will both write to
> the uptodate array at the same time.
> 
This patch just drop the state_lock in the buffered write path, doesn't
affect the read path, the uptodate setting in the read completion path
is still protected the state_lock, please see iomap_finish_folio_read().
So I think this patch doesn't affect the case you mentioned, or am I
missing something?

Thanks,
Yi.
Matthew Wilcox Aug. 1, 2024, 4:24 a.m. UTC | #3
On Thu, Aug 01, 2024 at 09:52:49AM +0800, Zhang Yi wrote:
> On 2024/8/1 0:52, Matthew Wilcox wrote:
> > On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> >> race issue when submitting multiple read bios for a page spans more than
> >> one file system block by adding a spinlock(which names state_lock now)
> >> to make the page uptodate synchronous. However, the race condition only
> >> happened between the read I/O submitting and completeing threads, it's
> >> sufficient to use page lock to protect other paths, e.g. buffered write
> >> path. After large folio is supported, the spinlock could affect more
> >> about the buffered write performance, so drop it could reduce some
> >> unnecessary locking overhead.
> > 
> > This patch doesn't work.  If we get two read completions at the same
> > time for blocks belonging to the same folio, they will both write to
> > the uptodate array at the same time.
> > 
> This patch just drop the state_lock in the buffered write path, doesn't
> affect the read path, the uptodate setting in the read completion path
> is still protected the state_lock, please see iomap_finish_folio_read().
> So I think this patch doesn't affect the case you mentioned, or am I
> missing something?

Oh, I see.  So the argument for locking correctness is that:

A. If ifs_set_range_uptodate() is called from iomap_finish_folio_read(),
   the state_lock is held.
B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
   either we know:
B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
    is the only place that can call ifs_set_range_uptodate() for this folio
B2. The caller of iomap_set_range_uptodate() holds the state lock

But I think you've assigned iomap_read_inline_data() to case B1 when I
think it's B2.  erofs can certainly have a file which consists of various
blocks elsewhere in the file and then a tail that is stored inline.

__iomap_write_begin() is case B1 because it holds the folio lock, and
submits its read(s) sychronously.  Likewise __iomap_write_end() is
case B1.

But, um.  Why do we need to call iomap_set_range_uptodate() in both
write_begin() and write_end()?

And I think this is actively buggy:

               if (iomap_block_needs_zeroing(iter, block_start)) {
                        if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
                                return -EIO;
                        folio_zero_segments(folio, poff, from, to, poff + plen);
...
                iomap_set_range_uptodate(folio, poff, plen);

because we zero from 'poff' to 'from', then from 'to' to 'poff+plen',
but mark the entire range as uptodate.  And once a range is marked
as uptodate, it can be read from.

So we can do this:

 - Get a write request for bytes 1-4094 over a hole
 - allocate single page folio
 - zero bytes 0 and 4095
 - mark 0-4095 as uptodate
 - take page fault while trying to access the user address
 - read() to bytes 0-4095 now succeeds even though we haven't written
   1-4094 yet

And that page fault can be uffd or a buffer that's in an mmap that's
out on disc.  Plenty of time to make this race happen, and we leak
4094/4096 bytes of the previous contents of that folio to userspace.

Or did I miss something?
Zhang Yi Aug. 1, 2024, 9:19 a.m. UTC | #4
On 2024/8/1 12:24, Matthew Wilcox wrote:
> On Thu, Aug 01, 2024 at 09:52:49AM +0800, Zhang Yi wrote:
>> On 2024/8/1 0:52, Matthew Wilcox wrote:
>>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>>>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>>>> race issue when submitting multiple read bios for a page spans more than
>>>> one file system block by adding a spinlock(which names state_lock now)
>>>> to make the page uptodate synchronous. However, the race condition only
>>>> happened between the read I/O submitting and completeing threads, it's
>>>> sufficient to use page lock to protect other paths, e.g. buffered write
>>>> path. After large folio is supported, the spinlock could affect more
>>>> about the buffered write performance, so drop it could reduce some
>>>> unnecessary locking overhead.
>>>
>>> This patch doesn't work.  If we get two read completions at the same
>>> time for blocks belonging to the same folio, they will both write to
>>> the uptodate array at the same time.
>>>
>> This patch just drop the state_lock in the buffered write path, doesn't
>> affect the read path, the uptodate setting in the read completion path
>> is still protected the state_lock, please see iomap_finish_folio_read().
>> So I think this patch doesn't affect the case you mentioned, or am I
>> missing something?
> 
> Oh, I see.  So the argument for locking correctness is that:
> 
> A. If ifs_set_range_uptodate() is called from iomap_finish_folio_read(),
>    the state_lock is held.
> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>    either we know:
> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>     is the only place that can call ifs_set_range_uptodate() for this folio
> B2. The caller of iomap_set_range_uptodate() holds the state lock
> 
> But I think you've assigned iomap_read_inline_data() to case B1 when I
> think it's B2.  erofs can certainly have a file which consists of various
> blocks elsewhere in the file and then a tail that is stored inline.

Oh, you are right, thanks for pointing this out. I missed the case of
having both file blocks and inline data in one folio on erofs. So we
also need to hold state_lock in iomap_read_inline_data(), it looks like
we'd better to introduce a new common helper to do this job for B2.

> 
> __iomap_write_begin() is case B1 because it holds the folio lock, and
> submits its read(s) sychronously.  Likewise __iomap_write_end() is
> case B1.
> 
> But, um.  Why do we need to call iomap_set_range_uptodate() in both
> write_begin() and write_end()?
> 
> And I think this is actively buggy:
> 
>                if (iomap_block_needs_zeroing(iter, block_start)) {
>                         if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
>                                 return -EIO;
>                         folio_zero_segments(folio, poff, from, to, poff + plen);
> ...
>                 iomap_set_range_uptodate(folio, poff, plen);
> 
> because we zero from 'poff' to 'from', then from 'to' to 'poff+plen',
> but mark the entire range as uptodate.  And once a range is marked
> as uptodate, it can be read from.
> 
> So we can do this:
> 
>  - Get a write request for bytes 1-4094 over a hole
>  - allocate single page folio
>  - zero bytes 0 and 4095
>  - mark 0-4095 as uptodate
>  - take page fault while trying to access the user address
>  - read() to bytes 0-4095 now succeeds even though we haven't written
>    1-4094 yet
> 
> And that page fault can be uffd or a buffer that's in an mmap that's
> out on disc.  Plenty of time to make this race happen, and we leak
> 4094/4096 bytes of the previous contents of that folio to userspace.
> 
> Or did I miss something?
> 

Indeed, this could happen on the filesystem without inode lock in the
buffered read path(I've checked it out on my ext4 buffered iomap
branch), and I guess it could also happen after a short copy happened
in the write path. We don't need iomap_set_range_uptodate() for the
zeroing case in __iomap_write_begin().

Thanks,
Yi.
Dave Chinner Aug. 2, 2024, 12:05 a.m. UTC | #5
On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> race issue when submitting multiple read bios for a page spans more than
> one file system block by adding a spinlock(which names state_lock now)
> to make the page uptodate synchronous. However, the race condition only
> happened between the read I/O submitting and completeing threads,

when we do writeback on a folio that has multiple blocks on it we
can submit multiple bios for that, too. Hence the write completions
can race with each other and write submission, too.

Yes, write bio submission and completion only need to update ifs
accounting using an atomic operation, but the same race condition
exists even though the folio is fully locked at the point of bio
submission.


> it's
> sufficient to use page lock to protect other paths, e.g. buffered write
                    ^^^^ folio
> path.
>
> After large folio is supported, the spinlock could affect more
> about the buffered write performance, so drop it could reduce some
> unnecessary locking overhead.

From the point of view of simple to understand and maintain code, I
think this is a bad idea. The data structure is currently protected
by the state lock in all situations, but this change now makes it
protected by the state lock in one case and the folio lock in a
different case.

Making this change also misses the elephant in the room: the
buffered write path still needs the ifs->state_lock to update the
dirty bitmap. Hence we're effectively changing the serialisation
mechanism for only one of the two ifs state bitmaps that the
buffered write path has to update.

Indeed, we can't get rid of the ifs->state_lock from the dirty range
updates because iomap_dirty_folio() can be called without the folio
being locked through folio_mark_dirty() calling the ->dirty_folio()
aop.

IOWs, getting rid of the state lock out of the uptodate range
changes does not actually get rid of it from the buffered IO patch.
we still have to take it to update the dirty range, and so there's
an obvious way to optimise the state lock usage without changing any
of the bitmap access serialisation behaviour. i.e.  We combine the
uptodate and dirty range updates in __iomap_write_end() into a
single lock context such as:

iomap_set_range_dirty_uptodate()
{
	struct iomap_folio_state *ifs = folio->private;
	struct inode *inode:
        unsigned int blks_per_folio;
        unsigned int first_blk;
        unsigned int last_blk;
        unsigned int nr_blks;
	unsigned long flags;

	if (!ifs)
		return;

	inode = folio->mapping->host;
	blks_per_folio = i_blocks_per_folio(inode, folio);
	first_blk = (off >> inode->i_blkbits);
	last_blk = (off + len - 1) >> inode->i_blkbits;
	nr_blks = last_blk - first_blk + 1;

	spin_lock_irqsave(&ifs->state_lock, flags);
	bitmap_set(ifs->state, first_blk, nr_blks);
	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
	spin_unlock_irqrestore(&ifs->state_lock, flags);
}

This means we calculate the bitmap offsets only once, we take the
state lock only once, and we don't do anything if there is no
sub-folio state.

If we then fix the __iomap_write_begin() code as Willy pointed out
to elide the erroneous uptodate range update, then we end up only
taking the state lock once per buffered write instead of 3 times per
write.

This patch only reduces it to twice per buffered write, so doing the
above should provide even better performance without needing to
change the underlying serialisation mechanism at all.

-Dave.
Zhang Yi Aug. 2, 2024, 2:57 a.m. UTC | #6
On 2024/8/2 8:05, Dave Chinner wrote:
> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>> race issue when submitting multiple read bios for a page spans more than
>> one file system block by adding a spinlock(which names state_lock now)
>> to make the page uptodate synchronous. However, the race condition only
>> happened between the read I/O submitting and completeing threads,
> 
> when we do writeback on a folio that has multiple blocks on it we
> can submit multiple bios for that, too. Hence the write completions
> can race with each other and write submission, too.
> 
> Yes, write bio submission and completion only need to update ifs
> accounting using an atomic operation, but the same race condition
> exists even though the folio is fully locked at the point of bio
> submission.
> 
> 
>> it's
>> sufficient to use page lock to protect other paths, e.g. buffered write
>                     ^^^^ folio
>> path.
>>
>> After large folio is supported, the spinlock could affect more
>> about the buffered write performance, so drop it could reduce some
>> unnecessary locking overhead.
> 
> From the point of view of simple to understand and maintain code, I
> think this is a bad idea. The data structure is currently protected
> by the state lock in all situations, but this change now makes it
> protected by the state lock in one case and the folio lock in a
> different case.

Yeah, I agree that this is a side-effect of this change, after this patch,
we have to be careful to distinguish between below two cases B1 and B2 as
Willy mentioned.

B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
   either we know:
B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
    is the only place that can call ifs_set_range_uptodate() for this folio
B2. The caller of iomap_set_range_uptodate() holds the state lock

> 
> Making this change also misses the elephant in the room: the
> buffered write path still needs the ifs->state_lock to update the
> dirty bitmap. Hence we're effectively changing the serialisation
> mechanism for only one of the two ifs state bitmaps that the
> buffered write path has to update.
> 
> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> updates because iomap_dirty_folio() can be called without the folio
> being locked through folio_mark_dirty() calling the ->dirty_folio()
> aop.
> 

Sorry, I don't understand, why folio_mark_dirty() could be called without
folio lock (isn't this supposed to be a bug)? IIUC, all the file backed
folios must be locked before marking dirty. Are there any exceptions or am
I missing something?

> IOWs, getting rid of the state lock out of the uptodate range
> changes does not actually get rid of it from the buffered IO patch.
> we still have to take it to update the dirty range, and so there's
> an obvious way to optimise the state lock usage without changing any
> of the bitmap access serialisation behaviour. i.e.  We combine the
> uptodate and dirty range updates in __iomap_write_end() into a
> single lock context such as:
> 
> iomap_set_range_dirty_uptodate()
> {
> 	struct iomap_folio_state *ifs = folio->private;
> 	struct inode *inode:
>         unsigned int blks_per_folio;
>         unsigned int first_blk;
>         unsigned int last_blk;
>         unsigned int nr_blks;
> 	unsigned long flags;
> 
> 	if (!ifs)
> 		return;
> 
> 	inode = folio->mapping->host;
> 	blks_per_folio = i_blocks_per_folio(inode, folio);
> 	first_blk = (off >> inode->i_blkbits);
> 	last_blk = (off + len - 1) >> inode->i_blkbits;
> 	nr_blks = last_blk - first_blk + 1;
> 
> 	spin_lock_irqsave(&ifs->state_lock, flags);
> 	bitmap_set(ifs->state, first_blk, nr_blks);
> 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
> 	spin_unlock_irqrestore(&ifs->state_lock, flags);
> }
> 
> This means we calculate the bitmap offsets only once, we take the
> state lock only once, and we don't do anything if there is no
> sub-folio state.
> 
> If we then fix the __iomap_write_begin() code as Willy pointed out
> to elide the erroneous uptodate range update, then we end up only
> taking the state lock once per buffered write instead of 3 times per
> write.
> 
> This patch only reduces it to twice per buffered write, so doing the
> above should provide even better performance without needing to
> change the underlying serialisation mechanism at all.
> 

Thanks for the suggestion. I've thought about this solution too, but I
didn't think we need the state_lock when setting ifs dirty bit since the
folio lock should work, so I changed my mind and planed to drop all ifs
state_lock in the write path (please see the patch 6). Please let me
know if I'm wrong.

Thanks,
Yi.
Dave Chinner Aug. 2, 2024, 6:29 a.m. UTC | #7
On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> On 2024/8/2 8:05, Dave Chinner wrote:
> > On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
> >> race issue when submitting multiple read bios for a page spans more than
> >> one file system block by adding a spinlock(which names state_lock now)
> >> to make the page uptodate synchronous. However, the race condition only
> >> happened between the read I/O submitting and completeing threads,
> > 
> > when we do writeback on a folio that has multiple blocks on it we
> > can submit multiple bios for that, too. Hence the write completions
> > can race with each other and write submission, too.
> > 
> > Yes, write bio submission and completion only need to update ifs
> > accounting using an atomic operation, but the same race condition
> > exists even though the folio is fully locked at the point of bio
> > submission.
> > 
> > 
> >> it's
> >> sufficient to use page lock to protect other paths, e.g. buffered write
> >                     ^^^^ folio
> >> path.
> >>
> >> After large folio is supported, the spinlock could affect more
> >> about the buffered write performance, so drop it could reduce some
> >> unnecessary locking overhead.
> > 
> > From the point of view of simple to understand and maintain code, I
> > think this is a bad idea. The data structure is currently protected
> > by the state lock in all situations, but this change now makes it
> > protected by the state lock in one case and the folio lock in a
> > different case.
> 
> Yeah, I agree that this is a side-effect of this change, after this patch,
> we have to be careful to distinguish between below two cases B1 and B2 as
> Willy mentioned.
> 
> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>    either we know:
> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>     is the only place that can call ifs_set_range_uptodate() for this folio
> B2. The caller of iomap_set_range_uptodate() holds the state lock

Yes, I read that before I commented that I think it's a bad idea.
And then provided a method where we don't need to care about this at
all.
> 
> > 
> > Making this change also misses the elephant in the room: the
> > buffered write path still needs the ifs->state_lock to update the
> > dirty bitmap. Hence we're effectively changing the serialisation
> > mechanism for only one of the two ifs state bitmaps that the
> > buffered write path has to update.
> > 
> > Indeed, we can't get rid of the ifs->state_lock from the dirty range
> > updates because iomap_dirty_folio() can be called without the folio
> > being locked through folio_mark_dirty() calling the ->dirty_folio()
> > aop.
> > 
> 
> Sorry, I don't understand, why folio_mark_dirty() could be called without
> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> folios must be locked before marking dirty. Are there any exceptions or am
> I missing something?

Yes: reading the code I pointed you at.

/**
 * folio_mark_dirty - Mark a folio as being modified.
 * @folio: The folio.
 *
 * The folio may not be truncated while this function is running.
 * Holding the folio lock is sufficient to prevent truncation, but some
 * callers cannot acquire a sleeping lock.  These callers instead hold
 * the page table lock for a page table which contains at least one page
 * in this folio.  Truncation will block on the page table lock as it
 * unmaps pages before removing the folio from its mapping.
 *
 * Return: True if the folio was newly dirtied, false if it was already dirty.
 */

So, yes, ->dirty_folio() can indeed be called without the folio
being locked and it is not a bug.

Hence we have to serialise ->dirty_folio against both
__iomap_write_begin() dirtying the folio and iomap_writepage_map()
clearing the dirty range.

And that means we alway need to take the ifs->state_lock in
__iomap_write_begin() when we have an ifs attached to the folio.
Hence it is a) not correct and b) makes no sense to try to do
uptodate bitmap updates without it held...

> > IOWs, getting rid of the state lock out of the uptodate range
> > changes does not actually get rid of it from the buffered IO patch.
> > we still have to take it to update the dirty range, and so there's
> > an obvious way to optimise the state lock usage without changing any
> > of the bitmap access serialisation behaviour. i.e.  We combine the
> > uptodate and dirty range updates in __iomap_write_end() into a
> > single lock context such as:
> > 
> > iomap_set_range_dirty_uptodate()
> > {
> > 	struct iomap_folio_state *ifs = folio->private;
> > 	struct inode *inode:
> >         unsigned int blks_per_folio;
> >         unsigned int first_blk;
> >         unsigned int last_blk;
> >         unsigned int nr_blks;
> > 	unsigned long flags;
> > 
> > 	if (!ifs)
> > 		return;
> > 
> > 	inode = folio->mapping->host;
> > 	blks_per_folio = i_blocks_per_folio(inode, folio);
> > 	first_blk = (off >> inode->i_blkbits);
> > 	last_blk = (off + len - 1) >> inode->i_blkbits;
> > 	nr_blks = last_blk - first_blk + 1;
> > 
> > 	spin_lock_irqsave(&ifs->state_lock, flags);
> > 	bitmap_set(ifs->state, first_blk, nr_blks);
> > 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
> > 	spin_unlock_irqrestore(&ifs->state_lock, flags);
> > }
> > 
> > This means we calculate the bitmap offsets only once, we take the
> > state lock only once, and we don't do anything if there is no
> > sub-folio state.
> > 
> > If we then fix the __iomap_write_begin() code as Willy pointed out
> > to elide the erroneous uptodate range update, then we end up only
> > taking the state lock once per buffered write instead of 3 times per
> > write.
> > 
> > This patch only reduces it to twice per buffered write, so doing the
> > above should provide even better performance without needing to
> > change the underlying serialisation mechanism at all.
> > 
> 
> Thanks for the suggestion. I've thought about this solution too, but I
> didn't think we need the state_lock when setting ifs dirty bit since the
> folio lock should work, so I changed my mind and planed to drop all ifs
> state_lock in the write path (please see the patch 6). Please let me
> know if I'm wrong.

Whether it works or not is irrelevant: it is badly designed code
that you have proposed. We can acheive the same result without
changing the locking rules for the bitmap data via a small amount of
refactoring, and that is a much better solution than creating
complex and subtle locking rules for the object.

"But it works" doesn't mean the code is robust, maintainable code.

So, good optimisation, but NACK in this form. Please rework it to
only take the ifs->state_lock once for both bitmap updates in the
__iomap_write_end() path.

-Dave.
Zhang Yi Aug. 2, 2024, 11:13 a.m. UTC | #8
On 2024/8/2 14:29, Dave Chinner wrote:
> On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
>> On 2024/8/2 8:05, Dave Chinner wrote:
>>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>>>> race issue when submitting multiple read bios for a page spans more than
>>>> one file system block by adding a spinlock(which names state_lock now)
>>>> to make the page uptodate synchronous. However, the race condition only
>>>> happened between the read I/O submitting and completeing threads,
>>>
>>> when we do writeback on a folio that has multiple blocks on it we
>>> can submit multiple bios for that, too. Hence the write completions
>>> can race with each other and write submission, too.
>>>
>>> Yes, write bio submission and completion only need to update ifs
>>> accounting using an atomic operation, but the same race condition
>>> exists even though the folio is fully locked at the point of bio
>>> submission.
>>>
>>>
>>>> it's
>>>> sufficient to use page lock to protect other paths, e.g. buffered write
>>>                     ^^^^ folio
>>>> path.
>>>>
>>>> After large folio is supported, the spinlock could affect more
>>>> about the buffered write performance, so drop it could reduce some
>>>> unnecessary locking overhead.
>>>
>>> From the point of view of simple to understand and maintain code, I
>>> think this is a bad idea. The data structure is currently protected
>>> by the state lock in all situations, but this change now makes it
>>> protected by the state lock in one case and the folio lock in a
>>> different case.
>>
>> Yeah, I agree that this is a side-effect of this change, after this patch,
>> we have to be careful to distinguish between below two cases B1 and B2 as
>> Willy mentioned.
>>
>> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>>    either we know:
>> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>>     is the only place that can call ifs_set_range_uptodate() for this folio
>> B2. The caller of iomap_set_range_uptodate() holds the state lock
> 
> Yes, I read that before I commented that I think it's a bad idea.
> And then provided a method where we don't need to care about this at
> all.
>>
>>>
>>> Making this change also misses the elephant in the room: the
>>> buffered write path still needs the ifs->state_lock to update the
>>> dirty bitmap. Hence we're effectively changing the serialisation
>>> mechanism for only one of the two ifs state bitmaps that the
>>> buffered write path has to update.
>>>
>>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
>>> updates because iomap_dirty_folio() can be called without the folio
>>> being locked through folio_mark_dirty() calling the ->dirty_folio()
>>> aop.
>>>
>>
>> Sorry, I don't understand, why folio_mark_dirty() could be called without
>> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
>> folios must be locked before marking dirty. Are there any exceptions or am
>> I missing something?
> 
> Yes: reading the code I pointed you at.
> 
> /**
>  * folio_mark_dirty - Mark a folio as being modified.
>  * @folio: The folio.
>  *
>  * The folio may not be truncated while this function is running.
>  * Holding the folio lock is sufficient to prevent truncation, but some
>  * callers cannot acquire a sleeping lock.  These callers instead hold
>  * the page table lock for a page table which contains at least one page
>  * in this folio.  Truncation will block on the page table lock as it
>  * unmaps pages before removing the folio from its mapping.
>  *
>  * Return: True if the folio was newly dirtied, false if it was already dirty.
>  */
> 
> So, yes, ->dirty_folio() can indeed be called without the folio
> being locked and it is not a bug.

Ha, right, I missed the comments of this function, it means that there are
some special callers that hold table lock instead of folio lock, is it
pte_alloc_map_lock?

I checked all the filesystem related callers and didn't find any real
caller that mark folio dirty without holding folio lock and that could
affect current filesystems which are using iomap framework, it's just
a potential possibility in the future, am I right?

> 
> Hence we have to serialise ->dirty_folio against both
> __iomap_write_begin() dirtying the folio and iomap_writepage_map()
> clearing the dirty range.
> 

Both __iomap_write_begin() and iomap_writepage_map() are under the folio
lock now (locked in iomap_get_folio() and writeback_get_folio()), is there
any special about this case?

> And that means we alway need to take the ifs->state_lock in
> __iomap_write_begin() when we have an ifs attached to the folio.
> Hence it is a) not correct and b) makes no sense to try to do
> uptodate bitmap updates without it held...
> 
>>> IOWs, getting rid of the state lock out of the uptodate range
>>> changes does not actually get rid of it from the buffered IO patch.
>>> we still have to take it to update the dirty range, and so there's
>>> an obvious way to optimise the state lock usage without changing any
>>> of the bitmap access serialisation behaviour. i.e.  We combine the
>>> uptodate and dirty range updates in __iomap_write_end() into a
>>> single lock context such as:
>>>
>>> iomap_set_range_dirty_uptodate()
>>> {
>>> 	struct iomap_folio_state *ifs = folio->private;
>>> 	struct inode *inode:
>>>         unsigned int blks_per_folio;
>>>         unsigned int first_blk;
>>>         unsigned int last_blk;
>>>         unsigned int nr_blks;
>>> 	unsigned long flags;
>>>
>>> 	if (!ifs)
>>> 		return;
>>>
>>> 	inode = folio->mapping->host;
>>> 	blks_per_folio = i_blocks_per_folio(inode, folio);
>>> 	first_blk = (off >> inode->i_blkbits);
>>> 	last_blk = (off + len - 1) >> inode->i_blkbits;
>>> 	nr_blks = last_blk - first_blk + 1;
>>>
>>> 	spin_lock_irqsave(&ifs->state_lock, flags);
>>> 	bitmap_set(ifs->state, first_blk, nr_blks);
>>> 	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
>>> 	spin_unlock_irqrestore(&ifs->state_lock, flags);
>>> }
>>>
>>> This means we calculate the bitmap offsets only once, we take the
>>> state lock only once, and we don't do anything if there is no
>>> sub-folio state.
>>>
>>> If we then fix the __iomap_write_begin() code as Willy pointed out
>>> to elide the erroneous uptodate range update, then we end up only
>>> taking the state lock once per buffered write instead of 3 times per
>>> write.
>>>
>>> This patch only reduces it to twice per buffered write, so doing the
>>> above should provide even better performance without needing to
>>> change the underlying serialisation mechanism at all.
>>>
>>
>> Thanks for the suggestion. I've thought about this solution too, but I
>> didn't think we need the state_lock when setting ifs dirty bit since the
>> folio lock should work, so I changed my mind and planed to drop all ifs
>> state_lock in the write path (please see the patch 6). Please let me
>> know if I'm wrong.
> 
> Whether it works or not is irrelevant: it is badly designed code
> that you have proposed. We can acheive the same result without
> changing the locking rules for the bitmap data via a small amount of
> refactoring, and that is a much better solution than creating
> complex and subtle locking rules for the object.
> 
> "But it works" doesn't mean the code is robust, maintainable code.
> 
> So, good optimisation, but NACK in this form. Please rework it to
> only take the ifs->state_lock once for both bitmap updates in the
> __iomap_write_end() path.
> 

OK, sure, this looks reasonable to me, I will revise it as you suggested
and check performance gain again in my next iteration.

Thanks,
Yi.
Jan Kara Aug. 5, 2024, 12:42 p.m. UTC | #9
On Fri 02-08-24 19:13:11, Zhang Yi wrote:
> On 2024/8/2 14:29, Dave Chinner wrote:
> > On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> >> On 2024/8/2 8:05, Dave Chinner wrote:
> >>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> >>> Making this change also misses the elephant in the room: the
> >>> buffered write path still needs the ifs->state_lock to update the
> >>> dirty bitmap. Hence we're effectively changing the serialisation
> >>> mechanism for only one of the two ifs state bitmaps that the
> >>> buffered write path has to update.
> >>>
> >>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> >>> updates because iomap_dirty_folio() can be called without the folio
> >>> being locked through folio_mark_dirty() calling the ->dirty_folio()
> >>> aop.
> >>>
> >>
> >> Sorry, I don't understand, why folio_mark_dirty() could be called without
> >> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> >> folios must be locked before marking dirty. Are there any exceptions or am
> >> I missing something?
> > 
> > Yes: reading the code I pointed you at.
> > 
> > /**
> >  * folio_mark_dirty - Mark a folio as being modified.
> >  * @folio: The folio.
> >  *
> >  * The folio may not be truncated while this function is running.
> >  * Holding the folio lock is sufficient to prevent truncation, but some
> >  * callers cannot acquire a sleeping lock.  These callers instead hold
> >  * the page table lock for a page table which contains at least one page
> >  * in this folio.  Truncation will block on the page table lock as it
> >  * unmaps pages before removing the folio from its mapping.
> >  *
> >  * Return: True if the folio was newly dirtied, false if it was already dirty.
> >  */
> > 
> > So, yes, ->dirty_folio() can indeed be called without the folio
> > being locked and it is not a bug.
> 
> Ha, right, I missed the comments of this function, it means that there are
> some special callers that hold table lock instead of folio lock, is it
> pte_alloc_map_lock?
> 
> I checked all the filesystem related callers and didn't find any real
> caller that mark folio dirty without holding folio lock and that could
> affect current filesystems which are using iomap framework, it's just
> a potential possibility in the future, am I right?

There used to be quite a few places doing that. Now that I've checked all I
places was aware of got actually converted to call folio_mark_dirty() under
a folio lock (in particular all the cases happening on IO completion, folio
unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
would be called for regular file page cache (block device page cache is in a
different situation obviously) without folio lock held?

								Honza
Jan Kara Aug. 5, 2024, 2 p.m. UTC | #10
Actually add Matthew to CC ;)

On Mon 05-08-24 14:42:52, Jan Kara wrote:
> On Fri 02-08-24 19:13:11, Zhang Yi wrote:
> > On 2024/8/2 14:29, Dave Chinner wrote:
> > > On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
> > >> On 2024/8/2 8:05, Dave Chinner wrote:
> > >>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
> > >>> Making this change also misses the elephant in the room: the
> > >>> buffered write path still needs the ifs->state_lock to update the
> > >>> dirty bitmap. Hence we're effectively changing the serialisation
> > >>> mechanism for only one of the two ifs state bitmaps that the
> > >>> buffered write path has to update.
> > >>>
> > >>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
> > >>> updates because iomap_dirty_folio() can be called without the folio
> > >>> being locked through folio_mark_dirty() calling the ->dirty_folio()
> > >>> aop.
> > >>>
> > >>
> > >> Sorry, I don't understand, why folio_mark_dirty() could be called without
> > >> folio lock (isn't this supposed to be a bug)?  IIUC, all the file backed
> > >> folios must be locked before marking dirty. Are there any exceptions or am
> > >> I missing something?
> > > 
> > > Yes: reading the code I pointed you at.
> > > 
> > > /**
> > >  * folio_mark_dirty - Mark a folio as being modified.
> > >  * @folio: The folio.
> > >  *
> > >  * The folio may not be truncated while this function is running.
> > >  * Holding the folio lock is sufficient to prevent truncation, but some
> > >  * callers cannot acquire a sleeping lock.  These callers instead hold
> > >  * the page table lock for a page table which contains at least one page
> > >  * in this folio.  Truncation will block on the page table lock as it
> > >  * unmaps pages before removing the folio from its mapping.
> > >  *
> > >  * Return: True if the folio was newly dirtied, false if it was already dirty.
> > >  */
> > > 
> > > So, yes, ->dirty_folio() can indeed be called without the folio
> > > being locked and it is not a bug.
> > 
> > Ha, right, I missed the comments of this function, it means that there are
> > some special callers that hold table lock instead of folio lock, is it
> > pte_alloc_map_lock?
> > 
> > I checked all the filesystem related callers and didn't find any real
> > caller that mark folio dirty without holding folio lock and that could
> > affect current filesystems which are using iomap framework, it's just
> > a potential possibility in the future, am I right?
> 
> There used to be quite a few places doing that. Now that I've checked all
> places I was aware of got actually converted to call folio_mark_dirty() under
> a folio lock (in particular all the cases happening on IO completion, folio
> unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
> would be called for regular file page cache (block device page cache is in a
> different situation obviously) without folio lock held?
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
Matthew Wilcox Aug. 5, 2024, 3:48 p.m. UTC | #11
On Mon, Aug 05, 2024 at 04:00:23PM +0200, Jan Kara wrote:
> Actually add Matthew to CC ;)

It's OK, I was reading.

FWIW, I agree with Dave; the locking complexity in this patch was
horrendous.  I was going to get to the same critique he had, but I first
wanted to understand what the thought process was.

> > > Ha, right, I missed the comments of this function, it means that there are
> > > some special callers that hold table lock instead of folio lock, is it
> > > pte_alloc_map_lock?
> > > 
> > > I checked all the filesystem related callers and didn't find any real
> > > caller that mark folio dirty without holding folio lock and that could
> > > affect current filesystems which are using iomap framework, it's just
> > > a potential possibility in the future, am I right?

Filesystems are normally quite capable of taking the folio lock to
prevent truncation.  It's the MM code that needs the "or holding the
page table lock" get-out clause.  I forget exactly which callers it
is; I worked through them a few times.  It's not hard to put a
WARN_ON_RATELIMIT() into folio_mark_dirty() and get a good sampling.

There's also a "or holding a buffer_head locked" get-out clause that
I'm not sure is documented anywhere, but obviously that doesn't apply
to the iomap code.

> > There used to be quite a few places doing that. Now that I've checked all
> > places I was aware of got actually converted to call folio_mark_dirty() under
> > a folio lock (in particular all the cases happening on IO completion, folio
> > unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
> > would be called for regular file page cache (block device page cache is in a
> > different situation obviously) without folio lock held?

Yes, the MM code definitely applies to regular files as well as block
devices.
Zhang Yi Aug. 7, 2024, 11:39 a.m. UTC | #12
On 2024/8/5 23:48, Matthew Wilcox wrote:
> On Mon, Aug 05, 2024 at 04:00:23PM +0200, Jan Kara wrote:
>> Actually add Matthew to CC ;)
> 
> It's OK, I was reading.
> 
> FWIW, I agree with Dave; the locking complexity in this patch was
> horrendous.  I was going to get to the same critique he had, but I first
> wanted to understand what the thought process was.

Yes, I'd like to change to use the solution as Dave suggested.

> 
>>>> Ha, right, I missed the comments of this function, it means that there are
>>>> some special callers that hold table lock instead of folio lock, is it
>>>> pte_alloc_map_lock?
>>>>
>>>> I checked all the filesystem related callers and didn't find any real
>>>> caller that mark folio dirty without holding folio lock and that could
>>>> affect current filesystems which are using iomap framework, it's just
>>>> a potential possibility in the future, am I right?
> 
> Filesystems are normally quite capable of taking the folio lock to
> prevent truncation.  It's the MM code that needs the "or holding the
> page table lock" get-out clause.  I forget exactly which callers it
> is; I worked through them a few times.  It's not hard to put a
> WARN_ON_RATELIMIT() into folio_mark_dirty() and get a good sampling.
> 
> There's also a "or holding a buffer_head locked" get-out clause that
> I'm not sure is documented anywhere, but obviously that doesn't apply
> to the iomap code.

Thanks for your answer, I've found some callers.

Thanks,
Yi.

> 
>>> There used to be quite a few places doing that. Now that I've checked all
>>> places I was aware of got actually converted to call folio_mark_dirty() under
>>> a folio lock (in particular all the cases happening on IO completion, folio
>>> unmap etc.). Matthew, are you aware of any place where folio_mark_dirty()
>>> would be called for regular file page cache (block device page cache is in a
>>> different situation obviously) without folio lock held?
> 
> Yes, the MM code definitely applies to regular files as well as block
> devices.
>
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f5668895df66..248f4a586f8f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -73,14 +73,10 @@  static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 		size_t len)
 {
 	struct iomap_folio_state *ifs = folio->private;
-	unsigned long flags;
 	bool uptodate = true;
 
-	if (ifs) {
-		spin_lock_irqsave(&ifs->state_lock, flags);
+	if (ifs)
 		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
-		spin_unlock_irqrestore(&ifs->state_lock, flags);
-	}
 
 	if (uptodate)
 		folio_mark_uptodate(folio);
@@ -395,7 +391,18 @@  static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
+		if (ifs) {
+			/*
+			 * Hold state_lock to protect from submitting multiple
+			 * bios for the same locked folio and then get multiple
+			 * callbacks in parallel.
+			 */
+			spin_lock_irq(&ifs->state_lock);
+			iomap_set_range_uptodate(folio, poff, plen);
+			spin_unlock_irq(&ifs->state_lock);
+		} else {
+			folio_mark_uptodate(folio);
+		}
 		goto done;
 	}