diff mbox series

[12/27] ext4: introduce seq counter for the extent status entry

Message ID 20241022111059.2566137-13-yi.zhang@huaweicloud.com (mailing list archive)
State New
Headers show
Series ext4: use iomap for regular file's buffered I/O path and enable large folio | expand

Commit Message

Zhang Yi Oct. 22, 2024, 11:10 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.

Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. We will increment this number whenever the extent status
tree changes, thereby preparing for the buffered write iomap conversion.
Besides, it also changes the trace code style to make checkpatch.pl
happy.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h              |  1 +
 fs/ext4/extents_status.c    | 13 ++++++++-
 fs/ext4/super.c             |  1 +
 include/trace/events/ext4.h | 57 +++++++++++++++++++++----------------
 4 files changed, 46 insertions(+), 26 deletions(-)

Comments

Jan Kara Dec. 4, 2024, 12:42 p.m. UTC | #1
On Tue 22-10-24 19:10:43, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In the iomap_write_iter(), the iomap buffered write frame does not hold
> any locks between querying the inode extent mapping info and performing
> page cache writes. As a result, the extent mapping can be changed due to
> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
> write-back process faces a similar problem: concurrent changes can
> invalidate the extent mapping before the I/O is submitted.
> 
> Therefore, both of these processes must recheck the mapping info after
> acquiring the folio lock. To address this, similar to XFS, we propose
> introducing an extent sequence number to serve as a validity cookie for
> the extent. We will increment this number whenever the extent status
> tree changes, thereby preparing for the buffered write iomap conversion.
> Besides, it also changes the trace code style to make checkpatch.pl
> happy.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Overall using some sequence counter makes sense.

> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index c786691dabd3..bea4f87db502 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>  	return es->es_lblk + es->es_len - 1;
>  }
>  
> +static inline void ext4_es_inc_seq(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
> +}

This looks potentially dangerous because we can loose i_es_seq updates this
way. Like

CPU1					CPU2
x = READ_ONCE(ei->i_es_seq)
					x = READ_ONCE(ei->i_es_seq)
					WRITE_ONCE(ei->i_es_seq, x + 1)
					...
					potentially many times
WRITE_ONCE(ei->i_es_seq, x + 1)
  -> the counter goes back leading to possibly false equality checks

I think you'll need to use atomic_t and appropriate functions here.

> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	BUG_ON(end < lblk);
>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>  
> +	ext4_es_inc_seq(inode);

I'm somewhat wondering: Are extent status tree modifications the right
place to advance the sequence counter? The counter needs to advance
whenever the mapping information changes. This means that we'd be
needlessly advancing the counter (and thus possibly forcing retries) when
we are just adding new information from ordinary extent tree into cache.
Also someone can be doing extent tree manipulations without touching extent
status tree (if the information was already pruned from there). So I think
needs some very good documentation what are the expectations from the
sequence counter and explanations why they are satisfied so that we don't
break this in the future.

								Honza
Zhang Yi Dec. 6, 2024, 8:55 a.m. UTC | #2
On 2024/12/4 20:42, Jan Kara wrote:
> On Tue 22-10-24 19:10:43, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> In the iomap_write_iter(), the iomap buffered write frame does not hold
>> any locks between querying the inode extent mapping info and performing
>> page cache writes. As a result, the extent mapping can be changed due to
>> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
>> write-back process faces a similar problem: concurrent changes can
>> invalidate the extent mapping before the I/O is submitted.
>>
>> Therefore, both of these processes must recheck the mapping info after
>> acquiring the folio lock. To address this, similar to XFS, we propose
>> introducing an extent sequence number to serve as a validity cookie for
>> the extent. We will increment this number whenever the extent status
>> tree changes, thereby preparing for the buffered write iomap conversion.
>> Besides, it also changes the trace code style to make checkpatch.pl
>> happy.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Overall using some sequence counter makes sense.
> 
>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>> index c786691dabd3..bea4f87db502 100644
>> --- a/fs/ext4/extents_status.c
>> +++ b/fs/ext4/extents_status.c
>> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>>  	return es->es_lblk + es->es_len - 1;
>>  }
>>  
>> +static inline void ext4_es_inc_seq(struct inode *inode)
>> +{
>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>> +
>> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
>> +}
> 
> This looks potentially dangerous because we can loose i_es_seq updates this
> way. Like
> 
> CPU1					CPU2
> x = READ_ONCE(ei->i_es_seq)
> 					x = READ_ONCE(ei->i_es_seq)
> 					WRITE_ONCE(ei->i_es_seq, x + 1)
> 					...
> 					potentially many times
> WRITE_ONCE(ei->i_es_seq, x + 1)
>   -> the counter goes back leading to possibly false equality checks
> 

In my current implementation, I don't think this race condition can
happen since all ext4_es_inc_seq() invocations are under
EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
missed something?

> I think you'll need to use atomic_t and appropriate functions here.
> 
>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>  	BUG_ON(end < lblk);
>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>  
>> +	ext4_es_inc_seq(inode);
> 
> I'm somewhat wondering: Are extent status tree modifications the right
> place to advance the sequence counter? The counter needs to advance
> whenever the mapping information changes. This means that we'd be
> needlessly advancing the counter (and thus possibly forcing retries) when
> we are just adding new information from ordinary extent tree into cache.
> Also someone can be doing extent tree manipulations without touching extent
> status tree (if the information was already pruned from there). 

Sorry, I don't quite understand here. IIUC, we can't modify the extent
tree without also touching extent status tree; otherwise, the extent
status tree will become stale, potentially leading to undesirable and
unexpected outcomes later on, as the extent lookup paths rely on and
always trust the status tree. If this situation happens, would it be
considered a bug? Additionally, I have checked the code but didn't find
any concrete cases where this could happen. Was I overlooked something?

> So I think
> needs some very good documentation what are the expectations from the
> sequence counter and explanations why they are satisfied so that we don't
> break this in the future.
> 

Yeah, it's a good suggestion, where do you suggest putting this
documentation, how about in the front of extents_status.c?

Thanks,
Yi.
Jan Kara Dec. 6, 2024, 4:21 p.m. UTC | #3
On Fri 06-12-24 16:55:01, Zhang Yi wrote:
> On 2024/12/4 20:42, Jan Kara wrote:
> > On Tue 22-10-24 19:10:43, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> In the iomap_write_iter(), the iomap buffered write frame does not hold
> >> any locks between querying the inode extent mapping info and performing
> >> page cache writes. As a result, the extent mapping can be changed due to
> >> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
> >> write-back process faces a similar problem: concurrent changes can
> >> invalidate the extent mapping before the I/O is submitted.
> >>
> >> Therefore, both of these processes must recheck the mapping info after
> >> acquiring the folio lock. To address this, similar to XFS, we propose
> >> introducing an extent sequence number to serve as a validity cookie for
> >> the extent. We will increment this number whenever the extent status
> >> tree changes, thereby preparing for the buffered write iomap conversion.
> >> Besides, it also changes the trace code style to make checkpatch.pl
> >> happy.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Overall using some sequence counter makes sense.
> > 
> >> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> >> index c786691dabd3..bea4f87db502 100644
> >> --- a/fs/ext4/extents_status.c
> >> +++ b/fs/ext4/extents_status.c
> >> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
> >>  	return es->es_lblk + es->es_len - 1;
> >>  }
> >>  
> >> +static inline void ext4_es_inc_seq(struct inode *inode)
> >> +{
> >> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >> +
> >> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
> >> +}
> > 
> > This looks potentially dangerous because we can loose i_es_seq updates this
> > way. Like
> > 
> > CPU1					CPU2
> > x = READ_ONCE(ei->i_es_seq)
> > 					x = READ_ONCE(ei->i_es_seq)
> > 					WRITE_ONCE(ei->i_es_seq, x + 1)
> > 					...
> > 					potentially many times
> > WRITE_ONCE(ei->i_es_seq, x + 1)
> >   -> the counter goes back leading to possibly false equality checks
> > 
> 
> In my current implementation, I don't think this race condition can
> happen since all ext4_es_inc_seq() invocations are under
> EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
> missed something?

Hum, as far as I've checked, at least the place in ext4_es_insert_extent()
where you call ext4_es_inc_seq() doesn't hold i_es_lock (yet). If you meant
to protect the updates by i_es_lock, then move the call sites and please
add a comment about it. Also it should be enough to do:

WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);

since we cannot be really racing with other writers.

> > I think you'll need to use atomic_t and appropriate functions here.
> > 
> >> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>  	BUG_ON(end < lblk);
> >>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
> >>  
> >> +	ext4_es_inc_seq(inode);
> > 
> > I'm somewhat wondering: Are extent status tree modifications the right
> > place to advance the sequence counter? The counter needs to advance
> > whenever the mapping information changes. This means that we'd be
> > needlessly advancing the counter (and thus possibly forcing retries) when
> > we are just adding new information from ordinary extent tree into cache.
> > Also someone can be doing extent tree manipulations without touching extent
> > status tree (if the information was already pruned from there). 
> 
> Sorry, I don't quite understand here. IIUC, we can't modify the extent
> tree without also touching extent status tree; otherwise, the extent
> status tree will become stale, potentially leading to undesirable and
> unexpected outcomes later on, as the extent lookup paths rely on and
> always trust the status tree. If this situation happens, would it be
> considered a bug? Additionally, I have checked the code but didn't find
> any concrete cases where this could happen. Was I overlooked something?

What I'm worried about is that this seems a bit fragile because e.g. in
ext4_collapse_range() we do:

ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
<now go and manipulate the extent tree>

So if somebody managed to sneak in between ext4_es_remove_extent() and
the extent tree manipulation, he could get a block mapping which is shortly
after invalidated by the extent tree changes. And as I'm checking now,
writeback code *can* sneak in there because during extent tree
manipulations we call ext4_datasem_ensure_credits() which can drop
i_data_sem to restart a transaction.

Now we do writeout & invalidate page cache before we start to do these
extent tree dances so I don't see how this could lead to *actual* use
after free issues but it makes me somewhat nervous. So that's why I'd like
to have some clear rules from which it is obvious that the counter makes
sure we do not use stale mappings.

> > So I think
> > needs some very good documentation what are the expectations from the
> > sequence counter and explanations why they are satisfied so that we don't
> > break this in the future.
> 
> Yeah, it's a good suggestion, where do you suggest putting this
> documentation, how about in the front of extents_status.c?

I think at the function incrementing the counter would be fine.

								Honza
Zhang Yi Dec. 9, 2024, 8:32 a.m. UTC | #4
On 2024/12/7 0:21, Jan Kara wrote:
> On Fri 06-12-24 16:55:01, Zhang Yi wrote:
>> On 2024/12/4 20:42, Jan Kara wrote:
>>> On Tue 22-10-24 19:10:43, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> In the iomap_write_iter(), the iomap buffered write frame does not hold
>>>> any locks between querying the inode extent mapping info and performing
>>>> page cache writes. As a result, the extent mapping can be changed due to
>>>> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
>>>> write-back process faces a similar problem: concurrent changes can
>>>> invalidate the extent mapping before the I/O is submitted.
>>>>
>>>> Therefore, both of these processes must recheck the mapping info after
>>>> acquiring the folio lock. To address this, similar to XFS, we propose
>>>> introducing an extent sequence number to serve as a validity cookie for
>>>> the extent. We will increment this number whenever the extent status
>>>> tree changes, thereby preparing for the buffered write iomap conversion.
>>>> Besides, it also changes the trace code style to make checkpatch.pl
>>>> happy.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> Overall using some sequence counter makes sense.
>>>
>>>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>>>> index c786691dabd3..bea4f87db502 100644
>>>> --- a/fs/ext4/extents_status.c
>>>> +++ b/fs/ext4/extents_status.c
>>>> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>>>>  	return es->es_lblk + es->es_len - 1;
>>>>  }
>>>>  
>>>> +static inline void ext4_es_inc_seq(struct inode *inode)
>>>> +{
>>>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>>> +
>>>> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
>>>> +}
>>>
>>> This looks potentially dangerous because we can loose i_es_seq updates this
>>> way. Like
>>>
>>> CPU1					CPU2
>>> x = READ_ONCE(ei->i_es_seq)
>>> 					x = READ_ONCE(ei->i_es_seq)
>>> 					WRITE_ONCE(ei->i_es_seq, x + 1)
>>> 					...
>>> 					potentially many times
>>> WRITE_ONCE(ei->i_es_seq, x + 1)
>>>   -> the counter goes back leading to possibly false equality checks
>>>
>>
>> In my current implementation, I don't think this race condition can
>> happen since all ext4_es_inc_seq() invocations are under
>> EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
>> missed something?
> 
> Hum, as far as I've checked, at least the place in ext4_es_insert_extent()
> where you call ext4_es_inc_seq() doesn't hold i_es_lock (yet). If you meant
> to protect the updates by i_es_lock, then move the call sites and please
> add a comment about it. Also it should be enough to do:
> 
> WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);
> 
> since we cannot be really racing with other writers.

Oh, sorry, I mentioned the wrong lock. What I intended to say is
i_data_sem.

Currently, all instances where we update the extent status tree will
hold i_data_sem in write mode, preventing any race conditions in these
scenarios. However, we may hold i_data_sem in read mode while loading
a new entry from the extent tree (e.g., ext4_map_query_blocks()). In
these cases, a race condition could occur, but it doesn't modify the
extents, and the new loading range should not be related to the
mapping range we obtained (If it covers with the range we have, it
must first remove the old extents entry, which is protected by
i_data_sem, ensuring that i_es_seq increases by at least one).
Therefore, it should not use stale mapping and trigger any real issues.

However, after thinking about it again, I agree with you that this
approach is subtle, fragile and make us hard to understand, now I think
we should move it into i_es_lock.

> 
>>> I think you'll need to use atomic_t and appropriate functions here.
>>>
>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>  	BUG_ON(end < lblk);
>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>  
>>>> +	ext4_es_inc_seq(inode);
>>>
>>> I'm somewhat wondering: Are extent status tree modifications the right
>>> place to advance the sequence counter? The counter needs to advance
>>> whenever the mapping information changes. This means that we'd be
>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>> we are just adding new information from ordinary extent tree into cache.
>>> Also someone can be doing extent tree manipulations without touching extent
>>> status tree (if the information was already pruned from there). 
>>
>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>> tree without also touching extent status tree; otherwise, the extent
>> status tree will become stale, potentially leading to undesirable and
>> unexpected outcomes later on, as the extent lookup paths rely on and
>> always trust the status tree. If this situation happens, would it be
>> considered a bug? Additionally, I have checked the code but didn't find
>> any concrete cases where this could happen. Was I overlooked something?
> 
> What I'm worried about is that this seems a bit fragile because e.g. in
> ext4_collapse_range() we do:
> 
> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
> <now go and manipulate the extent tree>
> 
> So if somebody managed to sneak in between ext4_es_remove_extent() and
> the extent tree manipulation, he could get a block mapping which is shortly
> after invalidated by the extent tree changes. And as I'm checking now,
> writeback code *can* sneak in there because during extent tree
> manipulations we call ext4_datasem_ensure_credits() which can drop
> i_data_sem to restart a transaction.
> 
> Now we do writeout & invalidate page cache before we start to do these
> extent tree dances so I don't see how this could lead to *actual* use
> after free issues but it makes me somewhat nervous. So that's why I'd like
> to have some clear rules from which it is obvious that the counter makes
> sure we do not use stale mappings.

Yes, I see. I think the rule should be as follows:

First, when the iomap infrastructure is creating or querying file
mapping information, we must ensure that the mapping information
always passes through the extent status tree, which means
ext4_map_blocks(), ext4_map_query_blocks(), and
ext4_map_create_blocks() should cache the extent status entries that
we intend to use.

Second, when updating the extent tree, we must hold the i_data_sem in
write mode and update the extent status tree atomically. Additionally,
if we cannot update the extent tree while holding a single i_data_sem,
we should first remove all related extent status entries within the
specified range, then manipulate the extent tree, ensuring that the
extent status entries are always up-to-date if they exist (as
ext4_collapse_range() does).

Finally, if we want to manipulate the extent tree without caching, we
should also remove the extent status entries first.

In summary, ensure that the extent status tree and the extent tree are
consistent under one i_data_sem. If we can't, remove the extent status
entry before manipulating the extent tree.

Do you agree?

> 
>>> So I think
>>> needs some very good documentation what are the expectations from the
>>> sequence counter and explanations why they are satisfied so that we don't
>>> break this in the future.
>>
>> Yeah, it's a good suggestion, where do you suggest putting this
>> documentation, how about in the front of extents_status.c?
> 
> I think at the function incrementing the counter would be fine.
> 

Sure, thanks for pointing this out.

Thanks,
Yi.
Jan Kara Dec. 10, 2024, 12:57 p.m. UTC | #5
On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> On 2024/12/7 0:21, Jan Kara wrote:
> >>> I think you'll need to use atomic_t and appropriate functions here.
> >>>
> >>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>>>  	BUG_ON(end < lblk);
> >>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
> >>>>  
> >>>> +	ext4_es_inc_seq(inode);
> >>>
> >>> I'm somewhat wondering: Are extent status tree modifications the right
> >>> place to advance the sequence counter? The counter needs to advance
> >>> whenever the mapping information changes. This means that we'd be
> >>> needlessly advancing the counter (and thus possibly forcing retries) when
> >>> we are just adding new information from ordinary extent tree into cache.
> >>> Also someone can be doing extent tree manipulations without touching extent
> >>> status tree (if the information was already pruned from there). 
> >>
> >> Sorry, I don't quite understand here. IIUC, we can't modify the extent
> >> tree without also touching extent status tree; otherwise, the extent
> >> status tree will become stale, potentially leading to undesirable and
> >> unexpected outcomes later on, as the extent lookup paths rely on and
> >> always trust the status tree. If this situation happens, would it be
> >> considered a bug? Additionally, I have checked the code but didn't find
> >> any concrete cases where this could happen. Was I overlooked something?
> > 
> > What I'm worried about is that this seems a bit fragile because e.g. in
> > ext4_collapse_range() we do:
> > 
> > ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
> > <now go and manipulate the extent tree>
> > 
> > So if somebody managed to sneak in between ext4_es_remove_extent() and
> > the extent tree manipulation, he could get a block mapping which is shortly
> > after invalidated by the extent tree changes. And as I'm checking now,
> > writeback code *can* sneak in there because during extent tree
> > manipulations we call ext4_datasem_ensure_credits() which can drop
> > i_data_sem to restart a transaction.
> > 
> > Now we do writeout & invalidate page cache before we start to do these
> > extent tree dances so I don't see how this could lead to *actual* use
> > after free issues but it makes me somewhat nervous. So that's why I'd like
> > to have some clear rules from which it is obvious that the counter makes
> > sure we do not use stale mappings.
> 
> Yes, I see. I think the rule should be as follows:
> 
> First, when the iomap infrastructure is creating or querying file
> mapping information, we must ensure that the mapping information
> always passes through the extent status tree, which means
> ext4_map_blocks(), ext4_map_query_blocks(), and
> ext4_map_create_blocks() should cache the extent status entries that
> we intend to use.

OK, this currently holds. There's just one snag that during fastcommit
replay ext4_es_insert_extent() doesn't do anything. I don't think there's
any race possible during that stage but it's another case to think about.

> Second, when updating the extent tree, we must hold the i_data_sem in
> write mode and update the extent status tree atomically.

Fine.

> Additionally,
> if we cannot update the extent tree while holding a single i_data_sem,
> we should first remove all related extent status entries within the
> specified range, then manipulate the extent tree, ensuring that the
> extent status entries are always up-to-date if they exist (as
> ext4_collapse_range() does).

In this case, I think we need to provide more details. In particular I
would require that in all such cases we must:
a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
   provides exclusion against page faults, reads, writes
b) evict all page cache in the affected range -> should stop writeback -
   *but* currently there's one case which could be problematic. Assume we
   do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
   the above and starts removing blocks, needs to restart transaction so it
   drops i_data_sem. Writeback starts for page N+1, needs to load extent
   block into memory, ext4_cache_extents() now loads back some extents
   covering range 0..N into extent status tree. So the only protection
   against using freed blocks is that nobody should be mapping anything in
   the range 0..N because we hold those locks & have evicted page cache.

So I think we need to also document, that anybody mapping blocks needs to
hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
ext4_map_blocks() to catch cases we missed. Asserting for page lock will
not be really doable but luckily only page writeback needs that so that can
get some extemption from the assert.

> Finally, if we want to manipulate the extent tree without caching, we
> should also remove the extent status entries first.

Based on the above, I don't think this is really needed. We only must make
sure that after all extent tree updates are done and before we release
invalidate_lock, all extents from extent status tree in the modified range
must be evicted / replaced to match reality.

								Honza
Zhang Yi Dec. 11, 2024, 7:59 a.m. UTC | #6
On 2024/12/10 20:57, Jan Kara wrote:
> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>> On 2024/12/7 0:21, Jan Kara wrote:
>>>>> I think you'll need to use atomic_t and appropriate functions here.
>>>>>
>>>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>>>  	BUG_ON(end < lblk);
>>>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>>>  
>>>>>> +	ext4_es_inc_seq(inode);
>>>>>
>>>>> I'm somewhat wondering: Are extent status tree modifications the right
>>>>> place to advance the sequence counter? The counter needs to advance
>>>>> whenever the mapping information changes. This means that we'd be
>>>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>>>> we are just adding new information from ordinary extent tree into cache.
>>>>> Also someone can be doing extent tree manipulations without touching extent
>>>>> status tree (if the information was already pruned from there). 
>>>>
>>>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>>>> tree without also touching extent status tree; otherwise, the extent
>>>> status tree will become stale, potentially leading to undesirable and
>>>> unexpected outcomes later on, as the extent lookup paths rely on and
>>>> always trust the status tree. If this situation happens, would it be
>>>> considered a bug? Additionally, I have checked the code but didn't find
>>>> any concrete cases where this could happen. Was I overlooked something?
>>>
>>> What I'm worried about is that this seems a bit fragile because e.g. in
>>> ext4_collapse_range() we do:
>>>
>>> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
>>> <now go and manipulate the extent tree>
>>>
>>> So if somebody managed to sneak in between ext4_es_remove_extent() and
>>> the extent tree manipulation, he could get a block mapping which is shortly
>>> after invalidated by the extent tree changes. And as I'm checking now,
>>> writeback code *can* sneak in there because during extent tree
>>> manipulations we call ext4_datasem_ensure_credits() which can drop
>>> i_data_sem to restart a transaction.
>>>
>>> Now we do writeout & invalidate page cache before we start to do these
>>> extent tree dances so I don't see how this could lead to *actual* use
>>> after free issues but it makes me somewhat nervous. So that's why I'd like
>>> to have some clear rules from which it is obvious that the counter makes
>>> sure we do not use stale mappings.
>>
>> Yes, I see. I think the rule should be as follows:
>>
>> First, when the iomap infrastructure is creating or querying file
>> mapping information, we must ensure that the mapping information
>> always passes through the extent status tree, which means
>> ext4_map_blocks(), ext4_map_query_blocks(), and
>> ext4_map_create_blocks() should cache the extent status entries that
>> we intend to use.
> 
> OK, this currently holds. There's just one snag that during fastcommit
> replay ext4_es_insert_extent() doesn't do anything. I don't think there's
> any race possible during that stage but it's another case to think about.

OK.

> 
>> Second, when updating the extent tree, we must hold the i_data_sem in
>> write mode and update the extent status tree atomically.
> 
> Fine.
> 
>> Additionally,
>> if we cannot update the extent tree while holding a single i_data_sem,
>> we should first remove all related extent status entries within the
>> specified range, then manipulate the extent tree, ensuring that the
>> extent status entries are always up-to-date if they exist (as
>> ext4_collapse_range() does).
> 
> In this case, I think we need to provide more details. In particular I
> would require that in all such cases we must:
> a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
>    provides exclusion against page faults, reads, writes

Yes.

> b) evict all page cache in the affected range -> should stop writeback -
>    *but* currently there's one case which could be problematic. Assume we
>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>    the above and starts removing blocks, needs to restart transaction so it
>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>    block into memory, ext4_cache_extents() now loads back some extents
>    covering range 0..N into extent status tree. 

This completely confuses me. Do you mention the case below,

There are many extent entries in the page range 0..N+1, for example,

   0                                  N N+1
   |                                  |/
  [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
                |      |
                N-a    N-b

Punch hole is removing each extent entries from N..0
(ext4_ext_remove_space() removes blocks from end to start), and could
drop i_data_sem just after manipulating(removing) the extent entry
[N-a,N-b], At the same time, a concurrent writeback start write back
page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
and invalidate_lock. It may load back the extents 0..N-a into the
extent status tree again while finding extent that contains N+1?
Finally it may left some stale extent status entries after punch hole
is done?

If my understanding is correct, isn't that a problem that exists now?
I mean without this patch series.

>    So the only protection
>    against using freed blocks is that nobody should be mapping anything in
>    the range 0..N because we hold those locks & have evicted page cache.
> 
> So I think we need to also document, that anybody mapping blocks needs to
> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> not be really doable but luckily only page writeback needs that so that can
> get some extemption from the assert.

In the case above, it seems that merely holding a page lock is
insufficient?

> 
>> Finally, if we want to manipulate the extent tree without caching, we
>> should also remove the extent status entries first.
> 
> Based on the above, I don't think this is really needed. We only must make
> sure that after all extent tree updates are done and before we release
> invalidate_lock, all extents from extent status tree in the modified range
> must be evicted / replaced to match reality.
> 
Yeah, I agree with you.

Thanks,
Yi.
Jan Kara Dec. 11, 2024, 4 p.m. UTC | #7
On Wed 11-12-24 15:59:51, Zhang Yi wrote:
> On 2024/12/10 20:57, Jan Kara wrote:
> > On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> > b) evict all page cache in the affected range -> should stop writeback -
> >    *but* currently there's one case which could be problematic. Assume we
> >    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
> >    the above and starts removing blocks, needs to restart transaction so it
> >    drops i_data_sem. Writeback starts for page N+1, needs to load extent
> >    block into memory, ext4_cache_extents() now loads back some extents
> >    covering range 0..N into extent status tree. 
> 
> This completely confuses me. Do you mention the case below,
> 
> There are many extent entries in the page range 0..N+1, for example,
> 
>    0                                  N N+1
>    |                                  |/
>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>                 |      |
>                 N-a    N-b
> 
> Punch hole is removing each extent entries from N..0
> (ext4_ext_remove_space() removes blocks from end to start), and could
> drop i_data_sem just after manipulating(removing) the extent entry
> [N-a,N-b], At the same time, a concurrent writeback start write back
> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
> and invalidate_lock. It may load back the extents 0..N-a into the
> extent status tree again while finding extent that contains N+1?

Yes, because when we load extents from extent tree, we insert all extents
from the leaf of the extent tree into extent status tree. That's what
ext4_cache_extents() call does.

> Finally it may left some stale extent status entries after punch hole
> is done?

Yes, there may be stale extents in the extent status tree when
ext4_ext_remove_space() returns. But punch hole in particular then does:

ext4_es_insert_extent(inode, first_block, hole_len, ~0,
                                      EXTENT_STATUS_HOLE, 0);

which overwrites these stale extents with appropriate information.

> If my understanding is correct, isn't that a problem that exists now?
> I mean without this patch series.

Yes, the situation isn't really related to your patches. But with your
patches we are starting to rely even more on extent status tree vs extent
tree consistecy. So I wanted to spell out this situation to verify new
problem isn't introduced and so that we create rules that handle this
situation well.

> >    So the only protection
> >    against using freed blocks is that nobody should be mapping anything in
> >    the range 0..N because we hold those locks & have evicted page cache.
> > 
> > So I think we need to also document, that anybody mapping blocks needs to
> > hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> > ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> > not be really doable but luckily only page writeback needs that so that can
> > get some extemption from the assert.
> 
> In the case above, it seems that merely holding a page lock is
> insufficient?

Well, holding page lock(s) for the range you are operating on is enough to
make sure there cannot be parallel operations on that range like truncate,
punch hole or similar, because they always remove the page cache before
starting their work and because they hold invalidate_lock, new pages cannot
be created while they are working.

								Honza
Zhang Yi Dec. 12, 2024, 2:32 a.m. UTC | #8
On 2024/12/12 0:00, Jan Kara wrote:
> On Wed 11-12-24 15:59:51, Zhang Yi wrote:
>> On 2024/12/10 20:57, Jan Kara wrote:
>>> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>>> b) evict all page cache in the affected range -> should stop writeback -
>>>    *but* currently there's one case which could be problematic. Assume we
>>>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>>>    the above and starts removing blocks, needs to restart transaction so it
>>>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>>>    block into memory, ext4_cache_extents() now loads back some extents
>>>    covering range 0..N into extent status tree. 
>>
>> This completely confuses me. Do you mention the case below,
>>
>> There are many extent entries in the page range 0..N+1, for example,
>>
>>    0                                  N N+1
>>    |                                  |/
>>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>>                 |      |
>>                 N-a    N-b
>>
>> Punch hole is removing each extent entries from N..0
>> (ext4_ext_remove_space() removes blocks from end to start), and could
>> drop i_data_sem just after manipulating(removing) the extent entry
>> [N-a,N-b], At the same time, a concurrent writeback start write back
>> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
>> and invalidate_lock. It may load back the extents 0..N-a into the
>> extent status tree again while finding extent that contains N+1?
> 
> Yes, because when we load extents from extent tree, we insert all extents
> from the leaf of the extent tree into extent status tree. That's what
> ext4_cache_extents() call does.
> 
>> Finally it may left some stale extent status entries after punch hole
>> is done?
> 
> Yes, there may be stale extents in the extent status tree when
> ext4_ext_remove_space() returns. But punch hole in particular then does:
> 
> ext4_es_insert_extent(inode, first_block, hole_len, ~0,
>                                       EXTENT_STATUS_HOLE, 0);
> 
> which overwrites these stale extents with appropriate information.
> 

Yes, that's correct! I missed this insert yesterday. It looks fine now,
as it holds the i_rwsem and invalidate_lock, and has evicted the page
cache in this case. Thanks a lot for your detail explanation. I will
add these document in my next iteration.

Thanks!
Yi.

>> If my understanding is correct, isn't that a problem that exists now?
>> I mean without this patch series.
> 
> Yes, the situation isn't really related to your patches. But with your
> patches we are starting to rely even more on extent status tree vs extent
> tree consistecy. So I wanted to spell out this situation to verify new
> problem isn't introduced and so that we create rules that handle this
> situation well.
> 
>>>    So the only protection
>>>    against using freed blocks is that nobody should be mapping anything in
>>>    the range 0..N because we hold those locks & have evicted page cache.
>>>
>>> So I think we need to also document, that anybody mapping blocks needs to
>>> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
>>> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
>>> not be really doable but luckily only page writeback needs that so that can
>>> get some extemption from the assert.
>>
>> In the case above, it seems that merely holding a page lock is
>> insufficient?
> 
> Well, holding page lock(s) for the range you are operating on is enough to
> make sure there cannot be parallel operations on that range like truncate,
> punch hole or similar, because they always remove the page cache before
> starting their work and because they hold invalidate_lock, new pages cannot
> be created while they are working.
> 
> 								Honza
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6d0267afd4c1..44f6867d3037 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1123,6 +1123,7 @@  struct ext4_inode_info {
 	ext4_lblk_t i_es_shrink_lblk;	/* Offset where we start searching for
 					   extents to shrink. Protected by
 					   i_es_lock  */
+	unsigned int i_es_seq;		/* Change counter for extents */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index c786691dabd3..bea4f87db502 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -204,6 +204,13 @@  static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
 	return es->es_lblk + es->es_len - 1;
 }
 
+static inline void ext4_es_inc_seq(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
+}
+
 /*
  * search through the tree for an delayed extent with a given offset.  If
  * it can't be found, try to find next extent.
@@ -872,6 +879,7 @@  void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	BUG_ON(end < lblk);
 	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
 
+	ext4_es_inc_seq(inode);
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
@@ -1519,13 +1527,15 @@  void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
 
-	trace_ext4_es_remove_extent(inode, lblk, len);
 	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
 		 lblk, len, inode->i_ino);
 
 	if (!len)
 		return;
 
+	ext4_es_inc_seq(inode);
+	trace_ext4_es_remove_extent(inode, lblk, len);
+
 	end = lblk + len - 1;
 	BUG_ON(end < lblk);
 
@@ -2107,6 +2117,7 @@  void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 	WARN_ON_ONCE((EXT4_B2C(sbi, lblk) == EXT4_B2C(sbi, end)) &&
 		     end_allocated);
 
+	ext4_es_inc_seq(inode);
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 16a4ce704460..a01e0bbe57c8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1409,6 +1409,7 @@  static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_es_all_nr = 0;
 	ei->i_es_shk_nr = 0;
 	ei->i_es_shrink_lblk = 0;
+	ei->i_es_seq = 0;
 	ei->i_reserved_data_blocks = 0;
 	spin_lock_init(&(ei->i_block_reservation_lock));
 	ext4_init_pending_tree(&ei->i_pending_tree);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f2bf9035216 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2176,12 +2176,13 @@  DECLARE_EVENT_CLASS(ext4__es_extent,
 	TP_ARGS(inode, es),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,		dev		)
-		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	lblk		)
-		__field(	ext4_lblk_t,	len		)
-		__field(	ext4_fsblk_t,	pblk		)
-		__field(	char, status	)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(ext4_lblk_t,	lblk)
+		__field(ext4_lblk_t,	len)
+		__field(ext4_fsblk_t,	pblk)
+		__field(char,		status)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2191,13 +2192,15 @@  DECLARE_EVENT_CLASS(ext4__es_extent,
 		__entry->len	= es->es_len;
 		__entry->pblk	= ext4_es_show_pblock(es);
 		__entry->status	= ext4_es_status(es);
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
-		  __entry->pblk, show_extent_status(__entry->status))
+		  __entry->pblk, show_extent_status(__entry->status),
+		  __entry->seq)
 );
 
 DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
@@ -2218,10 +2221,11 @@  TRACE_EVENT(ext4_es_remove_extent,
 	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,	dev			)
-		__field(	ino_t,	ino			)
-		__field(	loff_t,	lblk			)
-		__field(	loff_t,	len			)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(loff_t,		lblk)
+		__field(loff_t,		len)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2229,12 +2233,13 @@  TRACE_EVENT(ext4_es_remove_extent,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= lblk;
 		__entry->len	= len;
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%lld/%lld) seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len, __entry->seq)
 );
 
 TRACE_EVENT(ext4_es_find_extent_range_enter,
@@ -2486,14 +2491,15 @@  TRACE_EVENT(ext4_es_insert_delayed_extent,
 	TP_ARGS(inode, es, lclu_allocated, end_allocated),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,		dev		)
-		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	lblk		)
-		__field(	ext4_lblk_t,	len		)
-		__field(	ext4_fsblk_t,	pblk		)
-		__field(	char,		status		)
-		__field(	bool,		lclu_allocated	)
-		__field(	bool,		end_allocated	)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(ext4_lblk_t,	lblk)
+		__field(ext4_lblk_t,	len)
+		__field(ext4_fsblk_t,	pblk)
+		__field(char,		status)
+		__field(bool,		lclu_allocated)
+		__field(bool,		end_allocated)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2505,15 +2511,16 @@  TRACE_EVENT(ext4_es_insert_delayed_extent,
 		__entry->status		= ext4_es_status(es);
 		__entry->lclu_allocated	= lclu_allocated;
 		__entry->end_allocated	= end_allocated;
+		__entry->seq		= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s "
-		  "allocated %d %d",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s allocated %d %d seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
 		  __entry->pblk, show_extent_status(__entry->status),
-		  __entry->lclu_allocated, __entry->end_allocated)
+		  __entry->lclu_allocated, __entry->end_allocated,
+		  __entry->seq)
 );
 
 /* fsmap traces */