Message ID | 20240802115120.362902-7-yi.zhang@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ext4: simplify the counting and management of delalloc reserved blocks | expand |
On Fri 02-08-24 19:51:16, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > Now that we update data reserved space for delalloc after allocating > new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is > enabled, we also need to query the extents_status tree to calculate the > exact reserved clusters. This is complicated now and it appears that > it's better to do this job in ext4_es_insert_extent(), because > __es_remove_extent() have already count delalloc blocks when removing > delalloc extents and __revise_pending() return new adding pending count, > we could update the reserved blocks easily in ext4_es_insert_extent(). > > Thers is one special case needs to concern is the quota claiming, when > bigalloc is enabled, if the delayed cluster allocation has been raced > by another no-delayed allocation(e.g. from fallocate) which doesn't > cover the delayed blocks: > > |< one cluster >| > hhhhhhhhhhhhhhhhhhhdddddddddd > ^ ^ > |< >| < fallocate this range, don't claim quota again > > We can't claim quota as usual because the fallocate has already claimed > it in ext4_mb_new_blocks(), we could notice this case through the > removed delalloc blocks count. > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> ... > @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, > __free_pending(pr); > pr = NULL; > } > + pending = err3; > } > error: > write_unlock(&EXT4_I(inode)->i_es_lock); > + /* > + * Reduce the reserved cluster count to reflect successful deferred > + * allocation of delayed allocated clusters or direct allocation of > + * clusters discovered to be delayed allocated. Once allocated, a > + * cluster is not included in the reserved count. > + * > + * When bigalloc is enabled, allocating non-delayed allocated blocks > + * which belong to delayed allocated clusters (from fallocate, filemap, > + * DIO, or clusters allocated when delalloc has been disabled by > + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), > + * so release the quota reservations made for any previously delayed > + * allocated clusters. > + */ > + resv_used = rinfo.delonly_cluster + pending; > + if (resv_used) > + ext4_da_update_reserve_space(inode, resv_used, > + rinfo.delonly_block); I'm not sure I understand here. We are inserting extent into extent status tree. We are replacing resv_used clusters worth of space with delayed allocation reservation with normally allocated clusters so we need to release the reservation (mballoc already reduced freeclusters counter). That I understand. In normal case we should also claim quota because we are converting from reserved into allocated state. Now if we allocated blocks under this range (e.g. from fallocate()) without EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating blocks for this extent or not. At this point it would seem much clearer if we passed flag to ext4_es_insert_extent() whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating extent or not instead of computing delonly_block and somehow infering from that. But maybe I miss some obvious reason why that is correct. Honza
On 2024/8/8 1:41, Jan Kara wrote: > On Fri 02-08-24 19:51:16, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> Now that we update data reserved space for delalloc after allocating >> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is >> enabled, we also need to query the extents_status tree to calculate the >> exact reserved clusters. This is complicated now and it appears that >> it's better to do this job in ext4_es_insert_extent(), because >> __es_remove_extent() have already count delalloc blocks when removing >> delalloc extents and __revise_pending() return new adding pending count, >> we could update the reserved blocks easily in ext4_es_insert_extent(). >> >> Thers is one special case needs to concern is the quota claiming, when >> bigalloc is enabled, if the delayed cluster allocation has been raced >> by another no-delayed allocation(e.g. from fallocate) which doesn't >> cover the delayed blocks: >> >> |< one cluster >| >> hhhhhhhhhhhhhhhhhhhdddddddddd >> ^ ^ >> |< >| < fallocate this range, don't claim quota again >> >> We can't claim quota as usual because the fallocate has already claimed >> it in ext4_mb_new_blocks(), we could notice this case through the >> removed delalloc blocks count. >> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > ... >> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, >> __free_pending(pr); >> pr = NULL; >> } >> + pending = err3; >> } >> error: >> write_unlock(&EXT4_I(inode)->i_es_lock); >> + /* >> + * Reduce the reserved cluster count to reflect successful deferred >> + * allocation of delayed allocated clusters or direct allocation of >> + * clusters discovered to be delayed allocated. Once allocated, a >> + * cluster is not included in the reserved count. >> + * >> + * When bigalloc is enabled, allocating non-delayed allocated blocks >> + * which belong to delayed allocated clusters (from fallocate, filemap, >> + * DIO, or clusters allocated when delalloc has been disabled by >> + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), >> + * so release the quota reservations made for any previously delayed >> + * allocated clusters. >> + */ >> + resv_used = rinfo.delonly_cluster + pending; >> + if (resv_used) >> + ext4_da_update_reserve_space(inode, resv_used, >> + rinfo.delonly_block); > > I'm not sure I understand here. We are inserting extent into extent status > tree. We are replacing resv_used clusters worth of space with delayed > allocation reservation with normally allocated clusters so we need to > release the reservation (mballoc already reduced freeclusters counter). > That I understand. In normal case we should also claim quota because we are > converting from reserved into allocated state. Now if we allocated blocks > under this range (e.g. from fallocate()) without > EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here > instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is > related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating > blocks for this extent or not. > Oh, this is really complicated due to the bigalloc feature, please let me explain it more clearly by listing all related situations. There are 2 types of paths of allocating delayed/reserved cluster: 1. Normal case, normally allocate delayed clusters from the write back path. 2. Special case, allocate blocks under this delayed range, e.g. from fallocate(). There are 4 situations below: A. bigalloc is disabled. This case is simple, after path 2, we don't need to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is detected. If the flag is set, we must be replacing a delayed extent and rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal to set EXT4_GET_BLOCKS_DELALLOC_RESERVE. B. bigalloc is enabled, there a 3 sub-cases of allocating a delayed cluster: B0.Allocating a whole delayed cluster, this case is the same to A. |< one cluster >| ddddddd+ddddddd+ddddddd+ddddddd ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ allocating the whole range B1.Allocating delayed blocks in a reserved cluster, this case is the same to A, too. |< one cluster >| hhhhhhh+hhhhhhh+ddddddd+ddddddd ^^^^^^^ allocating this range B2.Allocating blocks which doesn't cover the delayed blocks in one reserved cluster, |< one cluster >| hhhhhhh+hhhhhhh+hhhhhhh+ddddddd ^^^^^^^ fallocating this range This case must from path 2, which means allocating blocks without EXT4_GET_BLOCKS_DELALLOC_RESERVE. In this case, rinfo.delonly_block must be 0 since we are not replacing any delayed extents, so rinfo.delonly_block == 0 means allocate blocks without EXT4_MAP_DELAYED detected, which further means that EXT4_GET_BLOCKS_DELALLOC_RESERVE is not set. So I think we could use rinfo.delonly_block to identify this case. As the cases listed above, I think we could use rinfo.delonly_block to determine whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE is set, so I use it to determine if we need to claim quota or release quota. > At this point it would seem much clearer if we passed flag to > ext4_es_insert_extent() whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set > when allocating extent or not instead of computing delonly_block and > somehow infering from that. But maybe I miss some obvious reason why that > is correct. > Yeah, I agree that infer from computing delonly_block is little obscure and not clear enough, passing a flag is a clearer solution, but we have to pass one more parameter to ext4_es_insert_extent() which could only be set or not set in the allocating path in ext4_map_create_blocks(), other 5 callers don't care about it (so they should always have no EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set theoretically). I have no strong feeling of which one is better, which one do you perfer after reading my explanation above? Thanks, Yi.
On Thu 08-08-24 19:18:30, Zhang Yi wrote: > On 2024/8/8 1:41, Jan Kara wrote: > > On Fri 02-08-24 19:51:16, Zhang Yi wrote: > >> From: Zhang Yi <yi.zhang@huawei.com> > >> > >> Now that we update data reserved space for delalloc after allocating > >> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is > >> enabled, we also need to query the extents_status tree to calculate the > >> exact reserved clusters. This is complicated now and it appears that > >> it's better to do this job in ext4_es_insert_extent(), because > >> __es_remove_extent() have already count delalloc blocks when removing > >> delalloc extents and __revise_pending() return new adding pending count, > >> we could update the reserved blocks easily in ext4_es_insert_extent(). > >> > >> Thers is one special case needs to concern is the quota claiming, when > >> bigalloc is enabled, if the delayed cluster allocation has been raced > >> by another no-delayed allocation(e.g. from fallocate) which doesn't > >> cover the delayed blocks: > >> > >> |< one cluster >| > >> hhhhhhhhhhhhhhhhhhhdddddddddd > >> ^ ^ > >> |< >| < fallocate this range, don't claim quota again > >> > >> We can't claim quota as usual because the fallocate has already claimed > >> it in ext4_mb_new_blocks(), we could notice this case through the > >> removed delalloc blocks count. > >> > >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > > ... > >> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, > >> __free_pending(pr); > >> pr = NULL; > >> } > >> + pending = err3; > >> } > >> error: > >> write_unlock(&EXT4_I(inode)->i_es_lock); > >> + /* > >> + * Reduce the reserved cluster count to reflect successful deferred > >> + * allocation of delayed allocated clusters or direct allocation of > >> + * clusters discovered to be delayed allocated. Once allocated, a > >> + * cluster is not included in the reserved count. > >> + * > >> + * When bigalloc is enabled, allocating non-delayed allocated blocks > >> + * which belong to delayed allocated clusters (from fallocate, filemap, > >> + * DIO, or clusters allocated when delalloc has been disabled by > >> + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), > >> + * so release the quota reservations made for any previously delayed > >> + * allocated clusters. > >> + */ > >> + resv_used = rinfo.delonly_cluster + pending; > >> + if (resv_used) > >> + ext4_da_update_reserve_space(inode, resv_used, > >> + rinfo.delonly_block); > > > > I'm not sure I understand here. We are inserting extent into extent status > > tree. We are replacing resv_used clusters worth of space with delayed > > allocation reservation with normally allocated clusters so we need to > > release the reservation (mballoc already reduced freeclusters counter). > > That I understand. In normal case we should also claim quota because we are > > converting from reserved into allocated state. Now if we allocated blocks > > under this range (e.g. from fallocate()) without > > EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here > > instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is > > related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating > > blocks for this extent or not. > > Oh, this is really complicated due to the bigalloc feature, please let me > explain it more clearly by listing all related situations. > > There are 2 types of paths of allocating delayed/reserved cluster: > 1. Normal case, normally allocate delayed clusters from the write back path. > 2. Special case, allocate blocks under this delayed range, e.g. from > fallocate(). > > There are 4 situations below: > > A. bigalloc is disabled. This case is simple, after path 2, we don't need > to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we > set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is > detected. If the flag is set, we must be replacing a delayed extent and > rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal > to set EXT4_GET_BLOCKS_DELALLOC_RESERVE. Right. So fallocate() will call ext4_map_blocks() and ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED which you (due to patch 2 of this series) transform into EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc accounting through in ext4_ext_map_blocks() but this patch moved the update to ext4_es_insert_extent(). But there is one cornercase even here AFAICT: Suppose fallocate is called for range 0..16k, we have delalloc extent at 8k..16k. In this case ext4_map_blocks() at block 0 will not find the delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc without using delalloc reservation but then ext4_es_insert_extent() will still have rinfo.delonly_block > 0 so we claim the quota reservation instead of releasing it? > B. bigalloc is enabled, there a 3 sub-cases of allocating a delayed > cluster: > B0.Allocating a whole delayed cluster, this case is the same to A. > > |< one cluster >| > ddddddd+ddddddd+ddddddd+ddddddd > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ allocating the whole range I agree. In this case there's no difference. > B1.Allocating delayed blocks in a reserved cluster, this case is the same > to A, too. > > |< one cluster >| > hhhhhhh+hhhhhhh+ddddddd+ddddddd > ^^^^^^^ > allocating this range Yes, if the allocation starts within delalloc range, we will have EXT4_GET_BLOCKS_DELALLOC_RESERVE set and ndelonly_blocks will always be > 0. > B2.Allocating blocks which doesn't cover the delayed blocks in one reserved > cluster, > > |< one cluster >| > hhhhhhh+hhhhhhh+hhhhhhh+ddddddd > ^^^^^^^ > fallocating this range > > This case must from path 2, which means allocating blocks without > EXT4_GET_BLOCKS_DELALLOC_RESERVE. In this case, rinfo.delonly_block must > be 0 since we are not replacing any delayed extents, so > rinfo.delonly_block == 0 means allocate blocks without EXT4_MAP_DELAYED > detected, which further means that EXT4_GET_BLOCKS_DELALLOC_RESERVE is > not set. So I think we could use rinfo.delonly_block to identify this > case. Well, this is similar to the non-bigalloc case I was asking about above. Why the allocated unwritten extent cannot extend past the start of delalloc extent? I didn't find anything that would disallow that... Honza
On 2024/8/9 2:36, Jan Kara wrote: > On Thu 08-08-24 19:18:30, Zhang Yi wrote: >> On 2024/8/8 1:41, Jan Kara wrote: >>> On Fri 02-08-24 19:51:16, Zhang Yi wrote: >>>> From: Zhang Yi <yi.zhang@huawei.com> >>>> >>>> Now that we update data reserved space for delalloc after allocating >>>> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is >>>> enabled, we also need to query the extents_status tree to calculate the >>>> exact reserved clusters. This is complicated now and it appears that >>>> it's better to do this job in ext4_es_insert_extent(), because >>>> __es_remove_extent() have already count delalloc blocks when removing >>>> delalloc extents and __revise_pending() return new adding pending count, >>>> we could update the reserved blocks easily in ext4_es_insert_extent(). >>>> >>>> Thers is one special case needs to concern is the quota claiming, when >>>> bigalloc is enabled, if the delayed cluster allocation has been raced >>>> by another no-delayed allocation(e.g. from fallocate) which doesn't >>>> cover the delayed blocks: >>>> >>>> |< one cluster >| >>>> hhhhhhhhhhhhhhhhhhhdddddddddd >>>> ^ ^ >>>> |< >| < fallocate this range, don't claim quota again >>>> >>>> We can't claim quota as usual because the fallocate has already claimed >>>> it in ext4_mb_new_blocks(), we could notice this case through the >>>> removed delalloc blocks count. >>>> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>> ... >>>> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, >>>> __free_pending(pr); >>>> pr = NULL; >>>> } >>>> + pending = err3; >>>> } >>>> error: >>>> write_unlock(&EXT4_I(inode)->i_es_lock); >>>> + /* >>>> + * Reduce the reserved cluster count to reflect successful deferred >>>> + * allocation of delayed allocated clusters or direct allocation of >>>> + * clusters discovered to be delayed allocated. Once allocated, a >>>> + * cluster is not included in the reserved count. >>>> + * >>>> + * When bigalloc is enabled, allocating non-delayed allocated blocks >>>> + * which belong to delayed allocated clusters (from fallocate, filemap, >>>> + * DIO, or clusters allocated when delalloc has been disabled by >>>> + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), >>>> + * so release the quota reservations made for any previously delayed >>>> + * allocated clusters. >>>> + */ >>>> + resv_used = rinfo.delonly_cluster + pending; >>>> + if (resv_used) >>>> + ext4_da_update_reserve_space(inode, resv_used, >>>> + rinfo.delonly_block); >>> >>> I'm not sure I understand here. We are inserting extent into extent status >>> tree. We are replacing resv_used clusters worth of space with delayed >>> allocation reservation with normally allocated clusters so we need to >>> release the reservation (mballoc already reduced freeclusters counter). >>> That I understand. In normal case we should also claim quota because we are >>> converting from reserved into allocated state. Now if we allocated blocks >>> under this range (e.g. from fallocate()) without >>> EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here >>> instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is >>> related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating >>> blocks for this extent or not. >> >> Oh, this is really complicated due to the bigalloc feature, please let me >> explain it more clearly by listing all related situations. >> >> There are 2 types of paths of allocating delayed/reserved cluster: >> 1. Normal case, normally allocate delayed clusters from the write back path. >> 2. Special case, allocate blocks under this delayed range, e.g. from >> fallocate(). >> >> There are 4 situations below: >> >> A. bigalloc is disabled. This case is simple, after path 2, we don't need >> to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we >> set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is >> detected. If the flag is set, we must be replacing a delayed extent and >> rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal >> to set EXT4_GET_BLOCKS_DELALLOC_RESERVE. > > Right. So fallocate() will call ext4_map_blocks() and > ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED > which you (due to patch 2 of this series) transform into > EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc > accounting through in ext4_ext_map_blocks() but this patch moved the update > to ext4_es_insert_extent(). But there is one cornercase even here AFAICT: > > Suppose fallocate is called for range 0..16k, we have delalloc extent at > 8k..16k. In this case ext4_map_blocks() at block 0 will not find the > delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc > without using delalloc reservation but then ext4_es_insert_extent() will > still have rinfo.delonly_block > 0 so we claim the quota reservation > instead of releasing it? > After commit 6430dea07e85 ("ext4: correct the hole length returned by ext4_map_blocks()"), the fallocate range 0-16K would be divided into two rounds. When we first calling ext4_map_blocks() with 0-16K, the map range will be corrected to 0-8k by ext4_ext_determine_insert_hole() and the allocating range should not cover any delayed range. Then ext4_alloc_file_blocks() will call ext4_map_blocks() again to allocate 8K-16K in the second round, in this round, we are allocating a real delayed range. Please below graph for details, ext4_alloc_file_blocks() //0-16K ext4_map_blocks() //0-16K ext4_es_lookup_extent() //find nothing ext4_ext_map_blocks(0) ext4_ext_determine_insert_hole() //change map range to 0-8K ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under hole ext4_map_blocks() //8-16K ext4_es_lookup_extent() //find delayed extent ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under a whole delayed range, //use rinfo.delonly_block > 0 is okay Hence the allocating range can't mixed with delayed and non-delayed extent at a time, and the rinfo.delonly_block > 0 should work. >> B. bigalloc is enabled, there a 3 sub-cases of allocating a delayed >> cluster: >> B0.Allocating a whole delayed cluster, this case is the same to A. >> >> |< one cluster >| >> ddddddd+ddddddd+ddddddd+ddddddd >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ allocating the whole range > > I agree. In this case there's no difference. > > >> B1.Allocating delayed blocks in a reserved cluster, this case is the same >> to A, too. >> >> |< one cluster >| >> hhhhhhh+hhhhhhh+ddddddd+ddddddd >> ^^^^^^^ >> allocating this range > > Yes, if the allocation starts within delalloc range, we will have > EXT4_GET_BLOCKS_DELALLOC_RESERVE set and ndelonly_blocks will always be > > 0. > >> B2.Allocating blocks which doesn't cover the delayed blocks in one reserved >> cluster, >> >> |< one cluster >| >> hhhhhhh+hhhhhhh+hhhhhhh+ddddddd >> ^^^^^^^ >> fallocating this range >> >> This case must from path 2, which means allocating blocks without >> EXT4_GET_BLOCKS_DELALLOC_RESERVE. In this case, rinfo.delonly_block must >> be 0 since we are not replacing any delayed extents, so >> rinfo.delonly_block == 0 means allocate blocks without EXT4_MAP_DELAYED >> detected, which further means that EXT4_GET_BLOCKS_DELALLOC_RESERVE is >> not set. So I think we could use rinfo.delonly_block to identify this >> case. > > Well, this is similar to the non-bigalloc case I was asking about above. > Why the allocated unwritten extent cannot extend past the start of delalloc > extent? I didn't find anything that would disallow that... > The same to above, ext4_ext_determine_insert_hole() should work fine for this case. Thanks, Yi.
On Fri 09-08-24 11:35:49, Zhang Yi wrote: > On 2024/8/9 2:36, Jan Kara wrote: > > On Thu 08-08-24 19:18:30, Zhang Yi wrote: > >> On 2024/8/8 1:41, Jan Kara wrote: > >>> On Fri 02-08-24 19:51:16, Zhang Yi wrote: > >>>> From: Zhang Yi <yi.zhang@huawei.com> > >>>> > >>>> Now that we update data reserved space for delalloc after allocating > >>>> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is > >>>> enabled, we also need to query the extents_status tree to calculate the > >>>> exact reserved clusters. This is complicated now and it appears that > >>>> it's better to do this job in ext4_es_insert_extent(), because > >>>> __es_remove_extent() have already count delalloc blocks when removing > >>>> delalloc extents and __revise_pending() return new adding pending count, > >>>> we could update the reserved blocks easily in ext4_es_insert_extent(). > >>>> > >>>> Thers is one special case needs to concern is the quota claiming, when > >>>> bigalloc is enabled, if the delayed cluster allocation has been raced > >>>> by another no-delayed allocation(e.g. from fallocate) which doesn't > >>>> cover the delayed blocks: > >>>> > >>>> |< one cluster >| > >>>> hhhhhhhhhhhhhhhhhhhdddddddddd > >>>> ^ ^ > >>>> |< >| < fallocate this range, don't claim quota again > >>>> > >>>> We can't claim quota as usual because the fallocate has already claimed > >>>> it in ext4_mb_new_blocks(), we could notice this case through the > >>>> removed delalloc blocks count. > >>>> > >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > >>> ... > >>>> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, > >>>> __free_pending(pr); > >>>> pr = NULL; > >>>> } > >>>> + pending = err3; > >>>> } > >>>> error: > >>>> write_unlock(&EXT4_I(inode)->i_es_lock); > >>>> + /* > >>>> + * Reduce the reserved cluster count to reflect successful deferred > >>>> + * allocation of delayed allocated clusters or direct allocation of > >>>> + * clusters discovered to be delayed allocated. Once allocated, a > >>>> + * cluster is not included in the reserved count. > >>>> + * > >>>> + * When bigalloc is enabled, allocating non-delayed allocated blocks > >>>> + * which belong to delayed allocated clusters (from fallocate, filemap, > >>>> + * DIO, or clusters allocated when delalloc has been disabled by > >>>> + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), > >>>> + * so release the quota reservations made for any previously delayed > >>>> + * allocated clusters. > >>>> + */ > >>>> + resv_used = rinfo.delonly_cluster + pending; > >>>> + if (resv_used) > >>>> + ext4_da_update_reserve_space(inode, resv_used, > >>>> + rinfo.delonly_block); > >>> > >>> I'm not sure I understand here. We are inserting extent into extent status > >>> tree. We are replacing resv_used clusters worth of space with delayed > >>> allocation reservation with normally allocated clusters so we need to > >>> release the reservation (mballoc already reduced freeclusters counter). > >>> That I understand. In normal case we should also claim quota because we are > >>> converting from reserved into allocated state. Now if we allocated blocks > >>> under this range (e.g. from fallocate()) without > >>> EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here > >>> instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is > >>> related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating > >>> blocks for this extent or not. > >> > >> Oh, this is really complicated due to the bigalloc feature, please let me > >> explain it more clearly by listing all related situations. > >> > >> There are 2 types of paths of allocating delayed/reserved cluster: > >> 1. Normal case, normally allocate delayed clusters from the write back path. > >> 2. Special case, allocate blocks under this delayed range, e.g. from > >> fallocate(). > >> > >> There are 4 situations below: > >> > >> A. bigalloc is disabled. This case is simple, after path 2, we don't need > >> to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we > >> set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is > >> detected. If the flag is set, we must be replacing a delayed extent and > >> rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal > >> to set EXT4_GET_BLOCKS_DELALLOC_RESERVE. > > > > Right. So fallocate() will call ext4_map_blocks() and > > ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED > > which you (due to patch 2 of this series) transform into > > EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc > > accounting through in ext4_ext_map_blocks() but this patch moved the update > > to ext4_es_insert_extent(). But there is one cornercase even here AFAICT: > > > > Suppose fallocate is called for range 0..16k, we have delalloc extent at > > 8k..16k. In this case ext4_map_blocks() at block 0 will not find the > > delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc > > without using delalloc reservation but then ext4_es_insert_extent() will > > still have rinfo.delonly_block > 0 so we claim the quota reservation > > instead of releasing it? > > > > After commit 6430dea07e85 ("ext4: correct the hole length returned by > ext4_map_blocks()"), the fallocate range 0-16K would be divided into two > rounds. When we first calling ext4_map_blocks() with 0-16K, the map range > will be corrected to 0-8k by ext4_ext_determine_insert_hole() and the > allocating range should not cover any delayed range. Eww, subtle, subtle, subtle... And isn't this also racy? We drop i_data_sem in ext4_map_blocks() after we do the initial lookup. So there can be some changes to both the extent tree and extent status tree before we grab i_data_sem again for the allocation. We hold inode_lock so there can be only writeback and page faults racing with us but e.g. ext4_page_mkwrite() -> block_page_mkwrite -> ext4_da_get_block_prep() -> ext4_da_map_blocks() can add delayed extent into extent status tree in that window causing breakage, can't it? > Then > ext4_alloc_file_blocks() will call ext4_map_blocks() again to allocate > 8K-16K in the second round, in this round, we are allocating a real > delayed range. Please below graph for details, > > ext4_alloc_file_blocks() //0-16K > ext4_map_blocks() //0-16K > ext4_es_lookup_extent() //find nothing > ext4_ext_map_blocks(0) > ext4_ext_determine_insert_hole() //change map range to 0-8K > ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under hole > ext4_map_blocks() //8-16K > ext4_es_lookup_extent() //find delayed extent > ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) > //allocate blocks under a whole delayed range, > //use rinfo.delonly_block > 0 is okay > > Hence the allocating range can't mixed with delayed and non-delayed extent > at a time, and the rinfo.delonly_block > 0 should work. Besides the race above I agree. So either we need to trim mapping extent in ext4_map_blocks() after re-acquiring i_data_sem or we need to deal with unwritten extents that are partially delalloc. I'm more and more leaning towards just passing the information whether delalloc was used or not to extent status tree insertion. Because that can deal with partial extents just fine... Thanks for your patience with me :). Honza
On 2024/8/10 0:20, Jan Kara wrote: > On Fri 09-08-24 11:35:49, Zhang Yi wrote: >> On 2024/8/9 2:36, Jan Kara wrote: >>> On Thu 08-08-24 19:18:30, Zhang Yi wrote: >>>> On 2024/8/8 1:41, Jan Kara wrote: >>>>> On Fri 02-08-24 19:51:16, Zhang Yi wrote: >>>>>> From: Zhang Yi <yi.zhang@huawei.com> >>>>>> >>>>>> Now that we update data reserved space for delalloc after allocating >>>>>> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is >>>>>> enabled, we also need to query the extents_status tree to calculate the >>>>>> exact reserved clusters. This is complicated now and it appears that >>>>>> it's better to do this job in ext4_es_insert_extent(), because >>>>>> __es_remove_extent() have already count delalloc blocks when removing >>>>>> delalloc extents and __revise_pending() return new adding pending count, >>>>>> we could update the reserved blocks easily in ext4_es_insert_extent(). >>>>>> >>>>>> Thers is one special case needs to concern is the quota claiming, when >>>>>> bigalloc is enabled, if the delayed cluster allocation has been raced >>>>>> by another no-delayed allocation(e.g. from fallocate) which doesn't >>>>>> cover the delayed blocks: >>>>>> >>>>>> |< one cluster >| >>>>>> hhhhhhhhhhhhhhhhhhhdddddddddd >>>>>> ^ ^ >>>>>> |< >| < fallocate this range, don't claim quota again >>>>>> >>>>>> We can't claim quota as usual because the fallocate has already claimed >>>>>> it in ext4_mb_new_blocks(), we could notice this case through the >>>>>> removed delalloc blocks count. >>>>>> >>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>>>> ... >>>>>> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, >>>>>> __free_pending(pr); >>>>>> pr = NULL; >>>>>> } >>>>>> + pending = err3; >>>>>> } >>>>>> error: >>>>>> write_unlock(&EXT4_I(inode)->i_es_lock); >>>>>> + /* >>>>>> + * Reduce the reserved cluster count to reflect successful deferred >>>>>> + * allocation of delayed allocated clusters or direct allocation of >>>>>> + * clusters discovered to be delayed allocated. Once allocated, a >>>>>> + * cluster is not included in the reserved count. >>>>>> + * >>>>>> + * When bigalloc is enabled, allocating non-delayed allocated blocks >>>>>> + * which belong to delayed allocated clusters (from fallocate, filemap, >>>>>> + * DIO, or clusters allocated when delalloc has been disabled by >>>>>> + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), >>>>>> + * so release the quota reservations made for any previously delayed >>>>>> + * allocated clusters. >>>>>> + */ >>>>>> + resv_used = rinfo.delonly_cluster + pending; >>>>>> + if (resv_used) >>>>>> + ext4_da_update_reserve_space(inode, resv_used, >>>>>> + rinfo.delonly_block); >>>>> >>>>> I'm not sure I understand here. We are inserting extent into extent status >>>>> tree. We are replacing resv_used clusters worth of space with delayed >>>>> allocation reservation with normally allocated clusters so we need to >>>>> release the reservation (mballoc already reduced freeclusters counter). >>>>> That I understand. In normal case we should also claim quota because we are >>>>> converting from reserved into allocated state. Now if we allocated blocks >>>>> under this range (e.g. from fallocate()) without >>>>> EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here >>>>> instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is >>>>> related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating >>>>> blocks for this extent or not. >>>> >>>> Oh, this is really complicated due to the bigalloc feature, please let me >>>> explain it more clearly by listing all related situations. >>>> >>>> There are 2 types of paths of allocating delayed/reserved cluster: >>>> 1. Normal case, normally allocate delayed clusters from the write back path. >>>> 2. Special case, allocate blocks under this delayed range, e.g. from >>>> fallocate(). >>>> >>>> There are 4 situations below: >>>> >>>> A. bigalloc is disabled. This case is simple, after path 2, we don't need >>>> to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we >>>> set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is >>>> detected. If the flag is set, we must be replacing a delayed extent and >>>> rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal >>>> to set EXT4_GET_BLOCKS_DELALLOC_RESERVE. >>> >>> Right. So fallocate() will call ext4_map_blocks() and >>> ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED >>> which you (due to patch 2 of this series) transform into >>> EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc >>> accounting through in ext4_ext_map_blocks() but this patch moved the update >>> to ext4_es_insert_extent(). But there is one cornercase even here AFAICT: >>> >>> Suppose fallocate is called for range 0..16k, we have delalloc extent at >>> 8k..16k. In this case ext4_map_blocks() at block 0 will not find the >>> delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc >>> without using delalloc reservation but then ext4_es_insert_extent() will >>> still have rinfo.delonly_block > 0 so we claim the quota reservation >>> instead of releasing it? >>> >> >> After commit 6430dea07e85 ("ext4: correct the hole length returned by >> ext4_map_blocks()"), the fallocate range 0-16K would be divided into two >> rounds. When we first calling ext4_map_blocks() with 0-16K, the map range >> will be corrected to 0-8k by ext4_ext_determine_insert_hole() and the >> allocating range should not cover any delayed range. > > Eww, subtle, subtle, subtle... And isn't this also racy? We drop i_data_sem > in ext4_map_blocks() after we do the initial lookup. So there can be some > changes to both the extent tree and extent status tree before we grab > i_data_sem again for the allocation. We hold inode_lock so there can be > only writeback and page faults racing with us but e.g. ext4_page_mkwrite() > -> block_page_mkwrite -> ext4_da_get_block_prep() -> ext4_da_map_blocks() > can add delayed extent into extent status tree in that window causing > breakage, can't it? Oh! you are totally right, I missed that current ext4_fallocate() doesn't hold invalidate_lock for the normal fallocate path, hence there's nothing could prevent this race now, thanks a lot for pointing this out. > >> Then >> ext4_alloc_file_blocks() will call ext4_map_blocks() again to allocate >> 8K-16K in the second round, in this round, we are allocating a real >> delayed range. Please below graph for details, >> >> ext4_alloc_file_blocks() //0-16K >> ext4_map_blocks() //0-16K >> ext4_es_lookup_extent() //find nothing >> ext4_ext_map_blocks(0) >> ext4_ext_determine_insert_hole() //change map range to 0-8K >> ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under hole >> ext4_map_blocks() //8-16K >> ext4_es_lookup_extent() //find delayed extent >> ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) >> //allocate blocks under a whole delayed range, >> //use rinfo.delonly_block > 0 is okay >> >> Hence the allocating range can't mixed with delayed and non-delayed extent >> at a time, and the rinfo.delonly_block > 0 should work. > > Besides the race above I agree. So either we need to trim mapping extent in > ext4_map_blocks() after re-acquiring i_data_sem Yeah, if we keep on using this solution, it looks like we have to add similar logic we've done in ext4_da_map_blocks() a few months ago into the begin of the new helper ext4_map_create_blocks(). I guess it may expensive and not worth now. if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { map->m_len = min_t(unsigned int, map->m_len, es.es_len - (map->m_lblk - es.es_lblk)); } else retval = ext4_map_query_blocks(NULL, inode, map); ... } > or we need to deal with > unwritten extents that are partially delalloc. I'm more and more leaning > towards just passing the information whether delalloc was used or not to > extent status tree insertion. Because that can deal with partial extents > just fine... > Yeah, I agree with you, passing the information to ext4_es_init_extent() is simple and looks fine. I will change to use this solution. > Thanks for your patience with me :). > Anytime! I appreciate your review and suggestions as well. :) Thanks, Yi.
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index e067f2dd0335..a58240fdfe3f 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4356,43 +4356,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, goto out; } - /* - * Reduce the reserved cluster count to reflect successful deferred - * allocation of delayed allocated clusters or direct allocation of - * clusters discovered to be delayed allocated. Once allocated, a - * cluster is not included in the reserved count. - */ - if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) { - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { - /* - * When allocating delayed allocated clusters, simply - * reduce the reserved cluster count and claim quota - */ - ext4_da_update_reserve_space(inode, allocated_clusters, - 1); - } else { - ext4_lblk_t lblk, len; - unsigned int n; - - /* - * When allocating non-delayed allocated clusters - * (from fallocate, filemap, DIO, or clusters - * allocated when delalloc has been disabled by - * ext4_nonda_switch), reduce the reserved cluster - * count by the number of allocated clusters that - * have previously been delayed allocated. Quota - * has been claimed by ext4_mb_new_blocks() above, - * so release the quota reservations made for any - * previously delayed allocated clusters. - */ - lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk); - len = allocated_clusters << sbi->s_cluster_bits; - n = ext4_es_delayed_clu(inode, lblk, len); - if (n > 0) - ext4_da_update_reserve_space(inode, (int) n, 0); - } - } - /* * Cache the extent and update transaction to commit on fdatasync only * when it is _not_ an unwritten extent. diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 3107e07ffe46..2daf61cfcf58 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -858,6 +858,8 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, struct extent_status newes; ext4_lblk_t end = lblk + len - 1; int err1 = 0, err2 = 0, err3 = 0; + struct rsvd_info rinfo; + int resv_used, pending = 0; struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct extent_status *es1 = NULL; struct extent_status *es2 = NULL; @@ -896,7 +898,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, pr = __alloc_pending(true); write_lock(&EXT4_I(inode)->i_es_lock); - err1 = __es_remove_extent(inode, lblk, end, NULL, es1); + err1 = __es_remove_extent(inode, lblk, end, &rinfo, es1); if (err1 != 0) goto error; /* Free preallocated extent if it didn't get used. */ @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, __free_pending(pr); pr = NULL; } + pending = err3; } error: write_unlock(&EXT4_I(inode)->i_es_lock); + /* + * Reduce the reserved cluster count to reflect successful deferred + * allocation of delayed allocated clusters or direct allocation of + * clusters discovered to be delayed allocated. Once allocated, a + * cluster is not included in the reserved count. + * + * When bigalloc is enabled, allocating non-delayed allocated blocks + * which belong to delayed allocated clusters (from fallocate, filemap, + * DIO, or clusters allocated when delalloc has been disabled by + * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(), + * so release the quota reservations made for any previously delayed + * allocated clusters. + */ + resv_used = rinfo.delonly_cluster + pending; + if (resv_used) + ext4_da_update_reserve_space(inode, resv_used, + rinfo.delonly_block); if (err1 || err2 || err3 < 0) goto retry; diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index d8ca7f64f952..7404f0935c90 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -652,13 +652,6 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode, ext4_update_inode_fsync_trans(handle, inode, 1); count = ar.len; - /* - * Update reserved blocks/metadata blocks after successful block - * allocation which had been deferred till now. - */ - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) - ext4_da_update_reserve_space(inode, count, 1); - got_it: map->m_flags |= EXT4_MAP_MAPPED; map->m_pblk = le32_to_cpu(chain[depth-1].key);