diff mbox series

[v3,7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

Message ID 20230116080216.249195-8-ojaswin@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series ext4: Convert inode preallocation list to an rbtree | expand

Commit Message

Ojaswin Mujoo Jan. 16, 2023, 8:02 a.m. UTC
Currently, the kernel uses i_prealloc_list to hold all the inode
preallocations. This is known to cause degradation in performance in
workloads which perform large number of sparse writes on a single file.
This is mainly because functions like ext4_mb_normalize_request() and
ext4_mb_use_preallocated() iterate over this complete list, resulting in
slowdowns when large number of PAs are present.

Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
the inode preallocation list and adding logic to continually trim the
list if it grows above the threshold, however our testing revealed that
a hardcoded value is not suitable for all kinds of workloads.

To optimize this, add an rbtree to the inode and hold the inode
preallocations in this rbtree. This will make iterating over inode PAs
faster and scale much better than a linked list. Additionally, we also
had to remove the LRU logic that was added during trimming of the list
(in ext4_mb_release_context()) as it will add extra overhead in rbtree.
The discards now happen in the lowest-logical-offset-first order.

** Locking notes **

With the introduction of rbtree to maintain inode PAs, we can't use RCU
to walk the tree for searching since it can result in partial traversals
which might miss some nodes(or entire subtrees) while discards happen
in parallel (which happens under a lock).  Hence this patch converts the
ei->i_prealloc_lock spin_lock to rw_lock.

Almost all the codepaths that read/modify the PA rbtrees are protected
by the higher level inode->i_data_sem (except
ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
only place we need lock protection is when one thread is reading
"searching" the PA rbtree (earlier protected under rcu_read_lock()) and
another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
function (which iterates all the PAs using the grp->bb_prealloc_list and
deletes PAs from the tree without taking any inode lock (i_data_sem)).

So, this patch converts all rcu_read_lock/unlock() paths for inode list
PA to use read_lock() and all places where we were using
ei->i_prealloc_lock spinlock will now be using write_lock().

Note that this makes the fast path (searching of the right PA e.g.
ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
read_lock() instead of rcu_read_lock/unlock().  Ths also will now block
due to slow discard path (ext4_mb_discard_group_preallocations()) which
uses write_lock().

But this is not as bad as it looks. This is because -

1. The slow path only occurs when the normal allocation failed and we
   can say that we are low on disk space.  One can argue this scenario
   won't be much frequent.

2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
   for deleting every individual PA.  This gives enough opportunity for
   the fast path to acquire the read_lock for searching the PA inode
   list.

** Design changes around deleted PAs **

In ext4_mb_adjust_overlap(), it is possible for an allocating thread to
come across a PA that is marked deleted via
ext4_mb_discard_group_preallocations() but not yet removed from the
inode PA rbtree. In such a case, we ignore any overlaps with this PA
node and simply move forward in the rbtree by comparing logical start of
to-be-inserted PA and the deleted PA node.

Although this usually doesn't cause an issue and we can move forward in
the tree, in one speacial case, i.e if range of deleted PA lies
completely inside the normalized range, we might require to travese both
the sub-trees under such a deleted PA.

To simplify this special scenario and also as an optimization, undelete
the PA If the allocating thread determines that this PA might be needed
in the near future. This results in the following changes:

- ext4_mb_use_preallocated(): Use a deleted PA if original request lies in it.
- ext4_mb_pa_adjust_overlap(): Undelete a PA if it is deleted but there
		is an overlap with the normalized range.
- ext4_mb_discard_group_preallocations(): Rollback delete operation if
		allocation path undeletes a PA before it is erased from inode PA
		list.

Since this covers the special case we discussed above, we will always
un-delete the PA when we encounter the special case and we can then
adjust for overlap and traverse the PA rbtree without any issues.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/ext4.h    |   4 +-
 fs/ext4/mballoc.c | 239 ++++++++++++++++++++++++++++++++++------------
 fs/ext4/mballoc.h |   6 +-
 fs/ext4/super.c   |   4 +-
 4 files changed, 183 insertions(+), 70 deletions(-)

Comments

Jan Kara Jan. 16, 2023, 12:23 p.m. UTC | #1
On Mon 16-01-23 13:32:15, Ojaswin Mujoo wrote:
> Currently, the kernel uses i_prealloc_list to hold all the inode
> preallocations. This is known to cause degradation in performance in
> workloads which perform large number of sparse writes on a single file.
> This is mainly because functions like ext4_mb_normalize_request() and
> ext4_mb_use_preallocated() iterate over this complete list, resulting in
> slowdowns when large number of PAs are present.
> 
> Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
> the inode preallocation list and adding logic to continually trim the
> list if it grows above the threshold, however our testing revealed that
> a hardcoded value is not suitable for all kinds of workloads.
> 
> To optimize this, add an rbtree to the inode and hold the inode
> preallocations in this rbtree. This will make iterating over inode PAs
> faster and scale much better than a linked list. Additionally, we also
> had to remove the LRU logic that was added during trimming of the list
> (in ext4_mb_release_context()) as it will add extra overhead in rbtree.
> The discards now happen in the lowest-logical-offset-first order.
> 
> ** Locking notes **
> 
> With the introduction of rbtree to maintain inode PAs, we can't use RCU
> to walk the tree for searching since it can result in partial traversals
> which might miss some nodes(or entire subtrees) while discards happen
> in parallel (which happens under a lock).  Hence this patch converts the
> ei->i_prealloc_lock spin_lock to rw_lock.
> 
> Almost all the codepaths that read/modify the PA rbtrees are protected
> by the higher level inode->i_data_sem (except
> ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
> only place we need lock protection is when one thread is reading
> "searching" the PA rbtree (earlier protected under rcu_read_lock()) and
> another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
> function (which iterates all the PAs using the grp->bb_prealloc_list and
> deletes PAs from the tree without taking any inode lock (i_data_sem)).
> 
> So, this patch converts all rcu_read_lock/unlock() paths for inode list
> PA to use read_lock() and all places where we were using
> ei->i_prealloc_lock spinlock will now be using write_lock().
> 
> Note that this makes the fast path (searching of the right PA e.g.
> ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
> read_lock() instead of rcu_read_lock/unlock().  Ths also will now block
> due to slow discard path (ext4_mb_discard_group_preallocations()) which
> uses write_lock().
> 
> But this is not as bad as it looks. This is because -
> 
> 1. The slow path only occurs when the normal allocation failed and we
>    can say that we are low on disk space.  One can argue this scenario
>    won't be much frequent.
> 
> 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
>    for deleting every individual PA.  This gives enough opportunity for
>    the fast path to acquire the read_lock for searching the PA inode
>    list.
> 
> ** Design changes around deleted PAs **
> 
> In ext4_mb_adjust_overlap(), it is possible for an allocating thread to
> come across a PA that is marked deleted via
> ext4_mb_discard_group_preallocations() but not yet removed from the
> inode PA rbtree. In such a case, we ignore any overlaps with this PA
> node and simply move forward in the rbtree by comparing logical start of
> to-be-inserted PA and the deleted PA node.
> 
> Although this usually doesn't cause an issue and we can move forward in
> the tree, in one speacial case, i.e if range of deleted PA lies
> completely inside the normalized range, we might require to travese both
> the sub-trees under such a deleted PA.
> 
> To simplify this special scenario and also as an optimization, undelete
> the PA If the allocating thread determines that this PA might be needed
> in the near future. This results in the following changes:
> 
> - ext4_mb_use_preallocated(): Use a deleted PA if original request lies in it.
> - ext4_mb_pa_adjust_overlap(): Undelete a PA if it is deleted but there
> 		is an overlap with the normalized range.
> - ext4_mb_discard_group_preallocations(): Rollback delete operation if
> 		allocation path undeletes a PA before it is erased from inode PA
> 		list.
> 
> Since this covers the special case we discussed above, we will always
> un-delete the PA when we encounter the special case and we can then
> adjust for overlap and traverse the PA rbtree without any issues.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

So I find this putting back of already deleted inode PA very fragile. For
example in current code I suspect you've missed a case in ext4_mb_put_pa()
which can mark inode PA (so it can then be spotted by
ext4_mb_pa_adjust_overlap() and marked as in use again) but
ext4_mb_put_pa() still goes on and destroys the PA.

So I'd really love to stay with the invariant that once PA is marked as
deleted, it can never go alive again. Since finding such deleted PA that is
overlapping our new range should be really rare, cannot we just make
ext4_mb_pa_adjust_overlap() rb_erase() this deleted PA and restart the
adjustment search? Since rb_erase() is not safe to be called twice, we'd
have to record somewhere in the PA that the erasure has already happened
(probably we could have two flags in 'deleted' field - deleted from group
list, deleted from object (lg_list/inode_node)) but that is still much more
robust...

								Honza

> ---
>  fs/ext4/ext4.h    |   4 +-
>  fs/ext4/mballoc.c | 239 ++++++++++++++++++++++++++++++++++------------
>  fs/ext4/mballoc.h |   6 +-
>  fs/ext4/super.c   |   4 +-
>  4 files changed, 183 insertions(+), 70 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 140e1eb300d1..fad5f087e4c6 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1120,8 +1120,8 @@ struct ext4_inode_info {
>  
>  	/* mballoc */
>  	atomic_t i_prealloc_active;
> -	struct list_head i_prealloc_list;
> -	spinlock_t i_prealloc_lock;
> +	struct rb_root i_prealloc_node;
> +	rwlock_t i_prealloc_lock;
>  
>  	/* extents status tree */
>  	struct ext4_es_tree i_es_tree;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 53c4efd34d1c..85598079b7ce 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3984,6 +3984,44 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
>  	mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
>  }
>  
> +static void ext4_mb_mark_pa_inuse(struct super_block *sb,
> +				    struct ext4_prealloc_space *pa)
> +{
> +	struct ext4_inode_info *ei;
> +
> +	if (pa->pa_deleted == 0) {
> +		ext4_warning(
> +			sb, "pa already inuse, type:%d, pblk:%llu, lblk:%u, len:%d\n",
> +			pa->pa_type, pa->pa_pstart, pa->pa_lstart, pa->pa_len);
> +		return;
> +	}
> +
> +	pa->pa_deleted = 0;
> +
> +	if (pa->pa_type == MB_INODE_PA) {
> +		ei = EXT4_I(pa->pa_inode);
> +		atomic_inc(&ei->i_prealloc_active);
> +	}
> +}
> +
> +/*
> + * This function returns the next element to look at during inode
> + * PA rbtree walk. We assume that we have held the inode PA rbtree lock
> + * (ei->i_prealloc_lock)
> + *
> + * new_start	The start of the range we want to compare
> + * cur_start	The existing start that we are comparing against
> + * node	The node of the rb_tree
> + */
> +static inline struct rb_node*
> +ext4_mb_pa_rb_next_iter(ext4_lblk_t new_start, ext4_lblk_t cur_start, struct rb_node *node)
> +{
> +	if (new_start < cur_start)
> +		return node->rb_left;
> +	else
> +		return node->rb_right;
> +}
> +
>  static inline void
>  ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
>  			  ext4_lblk_t start, ext4_lblk_t end)
> @@ -3992,27 +4030,31 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
>  	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
>  	struct ext4_prealloc_space *tmp_pa;
>  	ext4_lblk_t tmp_pa_start, tmp_pa_end;
> +	struct rb_node *iter;
>  
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> -		spin_lock(&tmp_pa->pa_lock);
> -		if (tmp_pa->pa_deleted == 0) {
> -			tmp_pa_start = tmp_pa->pa_lstart;
> -			tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +	read_lock(&ei->i_prealloc_lock);
> +	iter = ei->i_prealloc_node.rb_node;
> +	while (iter) {
> +		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
> +				  pa_node.inode_node);
> +		tmp_pa_start = tmp_pa->pa_lstart;
> +		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
>  
> +		spin_lock(&tmp_pa->pa_lock);
> +		if (tmp_pa->pa_deleted == 0)
>  			BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> -		}
>  		spin_unlock(&tmp_pa->pa_lock);
> +
> +		iter = ext4_mb_pa_rb_next_iter(start, tmp_pa_start, iter);
>  	}
> -	rcu_read_unlock();
> +	read_unlock(&ei->i_prealloc_lock);
>  }
> -
>  /*
>   * Given an allocation context "ac" and a range "start", "end", check
>   * and adjust boundaries if the range overlaps with any of the existing
>   * preallocatoins stored in the corresponding inode of the allocation context.
>   *
> - *Parameters:
> + * Parameters:
>   *	ac			allocation context
>   *	start			start of the new range
>   *	end			end of the new range
> @@ -4024,6 +4066,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
>  	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
>  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>  	struct ext4_prealloc_space *tmp_pa;
> +	struct rb_node *iter;
>  	ext4_lblk_t new_start, new_end;
>  	ext4_lblk_t tmp_pa_start, tmp_pa_end;
>  
> @@ -4031,25 +4074,40 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
>  	new_end = *end;
>  
>  	/* check we don't cross already preallocated blocks */
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> -		if (tmp_pa->pa_deleted)
> -			continue;
> -		spin_lock(&tmp_pa->pa_lock);
> -		if (tmp_pa->pa_deleted) {
> -			spin_unlock(&tmp_pa->pa_lock);
> -			continue;
> -		}
> +	read_lock(&ei->i_prealloc_lock);
> +	for (iter = ei->i_prealloc_node.rb_node; iter;
> +	     iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) {
> +		int is_overlap;
>  
> +		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
> +				  pa_node.inode_node);
>  		tmp_pa_start = tmp_pa->pa_lstart;
>  		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +		is_overlap = !(tmp_pa_start >= new_end || tmp_pa_end <= new_start);
> +
> +		spin_lock(&tmp_pa->pa_lock);
> +		if (tmp_pa->pa_deleted) {
> +			if (is_overlap) {
> +				/*
> +				 * Normalized range overlaps with range of this
> +				 * deleted PA, that means we might need it in
> +				 * near future. Since the PA is yet to be
> +				 * removed from inode PA tree and freed, lets
> +				 * just undelete it.
> +				 */
> +				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
> +			} else {
> +				spin_unlock(&tmp_pa->pa_lock);
> +				continue;
> +			}
> +		}
>  
>  		/* PA must not overlap original request */
>  		BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
>  			ac->ac_o_ex.fe_logical < tmp_pa_start));
>  
>  		/* skip PAs this normalized request doesn't overlap with */
> -		if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
> +		if (!is_overlap) {
>  			spin_unlock(&tmp_pa->pa_lock);
>  			continue;
>  		}
> @@ -4065,7 +4123,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
>  		}
>  		spin_unlock(&tmp_pa->pa_lock);
>  	}
> -	rcu_read_unlock();
> +	read_unlock(&ei->i_prealloc_lock);
>  
>  	/* XXX: extra loop to check we really don't overlap preallocations */
>  	ext4_mb_pa_assert_overlap(ac, new_start, new_end);
> @@ -4192,7 +4250,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	ext4_mb_pa_adjust_overlap(ac, &start, &end);
>  
>  	size = end - start;
> -
>  	/*
>  	 * In this function "start" and "size" are normalized for better
>  	 * alignment and length such that we could preallocate more blocks.
> @@ -4401,6 +4458,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
>  	struct ext4_locality_group *lg;
>  	struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
>  	ext4_lblk_t tmp_pa_start, tmp_pa_end;
> +	struct rb_node *iter;
>  	ext4_fsblk_t goal_block;
>  
>  	/* only data can be preallocated */
> @@ -4408,14 +4466,17 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
>  		return false;
>  
>  	/* first, try per-file preallocation */
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> +	read_lock(&ei->i_prealloc_lock);
> +	for (iter = ei->i_prealloc_node.rb_node; iter;
> +	     iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical, tmp_pa_start, iter)) {
> +		tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
>  
>  		/* all fields in this condition don't change,
>  		 * so we can skip locking for them */
>  		tmp_pa_start = tmp_pa->pa_lstart;
>  		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
>  
> +		/* original request start doesn't lie in this PA */
>  		if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
>  		    ac->ac_o_ex.fe_logical >= tmp_pa_end)
>  			continue;
> @@ -4433,17 +4494,25 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
>  
>  		/* found preallocated blocks, use them */
>  		spin_lock(&tmp_pa->pa_lock);
> -		if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free) {
> +		if (tmp_pa->pa_free) {
> +			if (tmp_pa->pa_deleted == 1) {
> +				/*
> +				 * Since PA is yet to be deleted from inode PA
> +				 * rbtree, just undelete it and use it.
> +				 */
> +				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
> +			}
> +
>  			atomic_inc(&tmp_pa->pa_count);
>  			ext4_mb_use_inode_pa(ac, tmp_pa);
>  			spin_unlock(&tmp_pa->pa_lock);
>  			ac->ac_criteria = 10;
> -			rcu_read_unlock();
> +			read_unlock(&ei->i_prealloc_lock);
>  			return true;
>  		}
>  		spin_unlock(&tmp_pa->pa_lock);
>  	}
> -	rcu_read_unlock();
> +	read_unlock(&ei->i_prealloc_lock);
>  
>  	/* can we use group allocation? */
>  	if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
> @@ -4596,6 +4665,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
>  {
>  	ext4_group_t grp;
>  	ext4_fsblk_t grp_blk;
> +	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
>  
>  	/* in this short window concurrent discard can set pa_deleted */
>  	spin_lock(&pa->pa_lock);
> @@ -4641,16 +4711,41 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
>  	ext4_unlock_group(sb, grp);
>  
>  	if (pa->pa_type == MB_INODE_PA) {
> -		spin_lock(pa->pa_node_lock.inode_lock);
> -		list_del_rcu(&pa->pa_node.inode_list);
> -		spin_unlock(pa->pa_node_lock.inode_lock);
> +		write_lock(pa->pa_node_lock.inode_lock);
> +		rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
> +		write_unlock(pa->pa_node_lock.inode_lock);
> +		ext4_mb_pa_free(pa);
>  	} else {
>  		spin_lock(pa->pa_node_lock.lg_lock);
>  		list_del_rcu(&pa->pa_node.lg_list);
>  		spin_unlock(pa->pa_node_lock.lg_lock);
> +		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
>  	}
> +}
>  
> -	call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> +static void ext4_mb_pa_rb_insert(struct rb_root *root, struct rb_node *new)
> +{
> +	struct rb_node **iter = &root->rb_node, *parent = NULL;
> +	struct ext4_prealloc_space *iter_pa, *new_pa;
> +	ext4_lblk_t iter_start, new_start;
> +
> +	while (*iter) {
> +		iter_pa = rb_entry(*iter, struct ext4_prealloc_space,
> +				   pa_node.inode_node);
> +		new_pa = rb_entry(new, struct ext4_prealloc_space,
> +				   pa_node.inode_node);
> +		iter_start = iter_pa->pa_lstart;
> +		new_start = new_pa->pa_lstart;
> +
> +		parent = *iter;
> +		if (new_start < iter_start)
> +			iter = &((*iter)->rb_left);
> +		else
> +			iter = &((*iter)->rb_right);
> +	}
> +
> +	rb_link_node(new, parent, iter);
> +	rb_insert_color(new, root);
>  }
>  
>  /*
> @@ -4716,7 +4811,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	pa->pa_len = ac->ac_b_ex.fe_len;
>  	pa->pa_free = pa->pa_len;
>  	spin_lock_init(&pa->pa_lock);
> -	INIT_LIST_HEAD(&pa->pa_node.inode_list);
>  	INIT_LIST_HEAD(&pa->pa_group_list);
>  	pa->pa_deleted = 0;
>  	pa->pa_type = MB_INODE_PA;
> @@ -4736,9 +4830,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  
>  	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>  
> -	spin_lock(pa->pa_node_lock.inode_lock);
> -	list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
> -	spin_unlock(pa->pa_node_lock.inode_lock);
> +	write_lock(pa->pa_node_lock.inode_lock);
> +	ext4_mb_pa_rb_insert(&ei->i_prealloc_node, &pa->pa_node.inode_node);
> +	write_unlock(pa->pa_node_lock.inode_lock);
>  	atomic_inc(&ei->i_prealloc_active);
>  }
>  
> @@ -4904,6 +4998,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  	struct ext4_prealloc_space *pa, *tmp;
>  	struct list_head list;
>  	struct ext4_buddy e4b;
> +	struct ext4_inode_info *ei;
>  	int err;
>  	int free = 0;
>  
> @@ -4967,18 +5062,45 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  			list_del_rcu(&pa->pa_node.lg_list);
>  			spin_unlock(pa->pa_node_lock.lg_lock);
>  		} else {
> -			spin_lock(pa->pa_node_lock.inode_lock);
> -			list_del_rcu(&pa->pa_node.inode_list);
> -			spin_unlock(pa->pa_node_lock.inode_lock);
> +			/*
> +			 * The allocation path might undelete a PA
> +			 * incase it expects it to be used again in near
> +			 * future. In that case, rollback the ongoing delete
> +			 * operation and don't remove the PA from inode PA
> +			 * tree.
> +			 *
> +			 * TODO: See if we need pa_lock since there might no
> +			 * path that contends with it once the rbtree writelock
> +			 * is taken.
> +			 */
> +			write_lock(pa->pa_node_lock.inode_lock);
> +			spin_lock(&pa->pa_lock);
> +			if (pa->pa_deleted == 0) {
> +				free -= pa->pa_free;
> +				list_add(&pa->pa_group_list,
> +					 &grp->bb_prealloc_list);
> +				list_del(&pa->u.pa_tmp_list);
> +
> +				spin_unlock(&pa->pa_lock);
> +				write_unlock(pa->pa_node_lock.inode_lock);
> +				continue;
> +			}
> +			spin_unlock(&pa->pa_lock);
> +
> +			ei = EXT4_I(pa->pa_inode);
> +			rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
> +			write_unlock(pa->pa_node_lock.inode_lock);
>  		}
>  
> -		if (pa->pa_type == MB_GROUP_PA)
> +		list_del(&pa->u.pa_tmp_list);
> +
> +		if (pa->pa_type == MB_GROUP_PA) {
>  			ext4_mb_release_group_pa(&e4b, pa);
> -		else
> +			call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> +		} else {
>  			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> -
> -		list_del(&pa->u.pa_tmp_list);
> -		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> +			ext4_mb_pa_free(pa);
> +		}
>  	}
>  
>  	ext4_unlock_group(sb, group);
> @@ -5008,6 +5130,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  	ext4_group_t group = 0;
>  	struct list_head list;
>  	struct ext4_buddy e4b;
> +	struct rb_node *iter;
>  	int err;
>  
>  	if (!S_ISREG(inode->i_mode)) {
> @@ -5030,17 +5153,18 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  
>  repeat:
>  	/* first, collect all pa's in the inode */
> -	spin_lock(&ei->i_prealloc_lock);
> -	while (!list_empty(&ei->i_prealloc_list) && needed) {
> -		pa = list_entry(ei->i_prealloc_list.prev,
> -				struct ext4_prealloc_space, pa_node.inode_list);
> +	write_lock(&ei->i_prealloc_lock);
> +	for (iter = rb_first(&ei->i_prealloc_node); iter && needed; iter = rb_next(iter)) {
> +		pa = rb_entry(iter, struct ext4_prealloc_space,
> +				pa_node.inode_node);
>  		BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
> +
>  		spin_lock(&pa->pa_lock);
>  		if (atomic_read(&pa->pa_count)) {
>  			/* this shouldn't happen often - nobody should
>  			 * use preallocation while we're discarding it */
>  			spin_unlock(&pa->pa_lock);
> -			spin_unlock(&ei->i_prealloc_lock);
> +			write_unlock(&ei->i_prealloc_lock);
>  			ext4_msg(sb, KERN_ERR,
>  				 "uh-oh! used pa while discarding");
>  			WARN_ON(1);
> @@ -5051,7 +5175,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  		if (pa->pa_deleted == 0) {
>  			ext4_mb_mark_pa_deleted(sb, pa);
>  			spin_unlock(&pa->pa_lock);
> -			list_del_rcu(&pa->pa_node.inode_list);
> +			rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
>  			list_add(&pa->u.pa_tmp_list, &list);
>  			needed--;
>  			continue;
> @@ -5059,7 +5183,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  
>  		/* someone is deleting pa right now */
>  		spin_unlock(&pa->pa_lock);
> -		spin_unlock(&ei->i_prealloc_lock);
> +		write_unlock(&ei->i_prealloc_lock);
>  
>  		/* we have to wait here because pa_deleted
>  		 * doesn't mean pa is already unlinked from
> @@ -5076,7 +5200,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  		schedule_timeout_uninterruptible(HZ);
>  		goto repeat;
>  	}
> -	spin_unlock(&ei->i_prealloc_lock);
> +	write_unlock(&ei->i_prealloc_lock);
>  
>  	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
>  		BUG_ON(pa->pa_type != MB_INODE_PA);
> @@ -5108,7 +5232,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
>  		put_bh(bitmap_bh);
>  
>  		list_del(&pa->u.pa_tmp_list);
> -		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> +		ext4_mb_pa_free(pa);
>  	}
>  }
>  
> @@ -5482,7 +5606,6 @@ static void ext4_mb_trim_inode_pa(struct inode *inode)
>  static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>  {
>  	struct inode *inode = ac->ac_inode;
> -	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>  	struct ext4_prealloc_space *pa = ac->ac_pa;
>  	if (pa) {
> @@ -5509,16 +5632,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>  			}
>  		}
>  
> -		if (pa->pa_type == MB_INODE_PA) {
> -			/*
> -			 * treat per-inode prealloc list as a lru list, then try
> -			 * to trim the least recently used PA.
> -			 */
> -			spin_lock(pa->pa_node_lock.inode_lock);
> -			list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
> -			spin_unlock(pa->pa_node_lock.inode_lock);
> -		}
> -
>  		ext4_mb_put_pa(ac, ac->ac_sb, pa);
>  	}
>  	if (ac->ac_bitmap_page)
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 398a6688c341..f8e8ee493867 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -115,7 +115,7 @@ struct ext4_free_data {
>  
>  struct ext4_prealloc_space {
>  	union {
> -		struct list_head	inode_list; /* for inode PAs */
> +		struct rb_node	inode_node;		/* for inode PA rbtree */
>  		struct list_head	lg_list;	/* for lg PAs */
>  	} pa_node;
>  	struct list_head	pa_group_list;
> @@ -132,10 +132,10 @@ struct ext4_prealloc_space {
>  	ext4_grpblk_t		pa_free;	/* how many blocks are free */
>  	unsigned short		pa_type;	/* pa type. inode or group */
>  	union {
> -		spinlock_t		*inode_lock;	/* locks the inode list holding this PA */
> +		rwlock_t		*inode_lock;	/* locks the rbtree holding this PA */
>  		spinlock_t		*lg_lock;	/* locks the lg list holding this PA */
>  	} pa_node_lock;
> -	struct inode		*pa_inode;	/* hack, for history only */
> +	struct inode		*pa_inode;	/* used to get the inode during group discard */
>  };
>  
>  enum {
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 72ead3b56706..5fb3e401de6b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1325,9 +1325,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	inode_set_iversion(&ei->vfs_inode, 1);
>  	ei->i_flags = 0;
>  	spin_lock_init(&ei->i_raw_lock);
> -	INIT_LIST_HEAD(&ei->i_prealloc_list);
> +	ei->i_prealloc_node = RB_ROOT;
>  	atomic_set(&ei->i_prealloc_active, 0);
> -	spin_lock_init(&ei->i_prealloc_lock);
> +	rwlock_init(&ei->i_prealloc_lock);
>  	ext4_es_init_tree(&ei->i_es_tree);
>  	rwlock_init(&ei->i_es_lock);
>  	INIT_LIST_HEAD(&ei->i_es_list);
> -- 
> 2.31.1
>
Ojaswin Mujoo Jan. 17, 2023, 10:30 a.m. UTC | #2
On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > Since this covers the special case we discussed above, we will always
> > un-delete the PA when we encounter the special case and we can then
> > adjust for overlap and traverse the PA rbtree without any issues.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Hi Jan,
Thanks for the review, sharing some of my thoughts below.

> 
> So I find this putting back of already deleted inode PA very fragile. For
> example in current code I suspect you've missed a case in ext4_mb_put_pa()
> which can mark inode PA (so it can then be spotted by
> ext4_mb_pa_adjust_overlap() and marked as in use again) but
> ext4_mb_put_pa() still goes on and destroys the PA.

The 2 code paths that clash here are:

ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_new_blocks()
ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()

Since these are the only code paths from which these 2 functions are
called, for a given inode, access will always be serialized by the upper
level ei->i_data_sem, which is always taken when writing data blocks
using ext4_mb_new_block(). 

From my understanding of the code, I feel only
ext4_mb_discard_group_preallocations() can race against other functions
that are modifying the PA rbtree since it does not take any inode locks.

That being said, I do understand your concerns regarding the solution,
however I'm willing to work with the community to ensure our
implementation of this undelete feature is as robust as possible. Along
with fixing the bug reported here [1], I believe that it is also a good
optimization to have especially when the disk is near full and we are
seeing a lot of group discards going on. 

Also, in case the deleted PA completely lies inside our new range, it is
much better to just undelete and use it rather than deleting the
existing PA and reallocating the range again. I think the advantage
would be even bigger in ext4_mb_use_preallocated() function where we can
just undelete and use the PA and skip the entire allocation, incase original
range lies in a deleted PA.

> 
> So I'd really love to stay with the invariant that once PA is marked as
> deleted, it can never go alive again. Since finding such deleted PA that is
> overlapping our new range should be really rare, cannot we just make
> ext4_mb_pa_adjust_overlap() rb_erase() this deleted PA and restart the
> adjustment search? Since rb_erase() is not safe to be called twice, we'd
> have to record somewhere in the PA that the erasure has already happened
> (probably we could have two flags in 'deleted' field - deleted from group
> list, deleted from object (lg_list/inode_node)) but that is still much more
> robust...
Right, this is an apporach we did consider but we felt it would be a
better to take this opportunity to make the above optimization. Would
love to hear your thoughts on this.

Thanks,
Ojaswin

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Jan. 17, 2023, 11:03 a.m. UTC | #3
On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > Since this covers the special case we discussed above, we will always
> > > un-delete the PA when we encounter the special case and we can then
> > > adjust for overlap and traverse the PA rbtree without any issues.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> 
> Hi Jan,
> Thanks for the review, sharing some of my thoughts below.
> 
> > 
> > So I find this putting back of already deleted inode PA very fragile. For
> > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > which can mark inode PA (so it can then be spotted by
> > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > ext4_mb_put_pa() still goes on and destroys the PA.
> 
> The 2 code paths that clash here are:
> 
> ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> 
> Since these are the only code paths from which these 2 functions are
> called, for a given inode, access will always be serialized by the upper
> level ei->i_data_sem, which is always taken when writing data blocks
> using ext4_mb_new_block(). 

Indeed, inode->i_data_sem prevents the race I was afraid of.
 
> From my understanding of the code, I feel only
> ext4_mb_discard_group_preallocations() can race against other functions
> that are modifying the PA rbtree since it does not take any inode locks.
> 
> That being said, I do understand your concerns regarding the solution,
> however I'm willing to work with the community to ensure our
> implementation of this undelete feature is as robust as possible. Along
> with fixing the bug reported here [1], I believe that it is also a good
> optimization to have especially when the disk is near full and we are
> seeing a lot of group discards going on. 
> 
> Also, in case the deleted PA completely lies inside our new range, it is
> much better to just undelete and use it rather than deleting the
> existing PA and reallocating the range again. I think the advantage
> would be even bigger in ext4_mb_use_preallocated() function where we can
> just undelete and use the PA and skip the entire allocation, incase original
> range lies in a deleted PA.

Thanks for explantion. However I think you're optimizing the wrong thing.
We are running out of space (to run ext4_mb_discard_group_preallocations()
at all) and we allocate from an area covered by PA that we've just decided
to discard - if anything relies on performance of the filesystem in ENOSPC
conditions it has serious problems no matter what. Sure, we should deliver
the result (either ENOSPC or some block allocation) in a reasonable time
but the performance does not really matter much because all the scanning
and flushing is going to slow down everything a lot anyway. One additional
scan of the rbtree is really negligible in this case. So what we should
rather optimize for in this case is the code simplicity and maintainability
of this rare corner-case that will also likely get only a small amount of
testing. And in terms of code simplicity the delete & restart solution
seems to be much better (at least as far as I'm imagining it - maybe the
code will prove me wrong ;)).

								Honza
Ojaswin Mujoo Jan. 19, 2023, 6:27 a.m. UTC | #4
On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote:
> On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > > Since this covers the special case we discussed above, we will always
> > > > un-delete the PA when we encounter the special case and we can then
> > > > adjust for overlap and traverse the PA rbtree without any issues.
> > > > 
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > 
> > Hi Jan,
> > Thanks for the review, sharing some of my thoughts below.
> > 
> > > 
> > > So I find this putting back of already deleted inode PA very fragile. For
> > > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > > which can mark inode PA (so it can then be spotted by
> > > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > > ext4_mb_put_pa() still goes on and destroys the PA.
> > 
> > The 2 code paths that clash here are:
> > 
> > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> > 
> > Since these are the only code paths from which these 2 functions are
> > called, for a given inode, access will always be serialized by the upper
> > level ei->i_data_sem, which is always taken when writing data blocks
> > using ext4_mb_new_block(). 
> 
> Indeed, inode->i_data_sem prevents the race I was afraid of.
>  
> > From my understanding of the code, I feel only
> > ext4_mb_discard_group_preallocations() can race against other functions
> > that are modifying the PA rbtree since it does not take any inode locks.
> > 
> > That being said, I do understand your concerns regarding the solution,
> > however I'm willing to work with the community to ensure our
> > implementation of this undelete feature is as robust as possible. Along
> > with fixing the bug reported here [1], I believe that it is also a good
> > optimization to have especially when the disk is near full and we are
> > seeing a lot of group discards going on. 
> > 
> > Also, in case the deleted PA completely lies inside our new range, it is
> > much better to just undelete and use it rather than deleting the
> > existing PA and reallocating the range again. I think the advantage
> > would be even bigger in ext4_mb_use_preallocated() function where we can
> > just undelete and use the PA and skip the entire allocation, incase original
> > range lies in a deleted PA.
> 
> Thanks for explantion. However I think you're optimizing the wrong thing.
> We are running out of space (to run ext4_mb_discard_group_preallocations()
> at all) and we allocate from an area covered by PA that we've just decided
> to discard - if anything relies on performance of the filesystem in ENOSPC
> conditions it has serious problems no matter what. Sure, we should deliver
> the result (either ENOSPC or some block allocation) in a reasonable time
> but the performance does not really matter much because all the scanning
> and flushing is going to slow down everything a lot anyway. One additional
> scan of the rbtree is really negligible in this case. So what we should
> rather optimize for in this case is the code simplicity and maintainability
> of this rare corner-case that will also likely get only a small amount of
> testing. And in terms of code simplicity the delete & restart solution
> seems to be much better (at least as far as I'm imagining it - maybe the
> code will prove me wrong ;)).
Hi Jan,

So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method,
with ane extra pa_removed flag, but the locking is getting pretty messy. I'm
not sure if such a design is possible is the lock we currently have. 

Basically, the issue I'm facing is that we are having to drop the
locks read locks and accquire the write locks in
ext4_mb_adjust_overlap(), which looks something like this:

				spin_unlock(&tmp_pa->pa_lock);
				read_unlock(&ei->i_prealloc_lock);

				write_lock(&ei->i_prealloc_lock);
				spin_lock(&tmp_pa->pa_lock);

We have to preserve the order and drop both tree and PA locks to avoid
deadlocks.  With this approach, the issue is that in between dropping and
accquiring this lock, the group discard path can actually go ahead and free the
PA memory after calling rb erase on it, which can result in use after free in
the adjust overlap path.  This is because the PA is freed without any locks in
discard path, as it assumes no other thread will have a reference to it. This
assumption was true earlier since our allocation path never gave up the rbtree
lock however it is not possible with this approach now.  Essentially, the
concept of having two different areas where a PA can be deleted is bringing in
additional challenges and complexity, which might make things worse from a
maintainers/reviewers point of view.

After brainstorming a bit, I think there might be a few alternatives here:

1. Instead of deleting PA in the adjust overlap thread, make it sleep till group
discard path goes ahead and deletes/frees it. At this point we can wake it up and retry
allocation. 

* Pros: We can be sure that PA would have been removed at the time of retry so
we don't waste extra retries. C
* Cons: Extra complexity in code. 

2. Just go for a retry in adjust overlap without doing anything. In ideal case,
by the time we start retrying the PA might be already removed. Worse case: We
keep looping again and again since discard path has not deleted it yet.

* Pros: Simplest approach, code remains straightforward.
* Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough.

3. The approach of undeleting the PA (proposed in this patchset) that we've already discussed.

Now, to be honest, I still prefer the undelete PA approach as it makes more
sense to me and I think the code is simple enough as there are not many paths
that might race. Mostly just adjust_overlap and group discard or
use_preallocated and group discard.

However, I'm still open to improve the approach you suggested to overcome the
issues discussed in the mailing thread. For reference I'm also attaching the
diff of changes I did to implement the rb_erase and retry solution in this
mail. (The diff is on top of this patch series)

Regards,
Ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
commit 9c69b141d8815d6ecba409c2ac119dd4d6f1ef76
Author: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Date:   Wed Jan 18 17:14:49 2023 +0530

    ext4: Retry rbtree walk after deleteing PA
    
    This commit has use after free issue and possibly other issues.
    Just an initial draft
    
    Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 273a98bcaa0d..66a9ef416d00 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4073,6 +4073,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	new_end = *end;
 
 	/* check we don't cross already preallocated blocks */
+retry:
 	read_lock(&ei->i_prealloc_lock);
 	for (iter = ei->i_prealloc_node.rb_node; iter;
 	     iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) {
@@ -4089,12 +4090,34 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 			if (is_overlap) {
 				/*
 				 * Normalized range overlaps with range of this
-				 * deleted PA, that means we might need it in
-				 * near future. Since the PA is yet to be
-				 * removed from inode PA tree and freed, lets
-				 * just undelete it.
+				 * deleted PA. It can be tricky to decide how to
+				 * descend in the tree so it's easier to just
+				 * delete the PA from rbtree and retry the
+				 * operation. Since such a case is rare and
+				 * usually only happens when FS is under high
+				 * utilization, the performance hit due to retry
+				 * should be minimal.
 				 */
-				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
+				spin_unlock(&tmp_pa->pa_lock);
+				read_unlock(&ei->i_prealloc_lock);
+				/*
+				 * !!BUG!! As we giveup the tree and PA locks,
+				 * it might be possible that the group discard
+				 * path has already removed this PA from the
+				 * tree or the pa has already been freed. This
+				 * results in use after free when acquiring
+				 * pa_lock below.
+				 */
+				write_lock(&ei->i_prealloc_lock);
+				spin_lock(&tmp_pa->pa_lock);
+				if (tmp_pa->pa_removed == 0) {
+					rb_erase(&tmp_pa->pa_node.inode_node,
+						 &ei->i_prealloc_node);
+					tmp_pa->pa_removed = 1;
+				}
+				spin_unlock(&tmp_pa->pa_lock);
+				write_unlock(&ei->i_prealloc_lock);
+				goto retry;
 			} else {
 				spin_unlock(&tmp_pa->pa_lock);
 				continue;
@@ -4493,15 +4516,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 
 		/* found preallocated blocks, use them */
 		spin_lock(&tmp_pa->pa_lock);
-		if (tmp_pa->pa_free) {
-			if (tmp_pa->pa_deleted == 1) {
-				/*
-				 * Since PA is yet to be deleted from inode PA
-				 * rbtree, just undelete it and use it.
-				 */
-				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
-			}
-
+		if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free) {
 			atomic_inc(&tmp_pa->pa_count);
 			ext4_mb_use_inode_pa(ac, tmp_pa);
 			spin_unlock(&tmp_pa->pa_lock);
@@ -5061,33 +5076,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 			list_del_rcu(&pa->pa_node.lg_list);
 			spin_unlock(pa->pa_node_lock.lg_lock);
 		} else {
-			/*
-			 * The allocation path might undelete a PA
-			 * incase it expects it to be used again in near
-			 * future. In that case, rollback the ongoing delete
-			 * operation and don't remove the PA from inode PA
-			 * tree.
-			 *
-			 * TODO: See if we need pa_lock since there might no
-			 * path that contends with it once the rbtree writelock
-			 * is taken.
-			 */
 			write_lock(pa->pa_node_lock.inode_lock);
 			spin_lock(&pa->pa_lock);
-			if (pa->pa_deleted == 0) {
-				free -= pa->pa_free;
-				list_add(&pa->pa_group_list,
-					 &grp->bb_prealloc_list);
-				list_del(&pa->u.pa_tmp_list);
-
-				spin_unlock(&pa->pa_lock);
-				write_unlock(pa->pa_node_lock.inode_lock);
-				continue;
+			if (pa->pa_removed == 0) {
+				ei = EXT4_I(pa->pa_inode);
+				rb_erase(&pa->pa_node.inode_node,
+					 &ei->i_prealloc_node);
+				pa->pa_removed = 1;
 			}
 			spin_unlock(&pa->pa_lock);
-
-			ei = EXT4_I(pa->pa_inode);
-			rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
 			write_unlock(pa->pa_node_lock.inode_lock);
 		}
 
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 6d85ee8674a6..3ea5701215fc 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -120,7 +120,9 @@ struct ext4_prealloc_space {
 	} u;
 	spinlock_t		pa_lock;
 	atomic_t		pa_count;
-	unsigned		pa_deleted;
+	unsigned		pa_deleted;	/* Has PA been marked deleted */
+	unsigned		pa_removed;	/* Has PA been removed from its
+						   respecitve data structure */
 	ext4_fsblk_t		pa_pstart;	/* phys. block */
 	ext4_lblk_t		pa_lstart;	/* log. block */
 	ext4_grpblk_t		pa_len;		/* len of preallocated chunk */
Jan Kara Jan. 27, 2023, 2:43 p.m. UTC | #5
Hi Ojaswin!

I'm sorry for a bit delayed reply...

On Thu 19-01-23 11:57:25, Ojaswin Mujoo wrote:
> On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote:
> > On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> > > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > > > Since this covers the special case we discussed above, we will always
> > > > > un-delete the PA when we encounter the special case and we can then
> > > > > adjust for overlap and traverse the PA rbtree without any issues.
> > > > > 
> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > > 
> > > Hi Jan,
> > > Thanks for the review, sharing some of my thoughts below.
> > > 
> > > > 
> > > > So I find this putting back of already deleted inode PA very fragile. For
> > > > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > > > which can mark inode PA (so it can then be spotted by
> > > > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > > > ext4_mb_put_pa() still goes on and destroys the PA.
> > > 
> > > The 2 code paths that clash here are:
> > > 
> > > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> > > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> > > 
> > > Since these are the only code paths from which these 2 functions are
> > > called, for a given inode, access will always be serialized by the upper
> > > level ei->i_data_sem, which is always taken when writing data blocks
> > > using ext4_mb_new_block(). 
> > 
> > Indeed, inode->i_data_sem prevents the race I was afraid of.
> >  
> > > From my understanding of the code, I feel only
> > > ext4_mb_discard_group_preallocations() can race against other functions
> > > that are modifying the PA rbtree since it does not take any inode locks.
> > > 
> > > That being said, I do understand your concerns regarding the solution,
> > > however I'm willing to work with the community to ensure our
> > > implementation of this undelete feature is as robust as possible. Along
> > > with fixing the bug reported here [1], I believe that it is also a good
> > > optimization to have especially when the disk is near full and we are
> > > seeing a lot of group discards going on. 
> > > 
> > > Also, in case the deleted PA completely lies inside our new range, it is
> > > much better to just undelete and use it rather than deleting the
> > > existing PA and reallocating the range again. I think the advantage
> > > would be even bigger in ext4_mb_use_preallocated() function where we can
> > > just undelete and use the PA and skip the entire allocation, incase original
> > > range lies in a deleted PA.
> > 
> > Thanks for explantion. However I think you're optimizing the wrong thing.
> > We are running out of space (to run ext4_mb_discard_group_preallocations()
> > at all) and we allocate from an area covered by PA that we've just decided
> > to discard - if anything relies on performance of the filesystem in ENOSPC
> > conditions it has serious problems no matter what. Sure, we should deliver
> > the result (either ENOSPC or some block allocation) in a reasonable time
> > but the performance does not really matter much because all the scanning
> > and flushing is going to slow down everything a lot anyway. One additional
> > scan of the rbtree is really negligible in this case. So what we should
> > rather optimize for in this case is the code simplicity and maintainability
> > of this rare corner-case that will also likely get only a small amount of
> > testing. And in terms of code simplicity the delete & restart solution
> > seems to be much better (at least as far as I'm imagining it - maybe the
> > code will prove me wrong ;)).
> Hi Jan,
> 
> So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method,
> with ane extra pa_removed flag, but the locking is getting pretty messy. I'm
> not sure if such a design is possible is the lock we currently have. 
> 
> Basically, the issue I'm facing is that we are having to drop the
> locks read locks and accquire the write locks in
> ext4_mb_adjust_overlap(), which looks something like this:
> 
> 				spin_unlock(&tmp_pa->pa_lock);
> 				read_unlock(&ei->i_prealloc_lock);
> 
> 				write_lock(&ei->i_prealloc_lock);
> 				spin_lock(&tmp_pa->pa_lock);
> 
> We have to preserve the order and drop both tree and PA locks to avoid
> deadlocks.  With this approach, the issue is that in between dropping and
> accquiring this lock, the group discard path can actually go ahead and free the
> PA memory after calling rb erase on it, which can result in use after free in
> the adjust overlap path.  This is because the PA is freed without any locks in
> discard path, as it assumes no other thread will have a reference to it. This
> assumption was true earlier since our allocation path never gave up the rbtree
> lock however it is not possible with this approach now.  Essentially, the
> concept of having two different areas where a PA can be deleted is bringing in
> additional challenges and complexity, which might make things worse from a
> maintainers/reviewers point of view.

Right, I didn't realize that. That is nasty.

> After brainstorming a bit, I think there might be a few alternatives here:
> 
> 1. Instead of deleting PA in the adjust overlap thread, make it sleep till group
> discard path goes ahead and deletes/frees it. At this point we can wake it up and retry
> allocation. 
> 
> * Pros: We can be sure that PA would have been removed at the time of retry so
> we don't waste extra retries. C
> * Cons: Extra complexity in code. 
> 
> 2. Just go for a retry in adjust overlap without doing anything. In ideal case,
> by the time we start retrying the PA might be already removed. Worse case: We
> keep looping again and again since discard path has not deleted it yet.
> 
> * Pros: Simplest approach, code remains straightforward.
> * Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough.

Well, I think cond_resched() + goto retry would be OK here. We could also
cycle the corresponding group lock which would wait for
ext4_mb_discard_group_preallocations() to finish but that is going to burn
the CPU even more than the cond_resched() + retry as we'll be just spinning
on the spinlock. Sleeping is IMHO not warranted as the whole
ext4_mb_discard_group_preallocations() is running under a spinlock anyway
so it should better be a very short sleep.

Or actually I have one more possible solution: What the adjusting function
is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
trims start & end to not overlap these PAs. So we could just lookup these
two PAs (ignoring the deleted state) and then just iterate from these with
rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
 
> 3. The approach of undeleting the PA (proposed in this patchset) that
> we've already discussed.
> 
> Now, to be honest, I still prefer the undelete PA approach as it makes more
> sense to me and I think the code is simple enough as there are not many paths
> that might race. Mostly just adjust_overlap and group discard or
> use_preallocated and group discard.

Yeah, I'm still not too keen on this but I'm willing to reconsider if 
above approach proves to be too expensive under ENOSPC conditions...

									Honza
Ojaswin Mujoo Feb. 3, 2023, 8:36 a.m. UTC | #6
On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> 
> Well, I think cond_resched() + goto retry would be OK here. We could also
> cycle the corresponding group lock which would wait for
> ext4_mb_discard_group_preallocations() to finish but that is going to burn
> the CPU even more than the cond_resched() + retry as we'll be just spinning
> on the spinlock. Sleeping is IMHO not warranted as the whole
> ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> so it should better be a very short sleep.
> 
> Or actually I have one more possible solution: What the adjusting function
> is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> trims start & end to not overlap these PAs. So we could just lookup these
> two PAs (ignoring the deleted state) and then just iterate from these with
> rb_prev() & rb_next() until we find not-deleted ones. What do you think? 

Hey Jan, 

Just thought I'd update you, I'm trying this solution out, and it looks
good but I'm hitting a few bugs in the implementation. Will update here
once I have it working correctly.

Regards,
Ojaswin
Ojaswin Mujoo Feb. 8, 2023, 11:25 a.m. UTC | #7
On Fri, Feb 03, 2023 at 02:06:56PM +0530, Ojaswin Mujoo wrote:
> On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> > 
> > Well, I think cond_resched() + goto retry would be OK here. We could also
> > cycle the corresponding group lock which would wait for
> > ext4_mb_discard_group_preallocations() to finish but that is going to burn
> > the CPU even more than the cond_resched() + retry as we'll be just spinning
> > on the spinlock. Sleeping is IMHO not warranted as the whole
> > ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> > so it should better be a very short sleep.
> > 
> > Or actually I have one more possible solution: What the adjusting function
> > is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> > trims start & end to not overlap these PAs. So we could just lookup these
> > two PAs (ignoring the deleted state) and then just iterate from these with
> > rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
> 
> Hey Jan, 
> 
> Just thought I'd update you, I'm trying this solution out, and it looks
> good but I'm hitting a few bugs in the implementation. Will update here
> once I have it working correctly.

Alright, so after spending some time on these bugs I'm hitting I'm
seeing some strange behavior. Basically, it seems like in scenarios
where we are not able to allocate as many block as the normalized goal
request, we can sometimes end up adding a PA that overlaps with existing
PAs in the inode PA list/tree. This behavior exists even before this
particular patchset. Due to presence of such overlapping PAs, the above
logic was failing in some cases.

From my understanding of the code, this seems to be a BUG. We should not
be adding overlapping PA ranges as that causes us to preallocate
multiple blocks for the same logical offset in a file, however I would
also like to know if my understanding is incorrect and if this is an
intended behavior.

----- Analysis of the issue ------

Here's my analysis of the behavior, which I did by adding some BUG_ONs
and running generic/269 (4k bs). It happens pretty often, like once
every 5-10 runs. Testing was done without applying this patch series on
the Ted's dev branch.

1. So taking an example of a real scenario I hit. After we find the best
len possible, we enter the ext4_mb_new_inode_pa() function with the
following values for start and end of the extents:

## format: <start>/<end>(<len>)
orig_ex:503/510(7) goal_ex:0/512(512) best_ex:0/394(394)

2. Since (best_ex len < goal_ex len) we enter the PA window adjustment
if condition here:

	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len)
		...
	}

3. Here, we calc wins, winl and off and adjust logical start and end of
the best found extent. The idea is to make sure that the best extent
atleast covers the original request. In this example, the values are:

winl:503 wins:387 off:109

and win = min(winl, wins, off) = 109

4. We then adjust the logical start of the best ex as:

		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - EXT4_NUM_B2C(sbi, win);

which makes the new best extent as:

best_ex: 394/788(394)

As we can see, the best extent overflows outside the goal range, and
hence we don't have any guarentee anymore that it will not overlap with
another PA since we only check overlaps with the goal start and end.
We then initialze the new PA with the logical start and end of the best
extent and finaly add it to the inode PA list.

In my testing I was able to actually see overlapping PAs being added to
the inode list.

----------- END ---------------

Again, I would like to know if this is a BUG or intended. If its a BUG,
is it okay for us to make sure the adjusted best extent length doesn't 
extend the goal length? 

Also, another thing I noticed is that after ext4_mb_normalize_request(),
sometimes the original range can also exceed the normalized goal range,
which is again was a bit surprising to me since my understanding was
that normalized range would always encompass the orignal range.

Hoping to get some insights into the above points.

Regards,
ojaswin
Jan Kara Feb. 9, 2023, 10:54 a.m. UTC | #8
Hello Ojaswin!

On Wed 08-02-23 16:55:05, Ojaswin Mujoo wrote:
> On Fri, Feb 03, 2023 at 02:06:56PM +0530, Ojaswin Mujoo wrote:
> > On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> > > 
> > > Well, I think cond_resched() + goto retry would be OK here. We could also
> > > cycle the corresponding group lock which would wait for
> > > ext4_mb_discard_group_preallocations() to finish but that is going to burn
> > > the CPU even more than the cond_resched() + retry as we'll be just spinning
> > > on the spinlock. Sleeping is IMHO not warranted as the whole
> > > ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> > > so it should better be a very short sleep.
> > > 
> > > Or actually I have one more possible solution: What the adjusting function
> > > is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> > > trims start & end to not overlap these PAs. So we could just lookup these
> > > two PAs (ignoring the deleted state) and then just iterate from these with
> > > rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
> > 
> > Hey Jan, 
> > 
> > Just thought I'd update you, I'm trying this solution out, and it looks
> > good but I'm hitting a few bugs in the implementation. Will update here
> > once I have it working correctly.
> 
> Alright, so after spending some time on these bugs I'm hitting I'm
> seeing some strange behavior. Basically, it seems like in scenarios
> where we are not able to allocate as many block as the normalized goal
> request, we can sometimes end up adding a PA that overlaps with existing
> PAs in the inode PA list/tree. This behavior exists even before this
> particular patchset. Due to presence of such overlapping PAs, the above
> logic was failing in some cases.
> 
> From my understanding of the code, this seems to be a BUG. We should not
> be adding overlapping PA ranges as that causes us to preallocate
> multiple blocks for the same logical offset in a file, however I would
> also like to know if my understanding is incorrect and if this is an
> intended behavior.
> 
> ----- Analysis of the issue ------
> 
> Here's my analysis of the behavior, which I did by adding some BUG_ONs
> and running generic/269 (4k bs). It happens pretty often, like once
> every 5-10 runs. Testing was done without applying this patch series on
> the Ted's dev branch.
> 
> 1. So taking an example of a real scenario I hit. After we find the best
> len possible, we enter the ext4_mb_new_inode_pa() function with the
> following values for start and end of the extents:
> 
> ## format: <start>/<end>(<len>)
> orig_ex:503/510(7) goal_ex:0/512(512) best_ex:0/394(394)
> 
> 2. Since (best_ex len < goal_ex len) we enter the PA window adjustment
> if condition here:
> 
> 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len)
> 		...
> 	}
> 
> 3. Here, we calc wins, winl and off and adjust logical start and end of
> the best found extent. The idea is to make sure that the best extent
> atleast covers the original request. In this example, the values are:
> 
> winl:503 wins:387 off:109
> 
> and win = min(winl, wins, off) = 109
> 
> 4. We then adjust the logical start of the best ex as:
> 
> 		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - EXT4_NUM_B2C(sbi, win);
> 
> which makes the new best extent as:
> 
> best_ex: 394/788(394)
> 
> As we can see, the best extent overflows outside the goal range, and
> hence we don't have any guarentee anymore that it will not overlap with
> another PA since we only check overlaps with the goal start and end.
> We then initialze the new PA with the logical start and end of the best
> extent and finaly add it to the inode PA list.
> 
> In my testing I was able to actually see overlapping PAs being added to
> the inode list.
> 
> ----------- END ---------------
> 
> Again, I would like to know if this is a BUG or intended. If its a BUG,
> is it okay for us to make sure the adjusted best extent length doesn't 
> extend the goal length? 

Good spotting. So I guess your understanding of mballoc is better than mine
by now :) but at least in my mental model I would also expect the resulting
preallocation to stay withing the goal extent. What is causing here the
issue is this code in ext4_mb_new_inode_pa():

                offs = ac->ac_o_ex.fe_logical %
                        EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
                if (offs && offs < win)
                        win = offs;

so we apparently try to align the logical offset of the allocation to a
multiple of the allocated size but that just does not make much sense when
we found some random leftover extent with shorter-than-goal size. So what
I'd do in the shorter-than-goal preallocation case is:

1) If we can place the allocation at the end of goal window and still cover
the original allocation request, do that.

2) Otherwise if we can place the allocation at the start of the goal
window and still cover the original allocation request, do that.

3) Otherwise place the allocation at the start of the original allocation
request.

This would seem to reasonably reduce fragmentation of preallocated space
and still keep things simple.

> Also, another thing I noticed is that after ext4_mb_normalize_request(),
> sometimes the original range can also exceed the normalized goal range,
> which is again was a bit surprising to me since my understanding was
> that normalized range would always encompass the orignal range.

Well, isn't that because (part of) the requested original range is already
preallocated? Or what causes the goal range to be shortened?

								Honza
Ojaswin Mujoo Feb. 9, 2023, 5:54 p.m. UTC | #9
On Thu, Feb 09, 2023 at 11:54:18AM +0100, Jan Kara wrote:
> Hello Ojaswin!
> 
> On Wed 08-02-23 16:55:05, Ojaswin Mujoo wrote:
> > On Fri, Feb 03, 2023 at 02:06:56PM +0530, Ojaswin Mujoo wrote:
> > > On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> > > > 
> > > > Well, I think cond_resched() + goto retry would be OK here. We could also
> > > > cycle the corresponding group lock which would wait for
> > > > ext4_mb_discard_group_preallocations() to finish but that is going to burn
> > > > the CPU even more than the cond_resched() + retry as we'll be just spinning
> > > > on the spinlock. Sleeping is IMHO not warranted as the whole
> > > > ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> > > > so it should better be a very short sleep.
> > > > 
> > > > Or actually I have one more possible solution: What the adjusting function
> > > > is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> > > > trims start & end to not overlap these PAs. So we could just lookup these
> > > > two PAs (ignoring the deleted state) and then just iterate from these with
> > > > rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
> > > 
> > > Hey Jan, 
> > > 
> > > Just thought I'd update you, I'm trying this solution out, and it looks
> > > good but I'm hitting a few bugs in the implementation. Will update here
> > > once I have it working correctly.
> > 
> > Alright, so after spending some time on these bugs I'm hitting I'm
> > seeing some strange behavior. Basically, it seems like in scenarios
> > where we are not able to allocate as many block as the normalized goal
> > request, we can sometimes end up adding a PA that overlaps with existing
> > PAs in the inode PA list/tree. This behavior exists even before this
> > particular patchset. Due to presence of such overlapping PAs, the above
> > logic was failing in some cases.
> > 
> > From my understanding of the code, this seems to be a BUG. We should not
> > be adding overlapping PA ranges as that causes us to preallocate
> > multiple blocks for the same logical offset in a file, however I would
> > also like to know if my understanding is incorrect and if this is an
> > intended behavior.
> > 
> > ----- Analysis of the issue ------
> > 
> > Here's my analysis of the behavior, which I did by adding some BUG_ONs
> > and running generic/269 (4k bs). It happens pretty often, like once
> > every 5-10 runs. Testing was done without applying this patch series on
> > the Ted's dev branch.
> > 
> > 1. So taking an example of a real scenario I hit. After we find the best
> > len possible, we enter the ext4_mb_new_inode_pa() function with the
> > following values for start and end of the extents:
> > 
> > ## format: <start>/<end>(<len>)
> > orig_ex:503/510(7) goal_ex:0/512(512) best_ex:0/394(394)
> > 
> > 2. Since (best_ex len < goal_ex len) we enter the PA window adjustment
> > if condition here:
> > 
> > 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len)
> > 		...
> > 	}
> > 
> > 3. Here, we calc wins, winl and off and adjust logical start and end of
> > the best found extent. The idea is to make sure that the best extent
> > atleast covers the original request. In this example, the values are:
> > 
> > winl:503 wins:387 off:109
> > 
> > and win = min(winl, wins, off) = 109
> > 
> > 4. We then adjust the logical start of the best ex as:
> > 
> > 		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - EXT4_NUM_B2C(sbi, win);
> > 
> > which makes the new best extent as:
> > 
> > best_ex: 394/788(394)
> > 
> > As we can see, the best extent overflows outside the goal range, and
> > hence we don't have any guarentee anymore that it will not overlap with
> > another PA since we only check overlaps with the goal start and end.
> > We then initialze the new PA with the logical start and end of the best
> > extent and finaly add it to the inode PA list.
> > 
> > In my testing I was able to actually see overlapping PAs being added to
> > the inode list.
> > 
> > ----------- END ---------------
> > 
> > Again, I would like to know if this is a BUG or intended. If its a BUG,
> > is it okay for us to make sure the adjusted best extent length doesn't 
> > extend the goal length? 
> 
> Good spotting. So I guess your understanding of mballoc is better than mine
> by now :) but at least in my mental model I would also expect the resulting
> preallocation to stay withing the goal extent. What is causing here the
> issue is this code in ext4_mb_new_inode_pa():
> 
>                 offs = ac->ac_o_ex.fe_logical %
>                         EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>                 if (offs && offs < win)
>                         win = offs;
> 
> so we apparently try to align the logical offset of the allocation to a
> multiple of the allocated size but that just does not make much sense when
Hi Jan!

Yep, it is indeed the offset calculation that is cauing issues in this
particular example. Any idea why this was originally added?

> we found some random leftover extent with shorter-than-goal size. So what
> I'd do in the shorter-than-goal preallocation case is:
> 
> 1) If we can place the allocation at the end of goal window and still cover
> the original allocation request, do that.
> 
> 2) Otherwise if we can place the allocation at the start of the goal
> window and still cover the original allocation request, do that.
> 
> 3) Otherwise place the allocation at the start of the original allocation
> request.
> 
> This would seem to reasonably reduce fragmentation of preallocated space
> and still keep things simple.
This looks like a good approach to me and it will take care of the issue
caused due to offset calculation.

However, after commenting out the offset calculation bit in PA window
adjustment logic, I noticed that there is one more way that such an
overflow can happen, which would need to be addressed before we can
implement the above approach. Basically, this happens when we end up
with a goal len greater than the original len.

See my comments at the end for more info.

> 
> > Also, another thing I noticed is that after ext4_mb_normalize_request(),
> > sometimes the original range can also exceed the normalized goal range,
> > which is again was a bit surprising to me since my understanding was
> > that normalized range would always encompass the orignal range.
> 
> Well, isn't that because (part of) the requested original range is already
> preallocated? Or what causes the goal range to be shortened?
> 
Yes I think that pre existing PAs could be one of the cases.

Other than that, I'm also seeing some cases of sparse writes which can cause
ext4_mb_normalize_request() to result in having an original range that
overflows out of the goal range. For example, I observed these values just
after the if else if else conditions in the function, before we check if range
overlaps pre existing PAs:

orig_ex:2045/2055(len:10) normalized_range:0/2048, orig_isize:8417280

Basically, since isize is large and we are doing a sparse write, we end
up in the following if condition:

	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
								(8<<20)>>bsbits, max, 8 * 1024)) {
		start_off = ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
		size = 8 * 1024 * 1024;
 }

Resulting in normalized range less than original range.

Now, in any case, once we get such an overflow, if we try to enter the PA
adjustment window in ext4_mb_new_inode_pa() function, we will again end up with
a best extent overflowing out of goal extent since we would try to cover the
original extent. 

So yeah, seems like there are 2 cases where we could result in overlapping PAs:

1. Due to off calculation in PA adjustment window, as we discussed.  2. Due to
original extent overflowing out of goal extent.

I think the 3 step solution you proposed works well to counter 1 but not 2, so
we probably need some more logic on top of your solution to take care of that.
I'll think some more on how to fix this but I think this will be as a separate
patch.

Regards,
Ojaswin 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Feb. 10, 2023, 2:37 p.m. UTC | #10
On Thu 09-02-23 23:24:31, Ojaswin Mujoo wrote:
> On Thu, Feb 09, 2023 at 11:54:18AM +0100, Jan Kara wrote:
> > On Wed 08-02-23 16:55:05, Ojaswin Mujoo wrote:
> > > On Fri, Feb 03, 2023 at 02:06:56PM +0530, Ojaswin Mujoo wrote:
> > > > On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> > > > > 
> > > > > Well, I think cond_resched() + goto retry would be OK here. We could also
> > > > > cycle the corresponding group lock which would wait for
> > > > > ext4_mb_discard_group_preallocations() to finish but that is going to burn
> > > > > the CPU even more than the cond_resched() + retry as we'll be just spinning
> > > > > on the spinlock. Sleeping is IMHO not warranted as the whole
> > > > > ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> > > > > so it should better be a very short sleep.
> > > > > 
> > > > > Or actually I have one more possible solution: What the adjusting function
> > > > > is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> > > > > trims start & end to not overlap these PAs. So we could just lookup these
> > > > > two PAs (ignoring the deleted state) and then just iterate from these with
> > > > > rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
> > > > 
> > > > Hey Jan, 
> > > > 
> > > > Just thought I'd update you, I'm trying this solution out, and it looks
> > > > good but I'm hitting a few bugs in the implementation. Will update here
> > > > once I have it working correctly.
> > > 
> > > Alright, so after spending some time on these bugs I'm hitting I'm
> > > seeing some strange behavior. Basically, it seems like in scenarios
> > > where we are not able to allocate as many block as the normalized goal
> > > request, we can sometimes end up adding a PA that overlaps with existing
> > > PAs in the inode PA list/tree. This behavior exists even before this
> > > particular patchset. Due to presence of such overlapping PAs, the above
> > > logic was failing in some cases.
> > > 
> > > From my understanding of the code, this seems to be a BUG. We should not
> > > be adding overlapping PA ranges as that causes us to preallocate
> > > multiple blocks for the same logical offset in a file, however I would
> > > also like to know if my understanding is incorrect and if this is an
> > > intended behavior.
> > > 
> > > ----- Analysis of the issue ------
> > > 
> > > Here's my analysis of the behavior, which I did by adding some BUG_ONs
> > > and running generic/269 (4k bs). It happens pretty often, like once
> > > every 5-10 runs. Testing was done without applying this patch series on
> > > the Ted's dev branch.
> > > 
> > > 1. So taking an example of a real scenario I hit. After we find the best
> > > len possible, we enter the ext4_mb_new_inode_pa() function with the
> > > following values for start and end of the extents:
> > > 
> > > ## format: <start>/<end>(<len>)
> > > orig_ex:503/510(7) goal_ex:0/512(512) best_ex:0/394(394)
> > > 
> > > 2. Since (best_ex len < goal_ex len) we enter the PA window adjustment
> > > if condition here:
> > > 
> > > 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len)
> > > 		...
> > > 	}
> > > 
> > > 3. Here, we calc wins, winl and off and adjust logical start and end of
> > > the best found extent. The idea is to make sure that the best extent
> > > atleast covers the original request. In this example, the values are:
> > > 
> > > winl:503 wins:387 off:109
> > > 
> > > and win = min(winl, wins, off) = 109
> > > 
> > > 4. We then adjust the logical start of the best ex as:
> > > 
> > > 		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - EXT4_NUM_B2C(sbi, win);
> > > 
> > > which makes the new best extent as:
> > > 
> > > best_ex: 394/788(394)
> > > 
> > > As we can see, the best extent overflows outside the goal range, and
> > > hence we don't have any guarentee anymore that it will not overlap with
> > > another PA since we only check overlaps with the goal start and end.
> > > We then initialze the new PA with the logical start and end of the best
> > > extent and finaly add it to the inode PA list.
> > > 
> > > In my testing I was able to actually see overlapping PAs being added to
> > > the inode list.
> > > 
> > > ----------- END ---------------
> > > 
> > > Again, I would like to know if this is a BUG or intended. If its a BUG,
> > > is it okay for us to make sure the adjusted best extent length doesn't 
> > > extend the goal length? 
> > 
> > Good spotting. So I guess your understanding of mballoc is better than mine
> > by now :) but at least in my mental model I would also expect the resulting
> > preallocation to stay withing the goal extent. What is causing here the
> > issue is this code in ext4_mb_new_inode_pa():
> > 
> >                 offs = ac->ac_o_ex.fe_logical %
> >                         EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> >                 if (offs && offs < win)
> >                         win = offs;
> > 
> > so we apparently try to align the logical offset of the allocation to a
> > multiple of the allocated size but that just does not make much sense when
> 
> Yep, it is indeed the offset calculation that is cauing issues in this
> particular example. Any idea why this was originally added?

So I belive mballoc tries to align everything (offsets & lengths) to powers
of two to reduce fragmentation and simplify the work for the buddy allocator.
If ac->ac_b_ex.fe_len is a power-of-two, the alignment makes sense. But
once we had to resort to higher allocator passes and just got some
random-length extent, the alignment stops making sense.

> > we found some random leftover extent with shorter-than-goal size. So what
> > I'd do in the shorter-than-goal preallocation case is:
> > 
> > 1) If we can place the allocation at the end of goal window and still cover
> > the original allocation request, do that.
> > 
> > 2) Otherwise if we can place the allocation at the start of the goal
> > window and still cover the original allocation request, do that.
> > 
> > 3) Otherwise place the allocation at the start of the original allocation
> > request.
> > 
> > This would seem to reasonably reduce fragmentation of preallocated space
> > and still keep things simple.
> This looks like a good approach to me and it will take care of the issue
> caused due to offset calculation.
> 
> However, after commenting out the offset calculation bit in PA window
> adjustment logic, I noticed that there is one more way that such an
> overflow can happen, which would need to be addressed before we can
> implement the above approach. Basically, this happens when we end up
> with a goal len greater than the original len.

You probably mean goal end block smaller than original end block here... At
least that's what you speak about below.

> See my comments at the end for more info.
> 
> > 
> > > Also, another thing I noticed is that after ext4_mb_normalize_request(),
> > > sometimes the original range can also exceed the normalized goal range,
> > > which is again was a bit surprising to me since my understanding was
> > > that normalized range would always encompass the orignal range.
> > 
> > Well, isn't that because (part of) the requested original range is already
> > preallocated? Or what causes the goal range to be shortened?
> > 
> Yes I think that pre existing PAs could be one of the cases.
> 
> Other than that, I'm also seeing some cases of sparse writes which can cause
> ext4_mb_normalize_request() to result in having an original range that
> overflows out of the goal range. For example, I observed these values just
> after the if else if else conditions in the function, before we check if range
> overlaps pre existing PAs:
> 
> orig_ex:2045/2055(len:10) normalized_range:0/2048, orig_isize:8417280
> 
> Basically, since isize is large and we are doing a sparse write, we end
> up in the following if condition:
> 
> 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
> 								(8<<20)>>bsbits, max, 8 * 1024)) {
> 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
> 		size = 8 * 1024 * 1024;
>  }
> 
> Resulting in normalized range less than original range.

I see.

> Now, in any case, once we get such an overflow, if we try to enter the PA
> adjustment window in ext4_mb_new_inode_pa() function, we will again end
> up with a best extent overflowing out of goal extent since we would try
> to cover the original extent. 
> 
> So yeah, seems like there are 2 cases where we could result in
> overlapping PAs:
> 
> 1. Due to off calculation in PA adjustment window, as we discussed.  2.
> Due to original extent overflowing out of goal extent.
> 
> I think the 3 step solution you proposed works well to counter 1 but not
> 2, so we probably need some more logic on top of your solution to take
> care of that.  I'll think some more on how to fix this but I think this
> will be as a separate patch.

Well, my algorithm will still result in preallocation being within the goal
range AFAICS. In the example above we had:

Orig 2045/2055 Goal 0/2048

Suppose we found 200 blocks. So we try placing preallocation like:

1848/2048, it covers the original starting block 2045 so we are fine and
create preallocation 1848/2048. Nothing has overflown the goal window...

								Honza
Ojaswin Mujoo Feb. 13, 2023, 5:58 p.m. UTC | #11
Hi Jan!

On Fri, Feb 10, 2023 at 03:37:53PM +0100, Jan Kara wrote:
> On Thu 09-02-23 23:24:31, Ojaswin Mujoo wrote:
> > On Thu, Feb 09, 2023 at 11:54:18AM +0100, Jan Kara wrote:
> > > On Wed 08-02-23 16:55:05, Ojaswin Mujoo wrote:
> > > > On Fri, Feb 03, 2023 at 02:06:56PM +0530, Ojaswin Mujoo wrote:
> > > > > On Fri, Jan 27, 2023 at 03:43:12PM +0100, Jan Kara wrote:
> > > > > > 
> > > > > > Well, I think cond_resched() + goto retry would be OK here. We could also
> > > > > > cycle the corresponding group lock which would wait for
> > > > > > ext4_mb_discard_group_preallocations() to finish but that is going to burn
> > > > > > the CPU even more than the cond_resched() + retry as we'll be just spinning
> > > > > > on the spinlock. Sleeping is IMHO not warranted as the whole
> > > > > > ext4_mb_discard_group_preallocations() is running under a spinlock anyway
> > > > > > so it should better be a very short sleep.
> > > > > > 
> > > > > > Or actually I have one more possible solution: What the adjusting function
> > > > > > is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
> > > > > > trims start & end to not overlap these PAs. So we could just lookup these
> > > > > > two PAs (ignoring the deleted state) and then just iterate from these with
> > > > > > rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
> > > > > 
> > > > > Hey Jan, 
> > > > > 
> > > > > Just thought I'd update you, I'm trying this solution out, and it looks
> > > > > good but I'm hitting a few bugs in the implementation. Will update here
> > > > > once I have it working correctly.
> > > > 
> > > > Alright, so after spending some time on these bugs I'm hitting I'm
> > > > seeing some strange behavior. Basically, it seems like in scenarios
> > > > where we are not able to allocate as many block as the normalized goal
> > > > request, we can sometimes end up adding a PA that overlaps with existing
> > > > PAs in the inode PA list/tree. This behavior exists even before this
> > > > particular patchset. Due to presence of such overlapping PAs, the above
> > > > logic was failing in some cases.
> > > > 
> > > > From my understanding of the code, this seems to be a BUG. We should not
> > > > be adding overlapping PA ranges as that causes us to preallocate
> > > > multiple blocks for the same logical offset in a file, however I would
> > > > also like to know if my understanding is incorrect and if this is an
> > > > intended behavior.
> > > > 
> > > > ----- Analysis of the issue ------
> > > > 
> > > > Here's my analysis of the behavior, which I did by adding some BUG_ONs
> > > > and running generic/269 (4k bs). It happens pretty often, like once
> > > > every 5-10 runs. Testing was done without applying this patch series on
> > > > the Ted's dev branch.
> > > > 
> > > > 1. So taking an example of a real scenario I hit. After we find the best
> > > > len possible, we enter the ext4_mb_new_inode_pa() function with the
> > > > following values for start and end of the extents:
> > > > 
> > > > ## format: <start>/<end>(<len>)
> > > > orig_ex:503/510(7) goal_ex:0/512(512) best_ex:0/394(394)
> > > > 
> > > > 2. Since (best_ex len < goal_ex len) we enter the PA window adjustment
> > > > if condition here:
> > > > 
> > > > 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len)
> > > > 		...
> > > > 	}
> > > > 
> > > > 3. Here, we calc wins, winl and off and adjust logical start and end of
> > > > the best found extent. The idea is to make sure that the best extent
> > > > atleast covers the original request. In this example, the values are:
> > > > 
> > > > winl:503 wins:387 off:109
> > > > 
> > > > and win = min(winl, wins, off) = 109
> > > > 
> > > > 4. We then adjust the logical start of the best ex as:
> > > > 
> > > > 		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - EXT4_NUM_B2C(sbi, win);
> > > > 
> > > > which makes the new best extent as:
> > > > 
> > > > best_ex: 394/788(394)
> > > > 
> > > > As we can see, the best extent overflows outside the goal range, and
> > > > hence we don't have any guarentee anymore that it will not overlap with
> > > > another PA since we only check overlaps with the goal start and end.
> > > > We then initialze the new PA with the logical start and end of the best
> > > > extent and finaly add it to the inode PA list.
> > > > 
> > > > In my testing I was able to actually see overlapping PAs being added to
> > > > the inode list.
> > > > 
> > > > ----------- END ---------------
> > > > 
> > > > Again, I would like to know if this is a BUG or intended. If its a BUG,
> > > > is it okay for us to make sure the adjusted best extent length doesn't 
> > > > extend the goal length? 
> > > 
> > > Good spotting. So I guess your understanding of mballoc is better than mine
> > > by now :) but at least in my mental model I would also expect the resulting
> > > preallocation to stay withing the goal extent. What is causing here the
> > > issue is this code in ext4_mb_new_inode_pa():
> > > 
> > >                 offs = ac->ac_o_ex.fe_logical %
> > >                         EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> > >                 if (offs && offs < win)
> > >                         win = offs;
> > > 
> > > so we apparently try to align the logical offset of the allocation to a
> > > multiple of the allocated size but that just does not make much sense when
> > 
> > Yep, it is indeed the offset calculation that is cauing issues in this
> > particular example. Any idea why this was originally added?
> 
> So I belive mballoc tries to align everything (offsets & lengths) to powers
> of two to reduce fragmentation and simplify the work for the buddy allocator.
> If ac->ac_b_ex.fe_len is a power-of-two, the alignment makes sense. But
> once we had to resort to higher allocator passes and just got some
> random-length extent, the alignment stops making sense.

Got it, thanks for clarifying.
> 
> > > we found some random leftover extent with shorter-than-goal size. So what
> > > I'd do in the shorter-than-goal preallocation case is:
> > > 
> > > 1) If we can place the allocation at the end of goal window and still cover
> > > the original allocation request, do that.
> > > 
> > > 2) Otherwise if we can place the allocation at the start of the goal
> > > window and still cover the original allocation request, do that.
> > > 
> > > 3) Otherwise place the allocation at the start of the original allocation
> > > request.
> > > 
> > > This would seem to reasonably reduce fragmentation of preallocated space
> > > and still keep things simple.
> > This looks like a good approach to me and it will take care of the issue
> > caused due to offset calculation.
> > 
> > However, after commenting out the offset calculation bit in PA window
> > adjustment logic, I noticed that there is one more way that such an
> > overflow can happen, which would need to be addressed before we can
> > implement the above approach. Basically, this happens when we end up
> > with a goal len greater than the original len.
> 
> You probably mean goal end block smaller than original end block here... At
> least that's what you speak about below.

Right, my bad :)

> 
> > See my comments at the end for more info.
> > 
> > > 
> > > > Also, another thing I noticed is that after ext4_mb_normalize_request(),
> > > > sometimes the original range can also exceed the normalized goal range,
> > > > which is again was a bit surprising to me since my understanding was
> > > > that normalized range would always encompass the orignal range.
> > > 
> > > Well, isn't that because (part of) the requested original range is already
> > > preallocated? Or what causes the goal range to be shortened?
> > > 
> > Yes I think that pre existing PAs could be one of the cases.
> > 
> > Other than that, I'm also seeing some cases of sparse writes which can cause
> > ext4_mb_normalize_request() to result in having an original range that
> > overflows out of the goal range. For example, I observed these values just
> > after the if else if else conditions in the function, before we check if range
> > overlaps pre existing PAs:
> > 
> > orig_ex:2045/2055(len:10) normalized_range:0/2048, orig_isize:8417280
> > 
> > Basically, since isize is large and we are doing a sparse write, we end
> > up in the following if condition:
> > 
> > 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
> > 								(8<<20)>>bsbits, max, 8 * 1024)) {
> > 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
> > 		size = 8 * 1024 * 1024;
> >  }
> > 
> > Resulting in normalized range less than original range.
> 
> I see.
> 
> > Now, in any case, once we get such an overflow, if we try to enter the PA
> > adjustment window in ext4_mb_new_inode_pa() function, we will again end
> > up with a best extent overflowing out of goal extent since we would try
> > to cover the original extent. 
> > 
> > So yeah, seems like there are 2 cases where we could result in
> > overlapping PAs:
> > 
> > 1. Due to off calculation in PA adjustment window, as we discussed.  2.
> > Due to original extent overflowing out of goal extent.
> > 
> > I think the 3 step solution you proposed works well to counter 1 but not
> > 2, so we probably need some more logic on top of your solution to take
> > care of that.  I'll think some more on how to fix this but I think this
> > will be as a separate patch.
> 
> Well, my algorithm will still result in preallocation being within the goal
> range AFAICS. In the example above we had:
> 
> Orig 2045/2055 Goal 0/2048
> 
> Suppose we found 200 blocks. So we try placing preallocation like:
> 
> 1848/2048, it covers the original starting block 2045 so we are fine and
> create preallocation 1848/2048. Nothing has overflown the goal window...
> 
> 								Honza
Ahh okay, I though you meant checking if we covered the complete
original range instead of just the original start. In that case we
should be good.

However, I still feel that the example where we unnecessarily end up with 
a lesser goal len than original len should not happen. Maybe in
ext4_mb_normalize_request(), instead of hardcording the goal lengths for
different i_sizes, we can select the next power of 2 greater than our
original length or some similar heuristic. What do you think?

Regards,
ojaswin
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Feb. 14, 2023, 8:50 a.m. UTC | #12
Hi Ojaswin!

On Mon 13-02-23 23:28:41, Ojaswin Mujoo wrote:
> On Fri, Feb 10, 2023 at 03:37:53PM +0100, Jan Kara wrote:
> > > See my comments at the end for more info.
> > > 
> > > > 
> > > > > Also, another thing I noticed is that after ext4_mb_normalize_request(),
> > > > > sometimes the original range can also exceed the normalized goal range,
> > > > > which is again was a bit surprising to me since my understanding was
> > > > > that normalized range would always encompass the orignal range.
> > > > 
> > > > Well, isn't that because (part of) the requested original range is already
> > > > preallocated? Or what causes the goal range to be shortened?
> > > > 
> > > Yes I think that pre existing PAs could be one of the cases.
> > > 
> > > Other than that, I'm also seeing some cases of sparse writes which can cause
> > > ext4_mb_normalize_request() to result in having an original range that
> > > overflows out of the goal range. For example, I observed these values just
> > > after the if else if else conditions in the function, before we check if range
> > > overlaps pre existing PAs:
> > > 
> > > orig_ex:2045/2055(len:10) normalized_range:0/2048, orig_isize:8417280
> > > 
> > > Basically, since isize is large and we are doing a sparse write, we end
> > > up in the following if condition:
> > > 
> > > 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
> > > 								(8<<20)>>bsbits, max, 8 * 1024)) {
> > > 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
> > > 		size = 8 * 1024 * 1024;
> > >  }
> > > 
> > > Resulting in normalized range less than original range.
> > 
> > I see.
> > 
> > > Now, in any case, once we get such an overflow, if we try to enter the PA
> > > adjustment window in ext4_mb_new_inode_pa() function, we will again end
> > > up with a best extent overflowing out of goal extent since we would try
> > > to cover the original extent. 
> > > 
> > > So yeah, seems like there are 2 cases where we could result in
> > > overlapping PAs:
> > > 
> > > 1. Due to off calculation in PA adjustment window, as we discussed.  2.
> > > Due to original extent overflowing out of goal extent.
> > > 
> > > I think the 3 step solution you proposed works well to counter 1 but not
> > > 2, so we probably need some more logic on top of your solution to take
> > > care of that.  I'll think some more on how to fix this but I think this
> > > will be as a separate patch.
> > 
> > Well, my algorithm will still result in preallocation being within the goal
> > range AFAICS. In the example above we had:
> > 
> > Orig 2045/2055 Goal 0/2048
> > 
> > Suppose we found 200 blocks. So we try placing preallocation like:
> > 
> > 1848/2048, it covers the original starting block 2045 so we are fine and
> > create preallocation 1848/2048. Nothing has overflown the goal window...
> 
> Ahh okay, I though you meant checking if we covered the complete
> original range instead of just the original start. In that case we
> should be good.
> 
> However, I still feel that the example where we unnecessarily end up with 
> a lesser goal len than original len should not happen. Maybe in
> ext4_mb_normalize_request(), instead of hardcording the goal lengths for
> different i_sizes, we can select the next power of 2 greater than our
> original length or some similar heuristic. What do you think?

I agree that seems suboptimal but I'd leave tweaking this heuristic for a
separate patchset. In this one I'd just fix the minimum to make ranges not
overlap. The current heuristic makes sense as an anti-fragmentation measure
when there's enough free space. We can tweak the goal window heuristic of
mballoc but it would require some deeper thought and measurements, how it
affects fragmentation for various workloads so it is not just about
changing those several lines of code...

								Honza
Andreas Dilger Feb. 16, 2023, 5:07 p.m. UTC | #13
On Feb 10, 2023, at 7:37 AM, Jan Kara <jack@suse.cz> wrote:
> So I belive mballoc tries to align everything (offsets & lengths)
> to powers of two to reduce fragmentation and simplify the work for
> the buddy allocator.  If ac->ac_b_ex.fe_len is a power-of-two, the
> alignment makes sense. But once we had to resort to higher allocator
> passes and just got some random-length extent, the alignment stops
> making sense.

In addition to optimizing for the buddy allocator, the other reason that
the allocations are aligned to power-of-two offsets is to better align
with underlying RAID stripes.  Otherwise, unaligned writes will cause
parity read-modify-write updates to multiple RAID stripes.  This alignment
can also help (though to a lesser degree) with NAND flash erase blocks.

Cheers, Andreas
Ojaswin Mujoo March 17, 2023, 12:40 p.m. UTC | #14
On Thu, Feb 16, 2023 at 10:07:39AM -0700, Andreas Dilger wrote:
> On Feb 10, 2023, at 7:37 AM, Jan Kara <jack@suse.cz> wrote:
> > So I belive mballoc tries to align everything (offsets & lengths)
> > to powers of two to reduce fragmentation and simplify the work for
> > the buddy allocator.  If ac->ac_b_ex.fe_len is a power-of-two, the
> > alignment makes sense. But once we had to resort to higher allocator
> > passes and just got some random-length extent, the alignment stops
> > making sense.
> 
> In addition to optimizing for the buddy allocator, the other reason that
> the allocations are aligned to power-of-two offsets is to better align
> with underlying RAID stripes.  Otherwise, unaligned writes will cause
> parity read-modify-write updates to multiple RAID stripes.  This alignment
> can also help (though to a lesser degree) with NAND flash erase blocks.
> 
> Cheers, Andreas
> 
Got it, thanks. So from my limited understanding of RAID, if the write
is stripe aligned and the (length % stripe == 0) then we won't need a 
RMW cycle for parity bits and thats one of the reasons to pay attention
to alignment and length in mballoc code. 

Then I think Jan's reasoning still holds that if ac_b_ex.fe_len is already
not of a proper size then we'll anyways be ending with a RMW write in
RAID so no point of paying attention to its alignment, right?

Regards,
ojaswin
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 140e1eb300d1..fad5f087e4c6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1120,8 +1120,8 @@  struct ext4_inode_info {
 
 	/* mballoc */
 	atomic_t i_prealloc_active;
-	struct list_head i_prealloc_list;
-	spinlock_t i_prealloc_lock;
+	struct rb_root i_prealloc_node;
+	rwlock_t i_prealloc_lock;
 
 	/* extents status tree */
 	struct ext4_es_tree i_es_tree;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 53c4efd34d1c..85598079b7ce 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3984,6 +3984,44 @@  static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 	mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
 }
 
+static void ext4_mb_mark_pa_inuse(struct super_block *sb,
+				    struct ext4_prealloc_space *pa)
+{
+	struct ext4_inode_info *ei;
+
+	if (pa->pa_deleted == 0) {
+		ext4_warning(
+			sb, "pa already inuse, type:%d, pblk:%llu, lblk:%u, len:%d\n",
+			pa->pa_type, pa->pa_pstart, pa->pa_lstart, pa->pa_len);
+		return;
+	}
+
+	pa->pa_deleted = 0;
+
+	if (pa->pa_type == MB_INODE_PA) {
+		ei = EXT4_I(pa->pa_inode);
+		atomic_inc(&ei->i_prealloc_active);
+	}
+}
+
+/*
+ * This function returns the next element to look at during inode
+ * PA rbtree walk. We assume that we have held the inode PA rbtree lock
+ * (ei->i_prealloc_lock)
+ *
+ * new_start	The start of the range we want to compare
+ * cur_start	The existing start that we are comparing against
+ * node	The node of the rb_tree
+ */
+static inline struct rb_node*
+ext4_mb_pa_rb_next_iter(ext4_lblk_t new_start, ext4_lblk_t cur_start, struct rb_node *node)
+{
+	if (new_start < cur_start)
+		return node->rb_left;
+	else
+		return node->rb_right;
+}
+
 static inline void
 ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
 			  ext4_lblk_t start, ext4_lblk_t end)
@@ -3992,27 +4030,31 @@  ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_prealloc_space *tmp_pa;
 	ext4_lblk_t tmp_pa_start, tmp_pa_end;
+	struct rb_node *iter;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
-		spin_lock(&tmp_pa->pa_lock);
-		if (tmp_pa->pa_deleted == 0) {
-			tmp_pa_start = tmp_pa->pa_lstart;
-			tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+	read_lock(&ei->i_prealloc_lock);
+	iter = ei->i_prealloc_node.rb_node;
+	while (iter) {
+		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+				  pa_node.inode_node);
+		tmp_pa_start = tmp_pa->pa_lstart;
+		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
 
+		spin_lock(&tmp_pa->pa_lock);
+		if (tmp_pa->pa_deleted == 0)
 			BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
-		}
 		spin_unlock(&tmp_pa->pa_lock);
+
+		iter = ext4_mb_pa_rb_next_iter(start, tmp_pa_start, iter);
 	}
-	rcu_read_unlock();
+	read_unlock(&ei->i_prealloc_lock);
 }
-
 /*
  * Given an allocation context "ac" and a range "start", "end", check
  * and adjust boundaries if the range overlaps with any of the existing
  * preallocatoins stored in the corresponding inode of the allocation context.
  *
- *Parameters:
+ * Parameters:
  *	ac			allocation context
  *	start			start of the new range
  *	end			end of the new range
@@ -4024,6 +4066,7 @@  ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_prealloc_space *tmp_pa;
+	struct rb_node *iter;
 	ext4_lblk_t new_start, new_end;
 	ext4_lblk_t tmp_pa_start, tmp_pa_end;
 
@@ -4031,25 +4074,40 @@  ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	new_end = *end;
 
 	/* check we don't cross already preallocated blocks */
-	rcu_read_lock();
-	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
-		if (tmp_pa->pa_deleted)
-			continue;
-		spin_lock(&tmp_pa->pa_lock);
-		if (tmp_pa->pa_deleted) {
-			spin_unlock(&tmp_pa->pa_lock);
-			continue;
-		}
+	read_lock(&ei->i_prealloc_lock);
+	for (iter = ei->i_prealloc_node.rb_node; iter;
+	     iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) {
+		int is_overlap;
 
+		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+				  pa_node.inode_node);
 		tmp_pa_start = tmp_pa->pa_lstart;
 		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+		is_overlap = !(tmp_pa_start >= new_end || tmp_pa_end <= new_start);
+
+		spin_lock(&tmp_pa->pa_lock);
+		if (tmp_pa->pa_deleted) {
+			if (is_overlap) {
+				/*
+				 * Normalized range overlaps with range of this
+				 * deleted PA, that means we might need it in
+				 * near future. Since the PA is yet to be
+				 * removed from inode PA tree and freed, lets
+				 * just undelete it.
+				 */
+				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
+			} else {
+				spin_unlock(&tmp_pa->pa_lock);
+				continue;
+			}
+		}
 
 		/* PA must not overlap original request */
 		BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
 			ac->ac_o_ex.fe_logical < tmp_pa_start));
 
 		/* skip PAs this normalized request doesn't overlap with */
-		if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
+		if (!is_overlap) {
 			spin_unlock(&tmp_pa->pa_lock);
 			continue;
 		}
@@ -4065,7 +4123,7 @@  ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 		}
 		spin_unlock(&tmp_pa->pa_lock);
 	}
-	rcu_read_unlock();
+	read_unlock(&ei->i_prealloc_lock);
 
 	/* XXX: extra loop to check we really don't overlap preallocations */
 	ext4_mb_pa_assert_overlap(ac, new_start, new_end);
@@ -4192,7 +4250,6 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	ext4_mb_pa_adjust_overlap(ac, &start, &end);
 
 	size = end - start;
-
 	/*
 	 * In this function "start" and "size" are normalized for better
 	 * alignment and length such that we could preallocate more blocks.
@@ -4401,6 +4458,7 @@  ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	struct ext4_locality_group *lg;
 	struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
 	ext4_lblk_t tmp_pa_start, tmp_pa_end;
+	struct rb_node *iter;
 	ext4_fsblk_t goal_block;
 
 	/* only data can be preallocated */
@@ -4408,14 +4466,17 @@  ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 		return false;
 
 	/* first, try per-file preallocation */
-	rcu_read_lock();
-	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
+	read_lock(&ei->i_prealloc_lock);
+	for (iter = ei->i_prealloc_node.rb_node; iter;
+	     iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical, tmp_pa_start, iter)) {
+		tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
 
 		/* all fields in this condition don't change,
 		 * so we can skip locking for them */
 		tmp_pa_start = tmp_pa->pa_lstart;
 		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
 
+		/* original request start doesn't lie in this PA */
 		if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
 		    ac->ac_o_ex.fe_logical >= tmp_pa_end)
 			continue;
@@ -4433,17 +4494,25 @@  ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 
 		/* found preallocated blocks, use them */
 		spin_lock(&tmp_pa->pa_lock);
-		if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free) {
+		if (tmp_pa->pa_free) {
+			if (tmp_pa->pa_deleted == 1) {
+				/*
+				 * Since PA is yet to be deleted from inode PA
+				 * rbtree, just undelete it and use it.
+				 */
+				ext4_mb_mark_pa_inuse(ac->ac_sb, tmp_pa);
+			}
+
 			atomic_inc(&tmp_pa->pa_count);
 			ext4_mb_use_inode_pa(ac, tmp_pa);
 			spin_unlock(&tmp_pa->pa_lock);
 			ac->ac_criteria = 10;
-			rcu_read_unlock();
+			read_unlock(&ei->i_prealloc_lock);
 			return true;
 		}
 		spin_unlock(&tmp_pa->pa_lock);
 	}
-	rcu_read_unlock();
+	read_unlock(&ei->i_prealloc_lock);
 
 	/* can we use group allocation? */
 	if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
@@ -4596,6 +4665,7 @@  static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 {
 	ext4_group_t grp;
 	ext4_fsblk_t grp_blk;
+	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 
 	/* in this short window concurrent discard can set pa_deleted */
 	spin_lock(&pa->pa_lock);
@@ -4641,16 +4711,41 @@  static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 	ext4_unlock_group(sb, grp);
 
 	if (pa->pa_type == MB_INODE_PA) {
-		spin_lock(pa->pa_node_lock.inode_lock);
-		list_del_rcu(&pa->pa_node.inode_list);
-		spin_unlock(pa->pa_node_lock.inode_lock);
+		write_lock(pa->pa_node_lock.inode_lock);
+		rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+		write_unlock(pa->pa_node_lock.inode_lock);
+		ext4_mb_pa_free(pa);
 	} else {
 		spin_lock(pa->pa_node_lock.lg_lock);
 		list_del_rcu(&pa->pa_node.lg_list);
 		spin_unlock(pa->pa_node_lock.lg_lock);
+		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
+}
 
-	call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+static void ext4_mb_pa_rb_insert(struct rb_root *root, struct rb_node *new)
+{
+	struct rb_node **iter = &root->rb_node, *parent = NULL;
+	struct ext4_prealloc_space *iter_pa, *new_pa;
+	ext4_lblk_t iter_start, new_start;
+
+	while (*iter) {
+		iter_pa = rb_entry(*iter, struct ext4_prealloc_space,
+				   pa_node.inode_node);
+		new_pa = rb_entry(new, struct ext4_prealloc_space,
+				   pa_node.inode_node);
+		iter_start = iter_pa->pa_lstart;
+		new_start = new_pa->pa_lstart;
+
+		parent = *iter;
+		if (new_start < iter_start)
+			iter = &((*iter)->rb_left);
+		else
+			iter = &((*iter)->rb_right);
+	}
+
+	rb_link_node(new, parent, iter);
+	rb_insert_color(new, root);
 }
 
 /*
@@ -4716,7 +4811,6 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_len = ac->ac_b_ex.fe_len;
 	pa->pa_free = pa->pa_len;
 	spin_lock_init(&pa->pa_lock);
-	INIT_LIST_HEAD(&pa->pa_node.inode_list);
 	INIT_LIST_HEAD(&pa->pa_group_list);
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_INODE_PA;
@@ -4736,9 +4830,9 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 
-	spin_lock(pa->pa_node_lock.inode_lock);
-	list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
-	spin_unlock(pa->pa_node_lock.inode_lock);
+	write_lock(pa->pa_node_lock.inode_lock);
+	ext4_mb_pa_rb_insert(&ei->i_prealloc_node, &pa->pa_node.inode_node);
+	write_unlock(pa->pa_node_lock.inode_lock);
 	atomic_inc(&ei->i_prealloc_active);
 }
 
@@ -4904,6 +4998,7 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 	struct ext4_prealloc_space *pa, *tmp;
 	struct list_head list;
 	struct ext4_buddy e4b;
+	struct ext4_inode_info *ei;
 	int err;
 	int free = 0;
 
@@ -4967,18 +5062,45 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 			list_del_rcu(&pa->pa_node.lg_list);
 			spin_unlock(pa->pa_node_lock.lg_lock);
 		} else {
-			spin_lock(pa->pa_node_lock.inode_lock);
-			list_del_rcu(&pa->pa_node.inode_list);
-			spin_unlock(pa->pa_node_lock.inode_lock);
+			/*
+			 * The allocation path might undelete a PA
+			 * incase it expects it to be used again in near
+			 * future. In that case, rollback the ongoing delete
+			 * operation and don't remove the PA from inode PA
+			 * tree.
+			 *
+			 * TODO: See if we need pa_lock since there might no
+			 * path that contends with it once the rbtree writelock
+			 * is taken.
+			 */
+			write_lock(pa->pa_node_lock.inode_lock);
+			spin_lock(&pa->pa_lock);
+			if (pa->pa_deleted == 0) {
+				free -= pa->pa_free;
+				list_add(&pa->pa_group_list,
+					 &grp->bb_prealloc_list);
+				list_del(&pa->u.pa_tmp_list);
+
+				spin_unlock(&pa->pa_lock);
+				write_unlock(pa->pa_node_lock.inode_lock);
+				continue;
+			}
+			spin_unlock(&pa->pa_lock);
+
+			ei = EXT4_I(pa->pa_inode);
+			rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+			write_unlock(pa->pa_node_lock.inode_lock);
 		}
 
-		if (pa->pa_type == MB_GROUP_PA)
+		list_del(&pa->u.pa_tmp_list);
+
+		if (pa->pa_type == MB_GROUP_PA) {
 			ext4_mb_release_group_pa(&e4b, pa);
-		else
+			call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+		} else {
 			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
-
-		list_del(&pa->u.pa_tmp_list);
-		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+			ext4_mb_pa_free(pa);
+		}
 	}
 
 	ext4_unlock_group(sb, group);
@@ -5008,6 +5130,7 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 	ext4_group_t group = 0;
 	struct list_head list;
 	struct ext4_buddy e4b;
+	struct rb_node *iter;
 	int err;
 
 	if (!S_ISREG(inode->i_mode)) {
@@ -5030,17 +5153,18 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 
 repeat:
 	/* first, collect all pa's in the inode */
-	spin_lock(&ei->i_prealloc_lock);
-	while (!list_empty(&ei->i_prealloc_list) && needed) {
-		pa = list_entry(ei->i_prealloc_list.prev,
-				struct ext4_prealloc_space, pa_node.inode_list);
+	write_lock(&ei->i_prealloc_lock);
+	for (iter = rb_first(&ei->i_prealloc_node); iter && needed; iter = rb_next(iter)) {
+		pa = rb_entry(iter, struct ext4_prealloc_space,
+				pa_node.inode_node);
 		BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
+
 		spin_lock(&pa->pa_lock);
 		if (atomic_read(&pa->pa_count)) {
 			/* this shouldn't happen often - nobody should
 			 * use preallocation while we're discarding it */
 			spin_unlock(&pa->pa_lock);
-			spin_unlock(&ei->i_prealloc_lock);
+			write_unlock(&ei->i_prealloc_lock);
 			ext4_msg(sb, KERN_ERR,
 				 "uh-oh! used pa while discarding");
 			WARN_ON(1);
@@ -5051,7 +5175,7 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		if (pa->pa_deleted == 0) {
 			ext4_mb_mark_pa_deleted(sb, pa);
 			spin_unlock(&pa->pa_lock);
-			list_del_rcu(&pa->pa_node.inode_list);
+			rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
 			list_add(&pa->u.pa_tmp_list, &list);
 			needed--;
 			continue;
@@ -5059,7 +5183,7 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 
 		/* someone is deleting pa right now */
 		spin_unlock(&pa->pa_lock);
-		spin_unlock(&ei->i_prealloc_lock);
+		write_unlock(&ei->i_prealloc_lock);
 
 		/* we have to wait here because pa_deleted
 		 * doesn't mean pa is already unlinked from
@@ -5076,7 +5200,7 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		schedule_timeout_uninterruptible(HZ);
 		goto repeat;
 	}
-	spin_unlock(&ei->i_prealloc_lock);
+	write_unlock(&ei->i_prealloc_lock);
 
 	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
 		BUG_ON(pa->pa_type != MB_INODE_PA);
@@ -5108,7 +5232,7 @@  void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		put_bh(bitmap_bh);
 
 		list_del(&pa->u.pa_tmp_list);
-		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+		ext4_mb_pa_free(pa);
 	}
 }
 
@@ -5482,7 +5606,6 @@  static void ext4_mb_trim_inode_pa(struct inode *inode)
 static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 {
 	struct inode *inode = ac->ac_inode;
-	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_prealloc_space *pa = ac->ac_pa;
 	if (pa) {
@@ -5509,16 +5632,6 @@  static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 			}
 		}
 
-		if (pa->pa_type == MB_INODE_PA) {
-			/*
-			 * treat per-inode prealloc list as a lru list, then try
-			 * to trim the least recently used PA.
-			 */
-			spin_lock(pa->pa_node_lock.inode_lock);
-			list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
-			spin_unlock(pa->pa_node_lock.inode_lock);
-		}
-
 		ext4_mb_put_pa(ac, ac->ac_sb, pa);
 	}
 	if (ac->ac_bitmap_page)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 398a6688c341..f8e8ee493867 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -115,7 +115,7 @@  struct ext4_free_data {
 
 struct ext4_prealloc_space {
 	union {
-		struct list_head	inode_list; /* for inode PAs */
+		struct rb_node	inode_node;		/* for inode PA rbtree */
 		struct list_head	lg_list;	/* for lg PAs */
 	} pa_node;
 	struct list_head	pa_group_list;
@@ -132,10 +132,10 @@  struct ext4_prealloc_space {
 	ext4_grpblk_t		pa_free;	/* how many blocks are free */
 	unsigned short		pa_type;	/* pa type. inode or group */
 	union {
-		spinlock_t		*inode_lock;	/* locks the inode list holding this PA */
+		rwlock_t		*inode_lock;	/* locks the rbtree holding this PA */
 		spinlock_t		*lg_lock;	/* locks the lg list holding this PA */
 	} pa_node_lock;
-	struct inode		*pa_inode;	/* hack, for history only */
+	struct inode		*pa_inode;	/* used to get the inode during group discard */
 };
 
 enum {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 72ead3b56706..5fb3e401de6b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1325,9 +1325,9 @@  static struct inode *ext4_alloc_inode(struct super_block *sb)
 	inode_set_iversion(&ei->vfs_inode, 1);
 	ei->i_flags = 0;
 	spin_lock_init(&ei->i_raw_lock);
-	INIT_LIST_HEAD(&ei->i_prealloc_list);
+	ei->i_prealloc_node = RB_ROOT;
 	atomic_set(&ei->i_prealloc_active, 0);
-	spin_lock_init(&ei->i_prealloc_lock);
+	rwlock_init(&ei->i_prealloc_lock);
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
 	INIT_LIST_HEAD(&ei->i_es_list);