diff mbox series

[1/2] Btrfs: kill btrfs_clear_path_blocking

Message ID 1534453645-17714-2-git-send-email-bo.liu@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series address lock contention of btree root | expand

Commit Message

Liu Bo Aug. 16, 2018, 9:07 p.m. UTC
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
        -d  $MNT/0  -d  $MNT/1 \
        -d  $MNT/2  -d  $MNT/3 \
        -d  $MNT/4  -d  $MNT/5 \
        -d  $MNT/6  -d  $MNT/7 \
        -d  $MNT/8  -d  $MNT/9 \
        -d  $MNT/10  -d  $MNT/11 \
        -d  $MNT/12  -d  $MNT/13 \
        -d  $MNT/14  -d  $MNT/15

w/o patch:
real    2m27.307s
user    0m12.839s
sys     13m42.831s

w/ patch:
real    1m2.273s
user    0m15.802s
sys     8m16.495s

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set leave_spinning to have all
nodes in the path stay in spinning mode, which means callers are ready
to not sleep before releasing the path, but it won't cause problems if
they don't want to sleep in blocking mode, IOW, we can just get rid of
leave_spinning.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
 fs/btrfs/ctree.h         |  2 --
 fs/btrfs/delayed-inode.c |  3 ---
 3 files changed, 4 insertions(+), 58 deletions(-)

Comments

Nikolay Borisov Aug. 17, 2018, 7:24 a.m. UTC | #1
On 17.08.2018 00:07, Liu Bo wrote:
> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> while searching btree, locking is always acquired in spinning mode and
> then converted to blocking mode if necessary, and in some hot paths we may
> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> 
> When acquiring locks, both of reader and writer need to wait for blocking
> readers and writers to complete before doing read_lock()/write_lock().
> 
> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> make lockdep happy before doing its actual clearing blocking job.
> 
> When switching to blocking mode from spinning mode, it consists of
> 
> step 1) bumping up blocking readers counter and
> step 2) read_unlock()/write_unlock(),
> 
> this has caused serious ping-pong effect if there're a great amount of
> concurrent readers/writers, as waiters will be woken up and go to
> sleep immediately.

I think this ping-pong needs to be explained in a bit more detail.

Looking at the code it seems the issue happens when we have a path
locked in spinning mode (via btrfs_tree_lock) - in this case we have
blocking_readers/writes == 0 and write_lock acquired. Then we could
potentially have multiple requests to lock the tree (in this case the
root) and since the current holder is spinning then btrfs_tree_lock
callers will block on the write_lock. Subsequently when the holder
switches to blocking he calls
btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
increments the blocking reader/writes and calls the unlock at which
point a "thundering herd" problem occurs on the lock. Am I correct?

Furthermore I think the problem occurs not in btrfs_clear_path_blocking
but rather in the initial call to btrfs_set_path_blocking.

The way I see it the following sequence of operations occur:

1. Call btrfs_set_path_blocking: increments
blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
the lock types for the nodes in the path are
BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING

2. Some time later we call btrfs_clear_path_blocking
which also calls btrfs_set_path_blocking, however this time this
function doesn't do anything because the lock types in path struct are
already blocking and none of the 2 conditions inside
btrfs_set_lock_blocking_rw will match hence this code won't be executed.

Did I miss something ?

> 
> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> files creation script,
> 
> MNT=/mnt/btrfs
> mkfs.btrfs -f /dev/sdf
> mount /dev/def $MNT
> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
>         -d  $MNT/0  -d  $MNT/1 \
>         -d  $MNT/2  -d  $MNT/3 \
>         -d  $MNT/4  -d  $MNT/5 \
>         -d  $MNT/6  -d  $MNT/7 \
>         -d  $MNT/8  -d  $MNT/9 \
>         -d  $MNT/10  -d  $MNT/11 \
>         -d  $MNT/12  -d  $MNT/13 \
>         -d  $MNT/14  -d  $MNT/15
> 
> w/o patch:
> real    2m27.307s
> user    0m12.839s
> sys     13m42.831s
> 
> w/ patch:
> real    1m2.273s
> user    0m15.802s
> sys     8m16.495s
> 
> 2) dbench also proves the improvement,
> dbench -t 120 -D /mnt/btrfs 16
> 
> w/o patch:
> Throughput 158.363 MB/sec
> 
> w/ patch:
> Throughput 449.52 MB/sec
> 
> 3) xfstests didn't show any additional failures.
> 
> One thing to note is that callers may set leave_spinning to have all
> nodes in the path stay in spinning mode, which means callers are ready
> to not sleep before releasing the path, but it won't cause problems if
> they don't want to sleep in blocking mode, IOW, we can just get rid of
> leave_spinning.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
>  fs/btrfs/ctree.h         |  2 --
>  fs/btrfs/delayed-inode.c |  3 ---
>  3 files changed, 4 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index d436fb4c002e..8b31caa60b0a 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
>  	}
>  }
>  
> -/*
> - * reset all the locked nodes in the patch to spinning locks.
> - *
> - * held is used to keep lockdep happy, when lockdep is enabled
> - * we set held to a blocking lock before we go around and
> - * retake all the spinlocks in the path.  You can safely use NULL
> - * for held
> - */
> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> -					struct extent_buffer *held, int held_rw)
> -{
> -	int i;
> -
> -	if (held) {
> -		btrfs_set_lock_blocking_rw(held, held_rw);
> -		if (held_rw == BTRFS_WRITE_LOCK)
> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> -		else if (held_rw == BTRFS_READ_LOCK)
> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> -	}
> -	btrfs_set_path_blocking(p);
> -
> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> -		if (p->nodes[i] && p->locks[i]) {
> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> -				p->locks[i] = BTRFS_WRITE_LOCK;
> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> -				p->locks[i] = BTRFS_READ_LOCK;
> -		}
> -	}
> -
> -	if (held)
> -		btrfs_clear_lock_blocking_rw(held, held_rw);
> -}
> -
>  /* this also releases the path */
>  void btrfs_free_path(struct btrfs_path *p)
>  {
> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
>  		}
>  	}
>  
> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
>  	btrfs_tree_read_unlock_blocking(eb);
>  	free_extent_buffer(eb);
>  
> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  		btrfs_set_path_blocking(p);
>  		reada_for_balance(fs_info, p, level);
>  		sret = split_node(trans, root, p, level);
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		BUG_ON(sret > 0);
>  		if (sret) {
> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  		btrfs_set_path_blocking(p);
>  		reada_for_balance(fs_info, p, level);
>  		sret = balance_level(trans, root, p, level);
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		if (sret) {
>  			ret = sret;
> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  		}
>  cow_done:
>  		p->nodes[level] = b;
> -		btrfs_clear_path_blocking(p, NULL, 0);
> +		/*
> +		 * Leave path with blocking locks to avoid massive
> +		 * lock context switch, this is made on purpose.
> +		 */
>  
>  		/*
>  		 * we have a lock on b and as long as we aren't changing
> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  					if (!err) {
>  						btrfs_set_path_blocking(p);
>  						btrfs_tree_lock(b);
> -						btrfs_clear_path_blocking(p, b,
> -								  BTRFS_WRITE_LOCK);
>  					}
>  					p->locks[level] = BTRFS_WRITE_LOCK;
>  				} else {
> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  					if (!err) {
>  						btrfs_set_path_blocking(p);
>  						btrfs_tree_read_lock(b);
> -						btrfs_clear_path_blocking(p, b,
> -								  BTRFS_READ_LOCK);
>  					}
>  					p->locks[level] = BTRFS_READ_LOCK;
>  				}
> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  				btrfs_set_path_blocking(p);
>  				err = split_leaf(trans, root, key,
>  						 p, ins_len, ret == 0);
> -				btrfs_clear_path_blocking(p, NULL, 0);
>  
>  				BUG_ON(err > 0);
>  				if (err) {
> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>  	while (b) {
>  		level = btrfs_header_level(b);
>  		p->nodes[level] = b;
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		/*
>  		 * we have a lock on b and as long as we aren't changing
> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>  			if (!err) {
>  				btrfs_set_path_blocking(p);
>  				btrfs_tree_read_lock(b);
> -				btrfs_clear_path_blocking(p, b,
> -							  BTRFS_READ_LOCK);
>  			}
>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
>  			if (!b) {
> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
>  		path->locks[level - 1] = BTRFS_READ_LOCK;
>  		path->nodes[level - 1] = cur;
>  		unlock_up(path, level, 1, 0, NULL);
> -		btrfs_clear_path_blocking(path, NULL, 0);
>  	}
>  out:
>  	path->keep_locks = keep_locks;
> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>  			if (!ret) {
>  				btrfs_set_path_blocking(path);
>  				btrfs_tree_read_lock(next);
> -				btrfs_clear_path_blocking(path, next,
> -							  BTRFS_READ_LOCK);
>  			}
>  			next_rw_lock = BTRFS_READ_LOCK;
>  		}
> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>  			if (!ret) {
>  				btrfs_set_path_blocking(path);
>  				btrfs_tree_read_lock(next);
> -				btrfs_clear_path_blocking(path, next,
> -							  BTRFS_READ_LOCK);
>  			}
>  			next_rw_lock = BTRFS_READ_LOCK;
>  		}
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 318be7864072..1aeed3c0e949 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
>  struct btrfs_path *btrfs_alloc_path(void);
>  void btrfs_free_path(struct btrfs_path *p);
>  void btrfs_set_path_blocking(struct btrfs_path *p);
> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> -			       struct extent_buffer *held, int held_rw);
>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
>  
>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index f51b509f2d9b..db9f45082c61 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
>  		i++;
>  	}
>  
> -	/* reset all the locked nodes in the patch to spinning locks. */
> -	btrfs_clear_path_blocking(path, NULL, 0);
> -
>  	/* insert the keys of the items */
>  	setup_items_for_insert(root, path, keys, data_size,
>  			       total_data_size, total_size, nitems);
>
Liu Bo Aug. 20, 2018, 6:07 a.m. UTC | #2
On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.08.2018 00:07, Liu Bo wrote:
> > Btrfs's btree locking has two modes, spinning mode and blocking mode,
> > while searching btree, locking is always acquired in spinning mode and
> > then converted to blocking mode if necessary, and in some hot paths we may
> > switch the locking back to spinning mode by btrfs_clear_path_blocking().
> > 
> > When acquiring locks, both of reader and writer need to wait for blocking
> > readers and writers to complete before doing read_lock()/write_lock().
> > 
> > The problem is that btrfs_clear_path_blocking() needs to switch nodes
> > in the path to blocking mode at first (by btrfs_set_path_blocking) to
> > make lockdep happy before doing its actual clearing blocking job.
> > 
> > When switching to blocking mode from spinning mode, it consists of
> > 
> > step 1) bumping up blocking readers counter and
> > step 2) read_unlock()/write_unlock(),
> > 
> > this has caused serious ping-pong effect if there're a great amount of
> > concurrent readers/writers, as waiters will be woken up and go to
> > sleep immediately.
> 
> I think this ping-pong needs to be explained in a bit more detail.
> 
> Looking at the code it seems the issue happens when we have a path
> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> blocking_readers/writes == 0 and write_lock acquired. Then we could
> potentially have multiple requests to lock the tree (in this case the
> root) and since the current holder is spinning then btrfs_tree_lock
> callers will block on the write_lock. Subsequently when the holder
> switches to blocking he calls
> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> increments the blocking reader/writes and calls the unlock at which
> point a "thundering herd" problem occurs on the lock. Am I correct?
>

That's correct, but the "thundering herd' problem is not really the
problem that this patch is trying to address, see more in below.

> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> but rather in the initial call to btrfs_set_path_blocking.
> 
> The way I see it the following sequence of operations occur:
> 
> 1. Call btrfs_set_path_blocking: increments
> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> the lock types for the nodes in the path are
> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> 
> 2. Some time later we call btrfs_clear_path_blocking
> which also calls btrfs_set_path_blocking, however this time this
> function doesn't do anything because the lock types in path struct are
> already blocking and none of the 2 conditions inside
> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> 
> Did I miss something ?
> 

Thanks for sharing your thoughts.

We have to set path blocking as the subsequent code needs to sleep,
it's the cost we have to pay by design, I'm OK with it.

The problem is that btrfs_clear_path_blocking always tries to set path
to blocking before clearing path blocking as it needs to ensure
spin_locks on each level are acquired in right order.  And if all the
nodes in the path are already in spinning mode (say that
should_cow_block() decides to not COW), setting path blocking doesn't
really make sense and causes others waiting on the lock to do a
wake-up and a go-to-sleep.

My trace showed that there could be up to 50 switches for a single
lock waiter and the difference on the finished time also proved how
big the impact is.

thanks,
-liubo
> > 
> > 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> > files creation script,
> > 
> > MNT=/mnt/btrfs
> > mkfs.btrfs -f /dev/sdf
> > mount /dev/def $MNT
> > time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >         -d  $MNT/0  -d  $MNT/1 \
> >         -d  $MNT/2  -d  $MNT/3 \
> >         -d  $MNT/4  -d  $MNT/5 \
> >         -d  $MNT/6  -d  $MNT/7 \
> >         -d  $MNT/8  -d  $MNT/9 \
> >         -d  $MNT/10  -d  $MNT/11 \
> >         -d  $MNT/12  -d  $MNT/13 \
> >         -d  $MNT/14  -d  $MNT/15
> > 
> > w/o patch:
> > real    2m27.307s
> > user    0m12.839s
> > sys     13m42.831s
> > 
> > w/ patch:
> > real    1m2.273s
> > user    0m15.802s
> > sys     8m16.495s
> > 
> > 2) dbench also proves the improvement,
> > dbench -t 120 -D /mnt/btrfs 16
> > 
> > w/o patch:
> > Throughput 158.363 MB/sec
> > 
> > w/ patch:
> > Throughput 449.52 MB/sec
> > 
> > 3) xfstests didn't show any additional failures.
> > 
> > One thing to note is that callers may set leave_spinning to have all
> > nodes in the path stay in spinning mode, which means callers are ready
> > to not sleep before releasing the path, but it won't cause problems if
> > they don't want to sleep in blocking mode, IOW, we can just get rid of
> > leave_spinning.
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
> >  fs/btrfs/ctree.h         |  2 --
> >  fs/btrfs/delayed-inode.c |  3 ---
> >  3 files changed, 4 insertions(+), 58 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index d436fb4c002e..8b31caa60b0a 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> >  	}
> >  }
> >  
> > -/*
> > - * reset all the locked nodes in the patch to spinning locks.
> > - *
> > - * held is used to keep lockdep happy, when lockdep is enabled
> > - * we set held to a blocking lock before we go around and
> > - * retake all the spinlocks in the path.  You can safely use NULL
> > - * for held
> > - */
> > -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> > -					struct extent_buffer *held, int held_rw)
> > -{
> > -	int i;
> > -
> > -	if (held) {
> > -		btrfs_set_lock_blocking_rw(held, held_rw);
> > -		if (held_rw == BTRFS_WRITE_LOCK)
> > -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> > -		else if (held_rw == BTRFS_READ_LOCK)
> > -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> > -	}
> > -	btrfs_set_path_blocking(p);
> > -
> > -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> > -		if (p->nodes[i] && p->locks[i]) {
> > -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> > -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> > -				p->locks[i] = BTRFS_WRITE_LOCK;
> > -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> > -				p->locks[i] = BTRFS_READ_LOCK;
> > -		}
> > -	}
> > -
> > -	if (held)
> > -		btrfs_clear_lock_blocking_rw(held, held_rw);
> > -}
> > -
> >  /* this also releases the path */
> >  void btrfs_free_path(struct btrfs_path *p)
> >  {
> > @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> >  		}
> >  	}
> >  
> > -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >  	btrfs_tree_read_unlock_blocking(eb);
> >  	free_extent_buffer(eb);
> >  
> > @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >  		btrfs_set_path_blocking(p);
> >  		reada_for_balance(fs_info, p, level);
> >  		sret = split_node(trans, root, p, level);
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		BUG_ON(sret > 0);
> >  		if (sret) {
> > @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >  		btrfs_set_path_blocking(p);
> >  		reada_for_balance(fs_info, p, level);
> >  		sret = balance_level(trans, root, p, level);
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		if (sret) {
> >  			ret = sret;
> > @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  		}
> >  cow_done:
> >  		p->nodes[level] = b;
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> > +		/*
> > +		 * Leave path with blocking locks to avoid massive
> > +		 * lock context switch, this is made on purpose.
> > +		 */
> >  
> >  		/*
> >  		 * we have a lock on b and as long as we aren't changing
> > @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  					if (!err) {
> >  						btrfs_set_path_blocking(p);
> >  						btrfs_tree_lock(b);
> > -						btrfs_clear_path_blocking(p, b,
> > -								  BTRFS_WRITE_LOCK);
> >  					}
> >  					p->locks[level] = BTRFS_WRITE_LOCK;
> >  				} else {
> > @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  					if (!err) {
> >  						btrfs_set_path_blocking(p);
> >  						btrfs_tree_read_lock(b);
> > -						btrfs_clear_path_blocking(p, b,
> > -								  BTRFS_READ_LOCK);
> >  					}
> >  					p->locks[level] = BTRFS_READ_LOCK;
> >  				}
> > @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  				btrfs_set_path_blocking(p);
> >  				err = split_leaf(trans, root, key,
> >  						 p, ins_len, ret == 0);
> > -				btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  				BUG_ON(err > 0);
> >  				if (err) {
> > @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >  	while (b) {
> >  		level = btrfs_header_level(b);
> >  		p->nodes[level] = b;
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		/*
> >  		 * we have a lock on b and as long as we aren't changing
> > @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >  			if (!err) {
> >  				btrfs_set_path_blocking(p);
> >  				btrfs_tree_read_lock(b);
> > -				btrfs_clear_path_blocking(p, b,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >  			if (!b) {
> > @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> >  		path->locks[level - 1] = BTRFS_READ_LOCK;
> >  		path->nodes[level - 1] = cur;
> >  		unlock_up(path, level, 1, 0, NULL);
> > -		btrfs_clear_path_blocking(path, NULL, 0);
> >  	}
> >  out:
> >  	path->keep_locks = keep_locks;
> > @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >  			if (!ret) {
> >  				btrfs_set_path_blocking(path);
> >  				btrfs_tree_read_lock(next);
> > -				btrfs_clear_path_blocking(path, next,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			next_rw_lock = BTRFS_READ_LOCK;
> >  		}
> > @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >  			if (!ret) {
> >  				btrfs_set_path_blocking(path);
> >  				btrfs_tree_read_lock(next);
> > -				btrfs_clear_path_blocking(path, next,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			next_rw_lock = BTRFS_READ_LOCK;
> >  		}
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 318be7864072..1aeed3c0e949 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> >  struct btrfs_path *btrfs_alloc_path(void);
> >  void btrfs_free_path(struct btrfs_path *p);
> >  void btrfs_set_path_blocking(struct btrfs_path *p);
> > -void btrfs_clear_path_blocking(struct btrfs_path *p,
> > -			       struct extent_buffer *held, int held_rw);
> >  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >  
> >  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index f51b509f2d9b..db9f45082c61 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> >  		i++;
> >  	}
> >  
> > -	/* reset all the locked nodes in the patch to spinning locks. */
> > -	btrfs_clear_path_blocking(path, NULL, 0);
> > -
> >  	/* insert the keys of the items */
> >  	setup_items_for_insert(root, path, keys, data_size,
> >  			       total_data_size, total_size, nitems);
> >
Nikolay Borisov Aug. 20, 2018, 7:31 a.m. UTC | #3
On 20.08.2018 09:07, Liu Bo wrote:
> On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 17.08.2018 00:07, Liu Bo wrote:
>>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
>>> while searching btree, locking is always acquired in spinning mode and
>>> then converted to blocking mode if necessary, and in some hot paths we may
>>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
>>>
>>> When acquiring locks, both of reader and writer need to wait for blocking
>>> readers and writers to complete before doing read_lock()/write_lock().
>>>
>>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
>>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
>>> make lockdep happy before doing its actual clearing blocking job.
>>>
>>> When switching to blocking mode from spinning mode, it consists of
>>>
>>> step 1) bumping up blocking readers counter and
>>> step 2) read_unlock()/write_unlock(),
>>>
>>> this has caused serious ping-pong effect if there're a great amount of
>>> concurrent readers/writers, as waiters will be woken up and go to
>>> sleep immediately.
>>
>> I think this ping-pong needs to be explained in a bit more detail.
>>
>> Looking at the code it seems the issue happens when we have a path
>> locked in spinning mode (via btrfs_tree_lock) - in this case we have
>> blocking_readers/writes == 0 and write_lock acquired. Then we could
>> potentially have multiple requests to lock the tree (in this case the
>> root) and since the current holder is spinning then btrfs_tree_lock
>> callers will block on the write_lock. Subsequently when the holder
>> switches to blocking he calls
>> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
>> increments the blocking reader/writes and calls the unlock at which
>> point a "thundering herd" problem occurs on the lock. Am I correct?
>>
> 
> That's correct, but the "thundering herd' problem is not really the
> problem that this patch is trying to address, see more in below.
> 
>> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
>> but rather in the initial call to btrfs_set_path_blocking.
>>
>> The way I see it the following sequence of operations occur:
>>
>> 1. Call btrfs_set_path_blocking: increments
>> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
>> the lock types for the nodes in the path are
>> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
>>
>> 2. Some time later we call btrfs_clear_path_blocking
>> which also calls btrfs_set_path_blocking, however this time this
>> function doesn't do anything because the lock types in path struct are
>> already blocking and none of the 2 conditions inside
>> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
>>
>> Did I miss something ?
>>
> 
> Thanks for sharing your thoughts.
> 
> We have to set path blocking as the subsequent code needs to sleep,
> it's the cost we have to pay by design, I'm OK with it.
> 
> The problem is that btrfs_clear_path_blocking always tries to set path
> to blocking before clearing path blocking as it needs to ensure
> spin_locks on each level are acquired in right order.  And if all the
> nodes in the path are already in spinning mode (say that
> should_cow_block() decides to not COW), setting path blocking doesn't
> really make sense and causes others waiting on the lock to do a
> wake-up and a go-to-sleep.

Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
agree this could happen but it can't in any of the other hunks you
touched since they all call btrfs_clear_path_blocking AFTER
btrfs_set_path_blocking was called. So the problem can't stem from them
since they take the perf hit at the time btrfs_set_path_blocking is
called. IMO this is important information for context and needs to be
included in the changelog. In short the only problematic code is the one
in btrfs_search_slot.

Another less intrusive idea is to touch only btrfs_search_slot by
introducing a boolean flag to indicate we have set the path to
blocking/COWed blocks.

Then the fix will be a simple:

if (cowed/path_blocking/whatever)
    btrfs_clear_path_blocking(p, NULL, 0);

Yours is also a valid approach albeit seems more heavy weight. Having
said that, I'm all in favor of removing code so long as it does not
introduce any regressions :)


> 
> My trace showed that there could be up to 50 switches for a single
> lock waiter and the difference on the finished time also proved how
> big the impact is>
> thanks,
> -liubo
>>>
>>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
>>> files creation script,
>>>
>>> MNT=/mnt/btrfs
>>> mkfs.btrfs -f /dev/sdf
>>> mount /dev/def $MNT
>>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
>>>         -d  $MNT/0  -d  $MNT/1 \
>>>         -d  $MNT/2  -d  $MNT/3 \
>>>         -d  $MNT/4  -d  $MNT/5 \
>>>         -d  $MNT/6  -d  $MNT/7 \
>>>         -d  $MNT/8  -d  $MNT/9 \
>>>         -d  $MNT/10  -d  $MNT/11 \
>>>         -d  $MNT/12  -d  $MNT/13 \
>>>         -d  $MNT/14  -d  $MNT/15
>>>
>>> w/o patch:
>>> real    2m27.307s
>>> user    0m12.839s
>>> sys     13m42.831s
>>>
>>> w/ patch:
>>> real    1m2.273s
>>> user    0m15.802s
>>> sys     8m16.495s
>>>
>>> 2) dbench also proves the improvement,
>>> dbench -t 120 -D /mnt/btrfs 16
>>>
>>> w/o patch:
>>> Throughput 158.363 MB/sec
>>>
>>> w/ patch:
>>> Throughput 449.52 MB/sec
>>>
>>> 3) xfstests didn't show any additional failures.
>>>
>>> One thing to note is that callers may set leave_spinning to have all
>>> nodes in the path stay in spinning mode, which means callers are ready
>>> to not sleep before releasing the path, but it won't cause problems if
>>> they don't want to sleep in blocking mode, IOW, we can just get rid of
>>> leave_spinning.
>>>
>>> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
>>> ---
>>>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
>>>  fs/btrfs/ctree.h         |  2 --
>>>  fs/btrfs/delayed-inode.c |  3 ---
>>>  3 files changed, 4 insertions(+), 58 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>>> index d436fb4c002e..8b31caa60b0a 100644
>>> --- a/fs/btrfs/ctree.c
>>> +++ b/fs/btrfs/ctree.c
>>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
>>>  	}
>>>  }
>>>  
>>> -/*
>>> - * reset all the locked nodes in the patch to spinning locks.
>>> - *
>>> - * held is used to keep lockdep happy, when lockdep is enabled
>>> - * we set held to a blocking lock before we go around and
>>> - * retake all the spinlocks in the path.  You can safely use NULL
>>> - * for held
>>> - */
>>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
>>> -					struct extent_buffer *held, int held_rw)
>>> -{
>>> -	int i;
>>> -
>>> -	if (held) {
>>> -		btrfs_set_lock_blocking_rw(held, held_rw);
>>> -		if (held_rw == BTRFS_WRITE_LOCK)
>>> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
>>> -		else if (held_rw == BTRFS_READ_LOCK)
>>> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
>>> -	}
>>> -	btrfs_set_path_blocking(p);
>>> -
>>> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
>>> -		if (p->nodes[i] && p->locks[i]) {
>>> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
>>> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
>>> -				p->locks[i] = BTRFS_WRITE_LOCK;
>>> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
>>> -				p->locks[i] = BTRFS_READ_LOCK;
>>> -		}
>>> -	}
>>> -
>>> -	if (held)
>>> -		btrfs_clear_lock_blocking_rw(held, held_rw);
>>> -}
>>> -
>>>  /* this also releases the path */
>>>  void btrfs_free_path(struct btrfs_path *p)
>>>  {
>>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
>>>  		}
>>>  	}
>>>  
>>> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
>>>  	btrfs_tree_read_unlock_blocking(eb);
>>>  	free_extent_buffer(eb);
>>>  
>>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>>>  		btrfs_set_path_blocking(p);
>>>  		reada_for_balance(fs_info, p, level);
>>>  		sret = split_node(trans, root, p, level);
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		BUG_ON(sret > 0);
>>>  		if (sret) {
>>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>>>  		btrfs_set_path_blocking(p);
>>>  		reada_for_balance(fs_info, p, level);
>>>  		sret = balance_level(trans, root, p, level);
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		if (sret) {
>>>  			ret = sret;
>>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  		}
>>>  cow_done:
>>>  		p->nodes[level] = b;
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>> +		/*
>>> +		 * Leave path with blocking locks to avoid massive
>>> +		 * lock context switch, this is made on purpose.
>>> +		 */
>>>  
>>>  		/*
>>>  		 * we have a lock on b and as long as we aren't changing
>>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  					if (!err) {
>>>  						btrfs_set_path_blocking(p);
>>>  						btrfs_tree_lock(b);
>>> -						btrfs_clear_path_blocking(p, b,
>>> -								  BTRFS_WRITE_LOCK);
>>>  					}
>>>  					p->locks[level] = BTRFS_WRITE_LOCK;
>>>  				} else {
>>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  					if (!err) {
>>>  						btrfs_set_path_blocking(p);
>>>  						btrfs_tree_read_lock(b);
>>> -						btrfs_clear_path_blocking(p, b,
>>> -								  BTRFS_READ_LOCK);
>>>  					}
>>>  					p->locks[level] = BTRFS_READ_LOCK;
>>>  				}
>>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  				btrfs_set_path_blocking(p);
>>>  				err = split_leaf(trans, root, key,
>>>  						 p, ins_len, ret == 0);
>>> -				btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  				BUG_ON(err > 0);
>>>  				if (err) {
>>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>>>  	while (b) {
>>>  		level = btrfs_header_level(b);
>>>  		p->nodes[level] = b;
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		/*
>>>  		 * we have a lock on b and as long as we aren't changing
>>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>>>  			if (!err) {
>>>  				btrfs_set_path_blocking(p);
>>>  				btrfs_tree_read_lock(b);
>>> -				btrfs_clear_path_blocking(p, b,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
>>>  			if (!b) {
>>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
>>>  		path->locks[level - 1] = BTRFS_READ_LOCK;
>>>  		path->nodes[level - 1] = cur;
>>>  		unlock_up(path, level, 1, 0, NULL);
>>> -		btrfs_clear_path_blocking(path, NULL, 0);
>>>  	}
>>>  out:
>>>  	path->keep_locks = keep_locks;
>>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>>>  			if (!ret) {
>>>  				btrfs_set_path_blocking(path);
>>>  				btrfs_tree_read_lock(next);
>>> -				btrfs_clear_path_blocking(path, next,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			next_rw_lock = BTRFS_READ_LOCK;
>>>  		}
>>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>>>  			if (!ret) {
>>>  				btrfs_set_path_blocking(path);
>>>  				btrfs_tree_read_lock(next);
>>> -				btrfs_clear_path_blocking(path, next,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			next_rw_lock = BTRFS_READ_LOCK;
>>>  		}
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 318be7864072..1aeed3c0e949 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
>>>  struct btrfs_path *btrfs_alloc_path(void);
>>>  void btrfs_free_path(struct btrfs_path *p);
>>>  void btrfs_set_path_blocking(struct btrfs_path *p);
>>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
>>> -			       struct extent_buffer *held, int held_rw);
>>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
>>>  
>>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
>>> index f51b509f2d9b..db9f45082c61 100644
>>> --- a/fs/btrfs/delayed-inode.c
>>> +++ b/fs/btrfs/delayed-inode.c
>>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
>>>  		i++;
>>>  	}
>>>  
>>> -	/* reset all the locked nodes in the patch to spinning locks. */
>>> -	btrfs_clear_path_blocking(path, NULL, 0);
>>> -
>>>  	/* insert the keys of the items */
>>>  	setup_items_for_insert(root, path, keys, data_size,
>>>  			       total_data_size, total_size, nitems);
>>>
>
Liu Bo Aug. 21, 2018, 5:24 p.m. UTC | #4
On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 20.08.2018 09:07, Liu Bo wrote:
> > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 17.08.2018 00:07, Liu Bo wrote:
> >>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> >>> while searching btree, locking is always acquired in spinning mode and
> >>> then converted to blocking mode if necessary, and in some hot paths we may
> >>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> >>>
> >>> When acquiring locks, both of reader and writer need to wait for blocking
> >>> readers and writers to complete before doing read_lock()/write_lock().
> >>>
> >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> >>> make lockdep happy before doing its actual clearing blocking job.
> >>>
> >>> When switching to blocking mode from spinning mode, it consists of
> >>>
> >>> step 1) bumping up blocking readers counter and
> >>> step 2) read_unlock()/write_unlock(),
> >>>
> >>> this has caused serious ping-pong effect if there're a great amount of
> >>> concurrent readers/writers, as waiters will be woken up and go to
> >>> sleep immediately.
> >>
> >> I think this ping-pong needs to be explained in a bit more detail.
> >>
> >> Looking at the code it seems the issue happens when we have a path
> >> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> >> blocking_readers/writes == 0 and write_lock acquired. Then we could
> >> potentially have multiple requests to lock the tree (in this case the
> >> root) and since the current holder is spinning then btrfs_tree_lock
> >> callers will block on the write_lock. Subsequently when the holder
> >> switches to blocking he calls
> >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> >> increments the blocking reader/writes and calls the unlock at which
> >> point a "thundering herd" problem occurs on the lock. Am I correct?
> >>
> > 
> > That's correct, but the "thundering herd' problem is not really the
> > problem that this patch is trying to address, see more in below.
> > 
> >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> >> but rather in the initial call to btrfs_set_path_blocking.
> >>
> >> The way I see it the following sequence of operations occur:
> >>
> >> 1. Call btrfs_set_path_blocking: increments
> >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> >> the lock types for the nodes in the path are
> >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> >>
> >> 2. Some time later we call btrfs_clear_path_blocking
> >> which also calls btrfs_set_path_blocking, however this time this
> >> function doesn't do anything because the lock types in path struct are
> >> already blocking and none of the 2 conditions inside
> >> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> >>
> >> Did I miss something ?
> >>
> > 
> > Thanks for sharing your thoughts.
> > 
> > We have to set path blocking as the subsequent code needs to sleep,
> > it's the cost we have to pay by design, I'm OK with it.
> > 
> > The problem is that btrfs_clear_path_blocking always tries to set path
> > to blocking before clearing path blocking as it needs to ensure
> > spin_locks on each level are acquired in right order.  And if all the
> > nodes in the path are already in spinning mode (say that
> > should_cow_block() decides to not COW), setting path blocking doesn't
> > really make sense and causes others waiting on the lock to do a
> > wake-up and a go-to-sleep.
> 
> Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
> agree this could happen but it can't in any of the other hunks you
> touched since they all call btrfs_clear_path_blocking AFTER
> btrfs_set_path_blocking was called. So the problem can't stem from them
> since they take the perf hit at the time btrfs_set_path_blocking is
> called. IMO this is important information for context and needs to be
> included in the changelog. In short the only problematic code is the one
> in btrfs_search_slot.
> 
> Another less intrusive idea is to touch only btrfs_search_slot by
> introducing a boolean flag to indicate we have set the path to
> blocking/COWed blocks.
> 
> Then the fix will be a simple:
> 
> if (cowed/path_blocking/whatever)
>     btrfs_clear_path_blocking(p, NULL, 0);

if (cow)
    btrfs_clear_path_blocking(); 

I did exactly this at first and it ended up with 1m20s while killing
all the btrfs_clear_path_blocking() gave a stable 1m2s, so I confirmed
that even switching path back to spinning from blocking also hurts a
little bit.

> 
> Yours is also a valid approach albeit seems more heavy weight. Having
> said that, I'm all in favor of removing code so long as it does not
> introduce any regressions :)

As all the callers that were running in spinning lock context should
be ready to cooperate with blocking lock context, it'd be fine IMO.

Thanks for the comments again.

thanks,
-liubo

> 
> 
> > 
> > My trace showed that there could be up to 50 switches for a single
> > lock waiter and the difference on the finished time also proved how
> > big the impact is>
> > thanks,
> > -liubo
> >>>
> >>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> >>> files creation script,
> >>>
> >>> MNT=/mnt/btrfs
> >>> mkfs.btrfs -f /dev/sdf
> >>> mount /dev/def $MNT
> >>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >>>         -d  $MNT/0  -d  $MNT/1 \
> >>>         -d  $MNT/2  -d  $MNT/3 \
> >>>         -d  $MNT/4  -d  $MNT/5 \
> >>>         -d  $MNT/6  -d  $MNT/7 \
> >>>         -d  $MNT/8  -d  $MNT/9 \
> >>>         -d  $MNT/10  -d  $MNT/11 \
> >>>         -d  $MNT/12  -d  $MNT/13 \
> >>>         -d  $MNT/14  -d  $MNT/15
> >>>
> >>> w/o patch:
> >>> real    2m27.307s
> >>> user    0m12.839s
> >>> sys     13m42.831s
> >>>
> >>> w/ patch:
> >>> real    1m2.273s
> >>> user    0m15.802s
> >>> sys     8m16.495s
> >>>
> >>> 2) dbench also proves the improvement,
> >>> dbench -t 120 -D /mnt/btrfs 16
> >>>
> >>> w/o patch:
> >>> Throughput 158.363 MB/sec
> >>>
> >>> w/ patch:
> >>> Throughput 449.52 MB/sec
> >>>
> >>> 3) xfstests didn't show any additional failures.
> >>>
> >>> One thing to note is that callers may set leave_spinning to have all
> >>> nodes in the path stay in spinning mode, which means callers are ready
> >>> to not sleep before releasing the path, but it won't cause problems if
> >>> they don't want to sleep in blocking mode, IOW, we can just get rid of
> >>> leave_spinning.
> >>>
> >>> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> >>> ---
> >>>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
> >>>  fs/btrfs/ctree.h         |  2 --
> >>>  fs/btrfs/delayed-inode.c |  3 ---
> >>>  3 files changed, 4 insertions(+), 58 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> >>> index d436fb4c002e..8b31caa60b0a 100644
> >>> --- a/fs/btrfs/ctree.c
> >>> +++ b/fs/btrfs/ctree.c
> >>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> >>>  	}
> >>>  }
> >>>  
> >>> -/*
> >>> - * reset all the locked nodes in the patch to spinning locks.
> >>> - *
> >>> - * held is used to keep lockdep happy, when lockdep is enabled
> >>> - * we set held to a blocking lock before we go around and
> >>> - * retake all the spinlocks in the path.  You can safely use NULL
> >>> - * for held
> >>> - */
> >>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -					struct extent_buffer *held, int held_rw)
> >>> -{
> >>> -	int i;
> >>> -
> >>> -	if (held) {
> >>> -		btrfs_set_lock_blocking_rw(held, held_rw);
> >>> -		if (held_rw == BTRFS_WRITE_LOCK)
> >>> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> >>> -		else if (held_rw == BTRFS_READ_LOCK)
> >>> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> >>> -	}
> >>> -	btrfs_set_path_blocking(p);
> >>> -
> >>> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> >>> -		if (p->nodes[i] && p->locks[i]) {
> >>> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> >>> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_WRITE_LOCK;
> >>> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_READ_LOCK;
> >>> -		}
> >>> -	}
> >>> -
> >>> -	if (held)
> >>> -		btrfs_clear_lock_blocking_rw(held, held_rw);
> >>> -}
> >>> -
> >>>  /* this also releases the path */
> >>>  void btrfs_free_path(struct btrfs_path *p)
> >>>  {
> >>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> >>>  		}
> >>>  	}
> >>>  
> >>> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >>>  	btrfs_tree_read_unlock_blocking(eb);
> >>>  	free_extent_buffer(eb);
> >>>  
> >>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = split_node(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		BUG_ON(sret > 0);
> >>>  		if (sret) {
> >>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = balance_level(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		if (sret) {
> >>>  			ret = sret;
> >>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  		}
> >>>  cow_done:
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>> +		/*
> >>> +		 * Leave path with blocking locks to avoid massive
> >>> +		 * lock context switch, this is made on purpose.
> >>> +		 */
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_WRITE_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_WRITE_LOCK;
> >>>  				} else {
> >>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_read_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_READ_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_READ_LOCK;
> >>>  				}
> >>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  				btrfs_set_path_blocking(p);
> >>>  				err = split_leaf(trans, root, key,
> >>>  						 p, ins_len, ret == 0);
> >>> -				btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  				BUG_ON(err > 0);
> >>>  				if (err) {
> >>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  	while (b) {
> >>>  		level = btrfs_header_level(b);
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  			if (!err) {
> >>>  				btrfs_set_path_blocking(p);
> >>>  				btrfs_tree_read_lock(b);
> >>> -				btrfs_clear_path_blocking(p, b,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >>>  			if (!b) {
> >>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> >>>  		path->locks[level - 1] = BTRFS_READ_LOCK;
> >>>  		path->nodes[level - 1] = cur;
> >>>  		unlock_up(path, level, 1, 0, NULL);
> >>> -		btrfs_clear_path_blocking(path, NULL, 0);
> >>>  	}
> >>>  out:
> >>>  	path->keep_locks = keep_locks;
> >>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>> index 318be7864072..1aeed3c0e949 100644
> >>> --- a/fs/btrfs/ctree.h
> >>> +++ b/fs/btrfs/ctree.h
> >>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> >>>  struct btrfs_path *btrfs_alloc_path(void);
> >>>  void btrfs_free_path(struct btrfs_path *p);
> >>>  void btrfs_set_path_blocking(struct btrfs_path *p);
> >>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -			       struct extent_buffer *held, int held_rw);
> >>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >>>  
> >>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> >>> index f51b509f2d9b..db9f45082c61 100644
> >>> --- a/fs/btrfs/delayed-inode.c
> >>> +++ b/fs/btrfs/delayed-inode.c
> >>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> >>>  		i++;
> >>>  	}
> >>>  
> >>> -	/* reset all the locked nodes in the patch to spinning locks. */
> >>> -	btrfs_clear_path_blocking(path, NULL, 0);
> >>> -
> >>>  	/* insert the keys of the items */
> >>>  	setup_items_for_insert(root, path, keys, data_size,
> >>>  			       total_data_size, total_size, nitems);
> >>>
> >
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d436fb4c002e..8b31caa60b0a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -52,42 +52,6 @@  noinline void btrfs_set_path_blocking(struct btrfs_path *p)
 	}
 }
 
-/*
- * reset all the locked nodes in the patch to spinning locks.
- *
- * held is used to keep lockdep happy, when lockdep is enabled
- * we set held to a blocking lock before we go around and
- * retake all the spinlocks in the path.  You can safely use NULL
- * for held
- */
-noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
-					struct extent_buffer *held, int held_rw)
-{
-	int i;
-
-	if (held) {
-		btrfs_set_lock_blocking_rw(held, held_rw);
-		if (held_rw == BTRFS_WRITE_LOCK)
-			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
-		else if (held_rw == BTRFS_READ_LOCK)
-			held_rw = BTRFS_READ_LOCK_BLOCKING;
-	}
-	btrfs_set_path_blocking(p);
-
-	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
-		if (p->nodes[i] && p->locks[i]) {
-			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
-			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
-				p->locks[i] = BTRFS_WRITE_LOCK;
-			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
-				p->locks[i] = BTRFS_READ_LOCK;
-		}
-	}
-
-	if (held)
-		btrfs_clear_lock_blocking_rw(held, held_rw);
-}
-
 /* this also releases the path */
 void btrfs_free_path(struct btrfs_path *p)
 {
@@ -1306,7 +1270,6 @@  static struct tree_mod_elem *__tree_mod_log_oldest_root(
 		}
 	}
 
-	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
 	btrfs_tree_read_unlock_blocking(eb);
 	free_extent_buffer(eb);
 
@@ -2483,7 +2446,6 @@  noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = split_node(trans, root, p, level);
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		BUG_ON(sret > 0);
 		if (sret) {
@@ -2504,7 +2466,6 @@  noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = balance_level(trans, root, p, level);
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		if (sret) {
 			ret = sret;
@@ -2789,7 +2750,10 @@  int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		}
 cow_done:
 		p->nodes[level] = b;
-		btrfs_clear_path_blocking(p, NULL, 0);
+		/*
+		 * Leave path with blocking locks to avoid massive
+		 * lock context switch, this is made on purpose.
+		 */
 
 		/*
 		 * we have a lock on b and as long as we aren't changing
@@ -2871,8 +2835,6 @@  int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					if (!err) {
 						btrfs_set_path_blocking(p);
 						btrfs_tree_lock(b);
-						btrfs_clear_path_blocking(p, b,
-								  BTRFS_WRITE_LOCK);
 					}
 					p->locks[level] = BTRFS_WRITE_LOCK;
 				} else {
@@ -2880,8 +2842,6 @@  int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					if (!err) {
 						btrfs_set_path_blocking(p);
 						btrfs_tree_read_lock(b);
-						btrfs_clear_path_blocking(p, b,
-								  BTRFS_READ_LOCK);
 					}
 					p->locks[level] = BTRFS_READ_LOCK;
 				}
@@ -2900,7 +2860,6 @@  int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				btrfs_set_path_blocking(p);
 				err = split_leaf(trans, root, key,
 						 p, ins_len, ret == 0);
-				btrfs_clear_path_blocking(p, NULL, 0);
 
 				BUG_ON(err > 0);
 				if (err) {
@@ -2967,7 +2926,6 @@  int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 	while (b) {
 		level = btrfs_header_level(b);
 		p->nodes[level] = b;
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		/*
 		 * we have a lock on b and as long as we aren't changing
@@ -3013,8 +2971,6 @@  int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 			if (!err) {
 				btrfs_set_path_blocking(p);
 				btrfs_tree_read_lock(b);
-				btrfs_clear_path_blocking(p, b,
-							  BTRFS_READ_LOCK);
 			}
 			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
 			if (!b) {
@@ -5198,7 +5154,6 @@  int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 		path->locks[level - 1] = BTRFS_READ_LOCK;
 		path->nodes[level - 1] = cur;
 		unlock_up(path, level, 1, 0, NULL);
-		btrfs_clear_path_blocking(path, NULL, 0);
 	}
 out:
 	path->keep_locks = keep_locks;
@@ -5783,8 +5738,6 @@  int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				btrfs_tree_read_lock(next);
-				btrfs_clear_path_blocking(path, next,
-							  BTRFS_READ_LOCK);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
@@ -5820,8 +5773,6 @@  int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				btrfs_tree_read_lock(next);
-				btrfs_clear_path_blocking(path, next,
-							  BTRFS_READ_LOCK);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 318be7864072..1aeed3c0e949 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2876,8 +2876,6 @@  int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 struct btrfs_path *btrfs_alloc_path(void);
 void btrfs_free_path(struct btrfs_path *p);
 void btrfs_set_path_blocking(struct btrfs_path *p);
-void btrfs_clear_path_blocking(struct btrfs_path *p,
-			       struct extent_buffer *held, int held_rw);
 void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
 
 int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index f51b509f2d9b..db9f45082c61 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -762,9 +762,6 @@  static int btrfs_batch_insert_items(struct btrfs_root *root,
 		i++;
 	}
 
-	/* reset all the locked nodes in the patch to spinning locks. */
-	btrfs_clear_path_blocking(path, NULL, 0);
-
 	/* insert the keys of the items */
 	setup_items_for_insert(root, path, keys, data_size,
 			       total_data_size, total_size, nitems);