btrfs: run delayed iput at unlink time
diff mbox series

Message ID 20190507172734.93994-1-josef@toxicpanda.com
State New
Headers show
Series
  • btrfs: run delayed iput at unlink time
Related show

Commit Message

Josef Bacik May 7, 2019, 5:27 p.m. UTC
We have been seeing issues in production where a cleaner script will end
up unlinking a bunch of files that have pending iputs.  This means they
will get their final iput's run at btrfs-cleaner time and thus are not
throttled, which impacts the workload.

Since we are unlinking these files we can just drop the delayed iput at
unlink time.  We are already holding a reference to the inode so this
will not be the final iput and thus is completely safe to do at this
point.  Doing this means we are more likely to be doing the final iput
at unlink time, and thus will get the IO charged to the caller and get
throttled appropriately without affecting the main workload.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Nikolay Borisov May 8, 2019, 7:15 a.m. UTC | #1
On 7.05.19 г. 20:27 ч., Josef Bacik wrote:
> We have been seeing issues in production where a cleaner script will end
> up unlinking a bunch of files that have pending iputs.  This means they
> will get their final iput's run at btrfs-cleaner time and thus are not
> throttled, which impacts the workload.
> 
> Since we are unlinking these files we can just drop the delayed iput at
> unlink time.  We are already holding a reference to the inode so this
> will not be the final iput and thus is completely safe to do at this
> point.  Doing this means we are more likely to be doing the final iput
> at unlink time, and thus will get the IO charged to the caller and get
> throttled appropriately without affecting the main workload.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/inode.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b6d549c993f6..e58685b5d398 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4009,6 +4009,28 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
>  		ret = 0;
>  	else if (ret)
>  		btrfs_abort_transaction(trans, ret);
> +
> +	/*
> +	 * If we have a pending delayed iput we could end up with the final iput
> +	 * being run in btrfs-cleaner context.  If we have enough of these built
> +	 * up we can end up burning a lot of time in btrfs-cleaner without any
> +	 * way to throttle the unlinks.  Since we're currently holding a ref on
> +	 * the inode we can run the delayed iput here without any issues as the
> +	 * final iput won't be done until after we drop the ref we're currently
> +	 * holding.
> +	 */

FWIW the caller is not really holding an explicit reference, rather
there is a reference held by the dentry which is going to be disposed of
by vfs. Considering this I'd say this is a false claim. I.e "we" do not
hold a reference.

> +	if (!list_empty(&inode->delayed_iput)) {
> +		spin_lock(&fs_info->delayed_iput_lock);
> +		if (!list_empty(&inode->delayed_iput)) {
> +			list_del_init(&inode->delayed_iput);
> +			spin_unlock(&fs_info->delayed_iput_lock);
> +			iput(&inode->vfs_inode);
> +			if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
> +				wake_up(&fs_info->delayed_iputs_wait);
> +		} else {
> +			spin_unlock(&fs_info->delayed_iput_lock);
> +		}
> +	}

OTOH this really feels like a hack and this stems from the fact that
iput is rather rudimentary. Additionally you are essentially opencoding
the body of btrfs_run_delayed_iputs. I was going to suggest to introduce
a new helper factoring out the common code but it will get ugly due to
the spin lock being dropped before doing the iput.

But then I'm really starting to question the utility of delayed iputs.
Presumably it was added to defer the expensive final iput in the cleaner
context or avoid some deadlocks (but we don't know which exactly). Yet,
here we are some time later where you are essentially saying "this
mechanism is suboptimal because it's dumb and instead of improving
things it's making them worse in certain cases, so let's unload it a bit
by doing an iput here".



>  err:
>  	btrfs_free_path(path);
>  	if (ret)
>
Chris Mason May 8, 2019, 1:09 p.m. UTC | #2
On 8 May 2019, at 3:15, Nikolay Borisov wrote:

> On 7.05.19 г. 20:27 ч., Josef Bacik wrote:
>> We have been seeing issues in production where a cleaner script will 
>> end
>> up unlinking a bunch of files that have pending iputs.  This means 
>> they
>> will get their final iput's run at btrfs-cleaner time and thus are 
>> not
>> throttled, which impacts the workload.
>>
>> Since we are unlinking these files we can just drop the delayed iput 
>> at
>> unlink time.  We are already holding a reference to the inode so this
>> will not be the final iput and thus is completely safe to do at this
>> point.  Doing this means we are more likely to be doing the final 
>> iput
>> at unlink time, and thus will get the IO charged to the caller and 
>> get
>> throttled appropriately without affecting the main workload.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/inode.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index b6d549c993f6..e58685b5d398 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4009,6 +4009,28 @@ static int __btrfs_unlink_inode(struct 
>> btrfs_trans_handle *trans,
>>  		ret = 0;
>>  	else if (ret)
>>  		btrfs_abort_transaction(trans, ret);
>> +
>> +	/*
>> +	 * If we have a pending delayed iput we could end up with the final 
>> iput
>> +	 * being run in btrfs-cleaner context.  If we have enough of these 
>> built
>> +	 * up we can end up burning a lot of time in btrfs-cleaner without 
>> any
>> +	 * way to throttle the unlinks.  Since we're currently holding a 
>> ref on
>> +	 * the inode we can run the delayed iput here without any issues as 
>> the
>> +	 * final iput won't be done until after we drop the ref we're 
>> currently
>> +	 * holding.
>> +	 */
>
> FWIW the caller is not really holding an explicit reference, rather
> there is a reference held by the dentry which is going to be disposed 
> of
> by vfs. Considering this I'd say this is a false claim. I.e "we" do 
> not
> hold a reference.

It's impossible to call this function without a reference held on the 
inode, kind of nit-picking on "we" vs "the vfs".

>
>> +	if (!list_empty(&inode->delayed_iput)) {
>> +		spin_lock(&fs_info->delayed_iput_lock);
>> +		if (!list_empty(&inode->delayed_iput)) {
>> +			list_del_init(&inode->delayed_iput);
>> +			spin_unlock(&fs_info->delayed_iput_lock);
>> +			iput(&inode->vfs_inode);
>> +			if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
>> +				wake_up(&fs_info->delayed_iputs_wait);
>> +		} else {
>> +			spin_unlock(&fs_info->delayed_iput_lock);
>> +		}
>> +	}
>
> OTOH this really feels like a hack and this stems from the fact that
> iput is rather rudimentary. Additionally you are essentially 
> opencoding
> the body of btrfs_run_delayed_iputs. I was going to suggest to 
> introduce
> a new helper factoring out the common code but it will get ugly due to
> the spin lock being dropped before doing the iput.
>
> But then I'm really starting to question the utility of delayed iputs.
> Presumably it was added to defer the expensive final iput in the 
> cleaner
> context or avoid some deadlocks (but we don't know which exactly). 
> Yet,
> here we are some time later where you are essentially saying "this
> mechanism is suboptimal because it's dumb and instead of improving
> things it's making them worse in certain cases, so let's unload it a 
> bit
> by doing an iput here".

The final iput is pretty expensive, since it potentially does the full 
truncate of an arbitrary sized file.  There are a lot of contexts it 
can't be called from, so the delayed iput code saves us from some 
impossible situations.  It originally came here:

commit 24bbcf0442ee04660a5a030efdbb6d03f1c275cb
Author: Yan, Zheng <zheng.yan@oracle.com>
Date:   Thu Nov 12 09:36:34 2009 +0000

     Btrfs: Add delayed iput

But we've expanded usage to solve a few different deadlocks.

-chris
David Sterba May 9, 2019, 2:27 p.m. UTC | #3
On Wed, May 08, 2019 at 10:15:16AM +0300, Nikolay Borisov wrote:
> > +	if (!list_empty(&inode->delayed_iput)) {
> > +		spin_lock(&fs_info->delayed_iput_lock);
> > +		if (!list_empty(&inode->delayed_iput)) {
> > +			list_del_init(&inode->delayed_iput);
> > +			spin_unlock(&fs_info->delayed_iput_lock);
> > +			iput(&inode->vfs_inode);
> > +			if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
> > +				wake_up(&fs_info->delayed_iputs_wait);
> > +		} else {
> > +			spin_unlock(&fs_info->delayed_iput_lock);
> > +		}
> > +	}
> 
> OTOH this really feels like a hack and this stems from the fact that
> iput is rather rudimentary. Additionally you are essentially opencoding
> the body of btrfs_run_delayed_iputs. I was going to suggest to introduce
> a new helper factoring out the common code but it will get ugly due to
> the spin lock being dropped before doing the iput.

Yeah this should be in a helper, something like

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3286,6 +3286,26 @@ void btrfs_add_delayed_iput(struct inode *inode)
                wake_up_process(fs_info->cleaner_kthread);
 }
 
+static void run_delayed_iput_now(struct inode *inode)
+{
+       list_del_init(&inode->delayed_iput);
+       spin_unlock(&fs_info->delayed_iput_lock);
+       iput(&inode->vfs_inode);
+       if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
+               wake_up(&fs_info->delayed_iputs_wait);
+       spin_lock(&fs_info->delayed_iput_lock);
+}
+
+void btrfs_run_delayed_iput_now(struct inode *inode)
+{
+       spin_lock(&fs_info->delayed_iput_lock);
+       if (list_empty(&inode->delayed_iput))
+               goto out_unlock;
+       run_delayed_iput_now(inode);
+out_unlock:
+       unspin_lock(&fs_info->delayed_iput_lock);
+}
+
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 {
 
@@ -3295,12 +3315,8 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 
                inode = list_first_entry(&fs_info->delayed_iputs,
                                struct btrfs_inode, delayed_iput);
-               list_del_init(&inode->delayed_iput);
-               spin_unlock(&fs_info->delayed_iput_lock);
-               iput(&inode->vfs_inode);
-               if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
-                       wake_up(&fs_info->delayed_iputs_wait);
-               spin_lock(&fs_info->delayed_iput_lock);
+
+               run_delayed_iput_now(inode);
        }
        spin_unlock(&fs_info->delayed_iput_lock);
 }
---

The delayed_iput_lock is not that contended so that the first check needs to be
done unlocked. There are only list manipulations in the critical section.

The above does one unnecessary lock/unlock in case the standalone
delayed iput is called, I don't see a cleaner way now.
David Sterba June 13, 2019, 5:04 p.m. UTC | #4
On Tue, May 07, 2019 at 01:27:34PM -0400, Josef Bacik wrote:
> We have been seeing issues in production where a cleaner script will end
> up unlinking a bunch of files that have pending iputs.  This means they
> will get their final iput's run at btrfs-cleaner time and thus are not
> throttled, which impacts the workload.
> 
> Since we are unlinking these files we can just drop the delayed iput at
> unlink time.  We are already holding a reference to the inode so this
> will not be the final iput and thus is completely safe to do at this
> point.  Doing this means we are more likely to be doing the final iput
> at unlink time, and thus will get the IO charged to the caller and get
> throttled appropriately without affecting the main workload.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Ping for updates.

Patch
diff mbox series

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b6d549c993f6..e58685b5d398 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4009,6 +4009,28 @@  static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 		ret = 0;
 	else if (ret)
 		btrfs_abort_transaction(trans, ret);
+
+	/*
+	 * If we have a pending delayed iput we could end up with the final iput
+	 * being run in btrfs-cleaner context.  If we have enough of these built
+	 * up we can end up burning a lot of time in btrfs-cleaner without any
+	 * way to throttle the unlinks.  Since we're currently holding a ref on
+	 * the inode we can run the delayed iput here without any issues as the
+	 * final iput won't be done until after we drop the ref we're currently
+	 * holding.
+	 */
+	if (!list_empty(&inode->delayed_iput)) {
+		spin_lock(&fs_info->delayed_iput_lock);
+		if (!list_empty(&inode->delayed_iput)) {
+			list_del_init(&inode->delayed_iput);
+			spin_unlock(&fs_info->delayed_iput_lock);
+			iput(&inode->vfs_inode);
+			if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
+				wake_up(&fs_info->delayed_iputs_wait);
+		} else {
+			spin_unlock(&fs_info->delayed_iput_lock);
+		}
+	}
 err:
 	btrfs_free_path(path);
 	if (ret)