diff mbox

[17/22] btrfs: don't take the dio_sem in the fsync path

Message ID 20180719145006.17532-17-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show

Commit Message

Josef Bacik July 19, 2018, 2:50 p.m. UTC
Since we've changed the fsync() path to always run ordered extents
before doing the tree log we no longer need to take the dio_sem in the
tree log path.  This gets rid of a lockdep splat that I was seeing with
the AIO tests.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/tree-log.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Filipe Manana July 19, 2018, 3:12 p.m. UTC | #1
On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> Since we've changed the fsync() path to always run ordered extents
> before doing the tree log we no longer need to take the dio_sem in the
> tree log path.  This gets rid of a lockdep splat that I was seeing with
> the AIO tests.

The dio_sem can be removed completely, it was added to synchronize
lockless dio writes with fsync to avoid races.
I.e., just removing it here in the tree-log makes it useless.

thanks!

>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/tree-log.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f8220ec02036..aa06e1954b84 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4439,7 +4439,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
>
>         INIT_LIST_HEAD(&extents);
>
> -       down_write(&inode->dio_sem);
>         write_lock(&tree->lock);
>         test_gen = root->fs_info->last_trans_committed;
>         logged_start = start;
> @@ -4520,7 +4519,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
>         }
>         WARN_ON(!list_empty(&extents));
>         write_unlock(&tree->lock);
> -       up_write(&inode->dio_sem);
>
>         btrfs_release_path(path);
>         if (!ret)
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana July 19, 2018, 3:21 p.m. UTC | #2
On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> Since we've changed the fsync() path to always run ordered extents
> before doing the tree log we no longer need to take the dio_sem in the
> tree log path.  This gets rid of a lockdep splat that I was seeing with
> the AIO tests.

So actually, we still need it (or some other means of sync).
Because even after the recent changes to fsync, the fast path still
logs extent items based on the extent maps, and the dio write path
creates first the extent map and then the ordered extent.
So the old problem can still happen between concurrent fsync and
lockless dio write, where fsync logs an extent item for an extent map
whose ordered extent we never waited for.
The solution prior to the introduction of dio_sem solved this - make
the dio write create first the ordered extent, and, only after it,
create the extent map.

>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/tree-log.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f8220ec02036..aa06e1954b84 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4439,7 +4439,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
>
>         INIT_LIST_HEAD(&extents);
>
> -       down_write(&inode->dio_sem);
>         write_lock(&tree->lock);
>         test_gen = root->fs_info->last_trans_committed;
>         logged_start = start;
> @@ -4520,7 +4519,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
>         }
>         WARN_ON(!list_empty(&extents));
>         write_unlock(&tree->lock);
> -       up_write(&inode->dio_sem);
>
>         btrfs_release_path(path);
>         if (!ret)
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik July 19, 2018, 3:54 p.m. UTC | #3
On Thu, Jul 19, 2018 at 04:21:58PM +0100, Filipe Manana wrote:
> On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > Since we've changed the fsync() path to always run ordered extents
> > before doing the tree log we no longer need to take the dio_sem in the
> > tree log path.  This gets rid of a lockdep splat that I was seeing with
> > the AIO tests.
> 
> So actually, we still need it (or some other means of sync).
> Because even after the recent changes to fsync, the fast path still
> logs extent items based on the extent maps, and the dio write path
> creates first the extent map and then the ordered extent.
> So the old problem can still happen between concurrent fsync and
> lockless dio write, where fsync logs an extent item for an extent map
> whose ordered extent we never waited for.
> The solution prior to the introduction of dio_sem solved this - make
> the dio write create first the ordered extent, and, only after it,
> create the extent map.
> 

Oooh balls I see.  This is still a problem even if we add the ordered extent
first, because we can easily just start the lockless dio write after we've
waited for ordered extents, so the order we create the extent map and ordered
extent don't actually matter.  We still have this lockdep thing, I think we just
move the dio_sem to the start of fsync, if we're fsyncing you just don't get to
do lockless dio writes while we're doing the fsync.  What do you think?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana July 19, 2018, 3:57 p.m. UTC | #4
On Thu, Jul 19, 2018 at 4:54 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Thu, Jul 19, 2018 at 04:21:58PM +0100, Filipe Manana wrote:
>> On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> > Since we've changed the fsync() path to always run ordered extents
>> > before doing the tree log we no longer need to take the dio_sem in the
>> > tree log path.  This gets rid of a lockdep splat that I was seeing with
>> > the AIO tests.
>>
>> So actually, we still need it (or some other means of sync).
>> Because even after the recent changes to fsync, the fast path still
>> logs extent items based on the extent maps, and the dio write path
>> creates first the extent map and then the ordered extent.
>> So the old problem can still happen between concurrent fsync and
>> lockless dio write, where fsync logs an extent item for an extent map
>> whose ordered extent we never waited for.
>> The solution prior to the introduction of dio_sem solved this - make
>> the dio write create first the ordered extent, and, only after it,
>> create the extent map.
>>
>
> Oooh balls I see.  This is still a problem even if we add the ordered extent
> first, because we can easily just start the lockless dio write after we've
> waited for ordered extents, so the order we create the extent map and ordered
> extent don't actually matter.  We still have this lockdep thing, I think we just
> move the dio_sem to the start of fsync, if we're fsyncing you just don't get to
> do lockless dio writes while we're doing the fsync.  What do you think?  Thanks,

On first though, it seems to be the simplest and sanest solution :)
Thanks.

>
> Josef
diff mbox

Patch

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f8220ec02036..aa06e1954b84 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4439,7 +4439,6 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 	INIT_LIST_HEAD(&extents);
 
-	down_write(&inode->dio_sem);
 	write_lock(&tree->lock);
 	test_gen = root->fs_info->last_trans_committed;
 	logged_start = start;
@@ -4520,7 +4519,6 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	}
 	WARN_ON(!list_empty(&extents));
 	write_unlock(&tree->lock);
-	up_write(&inode->dio_sem);
 
 	btrfs_release_path(path);
 	if (!ret)