diff mbox

Kernel BUG on Snapshot Deletion (3.11.0-rc5)

Message ID CAKcLGm_jFa_2FaaaudRZF8JRymh1t3ASMFN1tFGbs0PGt1vNaQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mitch Harder Aug. 21, 2013, 1:44 p.m. UTC
On Thu, Aug 15, 2013 at 12:29 PM, Mitch Harder
<mitch.harder@sabayonlinux.org> wrote:
> I'm running into a curious problem.
>
> In the process of making my script portable, I am breaking the ability
> to replicate the error.
>
> I'm trying to isolate the aspect of my local script that is triggering
> the error.  No firm insights yet.
>
>
> On Tue, Aug 13, 2013 at 11:03 AM, Mitch Harder
> <mitch.harder@sabayonlinux.org> wrote:
>> Let me work on making that script more portable, and hopefully quicker
>> to reproduce.
>>
>> On Tue, Aug 13, 2013 at 9:15 AM, Josef Bacik <jbacik@fusionio.com> wrote:
>>> On Mon, Aug 12, 2013 at 11:06:27PM -0500, Mitch Harder wrote:
>>>> I'm hitting a btrfs Kernel BUG running a snapshot stress script with
>>>> linux-3.11.0-rc5.
>>>>
>>>
>>> I can haz script?  Thanks,
>>>

I've had a hard time assembling a portable reproducer for this issue.

I discovered that my reproducer was highly dependent on a local
archive of out-of-date git kernel sources.  My efforts to reproduce
the error with a portable set of scripts with publicly available
kernel git sources weren't successful.

It seems like this issue is related to a corner-case workload that is
difficult to reproduce.

So I've bisected the error I was seeing with my local script, and
identified the following commit as triggering my issue:

commit:    3c64a1aba7cfcb04f79e76f859b3d66660275d59
Btrfs: cleanup: don't check the same thing twice
https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/fs/btrfs?h=for-linus&id=3c64a1aba7cfcb04

I tested a kernel which reverted this change, and also added WARN_ON
lines to provide a back trace.

Comments

Josef Bacik Aug. 21, 2013, 1:59 p.m. UTC | #1
On Wed, Aug 21, 2013 at 08:44:55AM -0500, Mitch Harder wrote:
> On Thu, Aug 15, 2013 at 12:29 PM, Mitch Harder
> <mitch.harder@sabayonlinux.org> wrote:
> > I'm running into a curious problem.
> >
> > In the process of making my script portable, I am breaking the ability
> > to replicate the error.
> >
> > I'm trying to isolate the aspect of my local script that is triggering
> > the error.  No firm insights yet.
> >
> >
> > On Tue, Aug 13, 2013 at 11:03 AM, Mitch Harder
> > <mitch.harder@sabayonlinux.org> wrote:
> >> Let me work on making that script more portable, and hopefully quicker
> >> to reproduce.
> >>
> >> On Tue, Aug 13, 2013 at 9:15 AM, Josef Bacik <jbacik@fusionio.com> wrote:
> >>> On Mon, Aug 12, 2013 at 11:06:27PM -0500, Mitch Harder wrote:
> >>>> I'm hitting a btrfs Kernel BUG running a snapshot stress script with
> >>>> linux-3.11.0-rc5.
> >>>>
> >>>
> >>> I can haz script?  Thanks,
> >>>
> 
> I've had a hard time assembling a portable reproducer for this issue.
> 
> I discovered that my reproducer was highly dependent on a local
> archive of out-of-date git kernel sources.  My efforts to reproduce
> the error with a portable set of scripts with publicly available
> kernel git sources weren't successful.
> 
> It seems like this issue is related to a corner-case workload that is
> difficult to reproduce.
> 
> So I've bisected the error I was seeing with my local script, and
> identified the following commit as triggering my issue:
> 
> commit:    3c64a1aba7cfcb04f79e76f859b3d66660275d59
> Btrfs: cleanup: don't check the same thing twice
> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/fs/btrfs?h=for-linus&id=3c64a1aba7cfcb04
> 
> I tested a kernel which reverted this change, and also added WARN_ON
> lines to provide a back trace.
> 

Well that works too :).  I'll look at this when I get back from the doctor in a
few hours and see if I can't figure out why it started happening.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Aug. 21, 2013, 2:04 p.m. UTC | #2
On Wed, Aug 21, 2013 at 08:44:55AM -0500, Mitch Harder wrote:
> On Thu, Aug 15, 2013 at 12:29 PM, Mitch Harder
> <mitch.harder@sabayonlinux.org> wrote:
> > I'm running into a curious problem.
> >
> > In the process of making my script portable, I am breaking the ability
> > to replicate the error.
> >
> > I'm trying to isolate the aspect of my local script that is triggering
> > the error.  No firm insights yet.
> >
> >
> > On Tue, Aug 13, 2013 at 11:03 AM, Mitch Harder
> > <mitch.harder@sabayonlinux.org> wrote:
> >> Let me work on making that script more portable, and hopefully quicker
> >> to reproduce.
> >>
> >> On Tue, Aug 13, 2013 at 9:15 AM, Josef Bacik <jbacik@fusionio.com> wrote:
> >>> On Mon, Aug 12, 2013 at 11:06:27PM -0500, Mitch Harder wrote:
> >>>> I'm hitting a btrfs Kernel BUG running a snapshot stress script with
> >>>> linux-3.11.0-rc5.
> >>>>
> >>>
> >>> I can haz script?  Thanks,
> >>>
> 
> I've had a hard time assembling a portable reproducer for this issue.
> 
> I discovered that my reproducer was highly dependent on a local
> archive of out-of-date git kernel sources.  My efforts to reproduce
> the error with a portable set of scripts with publicly available
> kernel git sources weren't successful.
> 
> It seems like this issue is related to a corner-case workload that is
> difficult to reproduce.
> 
> So I've bisected the error I was seeing with my local script, and
> identified the following commit as triggering my issue:
> 
> commit:    3c64a1aba7cfcb04f79e76f859b3d66660275d59
> Btrfs: cleanup: don't check the same thing twice
> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/fs/btrfs?h=for-linus&id=3c64a1aba7cfcb04
> 
> I tested a kernel which reverted this change, and also added WARN_ON
> lines to provide a back trace.
> 
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 4b86916..336d628 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -82,6 +82,12 @@ static struct dentry *btrfs_get_dentry(struct
> super_block *sb, u64 objectid,
>          goto fail;
>      }
> 
> +    if (btrfs_root_refs(&root->root_item) == 0) {
> +        WARN_ON(1);
> +        err = -ENOENT;
> +        goto fail;
> +    }
> +
>      key.objectid = objectid;
>      btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
>      key.offset = 0;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 94413af..4010257 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -310,6 +310,12 @@ static int __btrfs_run_defrag_inode(struct
> btrfs_fs_info *fs_info,
>          goto cleanup;
>      }
> 
> +    if (btrfs_root_refs(&inode_root->root_item) == 0) {
> +        WARN_ON(1);
> +        ret = -ENOENT;
> +        goto cleanup;
> +    }
> +

Funnily enough I just added this check back in a different commit.  Now that I
look at the reasoning tho this cleanup patch was wrong.  We do check if
root_refs is 0 in btrfs_read_fs_root_no_name, but only if the root isn't already
in cache.  If it is in cache we will happily return it with no issue.  So either
we should add the extra check for the in-cache case (probably a good idea), or
go back and add all of these checks back.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Behrens Aug. 23, 2013, 8:48 a.m. UTC | #3
On Wed, 21 Aug 2013 08:44:55 -0500, Mitch Harder wrote:
> I've had a hard time assembling a portable reproducer for this issue.
> 
> I discovered that my reproducer was highly dependent on a local
> archive of out-of-date git kernel sources.  My efforts to reproduce
> the error with a portable set of scripts with publicly available
> kernel git sources weren't successful.
> 
> It seems like this issue is related to a corner-case workload that is
> difficult to reproduce.
> 
> So I've bisected the error I was seeing with my local script, and
> identified the following commit as triggering my issue:
> 
> commit:    3c64a1aba7cfcb04f79e76f859b3d66660275d59
> Btrfs: cleanup: don't check the same thing twice
> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/fs/btrfs?h=for-linus&id=3c64a1aba7cfcb04
> 
> I tested a kernel which reverted this change, and also added WARN_ON
> lines to provide a back trace.
[...]
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cd46e2c..a1091f7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2302,6 +2302,12 @@ static noinline int
> relink_extent_backref(struct btrfs_path *path,
>              return 0;
>          return PTR_ERR(root);
>      }
> +    if (btrfs_root_refs(&root->root_item) == 0) {
> +        srcu_read_unlock(&fs_info->subvol_srcu, index);
> +        /* parse ENOENT to 0 */
> +        WARN_ON(1);
> +        return 0;
> +    }
[...]
> [ 1616.886868] ------------[ cut here ]------------
> [ 1616.886912] WARNING: at fs/btrfs/inode.c:2308 relink_extent_backref+0x103/0x721 [btrfs]()
> [ 1616.887050] Call Trace:
> [ 1616.887064] [<ffffffff8161a34a>] dump_stack+0x19/0x1b
> [ 1616.887071] [<ffffffff8103035a>] warn_slowpath_common+0x67/0x80
> [ 1616.887077] [<ffffffff8103038d>] warn_slowpath_null+0x1a/0x1c
> [ 1616.887100] [<ffffffffa019ea82>] relink_extent_backref+0x103/0x721
> [ 1616.887205] [<ffffffffa019f7e2>] btrfs_finish_ordered_io+0x742/0x829

Mitch,

Thank you for this excellent work to find the cause of the issue. I've sent a patch "Btrfs: fix for patch "cleanup: don't check the same thing twice"" and would appreciate if you could repeat your test, just to make sure, because I was never able to reproduce this issue myself.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Harder Aug. 23, 2013, 4:22 p.m. UTC | #4
On Fri, Aug 23, 2013 at 3:48 AM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:
> On Wed, 21 Aug 2013 08:44:55 -0500, Mitch Harder wrote:
>> I've had a hard time assembling a portable reproducer for this issue.
>>
>> I discovered that my reproducer was highly dependent on a local
>> archive of out-of-date git kernel sources.  My efforts to reproduce
>> the error with a portable set of scripts with publicly available
>> kernel git sources weren't successful.
>>
>> It seems like this issue is related to a corner-case workload that is
>> difficult to reproduce.
>>
>> So I've bisected the error I was seeing with my local script, and
>> identified the following commit as triggering my issue:
>>
>> commit:    3c64a1aba7cfcb04f79e76f859b3d66660275d59
>> Btrfs: cleanup: don't check the same thing twice
>> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/fs/btrfs?h=for-linus&id=3c64a1aba7cfcb04
>>
>> I tested a kernel which reverted this change, and also added WARN_ON
>> lines to provide a back trace.
> [...]
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index cd46e2c..a1091f7 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -2302,6 +2302,12 @@ static noinline int
>> relink_extent_backref(struct btrfs_path *path,
>>              return 0;
>>          return PTR_ERR(root);
>>      }
>> +    if (btrfs_root_refs(&root->root_item) == 0) {
>> +        srcu_read_unlock(&fs_info->subvol_srcu, index);
>> +        /* parse ENOENT to 0 */
>> +        WARN_ON(1);
>> +        return 0;
>> +    }
> [...]
>> [ 1616.886868] ------------[ cut here ]------------
>> [ 1616.886912] WARNING: at fs/btrfs/inode.c:2308 relink_extent_backref+0x103/0x721 [btrfs]()
>> [ 1616.887050] Call Trace:
>> [ 1616.887064] [<ffffffff8161a34a>] dump_stack+0x19/0x1b
>> [ 1616.887071] [<ffffffff8103035a>] warn_slowpath_common+0x67/0x80
>> [ 1616.887077] [<ffffffff8103038d>] warn_slowpath_null+0x1a/0x1c
>> [ 1616.887100] [<ffffffffa019ea82>] relink_extent_backref+0x103/0x721
>> [ 1616.887205] [<ffffffffa019f7e2>] btrfs_finish_ordered_io+0x742/0x829
>
> Mitch,
>
> Thank you for this excellent work to find the cause of the issue. I've sent a patch "Btrfs: fix for patch "cleanup: don't check the same thing twice"" and would appreciate if you could repeat your test, just to make sure, because I was never able to reproduce this issue myself.
>

Thanks.

I've tested my "special" workload with your patch on the latest
3.11_rc6 kernel, and the patch corrects the errors I was encountering.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 4b86916..336d628 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -82,6 +82,12 @@  static struct dentry *btrfs_get_dentry(struct
super_block *sb, u64 objectid,
         goto fail;
     }

+    if (btrfs_root_refs(&root->root_item) == 0) {
+        WARN_ON(1);
+        err = -ENOENT;
+        goto fail;
+    }
+
     key.objectid = objectid;
     btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
     key.offset = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 94413af..4010257 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -310,6 +310,12 @@  static int __btrfs_run_defrag_inode(struct
btrfs_fs_info *fs_info,
         goto cleanup;
     }

+    if (btrfs_root_refs(&inode_root->root_item) == 0) {
+        WARN_ON(1);
+        ret = -ENOENT;
+        goto cleanup;
+    }
+
     key.objectid = defrag->ino;
     btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
     key.offset = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cd46e2c..a1091f7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2302,6 +2302,12 @@  static noinline int
relink_extent_backref(struct btrfs_path *path,
             return 0;
         return PTR_ERR(root);
     }
+    if (btrfs_root_refs(&root->root_item) == 0) {
+        srcu_read_unlock(&fs_info->subvol_srcu, index);
+        /* parse ENOENT to 0 */
+        WARN_ON(1);
+        return 0;
+    }

     /* step 2: get inode */
     key.objectid = backref->inum;
@@ -4703,6 +4709,12 @@  static int fixup_tree_root_location(struct
btrfs_root *root,
         goto out;
     }

+    if (btrfs_root_refs(&new_root->root_item) == 0) {
+        WARN_ON(1);
+        err = -ENOENT;
+        goto out;
+    }
+
     *sub_root = new_root;
     location->objectid = btrfs_root_dirid(&new_root->root_item);
     location->type = BTRFS_INODE_ITEM_KEY;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0e17a30..0f74235 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2969,6 +2969,12 @@  static long btrfs_ioctl_default_subvol(struct
file *file, void __user *argp)
         goto out;
     }

+    if (btrfs_root_refs(&new_root->root_item) == 0) {
+        WARN_ON(1);
+        ret = -ENOENT;
+        goto out;
+    }
+
     path = btrfs_alloc_path();
     if (!path) {
         ret = -ENOMEM;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b267c3c..3cf4716 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -793,6 +793,11 @@  find_root:
     if (IS_ERR(new_root))
         return ERR_CAST(new_root);

+    if (btrfs_root_refs(&new_root->root_item) == 0) {
+        WARN_ON(1);
+        return ERR_PTR(-ENOENT);
+    }
+
     dir_id = btrfs_root_dirid(&new_root->root_item);
 setup_root:
     location.objectid = dir_id;