[v2] hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
diff mbox

Message ID CAJMB+Nid61a1toRWFFSckEBwfWbAgJEOS1RQL2Efg2DgqhPieQ@mail.gmail.com
State New
Headers show

Commit Message

Hin-Tak Leung June 30, 2015, 3:55 p.m. UTC
Michael Fox: FYI. If you feel like responding (and possibly add a "Tested-By:"),
please include liux-fs-devel linux-fsdevel in the reply.


---------- Forwarded message ----------
From: Hin-Tak Leung <hintak.leung@gmail.com>
Date: 28 June 2015 at 02:39
Subject: [PATCH v2] hfs,hfsplus: cache pages correctly between
bnode_create and bnode_free
To: linux-fsdevel@vger.kernel.org
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>, Sergei Antonov
<saproj@gmail.com>, Al Viro <viro@zeniv.linux.org.uk>, Christoph
Hellwig <hch@infradead.org>, Andrew Morton
<akpm@linux-foundation.org>, Vyacheslav Dubeyko <slava@dubeyko.com>,
Sougata Santra <sougata@tuxera.com>


From: Hin-Tak Leung <htl10@users.sourceforge.net>

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode) are
immediately kmap()'ed and used (both read and write) and kunmap()'ed, and
should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du" on a
large hfsplus-mounted directory a few times on a reasonably loaded system would
get the hfsplus driver all confused and complaining about B-tree
inconsistencies, and generates a "BUG: Bad page state". Most recently, I can
generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by
running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt"
simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of
the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
/dev/mapper/fedora-root    3276800  551665    2725135   17% /
/dev/mapper/fedora-home   52879360  716221   52163139    2% /home
/dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and
"du /mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load and
generating "BUG: Bad page state" or other similar issues over the years. [1]

The unpatched code [2] has always been wrong since it entered the kernel tree.
The only reason why it gets away with it is that the kmap/memcpy/kunmap follow
very quickly after the page_cache_release() so the kernel has not had a chance
to reuse the memory for something else, most of the time.

The current RW driver appears to have followed the design and development of
the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a
B-tree node-centric approach to read_cache_page()/page_cache_release() per
bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching
and releasing pages per inode extents. When the current RW code first entered
the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented
out code) to switch between B-node centric paging to inode-centric paging.
There was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating the
"REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release()
in bnode_release() (which should be spanned by !REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date:   Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel <zippel@linux-m68k.org>
Date:   Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
---
 fs/hfs/bnode.c     | 9 ++++-----
 fs/hfsplus/bnode.c | 3 ---
 2 files changed, 4 insertions(+), 8 deletions(-)


@@ -566,13 +565,11 @@ node_error:

 void hfs_bnode_free(struct hfs_bnode *node)
 {
-#if 0
        int i;

        for (i = 0; i < node->tree->pages_per_bnode; i++)
                if (node->page[i])
                        page_cache_release(node->page[i]);
-#endif
        kfree(node);
 }

--
2.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Hin-Tak Leung July 1, 2015, 11:29 p.m. UTC | #1
Hopefully this should go through. I am posting from my gmail account
(which I rarely use)
as vger.kernel.org (not just linux-fsdevel@, but also git@) is bouncing
my @users.sf.net account also.

On 30 June 2015 at 17:59, Michael Fox <415fox@gmail.com> wrote:
> By the way, linux-fsdevel@vger.kernel.org, bounced my response.
>
> On Tue, Jun 30, 2015 at 9:56 AM, Michael Fox <415fox@gmail.com> wrote:
>>
>> Hi Hin-Tak. I gave up on dual-booting many hard drives ago so I have
>> nowhere to test your patch. Thank you for your work. Your e-mail is
>> especially thorough and professional. I hope your patch gets accepted soon.
>>
>> On Tue, Jun 30, 2015 at 8:55 AM, Hin-Tak Leung <hintak.leung@gmail.com>
>> wrote:
>>>
>>> Michael Fox: FYI. If you feel like responding (and possibly add a
>>> "Tested-By:"),
>>> please include liux-fs-devel linux-fsdevel in the reply.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Hin-Tak Leung <hintak.leung@gmail.com>
>>> Date: 28 June 2015 at 02:39
>>> Subject: [PATCH v2] hfs,hfsplus: cache pages correctly between
>>> bnode_create and bnode_free
>>> To: linux-fsdevel@vger.kernel.org
>>> Cc: Hin-Tak Leung <htl10@users.sourceforge.net>, Sergei Antonov
>>> <saproj@gmail.com>, Al Viro <viro@zeniv.linux.org.uk>, Christoph
>>> Hellwig <hch@infradead.org>, Andrew Morton
>>> <akpm@linux-foundation.org>, Vyacheslav Dubeyko <slava@dubeyko.com>,
>>> Sougata Santra <sougata@tuxera.com>
>>>
>>>
>>> From: Hin-Tak Leung <htl10@users.sourceforge.net>
>>>
>>> Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
>>> hfs_bnode_find() for finding or creating pages corresponding to an inode)
>>> are
>>> immediately kmap()'ed and used (both read and write) and kunmap()'ed, and
>>> should not be page_cache_release()'ed until hfs_bnode_free().
>>>
>>> This patch fixes a problem I first saw in July 2012: merely running "du"
>>> on a
>>> large hfsplus-mounted directory a few times on a reasonably loaded system
>>> would
>>> get the hfsplus driver all confused and complaining about B-tree
>>> inconsistencies, and generates a "BUG: Bad page state". Most recently, I
>>> can
>>> generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5,
>>> by
>>> running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du
>>> /mnt"
>>> simultaneously on two windows, where /mnt is a lightly-used QEMU VM image
>>> of
>>> the full Mac OS X 10.9:
>>>
>>> $ df -i / /home /mnt
>>> Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
>>> /dev/mapper/fedora-root    3276800  551665    2725135   17% /
>>> /dev/mapper/fedora-home   52879360  716221   52163139    2% /home
>>> /dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt
>>>
>>> After applying the patch, I was able to run "du /" (60+ times) and
>>> "du /mnt" (150+ times) continuously and simultaneously for 6+ hours.
>>>
>>> There are many reports of the hfsplus driver getting confused under load
>>> and
>>> generating "BUG: Bad page state" or other similar issues over the years.
>>> [1]
>>>
>>> The unpatched code [2] has always been wrong since it entered the kernel
>>> tree.
>>> The only reason why it gets away with it is that the kmap/memcpy/kunmap
>>> follow
>>> very quickly after the page_cache_release() so the kernel has not had a
>>> chance
>>> to reuse the memory for something else, most of the time.
>>>
>>> The current RW driver appears to have followed the design and development
>>> of
>>> the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001)
>>> had a
>>> B-tree node-centric approach to read_cache_page()/page_cache_release()
>>> per
>>> bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of
>>> caching
>>> and releasing pages per inode extents. When the current RW code first
>>> entered
>>> the kernel [2] in 2005, there was an REF_PAGES conditional (and "//"
>>> commented
>>> out code) to switch between B-node centric paging to inode-centric
>>> paging.
>>> There was a mistake with the direction of one of the REF_PAGES
>>> conditionals in
>>> __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
>>> read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
>>> removed, but a page_cache_release() was mistakenly left in (propagating
>>> the
>>> "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
>>> page_cache_release()
>>> in bnode_release() (which should be spanned by !REF_PAGES) was never
>>> enabled.
>>>
>>> References:
>>> [1]:
>>> Michael Fox, Apr 2013
>>> http://www.spinics.net/lists/linux-fsdevel/msg63807.html
>>> ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
>>>
>>> Sasha Levin, Feb 2015
>>> http://lkml.org/lkml/2015/2/20/85 ("use after free")
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
>>> https://bugzilla.kernel.org/show_bug.cgi?id=42342
>>> https://bugzilla.kernel.org/show_bug.cgi?id=63841
>>> https://bugzilla.kernel.org/show_bug.cgi?id=78761
>>>
>>> [2]:
>>> http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
>>> fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
>>> commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
>>> Author: Andrew Morton <akpm@osdl.org>
>>> Date:   Wed Feb 25 16:17:36 2004 -0800
>>>
>>>     [PATCH] HFS rewrite
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
>>> fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd
>>>
>>> commit 91556682e0bf004d98a529bf829d339abb98bbbd
>>> Author: Andrew Morton <akpm@osdl.org>
>>> Date:   Wed Feb 25 16:17:48 2004 -0800
>>>
>>>     [PATCH] HFS+ support
>>>
>>> [3]:
>>> http://sourceforge.net/projects/linux-hfsplus/
>>>
>>>
>>> http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
>>>
>>> http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
>>>
>>> http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
>>> fs/hfsplus/bnode.c?r1=1.4&r2=1.5
>>>
>>> Date:   Thu Jun 6 09:45:14 2002 +0000
>>> Use buffer cache instead of page cache in bnode.c. Cache inode extents.
>>>
>>> [4]:
>>> http://git.kernel.org/cgit/linux/kernel/git/\
>>>
>>> stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d
>>>
>>> commit a5e3985fa014029eb6795664c704953720cc7f7d
>>> Author: Roman Zippel <zippel@linux-m68k.org>
>>> Date:   Tue Sep 6 15:18:47 2005 -0700
>>>
>>> [PATCH] hfs: remove debug code
>>>
>>> Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
>>> Signed-off-by: Sergei Antonov <saproj@gmail.com>
>>> Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
>>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>>> Cc: Christoph Hellwig <hch@infradead.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
>>> Cc: Sougata Santra <sougata@tuxera.com>
>>> ---
>>>  fs/hfs/bnode.c     | 9 ++++-----
>>>  fs/hfsplus/bnode.c | 3 ---
>>>  2 files changed, 4 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/hfs/bnode.c b/fs/hfs/bnode.c
>>> index d3fa6bd..221719e 100644
>>> --- a/fs/hfs/bnode.c
>>> +++ b/fs/hfs/bnode.c
>>> @@ -288,7 +288,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct
>>> hfs_btree *tree, u32 cnid)
>>>                         page_cache_release(page);
>>>                         goto fail;
>>>                 }
>>> -               page_cache_release(page);
>>>                 node->page[i] = page;
>>>         }
>>>
>>> @@ -398,11 +397,11 @@ node_error:
>>>
>>>  void hfs_bnode_free(struct hfs_bnode *node)
>>>  {
>>> -       //int i;
>>> +       int i;
>>>
>>> -       //for (i = 0; i < node->tree->pages_per_bnode; i++)
>>> -       //      if (node->page[i])
>>> -       //              page_cache_release(node->page[i]);
>>> +       for (i = 0; i < node->tree->pages_per_bnode; i++)
>>> +               if (node->page[i])
>>> +                       page_cache_release(node->page[i]);
>>>         kfree(node);
>>>  }
>>>
>>> diff --git a/fs/hfsplus/bnode.c b/fs/hfsplus/bnode.c
>>> index 759708f..6392466 100644
>>> --- a/fs/hfsplus/bnode.c
>>> +++ b/fs/hfsplus/bnode.c
>>> @@ -454,7 +454,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct
>>> hfs_btree *tree, u32 cnid)
>>>                         page_cache_release(page);
>>>                         goto fail;
>>>                 }
>>> -               page_cache_release(page);
>>>                 node->page[i] = page;
>>>         }
>>>
>>> @@ -566,13 +565,11 @@ node_error:
>>>
>>>  void hfs_bnode_free(struct hfs_bnode *node)
>>>  {
>>> -#if 0
>>>         int i;
>>>
>>>         for (i = 0; i < node->tree->pages_per_bnode; i++)
>>>                 if (node->page[i])
>>>                         page_cache_release(node->page[i]);
>>> -#endif
>>>         kfree(node);
>>>  }
>>>
>>> --
>>> 2.4.3
>>
>>
>>
>>
>> --
>>
>> -
>> Michael
>
>
>
>
> --
>
> -
> Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/hfs/bnode.c b/fs/hfs/bnode.c
index d3fa6bd..221719e 100644
--- a/fs/hfs/bnode.c
+++ b/fs/hfs/bnode.c
@@ -288,7 +288,6 @@  static struct hfs_bnode *__hfs_bnode_create(struct
hfs_btree *tree, u32 cnid)
                        page_cache_release(page);
                        goto fail;
                }
-               page_cache_release(page);
                node->page[i] = page;
        }

@@ -398,11 +397,11 @@  node_error:

 void hfs_bnode_free(struct hfs_bnode *node)
 {
-       //int i;
+       int i;

-       //for (i = 0; i < node->tree->pages_per_bnode; i++)
-       //      if (node->page[i])
-       //              page_cache_release(node->page[i]);
+       for (i = 0; i < node->tree->pages_per_bnode; i++)
+               if (node->page[i])
+                       page_cache_release(node->page[i]);
        kfree(node);
 }

diff --git a/fs/hfsplus/bnode.c b/fs/hfsplus/bnode.c
index 759708f..6392466 100644
--- a/fs/hfsplus/bnode.c
+++ b/fs/hfsplus/bnode.c
@@ -454,7 +454,6 @@  static struct hfs_bnode *__hfs_bnode_create(struct
hfs_btree *tree, u32 cnid)
                        page_cache_release(page);
                        goto fail;
                }
-               page_cache_release(page);
                node->page[i] = page;
        }