Message ID | 20190817074439.84C6C1056A3@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix allocation of bitmap pages. | expand |
On Sat, Aug 17, 2019 at 07:44:39AM +0000, Christophe Leroy wrote: > Various notifications of type "BUG kmalloc-4096 () : Redzone > overwritten" have been observed recently in various parts of > the kernel. After some time, it has been made a relation with > the use of BTRFS filesystem. > > [ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten > [ 22.809971] ----------------------------------------------------------------------------- > > [ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc > [ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224 > [ 22.811193] __slab_alloc.constprop.26+0x44/0x70 > [ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec > [ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs] > [ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs] > [ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs] > [ 22.812321] find_free_extent+0x680/0x12a4 [btrfs] > [ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs] > [ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs] > [ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs] > [ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs] > [ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs] > [ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs] > [ 22.813973] close_ctree+0xf8/0x2a4 [btrfs] > [ 22.814107] generic_shutdown_super+0x80/0x110 > [ 22.814250] kill_anon_super+0x18/0x30 > [ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs] > [ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83 > [ 22.814841] proc_cgroup_show+0xc0/0x248 > [ 22.814967] proc_single_show+0x54/0x98 > [ 22.815086] seq_read+0x278/0x45c > [ 22.815190] __vfs_read+0x28/0x17c > [ 22.815289] vfs_read+0xa8/0x14c > [ 22.815381] ksys_read+0x50/0x94 > [ 22.815475] ret_from_syscall+0x0/0x38 > > Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead > of memcpy") changed the way bitmap blocks are copied. But allthough > bitmaps have the size of a page, they were allocated with kzalloc(). > > Most of the time, kzalloc() allocates aligned blocks of memory, so > copy_page() can be used. But when some debug options like SLAB_DEBUG > are activated, kzalloc() may return unaligned pointer. > > On powerpc, memcpy(), copy_page() and other copying functions use > 'dcbz' instruction which provides an entire zeroed cacheline to avoid > memory read when the intention is to overwrite a full line. Functions > like memcpy() are writen to care about partial cachelines at the start > and end of the destination, but copy_page() assumes it gets pages. This assumption is not documented nor any pitfalls mentioned in include/asm-generic/page.h that provides the generic implementation. I as an API user cannot check each arch implementation for additional constraints or I would expect that it deals with the boundary cases the same way as arch-specific memcpy implementations. Another thing that is lost is the slub debugging support for all architectures, because get_zeroed_pages lacking the red zones and sanity checks. I find working with raw pages in this code a bit inconsistent with the rest of btrfs code, but that's rather minor compared to the above. Summing it up, I think that the proper fix should go to copy_page implementation on architectures that require it or make it clear what are the copy_page constraints.
Le 19/08/2019 à 19:46, David Sterba a écrit : > On Sat, Aug 17, 2019 at 07:44:39AM +0000, Christophe Leroy wrote: >> Various notifications of type "BUG kmalloc-4096 () : Redzone >> overwritten" have been observed recently in various parts of >> the kernel. After some time, it has been made a relation with >> the use of BTRFS filesystem. >> >> [ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten >> [ 22.809971] ----------------------------------------------------------------------------- >> >> [ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc >> [ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224 >> [ 22.811193] __slab_alloc.constprop.26+0x44/0x70 >> [ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec >> [ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs] >> [ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs] >> [ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs] >> [ 22.812321] find_free_extent+0x680/0x12a4 [btrfs] >> [ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs] >> [ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs] >> [ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs] >> [ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs] >> [ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs] >> [ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs] >> [ 22.813973] close_ctree+0xf8/0x2a4 [btrfs] >> [ 22.814107] generic_shutdown_super+0x80/0x110 >> [ 22.814250] kill_anon_super+0x18/0x30 >> [ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs] >> [ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83 >> [ 22.814841] proc_cgroup_show+0xc0/0x248 >> [ 22.814967] proc_single_show+0x54/0x98 >> [ 22.815086] seq_read+0x278/0x45c >> [ 22.815190] __vfs_read+0x28/0x17c >> [ 22.815289] vfs_read+0xa8/0x14c >> [ 22.815381] ksys_read+0x50/0x94 >> [ 22.815475] ret_from_syscall+0x0/0x38 >> >> Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead >> of memcpy") changed the way bitmap blocks are copied. But allthough >> bitmaps have the size of a page, they were allocated with kzalloc(). >> >> Most of the time, kzalloc() allocates aligned blocks of memory, so >> copy_page() can be used. But when some debug options like SLAB_DEBUG >> are activated, kzalloc() may return unaligned pointer. >> >> On powerpc, memcpy(), copy_page() and other copying functions use >> 'dcbz' instruction which provides an entire zeroed cacheline to avoid >> memory read when the intention is to overwrite a full line. Functions >> like memcpy() are writen to care about partial cachelines at the start >> and end of the destination, but copy_page() assumes it gets pages. > > This assumption is not documented nor any pitfalls mentioned in > include/asm-generic/page.h that provides the generic implementation. I > as an API user cannot check each arch implementation for additional > constraints or I would expect that it deals with the boundary cases the > same way as arch-specific memcpy implementations. For me, copy_page() is there to ... copy pages. Not to copy any piece of RAM having the size of a page. But it happened to others. See commit https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d72e9a7a93e4f8e9e52491921d99e0c8aa89eb4e > > Another thing that is lost is the slub debugging support for all > architectures, because get_zeroed_pages lacking the red zones and sanity > checks. > > I find working with raw pages in this code a bit inconsistent with the > rest of btrfs code, but that's rather minor compared to the above. What about using kmem_cache instead ? I see kmem_cache is already widely used in BTRFS, so using it also for block of memory of size PAGE_SIZE should be ok ? AFAICS, kmem_cache has the red zones and sanity checks. > > Summing it up, I think that the proper fix should go to copy_page > implementation on architectures that require it or make it clear what > are the copy_page constraints. > I guess anybody using copy_page() to copy something else than a page is on his/her own. But following that (bad) experience, I propose a patch to at least detect it early, see https://patchwork.ozlabs.org/patch/1148033/ Christophe
On Mon, Aug 19, 2019 at 07:46:00PM +0200, David Sterba wrote: > Another thing that is lost is the slub debugging support for all > architectures, because get_zeroed_pages lacking the red zones and sanity > checks. > > I find working with raw pages in this code a bit inconsistent with the > rest of btrfs code, but that's rather minor compared to the above. > > Summing it up, I think that the proper fix should go to copy_page > implementation on architectures that require it or make it clear what > are the copy_page constraints. The whole point of copy_page is to copy exactly one page and it makes sense to assume that is aligned. A sane memcpy would use the same underlying primitives as well after checking they fit. So I think the prime issue here is btrfs' use of copy_page instead of memcpy. The secondary issue is slub fucking up alignments for no good reason. We just got bitten by that crap again in XFS as well :(
On 8/20/19 4:30 AM, Christoph Hellwig wrote: > On Mon, Aug 19, 2019 at 07:46:00PM +0200, David Sterba wrote: >> Another thing that is lost is the slub debugging support for all >> architectures, because get_zeroed_pages lacking the red zones and sanity >> checks. >> >> I find working with raw pages in this code a bit inconsistent with the >> rest of btrfs code, but that's rather minor compared to the above. >> >> Summing it up, I think that the proper fix should go to copy_page >> implementation on architectures that require it or make it clear what >> are the copy_page constraints. > > The whole point of copy_page is to copy exactly one page and it makes > sense to assume that is aligned. A sane memcpy would use the same > underlying primitives as well after checking they fit. So I think the > prime issue here is btrfs' use of copy_page instead of memcpy. The > secondary issue is slub fucking up alignments for no good reason. We > just got bitten by that crap again in XFS as well :( Meh, I should finally get back to https://lwn.net/Articles/787740/ right
On Tue, Aug 20, 2019 at 01:06:25PM +0200, Vlastimil Babka wrote: > > The whole point of copy_page is to copy exactly one page and it makes > > sense to assume that is aligned. A sane memcpy would use the same > > underlying primitives as well after checking they fit. So I think the > > prime issue here is btrfs' use of copy_page instead of memcpy. The > > secondary issue is slub fucking up alignments for no good reason. We > > just got bitten by that crap again in XFS as well :( > > Meh, I should finally get back to https://lwn.net/Articles/787740/ right Yes. For now Dave came up with an idea for a workaround that will be forward-compatible with that: https://www.spinics.net/lists/linux-xfs/msg30521.html
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 062be9dde4c6..3229a058e025 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -764,7 +764,7 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode, } else { ASSERT(num_bitmaps); num_bitmaps--; - e->bitmap = kzalloc(PAGE_SIZE, GFP_NOFS); + e->bitmap = (void *)get_zeroed_page(GFP_NOFS); if (!e->bitmap) { kmem_cache_free( btrfs_free_space_cachep, e); @@ -1881,7 +1881,7 @@ static void free_bitmap(struct btrfs_free_space_ctl *ctl, struct btrfs_free_space *bitmap_info) { unlink_free_space(ctl, bitmap_info); - kfree(bitmap_info->bitmap); + free_page((unsigned long)bitmap_info->bitmap); kmem_cache_free(btrfs_free_space_cachep, bitmap_info); ctl->total_bitmaps--; ctl->op->recalc_thresholds(ctl); @@ -2135,7 +2135,7 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl, } /* allocate the bitmap */ - info->bitmap = kzalloc(PAGE_SIZE, GFP_NOFS); + info->bitmap = (void *)get_zeroed_page(GFP_NOFS); spin_lock(&ctl->tree_lock); if (!info->bitmap) { ret = -ENOMEM; @@ -2146,7 +2146,7 @@ static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl, out: if (info) { - kfree(info->bitmap); + free_page((unsigned long)info->bitmap); kmem_cache_free(btrfs_free_space_cachep, info); } @@ -2802,7 +2802,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group, if (entry->bytes == 0) { ctl->free_extents--; if (entry->bitmap) { - kfree(entry->bitmap); + free_page((unsigned long)entry->bitmap); ctl->total_bitmaps--; ctl->op->recalc_thresholds(ctl); } @@ -3606,7 +3606,7 @@ int test_add_free_space_entry(struct btrfs_block_group_cache *cache, } if (!map) { - map = kzalloc(PAGE_SIZE, GFP_NOFS); + map = (void *)get_zeroed_page(GFP_NOFS); if (!map) { kmem_cache_free(btrfs_free_space_cachep, info); return -ENOMEM; @@ -3635,7 +3635,7 @@ int test_add_free_space_entry(struct btrfs_block_group_cache *cache, if (info) kmem_cache_free(btrfs_free_space_cachep, info); - kfree(map); + free_page((unsigned long)map); return 0; }