Message ID | 7a14a2b897cbeb9a149bed18397ead70ec79345a.1731407982.git.jth@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: fix use-after-free in btrfs_encoded_read_endio | expand |
On Tue, Nov 12, 2024 at 1:54 PM Johannes Thumshirn <jth@kernel.org> wrote: > > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Shinichiro reported the following use-after free that sometimes is > happening in our CI system when running fstests' btrfs/284 on a TCMU > runner device: > > ================================================================== > BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780 > Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219 > > CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15 > Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020 > Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] > Call Trace: > <TASK> > dump_stack_lvl+0x6e/0xa0 > ? lock_release+0x708/0x780 > print_report+0x174/0x505 > ? lock_release+0x708/0x780 > ? __virt_addr_valid+0x224/0x410 > ? lock_release+0x708/0x780 > kasan_report+0xda/0x1b0 > ? lock_release+0x708/0x780 > ? __wake_up+0x44/0x60 > lock_release+0x708/0x780 > ? __pfx_lock_release+0x10/0x10 > ? __pfx_do_raw_spin_lock+0x10/0x10 > ? lock_is_held_type+0x9a/0x110 > _raw_spin_unlock_irqrestore+0x1f/0x60 > __wake_up+0x44/0x60 > btrfs_encoded_read_endio+0x14b/0x190 [btrfs] > btrfs_check_read_bio+0x8d9/0x1360 [btrfs] > ? lock_release+0x1b0/0x780 > ? trace_lock_acquire+0x12f/0x1a0 > ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs] > ? process_one_work+0x7e3/0x1460 > ? lock_acquire+0x31/0xc0 > ? process_one_work+0x7e3/0x1460 > process_one_work+0x85c/0x1460 > ? __pfx_process_one_work+0x10/0x10 > ? assign_work+0x16c/0x240 > worker_thread+0x5e6/0xfc0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x2c3/0x3a0 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x31/0x70 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > Allocated by task 3661: > kasan_save_stack+0x30/0x50 > kasan_save_track+0x14/0x30 > __kasan_kmalloc+0xaa/0xb0 > btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs] > send_extent_data+0xf0f/0x24a0 [btrfs] > process_extent+0x48a/0x1830 [btrfs] > changed_cb+0x178b/0x2ea0 [btrfs] > btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs] > _btrfs_ioctl_send+0x117/0x330 [btrfs] > btrfs_ioctl+0x184a/0x60a0 [btrfs] > __x64_sys_ioctl+0x12e/0x1a0 > do_syscall_64+0x95/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 3661: > kasan_save_stack+0x30/0x50 > kasan_save_track+0x14/0x30 > kasan_save_free_info+0x3b/0x70 > __kasan_slab_free+0x4f/0x70 > kfree+0x143/0x490 > btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs] > send_extent_data+0xf0f/0x24a0 [btrfs] > process_extent+0x48a/0x1830 [btrfs] > changed_cb+0x178b/0x2ea0 [btrfs] > btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs] > _btrfs_ioctl_send+0x117/0x330 [btrfs] > btrfs_ioctl+0x184a/0x60a0 [btrfs] > __x64_sys_ioctl+0x12e/0x1a0 > do_syscall_64+0x95/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff888106a83f00 > which belongs to the cache kmalloc-rnd-07-96 of size 96 > The buggy address is located 24 bytes inside of > freed 96-byte region [ffff888106a83f00, ffff888106a83f60) > > The buggy address belongs to the physical page: > page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83 > flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004 > raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ^ > ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > ================================================================== > > Further analyzing the trace and the crash dump's vmcore file shows that > the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on > the wait_queue that is in the private data passed to the end_io handler. > > Commit 4ff47df40447 ("btrfs: move priv off stack in > btrfs_encoded_read_regular_fill_pages()") moved 'struct > btrfs_encoded_read_private' off the stack. > > Before that commit one can see a corruption of the private data when > analyzing the vmcore after a crash: > > *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = { > .wait = (wait_queue_head_t){ > .lock = (spinlock_t){ > .rlock = (struct raw_spinlock){ > .raw_lock = (arch_spinlock_t){ > .val = (atomic_t){ > .counter = (int)-2005885696, > }, > .locked = (u8)0, > .pending = (u8)157, > .locked_pending = (u16)40192, > .tail = (u16)34928, > }, > .magic = (unsigned int)536325682, > .owner_cpu = (unsigned int)29, > .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0, > .dep_map = (struct lockdep_map){ > .key = (struct lock_class_key *)0xffff8881575a3b6c, > .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 }, > .name = (const char *)0xffff88815626f100 = "", > .wait_type_outer = (u8)37, > .wait_type_inner = (u8)178, > .lock_type = (u8)154, > }, > }, > .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 }, > .dep_map = (struct lockdep_map){ > .key = (struct lock_class_key *)0xffff8881575a3b6c, > .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 }, > .name = (const char *)0xffff88815626f100 = "", > .wait_type_outer = (u8)37, > .wait_type_inner = (u8)178, > .lock_type = (u8)154, > }, > }, > .head = (struct list_head){ > .next = (struct list_head *)0x112cca, > .prev = (struct list_head *)0x47, > }, > }, > .pending = (atomic_t){ > .counter = (int)-1491499288, > }, > .status = (blk_status_t)130, > } > > Here we can see several indicators of in-memory data corruption, e.g. the > large negative atomic values of ->pending or > ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic > 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus > pointer values for ->wait->head. > > To fix this, move the call to bio_put() before the atomic_test operation > so the submitter side in btrfs_encoded_read_regular_fill_pages() is not > woken up before the bio is cleaned up. This is the part I don't see what's the relation to the use-after-free problem on the private structure. This seems like a cleanup that should be a separate patch with its own changelog. > > Also change atomic_dec_return() to atomic_dec_and_test() to fix the > corruption, as atomic_dec_return() is defined as two instructions on > x86_64, whereas atomic_dec_and_test() is defined as a single atomic > operation. This can lead to a situation where counter value is already > decremented but the if statement in btrfs_encoded_read_endio() is not > completely processed, i.e. the 0 test has not completed. If another thread > continues executing btrfs_encoded_read_regular_fill_pages() the > atomic_dec_return() there can see an already updated ->pending counter and > continues by freeing the private data. Continuing in the endio handler the > test for 0 succeeds and the wait_queue is woken up, resulting in a > use-after-free. This is the sort of explanation that should have been in v1. Basically, that the non-atomicity of atomic_dec_return() can make the waiter see the 0 value and free the private structure before the waker does a wake_up() against the private's wait queue. So with bio_put() change in a separate patch: Reviewed-by: Filipe Manana <fdmanana@suse.com> (or with it but with an explanation on how this relates to the use-after-free, which I can't see) Thanks. > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com> > Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl") > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/inode.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 22b8e2764619..cb8b23a3e56b 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) > */ > WRITE_ONCE(priv->status, bbio->bio.bi_status); > } > - if (atomic_dec_return(&priv->pending) == 0) { > + bio_put(&bbio->bio); > + if (atomic_dec_and_test(&priv->pending)) { > int err = blk_status_to_errno(READ_ONCE(priv->status)); > > if (priv->uring_ctx) { > @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) > wake_up(&priv->wait); > } > } > - bio_put(&bbio->bio); > } > > int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode, > -- > 2.43.0 > >
On 12.11.24 15:46, Filipe Manana wrote: > This is the part I don't see what's the relation to the use-after-free > problem on the private structure. > This seems like a cleanup that should be a separate patch with its own > changelog. I need to re-test first, because AFAIR both the bio_put() and the atomic_dec_and_test() where needed to fix the bug on the test system.
在 2024/11/13 00:23, Johannes Thumshirn 写道: > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Shinichiro reported the following use-after free that sometimes is > happening in our CI system when running fstests' btrfs/284 on a TCMU > runner device: > > ================================================================== > BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780 > Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219 > > CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15 > Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020 > Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] > Call Trace: > <TASK> > dump_stack_lvl+0x6e/0xa0 > ? lock_release+0x708/0x780 > print_report+0x174/0x505 > ? lock_release+0x708/0x780 > ? __virt_addr_valid+0x224/0x410 > ? lock_release+0x708/0x780 > kasan_report+0xda/0x1b0 > ? lock_release+0x708/0x780 > ? __wake_up+0x44/0x60 > lock_release+0x708/0x780 > ? __pfx_lock_release+0x10/0x10 > ? __pfx_do_raw_spin_lock+0x10/0x10 > ? lock_is_held_type+0x9a/0x110 > _raw_spin_unlock_irqrestore+0x1f/0x60 > __wake_up+0x44/0x60 > btrfs_encoded_read_endio+0x14b/0x190 [btrfs] > btrfs_check_read_bio+0x8d9/0x1360 [btrfs] > ? lock_release+0x1b0/0x780 > ? trace_lock_acquire+0x12f/0x1a0 > ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs] > ? process_one_work+0x7e3/0x1460 > ? lock_acquire+0x31/0xc0 > ? process_one_work+0x7e3/0x1460 > process_one_work+0x85c/0x1460 > ? __pfx_process_one_work+0x10/0x10 > ? assign_work+0x16c/0x240 > worker_thread+0x5e6/0xfc0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x2c3/0x3a0 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x31/0x70 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > Allocated by task 3661: > kasan_save_stack+0x30/0x50 > kasan_save_track+0x14/0x30 > __kasan_kmalloc+0xaa/0xb0 > btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs] > send_extent_data+0xf0f/0x24a0 [btrfs] > process_extent+0x48a/0x1830 [btrfs] > changed_cb+0x178b/0x2ea0 [btrfs] > btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs] > _btrfs_ioctl_send+0x117/0x330 [btrfs] > btrfs_ioctl+0x184a/0x60a0 [btrfs] > __x64_sys_ioctl+0x12e/0x1a0 > do_syscall_64+0x95/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 3661: > kasan_save_stack+0x30/0x50 > kasan_save_track+0x14/0x30 > kasan_save_free_info+0x3b/0x70 > __kasan_slab_free+0x4f/0x70 > kfree+0x143/0x490 > btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs] > send_extent_data+0xf0f/0x24a0 [btrfs] > process_extent+0x48a/0x1830 [btrfs] > changed_cb+0x178b/0x2ea0 [btrfs] > btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs] > _btrfs_ioctl_send+0x117/0x330 [btrfs] > btrfs_ioctl+0x184a/0x60a0 [btrfs] > __x64_sys_ioctl+0x12e/0x1a0 > do_syscall_64+0x95/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff888106a83f00 > which belongs to the cache kmalloc-rnd-07-96 of size 96 > The buggy address is located 24 bytes inside of > freed 96-byte region [ffff888106a83f00, ffff888106a83f60) > > The buggy address belongs to the physical page: > page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83 > flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004 > raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ^ > ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc > ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > ================================================================== > > Further analyzing the trace and the crash dump's vmcore file shows that > the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on > the wait_queue that is in the private data passed to the end_io handler. > > Commit 4ff47df40447 ("btrfs: move priv off stack in > btrfs_encoded_read_regular_fill_pages()") moved 'struct > btrfs_encoded_read_private' off the stack. > > Before that commit one can see a corruption of the private data when > analyzing the vmcore after a crash: > > *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = { > .wait = (wait_queue_head_t){ > .lock = (spinlock_t){ > .rlock = (struct raw_spinlock){ > .raw_lock = (arch_spinlock_t){ > .val = (atomic_t){ > .counter = (int)-2005885696, > }, > .locked = (u8)0, > .pending = (u8)157, > .locked_pending = (u16)40192, > .tail = (u16)34928, > }, > .magic = (unsigned int)536325682, > .owner_cpu = (unsigned int)29, > .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0, > .dep_map = (struct lockdep_map){ > .key = (struct lock_class_key *)0xffff8881575a3b6c, > .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 }, > .name = (const char *)0xffff88815626f100 = "", > .wait_type_outer = (u8)37, > .wait_type_inner = (u8)178, > .lock_type = (u8)154, > }, > }, > .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 }, > .dep_map = (struct lockdep_map){ > .key = (struct lock_class_key *)0xffff8881575a3b6c, > .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 }, > .name = (const char *)0xffff88815626f100 = "", > .wait_type_outer = (u8)37, > .wait_type_inner = (u8)178, > .lock_type = (u8)154, > }, > }, > .head = (struct list_head){ > .next = (struct list_head *)0x112cca, > .prev = (struct list_head *)0x47, > }, > }, > .pending = (atomic_t){ > .counter = (int)-1491499288, > }, > .status = (blk_status_t)130, > } > > Here we can see several indicators of in-memory data corruption, e.g. the > large negative atomic values of ->pending or > ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic > 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus > pointer values for ->wait->head. > > To fix this, move the call to bio_put() before the atomic_test operation > so the submitter side in btrfs_encoded_read_regular_fill_pages() is not > woken up before the bio is cleaned up. > > Also change atomic_dec_return() to atomic_dec_and_test() to fix the > corruption, as atomic_dec_return() is defined as two instructions on > x86_64, whereas atomic_dec_and_test() is defined as a single atomic > operation. This means we should not utilize "atomic_dec_return() == 0" as a way to do synchronization. And unfortunately I'm also seeing other locations still utilizing the same patter inside btrfs_encoded_read_regular_fill_pages() Shouldn't we also fix that call site even just for the sake of consistency? Thanks, Qu > This can lead to a situation where counter value is already > decremented but the if statement in btrfs_encoded_read_endio() is not > completely processed, i.e. the 0 test has not completed. If another thread > continues executing btrfs_encoded_read_regular_fill_pages() the > atomic_dec_return() there can see an already updated ->pending counter and > continues by freeing the private data. Continuing in the endio handler the > test for 0 succeeds and the wait_queue is woken up, resulting in a > use-after-free. > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com> > Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl") > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/inode.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 22b8e2764619..cb8b23a3e56b 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) > */ > WRITE_ONCE(priv->status, bbio->bio.bi_status); > } > - if (atomic_dec_return(&priv->pending) == 0) { > + bio_put(&bbio->bio); > + if (atomic_dec_and_test(&priv->pending)) { > int err = blk_status_to_errno(READ_ONCE(priv->status)); > > if (priv->uring_ctx) { > @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) > wake_up(&priv->wait); > } > } > - bio_put(&bbio->bio); > } > > int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
On 12.11.24 21:51, Qu Wenruo wrote: >> To fix this, move the call to bio_put() before the atomic_test operation >> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not >> woken up before the bio is cleaned up. >> >> Also change atomic_dec_return() to atomic_dec_and_test() to fix the >> corruption, as atomic_dec_return() is defined as two instructions on >> x86_64, whereas atomic_dec_and_test() is defined as a single atomic >> operation. > > This means we should not utilize "atomic_dec_return() == 0" as a way to > do synchronization. At least not for reference counting, hence recount_t doesn't even have an equivalent. > And unfortunately I'm also seeing other locations still utilizing the > same patter inside btrfs_encoded_read_regular_fill_pages() > > Shouldn't we also fix that call site even just for the sake of consistency? I have no idea, TBH. The other user of atomic_dec_return() in btrfs is in delayed-inode.c:finish_one_item(): /* atomic_dec_return implies a barrier */ if ((atomic_dec_return(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND || seq % BTRFS_DELAYED_BATCH == 0)) cond_wake_up_nomb(&delayed_root->wait); And that one looks safe in my eyes.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 22b8e2764619..cb8b23a3e56b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) */ WRITE_ONCE(priv->status, bbio->bio.bi_status); } - if (atomic_dec_return(&priv->pending) == 0) { + bio_put(&bbio->bio); + if (atomic_dec_and_test(&priv->pending)) { int err = blk_status_to_errno(READ_ONCE(priv->status)); if (priv->uring_ctx) { @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio) wake_up(&priv->wait); } } - bio_put(&bbio->bio); } int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,