Message ID | c9e4e480-6f52-949b-e4b6-3eb0fcda3f83@alu.unizg.hr (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | BUG: KCSAN: data-race in btrfs_calculate_inode_block_rsv_size [btrfs] / btrfs_use_block_rsv [btrfs] [EXPERIMENTAL PATCH] | expand |
On Wed, Sep 20, 2023 at 08:18:35AM +0200, Mirsad Todorovac wrote: > Hi, > > This is your friendly bug reporter again. > > Please don't throw stuff at me, as I found another KCSAN data-race problem. > > I feel like a boy who cried wolf ... > > I hope this will get some attention, as it looks this time like a real btrfs problem that could cause > the kernel module to make a wrong turn when managing storage in different threads simultaneously and > lead to the corruption of data. However, I do not have an example of this corruption, it is by now only > theoretical in this otherwise great filesystem. > > In fact, I can announce quite a number of KCSAN bugs already in dmesg log: > > # of > occuren > ces problematic function > ------------------------------------------- > 182 __bitmap_and+0xa3/0x110 > 2 __bitmap_weight+0x62/0xa0 > 138 __call_rcu_common.constprop.0 > 3 __cgroup_account_cputime > 1 __dentry_kill > 3 __mod_lruvec_page_state > 15 __percpu_counter_compare > 1 __percpu_counter_sum+0x8f/0x120 > 1 acpi_ut_acquire_mutex > 2 amdgpu_fence_emit > 1 btrfs_calculate_inode_block_rsv_size > 1 btrfs_page_set_uptodate > 28 copy_from_read_buf > 3 d_add > 3 d_splice_alias > 1 delayacct_add_tsk+0x10d/0x630 > 7 do_epoll_ctl > 1 do_vmi_align_munmap > 86 drm_sched_entity_is_ready > 4 drm_sched_entity_pop_job > 3 enqueue_timer > 1 finish_fault+0xde/0x360 > 2 generic_fillattr > 2 getrusage > 9 getrusage+0x3ba/0xaa0 > 1 getrusage+0x3df/0xaa0 > 6 inode_needs_update_time > 1 inode_set_ctime_current > 1 inode_update_timestamps > 3 kernfs_refresh_inode > 22 ktime_get_mono_fast_ns+0x87/0x120 > 13 ktime_get_mono_fast_ns+0xb0/0x120 > 24 ktime_get_mono_fast_ns+0xc0/0x120 > 79 mas_topiary_replace > 12 mas_wr_modify > 61 mas_wr_node_store > 1 memchr_inv+0x71/0x160 > 1 memchr_inv+0xcf/0x160 > 19 n_tty_check_unthrottle > 5 n_tty_kick_worker > 35 n_tty_poll > 32 n_tty_read > 1 n_tty_read+0x5f8/0xaf0 > 3 osq_lock > 27 process_one_work > 4 process_one_work+0x169/0x700 > 2 rcu_implicit_dynticks_qs > 1 show_stat+0x45b/0xb70 > 3 task_mem > 344 tick_nohz_idle_stop_tick > 32 tick_nohz_next_event > 1 tick_nohz_next_event+0xe7/0x1e0 > 90 tick_sched_do_timer > 5 tick_sched_do_timer+0x2c/0x120 > 1 wbt_done > 1 wbt_issue > 2 wq_worker_tick > 37 xas_clear_mark > > ------------------------------------------------------ > > This report is from a vanilla torvalds tree 6.6-rc2 kernel on Ubuntu 22.04: > > [13429.116126] ================================================================== > [13429.116794] BUG: KCSAN: data-race in btrfs_calculate_inode_block_rsv_size [btrfs] / btrfs_use_block_rsv [btrfs] Thanks for the report. Some data races are known to happen in the reservation code but all the critical changes are done under locks, so an optimistic check may skip locking to check a status but then it's done properly again under a lock. Generally speaking. We had several reports from static checkers and at least in one case we added an annotation so KCSAN does not complain: https://git.kernel.org/linus/748f553c3c4c4f175c6c834358632aff802d72cf The original report is at https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/ I have briefly looked at your report, it seems to be different from the one above but still matches the general approach to the reservations. If it's a false flag then we can add another wrapper with the annotation, unless it's a real bug.
On 9/20/23 17:29, David Sterba wrote: > On Wed, Sep 20, 2023 at 08:18:35AM +0200, Mirsad Todorovac wrote: >> Hi, >> >> This is your friendly bug reporter again. >> >> Please don't throw stuff at me, as I found another KCSAN data-race problem. >> >> I feel like a boy who cried wolf ... >> >> I hope this will get some attention, as it looks this time like a real btrfs problem that could cause >> the kernel module to make a wrong turn when managing storage in different threads simultaneously and >> lead to the corruption of data. However, I do not have an example of this corruption, it is by now only >> theoretical in this otherwise great filesystem. >> >> In fact, I can announce quite a number of KCSAN bugs already in dmesg log: >> >> # of >> occuren >> ces problematic function >> ------------------------------------------- >> 182 __bitmap_and+0xa3/0x110 >> 2 __bitmap_weight+0x62/0xa0 >> 138 __call_rcu_common.constprop.0 >> 3 __cgroup_account_cputime >> 1 __dentry_kill >> 3 __mod_lruvec_page_state >> 15 __percpu_counter_compare >> 1 __percpu_counter_sum+0x8f/0x120 >> 1 acpi_ut_acquire_mutex >> 2 amdgpu_fence_emit >> 1 btrfs_calculate_inode_block_rsv_size >> 1 btrfs_page_set_uptodate >> 28 copy_from_read_buf >> 3 d_add >> 3 d_splice_alias >> 1 delayacct_add_tsk+0x10d/0x630 >> 7 do_epoll_ctl >> 1 do_vmi_align_munmap >> 86 drm_sched_entity_is_ready >> 4 drm_sched_entity_pop_job >> 3 enqueue_timer >> 1 finish_fault+0xde/0x360 >> 2 generic_fillattr >> 2 getrusage >> 9 getrusage+0x3ba/0xaa0 >> 1 getrusage+0x3df/0xaa0 >> 6 inode_needs_update_time >> 1 inode_set_ctime_current >> 1 inode_update_timestamps >> 3 kernfs_refresh_inode >> 22 ktime_get_mono_fast_ns+0x87/0x120 >> 13 ktime_get_mono_fast_ns+0xb0/0x120 >> 24 ktime_get_mono_fast_ns+0xc0/0x120 >> 79 mas_topiary_replace >> 12 mas_wr_modify >> 61 mas_wr_node_store >> 1 memchr_inv+0x71/0x160 >> 1 memchr_inv+0xcf/0x160 >> 19 n_tty_check_unthrottle >> 5 n_tty_kick_worker >> 35 n_tty_poll >> 32 n_tty_read >> 1 n_tty_read+0x5f8/0xaf0 >> 3 osq_lock >> 27 process_one_work >> 4 process_one_work+0x169/0x700 >> 2 rcu_implicit_dynticks_qs >> 1 show_stat+0x45b/0xb70 >> 3 task_mem >> 344 tick_nohz_idle_stop_tick >> 32 tick_nohz_next_event >> 1 tick_nohz_next_event+0xe7/0x1e0 >> 90 tick_sched_do_timer >> 5 tick_sched_do_timer+0x2c/0x120 >> 1 wbt_done >> 1 wbt_issue >> 2 wq_worker_tick >> 37 xas_clear_mark >> >> ------------------------------------------------------ >> >> This report is from a vanilla torvalds tree 6.6-rc2 kernel on Ubuntu 22.04: >> >> [13429.116126] ================================================================== >> [13429.116794] BUG: KCSAN: data-race in btrfs_calculate_inode_block_rsv_size [btrfs] / btrfs_use_block_rsv [btrfs] > > Thanks for the report. Some data races are known to happen in the > reservation code but all the critical changes are done under locks, so > an optimistic check may skip locking to check a status but then it's > done properly again under a lock. Generally speaking. > > We had several reports from static checkers and at least in one case we > added an annotation so KCSAN does not complain: > > https://git.kernel.org/linus/748f553c3c4c4f175c6c834358632aff802d72cf > > The original report is at > > https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/ > > I have briefly looked at your report, it seems to be different from the > one above but still matches the general approach to the reservations. If > it's a false flag then we can add another wrapper with the annotation, > unless it's a real bug. Thank you for your bug report evaluation. I cannot do more but pass on what KCSAN provides - my experience with btrfs is far from required (on the level of fresh user). However, without attempting to argue, it seems to be possible that there is a data-race, because the read side is in the function is not protected by a lock, and theoretically the block_rsv->failfast can change by the write-side thread while the read-side thread is using various parts of the block_rsv structure w/o a read lock. Best regards, Mirsad Todorovac
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c index 77684c5e0c8b..8153814c7861 100644 --- a/fs/btrfs/block-rsv.c +++ b/fs/btrfs/block-rsv.c @@ -166,7 +166,9 @@ int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src, { int ret; + spin_lock(&src->lock); ret = btrfs_block_rsv_use_bytes(src, num_bytes); + spin_unlock(&src->lock); if (ret) return ret; @@ -298,14 +300,12 @@ int btrfs_block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv, u64 num_bytes) { int ret = -ENOSPC; - spin_lock(&block_rsv->lock); if (block_rsv->reserved >= num_bytes) { block_rsv->reserved -= num_bytes; if (block_rsv->reserved < block_rsv->size) block_rsv->full = false; ret = 0; } - spin_unlock(&block_rsv->lock); return ret; } @@ -486,15 +486,16 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, block_rsv = get_block_rsv(trans, root); + spin_lock(&block_rsv->lock); if (unlikely(block_rsv->size == 0)) goto try_reserve; again: ret = btrfs_block_rsv_use_bytes(block_rsv, blocksize); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; if (block_rsv->failfast) - return ERR_PTR(ret); + goto exit_ret_err; if (block_rsv->type == BTRFS_BLOCK_RSV_GLOBAL && !global_updated) { global_updated = true; @@ -520,7 +521,7 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, blocksize, BTRFS_RESERVE_NO_FLUSH); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; /* * If we couldn't reserve metadata bytes try and use some from * the global reserve if its space type is the same as the global @@ -530,7 +531,7 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, block_rsv->space_info == global_rsv->space_info) { ret = btrfs_block_rsv_use_bytes(global_rsv, blocksize); if (!ret) - return global_rsv; + goto exit_ret_global_rsv; } /* @@ -542,9 +543,20 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, blocksize, BTRFS_RESERVE_FLUSH_EMERGENCY); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; +exit_ret_err: + spin_unlock(&block_rsv->lock); return ERR_PTR(ret); + +exit_ret_block_rsv: + spin_unlock(&block_rsv->lock); + return block_rsv; + +exit_ret_global_rsv: + spin_unlock(&block_rsv->lock); + return global_rsv; + } int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info, --- This is of course just a symptomatic patch, but since btrfs_block_rsv_use_bytes() is not used outside of fs/btrfs/block-rsv.c,it could just work as PoC. OTOH, this version might be more elegant: ---------------------------------------------------------------------- diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c index 77684c5e0c8b..192be99cc6f4 100644 --- a/fs/btrfs/block-rsv.c +++ b/fs/btrfs/block-rsv.c @@ -294,18 +294,28 @@ u64 btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, qgroup_to_release); } -int btrfs_block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv, u64 num_bytes) +static +int __btrfs_block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv, u64 num_bytes) { int ret = -ENOSPC; - spin_lock(&block_rsv->lock); if (block_rsv->reserved >= num_bytes) { block_rsv->reserved -= num_bytes; if (block_rsv->reserved < block_rsv->size) block_rsv->full = false; ret = 0; } + return ret; +} + +int btrfs_block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv, u64 num_bytes) +{ + int ret; + + spin_lock(&block_rsv->lock); + ret = __btrfs_block_rsv_use_bytes(block_rsv, num_bytes); spin_unlock(&block_rsv->lock); + return ret; } @@ -486,15 +496,16 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, block_rsv = get_block_rsv(trans, root); + spin_lock(&block_rsv->lock); if (unlikely(block_rsv->size == 0)) goto try_reserve; again: - ret = btrfs_block_rsv_use_bytes(block_rsv, blocksize); + ret = __btrfs_block_rsv_use_bytes(block_rsv, blocksize); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; if (block_rsv->failfast) - return ERR_PTR(ret); + goto exit_ret_err; if (block_rsv->type == BTRFS_BLOCK_RSV_GLOBAL && !global_updated) { global_updated = true; @@ -520,7 +531,7 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, blocksize, BTRFS_RESERVE_NO_FLUSH); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; /* * If we couldn't reserve metadata bytes try and use some from * the global reserve if its space type is the same as the global @@ -528,9 +539,9 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, */ if (block_rsv->type != BTRFS_BLOCK_RSV_GLOBAL && block_rsv->space_info == global_rsv->space_info) { - ret = btrfs_block_rsv_use_bytes(global_rsv, blocksize); + ret = __btrfs_block_rsv_use_bytes(global_rsv, blocksize); if (!ret) - return global_rsv; + goto exit_ret_global_rsv; } /* @@ -542,9 +553,20 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans, ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, blocksize, BTRFS_RESERVE_FLUSH_EMERGENCY); if (!ret) - return block_rsv; + goto exit_ret_block_rsv; +exit_ret_err: + spin_unlock(&block_rsv->lock); return ERR_PTR(ret); + +exit_ret_block_rsv: + spin_unlock(&block_rsv->lock); + return block_rsv; + +exit_ret_global_rsv: + spin_unlock(&block_rsv->lock); + return global_rsv; + } int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,