Message ID | 20231221012715.3048221-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | block, md: Better handle REQ_OP_FLUSH | expand |
On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > A recent bug report [1] shows md is handling a flush from bcachefs > as read: > > bch2_journal_write=> > submit_bio=> > ... > md_handle_request => > raid5_make_request => > chunk_aligned_read => > raid5_read_one_chunk => > ... > > It appears md code only checks REQ_PREFLUSH for flush requests, which > doesn't cover all cases. OTOH, op_is_flush() doesn't check > REQ_OP_FLUSH > either. > > Fix this by: > 1) Check REQ_PREFLUSH in op_is_flush(); > 2) Use op_is_flush() in md code. > > Thanks, > Song > > [1] > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > REQ_OP_FLUSH is only used by the block layer's flush code, and the filesystem should use REQ_PREFLUSH with an empty write bio. If we want upper layer to be able to directly send REQ_OP_FLUSH bio, then we should retrieve all REQ_PREFLUSH to confirm. At least for now, it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy` will directly return 0 and no flush operation will be sent to the driver. > > > Song Liu (2): > block: Check REQ_OP_FLUSH in op_is_flush() > md: Use op_is_flush() to check flush bio > > drivers/md/raid0.c | 2 +- > drivers/md/raid1.c | 2 +- > drivers/md/raid10.c | 2 +- > drivers/md/raid5.c | 2 +- > include/linux/blk_types.h | 3 ++- > 5 files changed, 6 insertions(+), 5 deletions(-) > > -- > 2.34.1
On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > > External email : Please do not click links or open attachments until > > you have verified the sender or the content. > > A recent bug report [1] shows md is handling a flush from bcachefs > > as read: > > > > bch2_journal_write=> > > submit_bio=> > > ... > > md_handle_request => > > raid5_make_request => > > chunk_aligned_read => > > raid5_read_one_chunk => > > ... > > > > It appears md code only checks REQ_PREFLUSH for flush requests, which > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > REQ_OP_FLUSH > > either. > > > > Fix this by: > > 1) Check REQ_PREFLUSH in op_is_flush(); > > 2) Use op_is_flush() in md code. > > > > Thanks, > > Song > > > > [1] > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > filesystem should use REQ_PREFLUSH with an empty write bio. > > If we want upper layer to be able to directly send REQ_OP_FLUSH bio, > then we should retrieve all REQ_PREFLUSH to confirm. At least for now, > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy` > will directly return 0 and no flush operation will be sent to the > driver. If that's the case, then it should be documented and there should be a WARN_ON() in generic_make_request(). Also, glancing at blk_types.h, we have the req_op and req_flag_bits both using (__force blk_opf_t), but using the same bit range - what the hell? That's seriously broken...
On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote: > On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > you have verified the sender or the content. > > > A recent bug report [1] shows md is handling a flush from > bcachefs > > > as read: > > > > > > bch2_journal_write=> > > > submit_bio=> > > > ... > > > md_handle_request => > > > raid5_make_request => > > > chunk_aligned_read => > > > raid5_read_one_chunk => > > > ... > > > > > > It appears md code only checks REQ_PREFLUSH for flush requests, > which > > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > > REQ_OP_FLUSH > > > either. > > > > > > Fix this by: > > > 1) Check REQ_PREFLUSH in op_is_flush(); > > > 2) Use op_is_flush() in md code. > > > > > > Thanks, > > > Song > > > > > > [1] > > > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > > filesystem should use REQ_PREFLUSH with an empty write bio. > > > > If we want upper layer to be able to directly send REQ_OP_FLUSH > bio, > > then we should retrieve all REQ_PREFLUSH to confirm. At least for > now, > > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in > `blk_flush_policy` > > will directly return 0 and no flush operation will be sent to the > > driver. > > If that's the case, then it should be documented and there should be > a > WARN_ON() in generic_make_request(). Please refer to the writeback_cache_control.rst. Use an empty write bio with the REQ_PREFLUSH flag for an explicit flush, or as commonly practiced by most filesystems, use blkdev_issue_flush for a pure flush. > > Also, glancing at blk_types.h, we have the req_op and req_flag_bits > both > using (__force blk_opf_t), but using the same bit range - what the > hell? > That's seriously broken... No, read the comment before req_op. We do not need to use the entire 32 bits to represent OP; only 8 bits for OP, while the remaning 24 bits is used for FLAG.
On Thu, Dec 21, 2023 at 07:56:45AM +0000, Ed Tsai (蔡宗軒) wrote: > On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote: > > On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > > > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > > you have verified the sender or the content. > > > > A recent bug report [1] shows md is handling a flush from > > bcachefs > > > > as read: > > > > > > > > bch2_journal_write=> > > > > submit_bio=> > > > > ... > > > > md_handle_request => > > > > raid5_make_request => > > > > chunk_aligned_read => > > > > raid5_read_one_chunk => > > > > ... > > > > > > > > It appears md code only checks REQ_PREFLUSH for flush requests, > > which > > > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > > > REQ_OP_FLUSH > > > > either. > > > > > > > > Fix this by: > > > > 1) Check REQ_PREFLUSH in op_is_flush(); > > > > 2) Use op_is_flush() in md code. > > > > > > > > Thanks, > > > > Song > > > > > > > > [1] > > > > > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > > > > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > > > filesystem should use REQ_PREFLUSH with an empty write bio. > > > > > > If we want upper layer to be able to directly send REQ_OP_FLUSH > > bio, > > > then we should retrieve all REQ_PREFLUSH to confirm. At least for > > now, > > > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in > > `blk_flush_policy` > > > will directly return 0 and no flush operation will be sent to the > > > driver. > > > > If that's the case, then it should be documented and there should be > > a > > WARN_ON() in generic_make_request(). > > Please refer to the writeback_cache_control.rst. Use an empty write bio > with the REQ_PREFLUSH flag for an explicit flush, or as commonly > practiced by most filesystems, use blkdev_issue_flush for a pure flush. That's not a substitute for a proper comment in the code. > > > > > Also, glancing at blk_types.h, we have the req_op and req_flag_bits > > both > > using (__force blk_opf_t), but using the same bit range - what the > > hell? > > That's seriously broken... > > No, read the comment before req_op. We do not need to use the entire 32 > bits to represent OP; only 8 bits for OP, while the remaning 24 bits is > used for FLAG. No, this is just broken; it's using the same bitwise enum for two different enums. bitwise exists for a reason - C enums are not natively type safe, and mixing up enums/bitflags and using them in the wrong context is a serious source of bugs. If it would be incorrect to or the two different flags together, you can't use the same bitwise type.