[0/2] block, md: Better handle REQ_OP_FLUSH

Message ID	20231221012715.3048221-1-song@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 839AD3C02; Thu, 21 Dec 2023 01:27:36 +0000 (UTC) From: Song Liu <song@kernel.org> To: linux-block@vger.kernel.org, linux-raid@vger.kernel.org Cc: axboe@kernel.dk, kent.overstreet@linux.dev, janpieter.sollie@edpnet.be, colyli@suse.de, bagasdotme@gmail.com, Song Liu <song@kernel.org> Subject: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Date: Wed, 20 Dec 2023 17:27:13 -0800 Message-Id: <20231221012715.3048221-1-song@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	block, md: Better handle REQ_OP_FLUSH \| expand [0/2] block, md: Better handle REQ_OP_FLUSH [1/2] block: Check REQ_OP_FLUSH in op_is_flush() [2/2] md: Use op_is_flush() to check flush bio

Message ID

20231221012715.3048221-1-song@kernel.org (mailing list archive)

Headers

From: Song Liu <song@kernel.org>
To: linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
Cc: axboe@kernel.dk,
	kent.overstreet@linux.dev,
	janpieter.sollie@edpnet.be,
	colyli@suse.de,
	bagasdotme@gmail.com,
	Song Liu <song@kernel.org>
Subject: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH
Date: Wed, 20 Dec 2023 17:27:13 -0800
Message-Id: <20231221012715.3048221-1-song@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

block, md: Better handle REQ_OP_FLUSH | expand

Message

Song Liu Dec. 21, 2023, 1:27 a.m. UTC

A recent bug report [1] shows md is handling a flush from bcachefs as read:

bch2_journal_write=>
  submit_bio=>
    ...
    md_handle_request =>
      raid5_make_request =>
        chunk_aligned_read =>
          raid5_read_one_chunk =>
	    ...

It appears md code only checks REQ_PREFLUSH for flush requests, which
doesn't cover all cases. OTOH, op_is_flush() doesn't check REQ_OP_FLUSH
either.

Fix this by:
1) Check REQ_PREFLUSH in op_is_flush();
2) Use op_is_flush() in md code.

Thanks,
Song

[1] https://bugzilla.kernel.org/show_bug.cgi?id=218184

Song Liu (2):
  block: Check REQ_OP_FLUSH in op_is_flush()
  md: Use op_is_flush() to check flush bio

 drivers/md/raid0.c        | 2 +-
 drivers/md/raid1.c        | 2 +-
 drivers/md/raid10.c       | 2 +-
 drivers/md/raid5.c        | 2 +-
 include/linux/blk_types.h | 3 ++-
 5 files changed, 6 insertions(+), 5 deletions(-)

--
2.34.1

Comments

Ed Tsai (蔡宗軒) Dec. 21, 2023, 3:36 a.m. UTC | #1

On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  A recent bug report [1] shows md is handling a flush from bcachefs
> as read:
> 
> bch2_journal_write=>
>   submit_bio=>
>     ...
>     md_handle_request =>
>       raid5_make_request =>
>         chunk_aligned_read =>
>           raid5_read_one_chunk =>
> 	    ...
> 
> It appears md code only checks REQ_PREFLUSH for flush requests, which
> doesn't cover all cases. OTOH, op_is_flush() doesn't check
> REQ_OP_FLUSH
> either.
> 
> Fix this by:
> 1) Check REQ_PREFLUSH in op_is_flush();
> 2) Use op_is_flush() in md code.
> 
> Thanks,
> Song
> 
> [1] 
> https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$
> 

REQ_OP_FLUSH is only used by the block layer's flush code, and the
filesystem should use REQ_PREFLUSH with an empty write bio.

If we want upper layer to be able to directly send REQ_OP_FLUSH bio,
then we should retrieve all REQ_PREFLUSH to confirm. At least for now,
it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy`
will directly return 0 and no flush operation will be sent to the
driver.

>  
> 
> Song Liu (2):
>   block: Check REQ_OP_FLUSH in op_is_flush()
>   md: Use op_is_flush() to check flush bio
> 
>  drivers/md/raid0.c        | 2 +-
>  drivers/md/raid1.c        | 2 +-
>  drivers/md/raid10.c       | 2 +-
>  drivers/md/raid5.c        | 2 +-
>  include/linux/blk_types.h | 3 ++-
>  5 files changed, 6 insertions(+), 5 deletions(-)
> 
> --
> 2.34.1

Kent Overstreet Dec. 21, 2023, 5:30 a.m. UTC | #2

On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote:
> >  	 
> > External email : Please do not click links or open attachments until
> > you have verified the sender or the content.
> >  A recent bug report [1] shows md is handling a flush from bcachefs
> > as read:
> > 
> > bch2_journal_write=>
> >   submit_bio=>
> >     ...
> >     md_handle_request =>
> >       raid5_make_request =>
> >         chunk_aligned_read =>
> >           raid5_read_one_chunk =>
> > 	    ...
> > 
> > It appears md code only checks REQ_PREFLUSH for flush requests, which
> > doesn't cover all cases. OTOH, op_is_flush() doesn't check
> > REQ_OP_FLUSH
> > either.
> > 
> > Fix this by:
> > 1) Check REQ_PREFLUSH in op_is_flush();
> > 2) Use op_is_flush() in md code.
> > 
> > Thanks,
> > Song
> > 
> > [1] 
> > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$
> > 
> 
> REQ_OP_FLUSH is only used by the block layer's flush code, and the
> filesystem should use REQ_PREFLUSH with an empty write bio.
> 
> If we want upper layer to be able to directly send REQ_OP_FLUSH bio,
> then we should retrieve all REQ_PREFLUSH to confirm. At least for now,
> it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy`
> will directly return 0 and no flush operation will be sent to the
> driver.

If that's the case, then it should be documented and there should be a
WARN_ON() in generic_make_request().

Also, glancing at blk_types.h, we have the req_op and req_flag_bits both
using (__force blk_opf_t), but using the same bit range - what the hell?
That's seriously broken...

Ed Tsai (蔡宗軒) Dec. 21, 2023, 7:56 a.m. UTC | #3

On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote:
>  On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote:
> > > you have verified the sender or the content.
> > >  A recent bug report [1] shows md is handling a flush from
> bcachefs
> > > as read:
> > > 
> > > bch2_journal_write=>
> > >   submit_bio=>
> > >     ...
> > >     md_handle_request =>
> > >       raid5_make_request =>
> > >         chunk_aligned_read =>
> > >           raid5_read_one_chunk =>
> > >     ...
> > > 
> > > It appears md code only checks REQ_PREFLUSH for flush requests,
> which
> > > doesn't cover all cases. OTOH, op_is_flush() doesn't check
> > > REQ_OP_FLUSH
> > > either.
> > > 
> > > Fix this by:
> > > 1) Check REQ_PREFLUSH in op_is_flush();
> > > 2) Use op_is_flush() in md code.
> > > 
> > > Thanks,
> > > Song
> > > 
> > > [1] 
> > > 
> https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$
> > > 
> > 
> > REQ_OP_FLUSH is only used by the block layer's flush code, and the
> > filesystem should use REQ_PREFLUSH with an empty write bio.
> > 
> > If we want upper layer to be able to directly send REQ_OP_FLUSH
> bio,
> > then we should retrieve all REQ_PREFLUSH to confirm. At least for
> now,
> > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in
> `blk_flush_policy`
> > will directly return 0 and no flush operation will be sent to the
> > driver.
> 
> If that's the case, then it should be documented and there should be
> a
> WARN_ON() in generic_make_request().

Please refer to the writeback_cache_control.rst. Use an empty write bio
with the REQ_PREFLUSH flag for an explicit flush, or as commonly
practiced by most filesystems, use blkdev_issue_flush for a pure flush.

> 
> Also, glancing at blk_types.h, we have the req_op and req_flag_bits
> both
> using (__force blk_opf_t), but using the same bit range - what the
> hell?
> That's seriously broken...

No, read the comment before req_op. We do not need to use the entire 32
bits to represent OP; only 8 bits for OP, while the remaning 24 bits is
used for FLAG.

Kent Overstreet Dec. 21, 2023, 7:19 p.m. UTC | #4

On Thu, Dec 21, 2023 at 07:56:45AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote:
> >  On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote:
> > > > you have verified the sender or the content.
> > > >  A recent bug report [1] shows md is handling a flush from
> > bcachefs
> > > > as read:
> > > > 
> > > > bch2_journal_write=>
> > > >   submit_bio=>
> > > >     ...
> > > >     md_handle_request =>
> > > >       raid5_make_request =>
> > > >         chunk_aligned_read =>
> > > >           raid5_read_one_chunk =>
> > > >     ...
> > > > 
> > > > It appears md code only checks REQ_PREFLUSH for flush requests,
> > which
> > > > doesn't cover all cases. OTOH, op_is_flush() doesn't check
> > > > REQ_OP_FLUSH
> > > > either.
> > > > 
> > > > Fix this by:
> > > > 1) Check REQ_PREFLUSH in op_is_flush();
> > > > 2) Use op_is_flush() in md code.
> > > > 
> > > > Thanks,
> > > > Song
> > > > 
> > > > [1] 
> > > > 
> > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$
> > > > 
> > > 
> > > REQ_OP_FLUSH is only used by the block layer's flush code, and the
> > > filesystem should use REQ_PREFLUSH with an empty write bio.
> > > 
> > > If we want upper layer to be able to directly send REQ_OP_FLUSH
> > bio,
> > > then we should retrieve all REQ_PREFLUSH to confirm. At least for
> > now,
> > > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in
> > `blk_flush_policy`
> > > will directly return 0 and no flush operation will be sent to the
> > > driver.
> > 
> > If that's the case, then it should be documented and there should be
> > a
> > WARN_ON() in generic_make_request().
> 
> Please refer to the writeback_cache_control.rst. Use an empty write bio
> with the REQ_PREFLUSH flag for an explicit flush, or as commonly
> practiced by most filesystems, use blkdev_issue_flush for a pure flush.

That's not a substitute for a proper comment in the code.

> 
> > 
> > Also, glancing at blk_types.h, we have the req_op and req_flag_bits
> > both
> > using (__force blk_opf_t), but using the same bit range - what the
> > hell?
> > That's seriously broken...
> 
> No, read the comment before req_op. We do not need to use the entire 32
> bits to represent OP; only 8 bits for OP, while the remaning 24 bits is
> used for FLAG.

No, this is just broken; it's using the same bitwise enum for two
different enums.

bitwise exists for a reason - C enums are not natively type safe, and
mixing up enums/bitflags and using them in the wrong context is a
serious source of bugs. If it would be incorrect to or the two different
flags together, you can't use the same bitwise type.