scsi: ufs: mark HPB support as BROKEN

Message ID	20211026071204.1709318-1-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: martin.petersen@oracle.com Cc: axboe@kernel.dk, alim.akhtar@samsung.com, avri.altman@wdc.com, linux-scsi@vger.kernel.org, linux-block@vger.kernel.org Subject: [PATCH] scsi: ufs: mark HPB support as BROKEN Date: Tue, 26 Oct 2021 09:12:04 +0200 Message-Id: <20211026071204.1709318-1-hch@lst.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	scsi: ufs: mark HPB support as BROKEN \| expand scsi: ufs: mark HPB support as BROKEN

Christoph Hellwig Oct. 26, 2021, 7:12 a.m. UTC

The HPB support added this merge window is fundanetally flawed as it
uses blk_insert_cloned_request to insert a cloned request onto the same
queue as the one that the original request came from, leading to all
kinds of issues in blk-mq accounting (in addition to this API being
a special case for dm-mpath that should not see other users).

Mark is as BROKEN as the non-intrusive alternative to a last minute
large scale revert.

Fixes: f02bc9754a68 ("scsi: ufs: ufshpb: Introduce Host Performance Buffer
feature")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/ufs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Hannes Reinecke Oct. 26, 2021, 7:18 a.m. UTC | #1

On 10/26/21 9:12 AM, Christoph Hellwig wrote:
> The HPB support added this merge window is fundanetally flawed as it
> uses blk_insert_cloned_request to insert a cloned request onto the same
> queue as the one that the original request came from, leading to all
> kinds of issues in blk-mq accounting (in addition to this API being
> a special case for dm-mpath that should not see other users).
> 
> Mark is as BROKEN as the non-intrusive alternative to a last minute
> large scale revert.
> 
> Fixes: f02bc9754a68 ("scsi: ufs: ufshpb: Introduce Host Performance Buffer
> feature")
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/ufs/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/ufs/Kconfig b/drivers/scsi/ufs/Kconfig
> index 432df76e6318a..7835d9082aae4 100644
> --- a/drivers/scsi/ufs/Kconfig
> +++ b/drivers/scsi/ufs/Kconfig
> @@ -186,7 +186,7 @@ config SCSI_UFS_CRYPTO
>  
>  config SCSI_UFS_HPB
>  	bool "Support UFS Host Performance Booster"
> -	depends on SCSI_UFSHCD
> +	depends on SCSI_UFSHCD && BROKEN
>  	help
>  	  The UFS HPB feature improves random read performance. It caches
>  	  L2P (logical to physical) map of UFS to host DRAM. The driver uses HPB
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes

Damien Le Moal Oct. 26, 2021, 7:24 a.m. UTC | #2

On 2021/10/26 16:12, Christoph Hellwig wrote:
> The HPB support added this merge window is fundanetally flawed as it
> uses blk_insert_cloned_request to insert a cloned request onto the same
> queue as the one that the original request came from, leading to all
> kinds of issues in blk-mq accounting (in addition to this API being
> a special case for dm-mpath that should not see other users).
> 
> Mark is as BROKEN as the non-intrusive alternative to a last minute

s/Mark is/Mark it

> large scale revert.
> 
> Fixes: f02bc9754a68 ("scsi: ufs: ufshpb: Introduce Host Performance Buffer
> feature")
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/ufs/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/ufs/Kconfig b/drivers/scsi/ufs/Kconfig
> index 432df76e6318a..7835d9082aae4 100644
> --- a/drivers/scsi/ufs/Kconfig
> +++ b/drivers/scsi/ufs/Kconfig
> @@ -186,7 +186,7 @@ config SCSI_UFS_CRYPTO
>  
>  config SCSI_UFS_HPB
>  	bool "Support UFS Host Performance Booster"
> -	depends on SCSI_UFSHCD
> +	depends on SCSI_UFSHCD && BROKEN
>  	help
>  	  The UFS HPB feature improves random read performance. It caches
>  	  L2P (logical to physical) map of UFS to host DRAM. The driver uses HPB
> 

Otherwise, looks good to me.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

James Bottomley Oct. 26, 2021, 1:04 p.m. UTC | #3

On Tue, 2021-10-26 at 16:24 +0900, Damien Le Moal wrote:
> On 2021/10/26 16:12, Christoph Hellwig wrote:
> > The HPB support added this merge window is fundanetally

And s/n/m/ while you're at it: fundamentally

Otherwise:

Reviewed-by: James E.J. Bottomley <jejb@linux.ibm.com>

James

> >  flawed as it uses blk_insert_cloned_request to insert a cloned
> > request onto the same queue as the one that the original request
> > came from, leading to all kinds of issues in blk-mq accounting (in
> > addition to this API being a special case for dm-mpath that should
> > not see other users).
> > 
> > Mark is as BROKEN as the non-intrusive alternative to a last minute
> 
> s/Mark is/Mark it
> 
> > large scale revert.
> > 
> > Fixes: f02bc9754a68 ("scsi: ufs: ufshpb: Introduce Host Performance
> > Buffer
> > feature")
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  drivers/scsi/ufs/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/ufs/Kconfig b/drivers/scsi/ufs/Kconfig
> > index 432df76e6318a..7835d9082aae4 100644
> > --- a/drivers/scsi/ufs/Kconfig
> > +++ b/drivers/scsi/ufs/Kconfig
> > @@ -186,7 +186,7 @@ config SCSI_UFS_CRYPTO
> >  
> >  config SCSI_UFS_HPB
> >  	bool "Support UFS Host Performance Booster"
> > -	depends on SCSI_UFSHCD
> > +	depends on SCSI_UFSHCD && BROKEN
> >  	help
> >  	  The UFS HPB feature improves random read performance. It
> > caches
> >  	  L2P (logical to physical) map of UFS to host DRAM. The driver
> > uses HPB
> > 
> 
> Otherwise, looks good to me.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>

Bart Van Assche Oct. 26, 2021, 4:36 p.m. UTC | #4

On 10/26/21 12:12 AM, Christoph Hellwig wrote:
> The HPB support added this merge window is fundanetally flawed as it
                                              ^^^^^^^^^^^^
                                              fundanetally -> fundamentally

Since the implementation can be reworked not to use
blk_insert_cloned_request() I'm not sure using the word "fundamentally"
is appropriate.

> uses blk_insert_cloned_request to insert a cloned request onto the same
> queue as the one that the original request came from, leading to all
> kinds of issues in blk-mq accounting (in addition to this API being
> a special case for dm-mpath that should not see other users).

More detailed information would have been welcome.

> Fixes: f02bc9754a68 ("scsi: ufs: ufshpb: Introduce Host Performance Buffer
> feature")

I assume that you wanted to refer to commit 41d8a9333cc9 ("scsi: ufs: ufshpb:
Add HPB 2.0 support") instead since that commit is the only commit that
introduced a blk_insert_cloned_request() call in the UFS HPB code?

Bart.

James Bottomley Oct. 26, 2021, 5:19 p.m. UTC | #5

On Tue, 2021-10-26 at 09:36 -0700, Bart Van Assche wrote:
> On 10/26/21 12:12 AM, Christoph Hellwig wrote:
> > The HPB support added this merge window is fundanetally flawed as
> > it
>                                               ^^^^^^^^^^^^
>                                               fundanetally ->
> fundamentally
> 
> Since the implementation can be reworked not to use
> blk_insert_cloned_request() I'm not sure using the word
> "fundamentally" is appropriate.

I'm not so sure about that.  The READ BUFFER implementation runs from a
work queue and looks fine.  The WRITE BUFFER implementation is trying
to spawn a second command to precede the queued command which is a
fundamental problem for the block API.  It's not clear to me that the
WRITE BUFFER can be fixed because of the tying to the sent command ...
but like I said, the standard is proprietary so I can't look at it to
see if there are alternative ways of achieving the same effect.

James

Jens Axboe Oct. 26, 2021, 5:25 p.m. UTC | #6

On 10/26/21 11:19 AM, James Bottomley wrote:
> On Tue, 2021-10-26 at 09:36 -0700, Bart Van Assche wrote:
>> On 10/26/21 12:12 AM, Christoph Hellwig wrote:
>>> The HPB support added this merge window is fundanetally flawed as
>>> it
>>                                               ^^^^^^^^^^^^
>>                                               fundanetally ->
>> fundamentally
>>
>> Since the implementation can be reworked not to use
>> blk_insert_cloned_request() I'm not sure using the word
>> "fundamentally" is appropriate.
> 
> I'm not so sure about that.  The READ BUFFER implementation runs from a
> work queue and looks fine.  The WRITE BUFFER implementation is trying
> to spawn a second command to precede the queued command which is a
> fundamental problem for the block API.  It's not clear to me that the
> WRITE BUFFER can be fixed because of the tying to the sent command ...
> but like I said, the standard is proprietary so I can't look at it to
> see if there are alternative ways of achieving the same effect.

Is there a model in which this can actually work? If not, or if we
aren't sure, I think we'd be better off just reverting the parts
involved with that block layer misuse. Simply marking it broken is a
half measure that doesn't really solve anything (except send a message).

IMHO, it should be reverted and the clone usage we currently export be
moved into dm for now. That'll prevent further abuse of this in the
future.

Bart Van Assche Oct. 26, 2021, 6:05 p.m. UTC | #7

On 10/26/21 10:25 AM, Jens Axboe wrote:
> On 10/26/21 11:19 AM, James Bottomley wrote:
>> On Tue, 2021-10-26 at 09:36 -0700, Bart Van Assche wrote:
>>> On 10/26/21 12:12 AM, Christoph Hellwig wrote:
>>>> The HPB support added this merge window is fundanetally flawed as
>>>> it
>>>                                                ^^^^^^^^^^^^
>>>                                                fundanetally ->
>>> fundamentally
>>>
>>> Since the implementation can be reworked not to use
>>> blk_insert_cloned_request() I'm not sure using the word
>>> "fundamentally" is appropriate.
>>
>> I'm not so sure about that.  The READ BUFFER implementation runs from a
>> work queue and looks fine.  The WRITE BUFFER implementation is trying
>> to spawn a second command to precede the queued command which is a
>> fundamental problem for the block API.  It's not clear to me that the
>> WRITE BUFFER can be fixed because of the tying to the sent command ...
>> but like I said, the standard is proprietary so I can't look at it to
>> see if there are alternative ways of achieving the same effect.
> 
> Is there a model in which this can actually work? If not, or if we
> aren't sure, I think we'd be better off just reverting the parts
> involved with that block layer misuse. Simply marking it broken is a
> half measure that doesn't really solve anything (except send a message).
> 
> IMHO, it should be reverted and the clone usage we currently export be
> moved into dm for now. That'll prevent further abuse of this in the
> future.

Hi Jens and James,

This is what I found in the HPB 2.0 specification (the spec is
copyrighted but I assume that I have the right to quote small parts of
that spec):

<quote>
6.3 HPB WRITE BUFFER Command

HPB WRITE BUFFER command have following 3 different function depending
on the value of BUFFER_ID field.
1) Inactivating an HPB Region (supported in host control mode only)
2) prefetching HPB Entries from the host to the device (supported in any
    control mode)
3) Inactivating all HPB Regions, except for Provisioning Pinned Region
    in the host (supported in device control mode only)
</quote>

Reverting only the problematic code (HPB 2.0) sounds reasonable to me
because reworking the HPB 2.0 code will be nontrivial. Using
BLK_MQ_F_BLOCKING might be a viable approach. However, I don't want to
see that flag enabled in the UFS driver if HPB is not used because of
the negative performance effects of that flag.

Thanks,

Bart.

Jens Axboe Oct. 26, 2021, 6:10 p.m. UTC | #8

On 10/26/21 12:05 PM, Bart Van Assche wrote:
> On 10/26/21 10:25 AM, Jens Axboe wrote:
>> On 10/26/21 11:19 AM, James Bottomley wrote:
>>> On Tue, 2021-10-26 at 09:36 -0700, Bart Van Assche wrote:
>>>> On 10/26/21 12:12 AM, Christoph Hellwig wrote:
>>>>> The HPB support added this merge window is fundanetally flawed as
>>>>> it
>>>>                                                ^^^^^^^^^^^^
>>>>                                                fundanetally ->
>>>> fundamentally
>>>>
>>>> Since the implementation can be reworked not to use
>>>> blk_insert_cloned_request() I'm not sure using the word
>>>> "fundamentally" is appropriate.
>>>
>>> I'm not so sure about that.  The READ BUFFER implementation runs from a
>>> work queue and looks fine.  The WRITE BUFFER implementation is trying
>>> to spawn a second command to precede the queued command which is a
>>> fundamental problem for the block API.  It's not clear to me that the
>>> WRITE BUFFER can be fixed because of the tying to the sent command ...
>>> but like I said, the standard is proprietary so I can't look at it to
>>> see if there are alternative ways of achieving the same effect.
>>
>> Is there a model in which this can actually work? If not, or if we
>> aren't sure, I think we'd be better off just reverting the parts
>> involved with that block layer misuse. Simply marking it broken is a
>> half measure that doesn't really solve anything (except send a message).
>>
>> IMHO, it should be reverted and the clone usage we currently export be
>> moved into dm for now. That'll prevent further abuse of this in the
>> future.
> 
> Hi Jens and James,
> 
> This is what I found in the HPB 2.0 specification (the spec is
> copyrighted but I assume that I have the right to quote small parts of
> that spec):
> 
> <quote>
> 6.3 HPB WRITE BUFFER Command
> 
> HPB WRITE BUFFER command have following 3 different function depending
> on the value of BUFFER_ID field.
> 1) Inactivating an HPB Region (supported in host control mode only)
> 2) prefetching HPB Entries from the host to the device (supported in any
>     control mode)
> 3) Inactivating all HPB Regions, except for Provisioning Pinned Region
>     in the host (supported in device control mode only)
> </quote>
> 
> Reverting only the problematic code (HPB 2.0) sounds reasonable to me
> because reworking the HPB 2.0 code will be nontrivial.

Then let's please go ahead and do that. I'm assuming this is a smaller
set than what Christoph originally posted, who's taking on the job of
lining it up?

> Using BLK_MQ_F_BLOCKING might be a viable approach. However, I don't
> want to see that flag enabled in the UFS driver if HPB is not used
> because of the negative performance effects of that flag.

Agree, and if we do just the problematic revert, then the decision on
whether BLK_MQ_F_BLOCKING is useful or not belongs to whoever reworks
the WRITE BUFFER code and reposts that support.

James Bottomley Oct. 26, 2021, 6:18 p.m. UTC | #9

On Tue, 2021-10-26 at 12:10 -0600, Jens Axboe wrote:
> On 10/26/21 12:05 PM, Bart Van Assche wrote:
[...]
> > Hi Jens and James,
> > 
> > This is what I found in the HPB 2.0 specification (the spec is
> > copyrighted but I assume that I have the right to quote small parts
> > of that spec):
> > 
> > <quote>
> > 6.3 HPB WRITE BUFFER Command
> > 
> > HPB WRITE BUFFER command have following 3 different function
> > depending
> > on the value of BUFFER_ID field.
> > 1) Inactivating an HPB Region (supported in host control mode only)
> > 2) prefetching HPB Entries from the host to the device (supported
> > in any control mode)
> > 3) Inactivating all HPB Regions, except for Provisioning Pinned
> > Region
> >     in the host (supported in device control mode only)
> > </quote>
> > 
> > Reverting only the problematic code (HPB 2.0) sounds reasonable to
> > me because reworking the HPB 2.0 code will be nontrivial.
> 
> Then let's please go ahead and do that. I'm assuming this is a
> smaller set than what Christoph originally posted, who's taking on
> the job of lining it up?

I was hoping the HPB guys would do this.  I think I know how to do it
based on my investigations, but I'd really prefer that someone who
cares about HPB did it.  It looks like a fairly simple excision, so I
think we can have this done before the end of the week and if it isn't
we could revert the whole driver.

> > Using BLK_MQ_F_BLOCKING might be a viable approach. However, I
> > don't want to see that flag enabled in the UFS driver if HPB is not
> > used because of the negative performance effects of that flag.
> 
> Agree, and if we do just the problematic revert, then the decision on
> whether BLK_MQ_F_BLOCKING is useful or not belongs to whoever reworks
> the WRITE BUFFER code and reposts that support.

Agreed, that was my initial proposed solution: get rid of the write
buffer optimzation now to fix the API abuse and see if we can add it
back in a more acceptable form later.

James

Martin K. Petersen Oct. 26, 2021, 6:27 p.m. UTC | #10

James,

> I was hoping the HPB guys would do this.

Yes, that would be great. Easier to validate for someone with access to
the hardware in question.

> Agreed, that was my initial proposed solution: get rid of the write
> buffer optimzation now to fix the API abuse and see if we can add it
> back in a more acceptable form later.

Doesn't matter to me whether we back out the 2.0 stuff or mark it as
broken. I merely objected to reverting all of HPB since I don't think
that would solve anything.

But obviously we'll need a patch to fix 5.15 ASAP...

Bart Van Assche Oct. 26, 2021, 8:10 p.m. UTC | #11

On 10/26/21 11:27 AM, Martin K. Petersen wrote:
>> Agreed, that was my initial proposed solution: get rid of the write
>> buffer optimzation now to fix the API abuse and see if we can add it
>> back in a more acceptable form later.
> 
> Doesn't matter to me whether we back out the 2.0 stuff or mark it as
> broken. I merely objected to reverting all of HPB since I don't think
> that would solve anything.
> 
> But obviously we'll need a patch to fix 5.15 ASAP...

I do not have access to a test setup that supports HPB.

If blk_insert_cloned_request() is moved into the device mapper then I
think that blk_mq_request_issue_directly() will need to be exported. How
about the (totally untested) patch below for removing the
blk_insert_cloned_request() call from the UFS-HPB code?

Thanks,

Bart.


---
  block/blk-mq.c            | 1 +
  block/blk-mq.h            | 1 -
  drivers/scsi/ufs/ufshpb.c | 2 +-
  include/linux/blkdev.h    | 1 +
  4 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 108a352051be..186321f450f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2084,6 +2084,7 @@ blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)

  	return ret;
  }
+EXPORT_SYMBOL_GPL(blk_mq_request_issue_directly);

  void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
  		struct list_head *list)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d08779f77a26..ffba52189b18 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -74,7 +74,6 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
  				struct list_head *list);

  /* Used by blk_insert_cloned_request() to issue request directly */
-blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
  void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
  				    struct list_head *list);

diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
index 66b19500844e..458eadcb604f 100644
--- a/drivers/scsi/ufs/ufshpb.c
+++ b/drivers/scsi/ufs/ufshpb.c
@@ -547,7 +547,7 @@ static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
  				 read_id);
  	rq->cmd_len = scsi_command_size(rq->cmd);

-	if (blk_insert_cloned_request(q, req) != BLK_STS_OK)
+	if (blk_mq_request_issue_directly(req, true) != BLK_STS_OK)
  		return -EAGAIN;

  	hpb->stats.pre_req_cnt++;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 12b9dbcc980e..f203c7ea205b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,6 +867,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
  extern void blk_rq_unprep_clone(struct request *rq);
  extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
  				     struct request *rq);
+blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
  int blk_rq_append_bio(struct request *rq, struct bio *bio);
  extern void blk_queue_split(struct bio **);
  extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);

Daejun Park Oct. 26, 2021, 10:22 p.m. UTC | #12

Hi Bart,

>On 10/26/21 11:27 AM, Martin K. Petersen wrote:
>>> Agreed, that was my initial proposed solution: get rid of the write
>>> buffer optimzation now to fix the API abuse and see if we can add it
>>> back in a more acceptable form later.
>> 
>> Doesn't matter to me whether we back out the 2.0 stuff or mark it as
>> broken. I merely objected to reverting all of HPB since I don't think
>> that would solve anything.
>> 
>> But obviously we'll need a patch to fix 5.15 ASAP...
> 
>I do not have access to a test setup that supports HPB.
> 
>If blk_insert_cloned_request() is moved into the device mapper then I
>think that blk_mq_request_issue_directly() will need to be exported. How
>about the (totally untested) patch below for removing the
>blk_insert_cloned_request() call from the UFS-HPB code?

I will test this code works on real device.

Thanks,
Daejun

> 
>Thanks,
> 
>Bart.
> 
> 
>---
>  block/blk-mq.c            | 1 +
>  block/blk-mq.h            | 1 -
>  drivers/scsi/ufs/ufshpb.c | 2 +-
>  include/linux/blkdev.h    | 1 +
>  4 files changed, 3 insertions(+), 2 deletions(-)
> 
>diff --git a/block/blk-mq.c b/block/blk-mq.c
>index 108a352051be..186321f450f6 100644
>--- a/block/blk-mq.c
>+++ b/block/blk-mq.c
>@@ -2084,6 +2084,7 @@ blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
> 
>          return ret;
>  }
>+EXPORT_SYMBOL_GPL(blk_mq_request_issue_directly);
> 
>  void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
>                  struct list_head *list)
>diff --git a/block/blk-mq.h b/block/blk-mq.h
>index d08779f77a26..ffba52189b18 100644
>--- a/block/blk-mq.h
>+++ b/block/blk-mq.h
>@@ -74,7 +74,6 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>                                  struct list_head *list);
> 
>  /* Used by blk_insert_cloned_request() to issue request directly */
>-blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
>  void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
>                                      struct list_head *list);
> 
>diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
>index 66b19500844e..458eadcb604f 100644
>--- a/drivers/scsi/ufs/ufshpb.c
>+++ b/drivers/scsi/ufs/ufshpb.c
>@@ -547,7 +547,7 @@ static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
>                                   read_id);
>          rq->cmd_len = scsi_command_size(rq->cmd);
> 
>-        if (blk_insert_cloned_request(q, req) != BLK_STS_OK)
>+        if (blk_mq_request_issue_directly(req, true) != BLK_STS_OK)
>                  return -EAGAIN;
> 
>          hpb->stats.pre_req_cnt++;
>diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>index 12b9dbcc980e..f203c7ea205b 100644
>--- a/include/linux/blkdev.h
>+++ b/include/linux/blkdev.h
>@@ -867,6 +867,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
>  extern void blk_rq_unprep_clone(struct request *rq);
>  extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
>                                       struct request *rq);
>+blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
>  int blk_rq_append_bio(struct request *rq, struct bio *bio);
>  extern void blk_queue_split(struct bio **);
>  extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
> 
> 
>

Christoph Hellwig Oct. 27, 2021, 5:27 a.m. UTC | #13

On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
> If blk_insert_cloned_request() is moved into the device mapper then I
> think that blk_mq_request_issue_directly() will need to be exported.

Which is even worse.

> How
> about the (totally untested) patch below for removing the
> blk_insert_cloned_request() call from the UFS-HPB code?

Which again doesn't fix anything.  The problem is that it fans out one
request into two on the same queue, not the specific interface used.

Martin:  please just take the HPB removal.  This seems to be the only
thing that makes sense given that no one from the UFS camp seems to
have the time and resources to come up with an alternative.

James Bottomley Oct. 27, 2021, 12:20 p.m. UTC | #14

On Wed, 2021-10-27 at 07:27 +0200, Christoph Hellwig wrote:
> On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
> > If blk_insert_cloned_request() is moved into the device mapper then
> > I think that blk_mq_request_issue_directly() will need to be
> > exported.
> 
> Which is even worse.
> 
> > How about the (totally untested) patch below for removing the
> > blk_insert_cloned_request() call from the UFS-HPB code?
> 
> Which again doesn't fix anything.  The problem is that it fans out
> one request into two on the same queue, not the specific interface
> used.
> 
> Martin:  please just take the HPB removal.  This seems to be the only
> thing that makes sense given that no one from the UFS camp seems to
> have the time and resources to come up with an alternative.

We don't have to revert it entirely, we can just remove the API since
it's in an optimization path.  The below patch does exactly that.  It's
compile tested but I don't have hardware so I've no idea if it works;
Daejun can you please test it as a matter of urgency since we have to
get this in before the end of the week.

James

---

diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
index 66b19500844e..1c89eafe0c1d 100644
--- a/drivers/scsi/ufs/ufshpb.c
+++ b/drivers/scsi/ufs/ufshpb.c
@@ -365,148 +365,6 @@ static inline void ufshpb_set_write_buf_cmd(unsigned char *cdb,
 	cdb[9] = 0x00;	/* Control = 0x00 */
 }
 
-static struct ufshpb_req *ufshpb_get_pre_req(struct ufshpb_lu *hpb)
-{
-	struct ufshpb_req *pre_req;
-
-	if (hpb->num_inflight_pre_req >= hpb->throttle_pre_req) {
-		dev_info(&hpb->sdev_ufs_lu->sdev_dev,
-			 "pre_req throttle. inflight %d throttle %d",
-			 hpb->num_inflight_pre_req, hpb->throttle_pre_req);
-		return NULL;
-	}
-
-	pre_req = list_first_entry_or_null(&hpb->lh_pre_req_free,
-					   struct ufshpb_req, list_req);
-	if (!pre_req) {
-		dev_info(&hpb->sdev_ufs_lu->sdev_dev, "There is no pre_req");
-		return NULL;
-	}
-
-	list_del_init(&pre_req->list_req);
-	hpb->num_inflight_pre_req++;
-
-	return pre_req;
-}
-
-static inline void ufshpb_put_pre_req(struct ufshpb_lu *hpb,
-				      struct ufshpb_req *pre_req)
-{
-	pre_req->req = NULL;
-	bio_reset(pre_req->bio);
-	list_add_tail(&pre_req->list_req, &hpb->lh_pre_req_free);
-	hpb->num_inflight_pre_req--;
-}
-
-static void ufshpb_pre_req_compl_fn(struct request *req, blk_status_t error)
-{
-	struct ufshpb_req *pre_req = (struct ufshpb_req *)req->end_io_data;
-	struct ufshpb_lu *hpb = pre_req->hpb;
-	unsigned long flags;
-
-	if (error) {
-		struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
-		struct scsi_sense_hdr sshdr;
-
-		dev_err(&hpb->sdev_ufs_lu->sdev_dev, "block status %d", error);
-		scsi_command_normalize_sense(cmd, &sshdr);
-		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
-			"code %x sense_key %x asc %x ascq %x",
-			sshdr.response_code,
-			sshdr.sense_key, sshdr.asc, sshdr.ascq);
-		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
-			"byte4 %x byte5 %x byte6 %x additional_len %x",
-			sshdr.byte4, sshdr.byte5,
-			sshdr.byte6, sshdr.additional_length);
-	}
-
-	blk_mq_free_request(req);
-	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
-	ufshpb_put_pre_req(pre_req->hpb, pre_req);
-	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
-}
-
-static int ufshpb_prep_entry(struct ufshpb_req *pre_req, struct page *page)
-{
-	struct ufshpb_lu *hpb = pre_req->hpb;
-	struct ufshpb_region *rgn;
-	struct ufshpb_subregion *srgn;
-	__be64 *addr;
-	int offset = 0;
-	int copied;
-	unsigned long lpn = pre_req->wb.lpn;
-	int rgn_idx, srgn_idx, srgn_offset;
-	unsigned long flags;
-
-	addr = page_address(page);
-	ufshpb_get_pos_from_lpn(hpb, lpn, &rgn_idx, &srgn_idx, &srgn_offset);
-
-	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
-
-next_offset:
-	rgn = hpb->rgn_tbl + rgn_idx;
-	srgn = rgn->srgn_tbl + srgn_idx;
-
-	if (!ufshpb_is_valid_srgn(rgn, srgn))
-		goto mctx_error;
-
-	if (!srgn->mctx)
-		goto mctx_error;
-
-	copied = ufshpb_fill_ppn_from_page(hpb, srgn->mctx, srgn_offset,
-					   pre_req->wb.len - offset,
-					   &addr[offset]);
-
-	if (copied < 0)
-		goto mctx_error;
-
-	offset += copied;
-	srgn_offset += copied;
-
-	if (srgn_offset == hpb->entries_per_srgn) {
-		srgn_offset = 0;
-
-		if (++srgn_idx == hpb->srgns_per_rgn) {
-			srgn_idx = 0;
-			rgn_idx++;
-		}
-	}
-
-	if (offset < pre_req->wb.len)
-		goto next_offset;
-
-	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
-	return 0;
-mctx_error:
-	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
-	return -ENOMEM;
-}
-
-static int ufshpb_pre_req_add_bio_page(struct ufshpb_lu *hpb,
-				       struct request_queue *q,
-				       struct ufshpb_req *pre_req)
-{
-	struct page *page = pre_req->wb.m_page;
-	struct bio *bio = pre_req->bio;
-	int entries_bytes, ret;
-
-	if (!page)
-		return -ENOMEM;
-
-	if (ufshpb_prep_entry(pre_req, page))
-		return -ENOMEM;
-
-	entries_bytes = pre_req->wb.len * sizeof(__be64);
-
-	ret = bio_add_pc_page(q, bio, page, entries_bytes, 0);
-	if (ret != entries_bytes) {
-		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
-			"bio_add_pc_page fail: %d", ret);
-		return -ENOMEM;
-	}
-	return 0;
-}
-
 static inline int ufshpb_get_read_id(struct ufshpb_lu *hpb)
 {
 	if (++hpb->cur_read_id >= MAX_HPB_READ_ID)
@@ -514,88 +372,6 @@ static inline int ufshpb_get_read_id(struct ufshpb_lu *hpb)
 	return hpb->cur_read_id;
 }
 
-static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
-				  struct ufshpb_req *pre_req, int read_id)
-{
-	struct scsi_device *sdev = cmd->device;
-	struct request_queue *q = sdev->request_queue;
-	struct request *req;
-	struct scsi_request *rq;
-	struct bio *bio = pre_req->bio;
-
-	pre_req->hpb = hpb;
-	pre_req->wb.lpn = sectors_to_logical(cmd->device,
-					     blk_rq_pos(scsi_cmd_to_rq(cmd)));
-	pre_req->wb.len = sectors_to_logical(cmd->device,
-					     blk_rq_sectors(scsi_cmd_to_rq(cmd)));
-	if (ufshpb_pre_req_add_bio_page(hpb, q, pre_req))
-		return -ENOMEM;
-
-	req = pre_req->req;
-
-	/* 1. request setup */
-	blk_rq_append_bio(req, bio);
-	req->rq_disk = NULL;
-	req->end_io_data = (void *)pre_req;
-	req->end_io = ufshpb_pre_req_compl_fn;
-
-	/* 2. scsi_request setup */
-	rq = scsi_req(req);
-	rq->retries = 1;
-
-	ufshpb_set_write_buf_cmd(rq->cmd, pre_req->wb.lpn, pre_req->wb.len,
-				 read_id);
-	rq->cmd_len = scsi_command_size(rq->cmd);
-
-	if (blk_insert_cloned_request(q, req) != BLK_STS_OK)
-		return -EAGAIN;
-
-	hpb->stats.pre_req_cnt++;
-
-	return 0;
-}
-
-static int ufshpb_issue_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
-				int *read_id)
-{
-	struct ufshpb_req *pre_req;
-	struct request *req = NULL;
-	unsigned long flags;
-	int _read_id;
-	int ret = 0;
-
-	req = blk_get_request(cmd->device->request_queue,
-			      REQ_OP_DRV_OUT | REQ_SYNC, BLK_MQ_REQ_NOWAIT);
-	if (IS_ERR(req))
-		return -EAGAIN;
-
-	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
-	pre_req = ufshpb_get_pre_req(hpb);
-	if (!pre_req) {
-		ret = -EAGAIN;
-		goto unlock_out;
-	}
-	_read_id = ufshpb_get_read_id(hpb);
-	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
-
-	pre_req->req = req;
-
-	ret = ufshpb_execute_pre_req(hpb, cmd, pre_req, _read_id);
-	if (ret)
-		goto free_pre_req;
-
-	*read_id = _read_id;
-
-	return ret;
-free_pre_req:
-	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
-	ufshpb_put_pre_req(hpb, pre_req);
-unlock_out:
-	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
-	blk_put_request(req);
-	return ret;
-}
-
 /*
  * This function will set up HPB read command using host-side L2P map data.
  */
@@ -685,23 +461,6 @@ int ufshpb_prep(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
 		dev_err(hba->dev, "get ppn failed. err %d\n", err);
 		return err;
 	}
-	if (!ufshpb_is_legacy(hba) &&
-	    ufshpb_is_required_wb(hpb, transfer_len)) {
-		err = ufshpb_issue_pre_req(hpb, cmd, &read_id);
-		if (err) {
-			unsigned long timeout;
-
-			timeout = cmd->jiffies_at_alloc + msecs_to_jiffies(
-				  hpb->params.requeue_timeout_ms);
-
-			if (time_before(jiffies, timeout))
-				return -EAGAIN;
-
-			hpb->stats.miss_cnt++;
-			return 0;
-		}
-	}
-
 	ufshpb_set_hpb_read_to_upiu(hba, lrbp, ppn, transfer_len, read_id);
 
 	hpb->stats.hit_cnt++;

Bart Van Assche Oct. 27, 2021, 1:16 p.m. UTC | #15

On 10/26/21 10:27 PM, Christoph Hellwig wrote:
> On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
>> If blk_insert_cloned_request() is moved into the device mapper then I
>> think that blk_mq_request_issue_directly() will need to be exported.
> 
> Which is even worse.
> 
>> How
>> about the (totally untested) patch below for removing the
>> blk_insert_cloned_request() call from the UFS-HPB code?
> 
> Which again doesn't fix anything.  The problem is that it fans out one
> request into two on the same queue, not the specific interface used.

That patch fixes the reported issue, namely removing the additional accounting
caused by calling blk_insert_cloned_request(). Please explain why it is
considered wrong to fan out one request into two. That code could be reworked
such that the block layer is not involved as Adrian Hunter explained. However,
before someone spends time on making these changes I think that someone should
provide more information about why it is considered wrong to fan out one request
into two.

Thanks,

Bart.

Keith Busch Oct. 27, 2021, 2:12 p.m. UTC | #16

On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
> On 10/26/21 10:27 PM, Christoph Hellwig wrote:
> > On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
> > > If blk_insert_cloned_request() is moved into the device mapper then I
> > > think that blk_mq_request_issue_directly() will need to be exported.
> > 
> > Which is even worse.
> > 
> > > How
> > > about the (totally untested) patch below for removing the
> > > blk_insert_cloned_request() call from the UFS-HPB code?
> > 
> > Which again doesn't fix anything.  The problem is that it fans out one
> > request into two on the same queue, not the specific interface used.
> 
> That patch fixes the reported issue, namely removing the additional accounting
> caused by calling blk_insert_cloned_request(). Please explain why it is
> considered wrong to fan out one request into two. That code could be reworked
> such that the block layer is not involved as Adrian Hunter explained. However,
> before someone spends time on making these changes I think that someone should
> provide more information about why it is considered wrong to fan out one request
> into two.

The original request consumes a tag from that queue's tagset. If the
lifetime of that tag depends on that same queue having another free tag,
you can deadlock.

Jens Axboe Oct. 27, 2021, 2:38 p.m. UTC | #17

On 10/27/21 8:12 AM, Keith Busch wrote:
> On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
>> On 10/26/21 10:27 PM, Christoph Hellwig wrote:
>>> On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
>>>> If blk_insert_cloned_request() is moved into the device mapper then I
>>>> think that blk_mq_request_issue_directly() will need to be exported.
>>>
>>> Which is even worse.
>>>
>>>> How
>>>> about the (totally untested) patch below for removing the
>>>> blk_insert_cloned_request() call from the UFS-HPB code?
>>>
>>> Which again doesn't fix anything.  The problem is that it fans out one
>>> request into two on the same queue, not the specific interface used.
>>
>> That patch fixes the reported issue, namely removing the additional accounting
>> caused by calling blk_insert_cloned_request(). Please explain why it is
>> considered wrong to fan out one request into two. That code could be reworked
>> such that the block layer is not involved as Adrian Hunter explained. However,
>> before someone spends time on making these changes I think that someone should
>> provide more information about why it is considered wrong to fan out one request
>> into two.
> 
> The original request consumes a tag from that queue's tagset. If the
> lifetime of that tag depends on that same queue having another free tag,
> you can deadlock.

And to expand on that, one potential solution would be to require as
many reserved tags as normal tags, and have the cloned/direct insert
grab requests out of the reserved pool. That guarantees the forward
progress that is currently violated by randomly requiring an extra tag.

James Bottomley Oct. 27, 2021, 2:43 p.m. UTC | #18

On Wed, 2021-10-27 at 08:38 -0600, Jens Axboe wrote:
> On 10/27/21 8:12 AM, Keith Busch wrote:
> > On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
> > > On 10/26/21 10:27 PM, Christoph Hellwig wrote:
> > > > On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche
> > > > wrote:
> > > > > If blk_insert_cloned_request() is moved into the device
> > > > > mapper then I think that blk_mq_request_issue_directly() will
> > > > > need to be exported.
> > > > 
> > > > Which is even worse.
> > > > 
> > > > > How
> > > > > about the (totally untested) patch below for removing the
> > > > > blk_insert_cloned_request() call from the UFS-HPB code?
> > > > 
> > > > Which again doesn't fix anything.  The problem is that it fans
> > > > out one request into two on the same queue, not the specific
> > > > interface used.
> > > 
> > > That patch fixes the reported issue, namely removing the
> > > additional accounting caused by calling
> > > blk_insert_cloned_request(). Please explain why it is considered
> > > wrong to fan out one request into two. That code could be
> > > reworked such that the block layer is not involved as Adrian
> > > Hunter explained. However, before someone spends time on making
> > > these changes I think that someone should provide more
> > > information about why it is considered wrong to fan out one
> > > request into two.
> > 
> > The original request consumes a tag from that queue's tagset. If
> > the lifetime of that tag depends on that same queue having another
> > free tag, you can deadlock.
> 
> And to expand on that, one potential solution would be to require as
> many reserved tags as normal tags, and have the cloned/direct insert
> grab requests out of the reserved pool. That guarantees the forward
> progress that is currently violated by randomly requiring an extra
> tag.

Another potential way of doing this might be to transmit the buffer
information as a separate scatterlist, a bit like DIF/DIX.  At the
bottom of the stack, the write buffer has to complete before the
dependent command so we could actually enforce that right at the driver
level using the same tag since the two commands are sequential.

James

Ming Lei Oct. 27, 2021, 3:03 p.m. UTC | #19

On Wed, Oct 27, 2021 at 07:12:31AM -0700, Keith Busch wrote:
> On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
> > On 10/26/21 10:27 PM, Christoph Hellwig wrote:
> > > On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
> > > > If blk_insert_cloned_request() is moved into the device mapper then I
> > > > think that blk_mq_request_issue_directly() will need to be exported.
> > > 
> > > Which is even worse.
> > > 
> > > > How
> > > > about the (totally untested) patch below for removing the
> > > > blk_insert_cloned_request() call from the UFS-HPB code?
> > > 
> > > Which again doesn't fix anything.  The problem is that it fans out one
> > > request into two on the same queue, not the specific interface used.
> > 
> > That patch fixes the reported issue, namely removing the additional accounting
> > caused by calling blk_insert_cloned_request(). Please explain why it is
> > considered wrong to fan out one request into two. That code could be reworked
> > such that the block layer is not involved as Adrian Hunter explained. However,
> > before someone spends time on making these changes I think that someone should
> > provide more information about why it is considered wrong to fan out one request
> > into two.
> 
> The original request consumes a tag from that queue's tagset. If the
> lifetime of that tag depends on that same queue having another free tag,
> you can deadlock.

Just take a quick look at the code, if the spawned request can't be allocated,
scsi will return BLK_STS_RESOURCE for the original scsi request which will be
retried later by blk-mq.

So if tag depth is > 1 and max allowed inflight write buffer command is limited
as 1, there shouldn't be the deadlock.

Or is it possible to reuse the original scsi request's tag for the
spawned request? Like the trick used in inserting flush request.


Thanks,
Ming

Jens Axboe Oct. 27, 2021, 3:06 p.m. UTC | #20

On 10/27/21 9:03 AM, Ming Lei wrote:
> On Wed, Oct 27, 2021 at 07:12:31AM -0700, Keith Busch wrote:
>> On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
>>> On 10/26/21 10:27 PM, Christoph Hellwig wrote:
>>>> On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
>>>>> If blk_insert_cloned_request() is moved into the device mapper then I
>>>>> think that blk_mq_request_issue_directly() will need to be exported.
>>>>
>>>> Which is even worse.
>>>>
>>>>> How
>>>>> about the (totally untested) patch below for removing the
>>>>> blk_insert_cloned_request() call from the UFS-HPB code?
>>>>
>>>> Which again doesn't fix anything.  The problem is that it fans out one
>>>> request into two on the same queue, not the specific interface used.
>>>
>>> That patch fixes the reported issue, namely removing the additional accounting
>>> caused by calling blk_insert_cloned_request(). Please explain why it is
>>> considered wrong to fan out one request into two. That code could be reworked
>>> such that the block layer is not involved as Adrian Hunter explained. However,
>>> before someone spends time on making these changes I think that someone should
>>> provide more information about why it is considered wrong to fan out one request
>>> into two.
>>
>> The original request consumes a tag from that queue's tagset. If the
>> lifetime of that tag depends on that same queue having another free tag,
>> you can deadlock.
> 
> Just take a quick look at the code, if the spawned request can't be allocated,
> scsi will return BLK_STS_RESOURCE for the original scsi request which will be
> retried later by blk-mq.
> 
> So if tag depth is > 1 and max allowed inflight write buffer command is limited
> as 1, there shouldn't be the deadlock.
> 
> Or is it possible to reuse the original scsi request's tag for the
> spawned request? Like the trick used in inserting flush request.

The flush approach did come to mind here as well, but honestly that one is
very ugly and would never have been permitted if it wasn't excluded to be
in the very core code already. But yes, reuse of the existing request is
probably another potentially viable approach. My worry there is that
inevitably you end up needing to stash a lot of data to restore the original,
and we're certainly not adding anything to struct request for that.

Hence I think being able to find a new request reliably would be better.

Ming Lei Oct. 27, 2021, 3:16 p.m. UTC | #21

On Wed, Oct 27, 2021 at 09:06:05AM -0600, Jens Axboe wrote:
> On 10/27/21 9:03 AM, Ming Lei wrote:
> > On Wed, Oct 27, 2021 at 07:12:31AM -0700, Keith Busch wrote:
> >> On Wed, Oct 27, 2021 at 06:16:19AM -0700, Bart Van Assche wrote:
> >>> On 10/26/21 10:27 PM, Christoph Hellwig wrote:
> >>>> On Tue, Oct 26, 2021 at 01:10:47PM -0700, Bart Van Assche wrote:
> >>>>> If blk_insert_cloned_request() is moved into the device mapper then I
> >>>>> think that blk_mq_request_issue_directly() will need to be exported.
> >>>>
> >>>> Which is even worse.
> >>>>
> >>>>> How
> >>>>> about the (totally untested) patch below for removing the
> >>>>> blk_insert_cloned_request() call from the UFS-HPB code?
> >>>>
> >>>> Which again doesn't fix anything.  The problem is that it fans out one
> >>>> request into two on the same queue, not the specific interface used.
> >>>
> >>> That patch fixes the reported issue, namely removing the additional accounting
> >>> caused by calling blk_insert_cloned_request(). Please explain why it is
> >>> considered wrong to fan out one request into two. That code could be reworked
> >>> such that the block layer is not involved as Adrian Hunter explained. However,
> >>> before someone spends time on making these changes I think that someone should
> >>> provide more information about why it is considered wrong to fan out one request
> >>> into two.
> >>
> >> The original request consumes a tag from that queue's tagset. If the
> >> lifetime of that tag depends on that same queue having another free tag,
> >> you can deadlock.
> > 
> > Just take a quick look at the code, if the spawned request can't be allocated,
> > scsi will return BLK_STS_RESOURCE for the original scsi request which will be
> > retried later by blk-mq.
> > 
> > So if tag depth is > 1 and max allowed inflight write buffer command is limited
> > as 1, there shouldn't be the deadlock.
> > 
> > Or is it possible to reuse the original scsi request's tag for the
> > spawned request? Like the trick used in inserting flush request.
> 
> The flush approach did come to mind here as well, but honestly that one is
> very ugly and would never have been permitted if it wasn't excluded to be
> in the very core code already. But yes, reuse of the existing request is
> probably another potentially viable approach. My worry there is that
> inevitably you end up needing to stash a lot of data to restore the original,
> and we're certainly not adding anything to struct request for that.
> 
> Hence I think being able to find a new request reliably would be better.

request with scsi_cmnd may be allocated by the ufshpb driver, even it
should be fine to call ufshcd_queuecommand() directly for this driver
private IO, if the tag can be reused. One example is scsi_ioctl_reset().


Thanks,
Ming

Martin K. Petersen Oct. 27, 2021, 3:35 p.m. UTC | #22

Jens,

> But yes, reuse of the existing request is probably another potentially
> viable approach. My worry there is that inevitably you end up needing
> to stash a lot of data to restore the original, and we're certainly
> not adding anything to struct request for that.

Yeah, I much prefer the reserved tag approach. That was my original
recommendation.

SCSI error handling does command hijacking and it is absolutely
dreadful.

Jens Axboe Oct. 27, 2021, 3:40 p.m. UTC | #23

On 10/27/21 9:35 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> But yes, reuse of the existing request is probably another potentially
>> viable approach. My worry there is that inevitably you end up needing
>> to stash a lot of data to restore the original, and we're certainly
>> not adding anything to struct request for that.
> 
> Yeah, I much prefer the reserved tag approach. That was my original
> recommendation.
> 
> SCSI error handling does command hijacking and it is absolutely
> dreadful.

Then let's make sure we nudge it in that direction! It'd be feasible to
have less reserved tags, you only need as many as you want to have these
special commands inflight. Post that, returning BUSY and just retrying
when a request completes should be fine. Hence I'd size the reserved tag
pool appropriately depending on what kind of performance is expected out
of this, with just 1 reserved tag being enough to give us the guarantees
we need for forward progress.

I think the plan forward is clear here then:

1) Revert the optimization that requires the use of cloned insert for
   5.15.
2) Re-write the optimization using reserved tags, post 5.15 obviously.

Martin K. Petersen Oct. 27, 2021, 3:44 p.m. UTC | #24

Ming,

> request with scsi_cmnd may be allocated by the ufshpb driver, even it
> should be fine to call ufshcd_queuecommand() directly for this driver
> private IO, if the tag can be reused. One example is scsi_ioctl_reset().

scsi_ioctl_reset() allocates a new request, though, so that doesn't
solve the forward progress guarantee. Whereas eh puts the saved request
on the stack.

Ming Lei Oct. 27, 2021, 3:58 p.m. UTC | #25

On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
> 
> Ming,
> 
> > request with scsi_cmnd may be allocated by the ufshpb driver, even it
> > should be fine to call ufshcd_queuecommand() directly for this driver
> > private IO, if the tag can be reused. One example is scsi_ioctl_reset().
> 
> scsi_ioctl_reset() allocates a new request, though, so that doesn't
> solve the forward progress guarantee. Whereas eh puts the saved request
> on the stack.

What I meant is to use one totally ufshpb private command allocated from
private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
directly, so forward progress is guaranteed if the blk-mq request's tag can be
reused for issuing this private command. This approach takes a bit effort,
but avoids tags reservation.

Yeah, it is cleaner to use reserved tag for the spawned request, but we
need to know:

1) how many queue depth for the hba? If it is small, even 1 reservation
can affect performance.

2) how many inflight write buffer commands are to be supported? Or how many
is enough for obtaining expected performance? If the number is big, reserved
tags can't work.

Thanks,
Ming

Martin K. Petersen Oct. 27, 2021, 4:16 p.m. UTC | #26

Jens,

> It'd be feasible to have less reserved tags, you only need as many as
> you want to have these special commands inflight.  Post that,
> returning BUSY and just retrying when a request completes should be
> fine. Hence I'd size the reserved tag pool appropriately depending on
> what kind of performance is expected out of this, with just 1 reserved
> tag being enough to give us the guarantees we need for forward
> progress.

Yep. Since this is labeled as a performance feature I do think it's
important to find the right balance between reserved and real tags.

> I think the plan forward is clear here then:
>
> 1) Revert the optimization that requires the use of cloned insert for
>    5.15.

Given that HPB developed over time, I am not sure how simple a revert
would be. And we only have a couple of days left before release. I
really want the smallest patch possible that either removes or disables
the 2.0 support.

Keith Busch Oct. 27, 2021, 4:16 p.m. UTC | #27

On Wed, Oct 27, 2021 at 11:58:23PM +0800, Ming Lei wrote:
> On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
> > 
> > Ming,
> > 
> > > request with scsi_cmnd may be allocated by the ufshpb driver, even it
> > > should be fine to call ufshcd_queuecommand() directly for this driver
> > > private IO, if the tag can be reused. One example is scsi_ioctl_reset().
> > 
> > scsi_ioctl_reset() allocates a new request, though, so that doesn't
> > solve the forward progress guarantee. Whereas eh puts the saved request
> > on the stack.
> 
> What I meant is to use one totally ufshpb private command allocated from
> private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
> directly, so forward progress is guaranteed if the blk-mq request's tag can be
> reused for issuing this private command. This approach takes a bit effort,
> but avoids tags reservation.
> 
> Yeah, it is cleaner to use reserved tag for the spawned request, but we
> need to know:
> 
> 1) how many queue depth for the hba? If it is small, even 1 reservation
> can affect performance.
> 
> 2) how many inflight write buffer commands are to be supported? Or how many
> is enough for obtaining expected performance? If the number is big, reserved
> tags can't work.

The original and clone are not dispatched to hardware concurrently, so I
don't think the reserved_tags need to subtract from the generic ones.
The original request already accounts for the hardware resource, so the
clone doesn't need to consume another one.

Jens Axboe Oct. 27, 2021, 4:19 p.m. UTC | #28

On 10/27/21 10:16 AM, Keith Busch wrote:
> On Wed, Oct 27, 2021 at 11:58:23PM +0800, Ming Lei wrote:
>> On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
>>>
>>> Ming,
>>>
>>>> request with scsi_cmnd may be allocated by the ufshpb driver, even it
>>>> should be fine to call ufshcd_queuecommand() directly for this driver
>>>> private IO, if the tag can be reused. One example is scsi_ioctl_reset().
>>>
>>> scsi_ioctl_reset() allocates a new request, though, so that doesn't
>>> solve the forward progress guarantee. Whereas eh puts the saved request
>>> on the stack.
>>
>> What I meant is to use one totally ufshpb private command allocated from
>> private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
>> directly, so forward progress is guaranteed if the blk-mq request's tag can be
>> reused for issuing this private command. This approach takes a bit effort,
>> but avoids tags reservation.
>>
>> Yeah, it is cleaner to use reserved tag for the spawned request, but we
>> need to know:
>>
>> 1) how many queue depth for the hba? If it is small, even 1 reservation
>> can affect performance.
>>
>> 2) how many inflight write buffer commands are to be supported? Or how many
>> is enough for obtaining expected performance? If the number is big, reserved
>> tags can't work.
> 
> The original and clone are not dispatched to hardware concurrently, so
> I don't think the reserved_tags need to subtract from the generic
> ones. The original request already accounts for the hardware resource,
> so the clone doesn't need to consume another one.

Maybe I didn't phrase it clearly enough, because that's not what I
meant. My point is that just one reserved tag is fine for the
correctness guarantee, you'd just only have one of these special
requests inflight at the time then and that may be a performance
concern. More reserved tags would allow more inflight at the same time.
This is totally independent of the normal tags available.

Martin K. Petersen Oct. 27, 2021, 4:59 p.m. UTC | #29

Ming,

> What I meant is to use one totally ufshpb private command allocated
> from private slab to replace the spawned request, which is sent to
> ufshcd_queuecommand() directly, so forward progress is guaranteed if
> the blk-mq request's tag can be reused for issuing this private
> command. This approach takes a bit effort, but avoids tags
> reservation.

Yep, I was just noting that error handling was a better example than the
ioctl given the context.

> 1) how many queue depth for the hba? If it is small, even 1 reservation
> can affect performance.

Indeed. But there is no free lunch.

> 2) how many inflight write buffer commands are to be supported? Or how
> many is enough for obtaining expected performance? If the number is
> big, reserved tags can't work.

That really is something the UFS developers need to work out. I am not
sure whether permanently pinning down memory is a bigger problem than
queue tags being a finite resource.

Bart Van Assche Oct. 27, 2021, 5:01 p.m. UTC | #30

On 10/27/21 9:16 AM, Martin K. Petersen wrote:
> Given that HPB developed over time, I am not sure how simple a revert
> would be. And we only have a couple of days left before release. I
> really want the smallest patch possible that either removes or disables
> the 2.0 support.

How about one of the untested patches below?

The patch below disables support for HPB 2.0 by ignoring the HPB version reported
by the UFS controller:

diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
index 66b19500844e..5f9f7139480a 100644
--- a/drivers/scsi/ufs/ufshpb.c
+++ b/drivers/scsi/ufs/ufshpb.c
@@ -2872,8 +2872,8 @@ void ufshpb_get_dev_info(struct ufs_hba *hba, u8 *desc_buf)
  		return;
  	}

-	if (version == HPB_SUPPORT_LEGACY_VERSION)
-		hpb_dev_info->is_legacy = true;
+	/* Do not use HPB 2.0 because of the blk_insert_cloned_request() call. */
+	hpb_dev_info->is_legacy = true;

  	ret = ufshcd_query_attr_retry(hba, UPIU_QUERY_OPCODE_READ_ATTR,
  		QUERY_ATTR_IDN_MAX_HPB_SINGLE_CMD, 0, 0, &max_hpb_single_cmd);


The second patch changes the blk_insert_cloned_request() call into a
blk_execute_rq_nowait() call. That should work fine since this function
bypasses the I/O scheduler for passthrough requests:

diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
index 66b19500844e..9e16d1321b34 100644
--- a/drivers/scsi/ufs/ufshpb.c
+++ b/drivers/scsi/ufs/ufshpb.c
@@ -547,8 +547,7 @@ static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
  				 read_id);
  	rq->cmd_len = scsi_command_size(rq->cmd);

-	if (blk_insert_cloned_request(q, req) != BLK_STS_OK)
-		return -EAGAIN;
+	blk_execute_rq_nowait(req->rq_disk, req, true, req->end_io);

  	hpb->stats.pre_req_cnt++;

Thanks,

Bart.

Ming Lei Oct. 28, 2021, 12:42 a.m. UTC | #31

On Wed, Oct 27, 2021 at 09:16:32AM -0700, Keith Busch wrote:
> On Wed, Oct 27, 2021 at 11:58:23PM +0800, Ming Lei wrote:
> > On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
> > > 
> > > Ming,
> > > 
> > > > request with scsi_cmnd may be allocated by the ufshpb driver, even it
> > > > should be fine to call ufshcd_queuecommand() directly for this driver
> > > > private IO, if the tag can be reused. One example is scsi_ioctl_reset().
> > > 
> > > scsi_ioctl_reset() allocates a new request, though, so that doesn't
> > > solve the forward progress guarantee. Whereas eh puts the saved request
> > > on the stack.
> > 
> > What I meant is to use one totally ufshpb private command allocated from
> > private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
> > directly, so forward progress is guaranteed if the blk-mq request's tag can be
> > reused for issuing this private command. This approach takes a bit effort,
> > but avoids tags reservation.
> > 
> > Yeah, it is cleaner to use reserved tag for the spawned request, but we
> > need to know:
> > 
> > 1) how many queue depth for the hba? If it is small, even 1 reservation
> > can affect performance.
> > 
> > 2) how many inflight write buffer commands are to be supported? Or how many
> > is enough for obtaining expected performance? If the number is big, reserved
> > tags can't work.
> 
> The original and clone are not dispatched to hardware concurrently, so I
> don't think the reserved_tags need to subtract from the generic ones.
> The original request already accounts for the hardware resource, so the
> clone doesn't need to consume another one.

Yeah, that is why I thought the tag could be reused for the spawned(cloned)
request, but it needs ufshpb developer to confirm, or at least
ufshcd_queuecommand() can handle this situation. If that is true, it isn't
necessary to use reserve tags, since the current blk-mq implementation
requires to reserve real hardware tags space, which has to take normal
tags.


thanks,
Ming

Daejun Park Oct. 28, 2021, 1:10 a.m. UTC | #32

Hi Ming,

> On Wed, Oct 27, 2021 at 09:16:32AM -0700, Keith Busch wrote:
> > On Wed, Oct 27, 2021 at 11:58:23PM +0800, Ming Lei wrote:
> > > On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
> > > > 
> > > > Ming,
> > > > 
> > > > > request with scsi_cmnd may be allocated by the ufshpb driver, even it
> > > > > should be fine to call ufshcd_queuecommand() directly for this driver
> > > > > private IO, if the tag can be reused. One example is scsi_ioctl_reset().
> > > > 
> > > > scsi_ioctl_reset() allocates a new request, though, so that doesn't
> > > > solve the forward progress guarantee. Whereas eh puts the saved request
> > > > on the stack.
> > > 
> > > What I meant is to use one totally ufshpb private command allocated from
> > > private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
> > > directly, so forward progress is guaranteed if the blk-mq request's tag can be
> > > reused for issuing this private command. This approach takes a bit effort,
> > > but avoids tags reservation.
> > > 
> > > Yeah, it is cleaner to use reserved tag for the spawned request, but we
> > > need to know:
> > > 
> > > 1) how many queue depth for the hba? If it is small, even 1 reservation
> > > can affect performance.
> > > 
> > > 2) how many inflight write buffer commands are to be supported? Or how many
> > > is enough for obtaining expected performance? If the number is big, reserved
> > > tags can't work.
> > 
> > The original and clone are not dispatched to hardware concurrently, so I
> > don't think the reserved_tags need to subtract from the generic ones.
> > The original request already accounts for the hardware resource, so the
> > clone doesn't need to consume another one.
>  
> Yeah, that is why I thought the tag could be reused for the spawned(cloned)
> request, but it needs ufshpb developer to confirm, or at least
> ufshcd_queuecommand() can handle this situation. If that is true, it isn't
> necessary to use reserve tags, since the current blk-mq implementation
> requires to reserve real hardware tags space, which has to take normal
> tags.

It is true that pre-request can use the tag of READ request, but the READ
request should wait to completion of the pre-request command. However, if
the pre-request and the READ request are dispatched concurrently, it can
save the time to completion of the pre-request.

So I implemented as allocating new request and it has limit time to getting
pre-request, so it doesn't cause deadlock.

Thanks,
Daejun

Ming Lei Oct. 28, 2021, 1:32 a.m. UTC | #33

On Wed, Oct 27, 2021 at 10:01:42AM -0700, Bart Van Assche wrote:
> On 10/27/21 9:16 AM, Martin K. Petersen wrote:
> > Given that HPB developed over time, I am not sure how simple a revert
> > would be. And we only have a couple of days left before release. I
> > really want the smallest patch possible that either removes or disables
> > the 2.0 support.
> 
> How about one of the untested patches below?
> 
> The patch below disables support for HPB 2.0 by ignoring the HPB version reported
> by the UFS controller:
> 
> diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
> index 66b19500844e..5f9f7139480a 100644
> --- a/drivers/scsi/ufs/ufshpb.c
> +++ b/drivers/scsi/ufs/ufshpb.c
> @@ -2872,8 +2872,8 @@ void ufshpb_get_dev_info(struct ufs_hba *hba, u8 *desc_buf)
>  		return;
>  	}
> 
> -	if (version == HPB_SUPPORT_LEGACY_VERSION)
> -		hpb_dev_info->is_legacy = true;
> +	/* Do not use HPB 2.0 because of the blk_insert_cloned_request() call. */
> +	hpb_dev_info->is_legacy = true;

I guess you may change ufshpb_is_required_wb() to return false simply
with comment.

> 
>  	ret = ufshcd_query_attr_retry(hba, UPIU_QUERY_OPCODE_READ_ATTR,
>  		QUERY_ATTR_IDN_MAX_HPB_SINGLE_CMD, 0, 0, &max_hpb_single_cmd);
> 
> 
> The second patch changes the blk_insert_cloned_request() call into a
> blk_execute_rq_nowait() call. That should work fine since this function
> bypasses the I/O scheduler for passthrough requests:

Either ->is_legacy is set as true or ufshpb_is_required_wb() returns
false, blk_insert_cloned_request() won't be called. But here
blk_execute_rq_nowait() should be used since it is one driver private IO.

That also shows the private command of pre_req is run concurrently with the
original FS IO, and two tags are consumed for doing one IO. It could be done
one by one, but I guess it is a bit slower, just saw Daejun replied this point.


thanks,
Ming

Ming Lei Oct. 28, 2021, 2:07 a.m. UTC | #34

On Thu, Oct 28, 2021 at 10:10:22AM +0900, Daejun Park wrote:
> Hi Ming,
> 
> > On Wed, Oct 27, 2021 at 09:16:32AM -0700, Keith Busch wrote:
> > > On Wed, Oct 27, 2021 at 11:58:23PM +0800, Ming Lei wrote:
> > > > On Wed, Oct 27, 2021 at 11:44:04AM -0400, Martin K. Petersen wrote:
> > > > > 
> > > > > Ming,
> > > > > 
> > > > > > request with scsi_cmnd may be allocated by the ufshpb driver, even it
> > > > > > should be fine to call ufshcd_queuecommand() directly for this driver
> > > > > > private IO, if the tag can be reused. One example is scsi_ioctl_reset().
> > > > > 
> > > > > scsi_ioctl_reset() allocates a new request, though, so that doesn't
> > > > > solve the forward progress guarantee. Whereas eh puts the saved request
> > > > > on the stack.
> > > > 
> > > > What I meant is to use one totally ufshpb private command allocated from
> > > > private slab to replace the spawned request, which is sent to ufshcd_queuecommand()
> > > > directly, so forward progress is guaranteed if the blk-mq request's tag can be
> > > > reused for issuing this private command. This approach takes a bit effort,
> > > > but avoids tags reservation.
> > > > 
> > > > Yeah, it is cleaner to use reserved tag for the spawned request, but we
> > > > need to know:
> > > > 
> > > > 1) how many queue depth for the hba? If it is small, even 1 reservation
> > > > can affect performance.
> > > > 
> > > > 2) how many inflight write buffer commands are to be supported? Or how many
> > > > is enough for obtaining expected performance? If the number is big, reserved
> > > > tags can't work.
> > > 
> > > The original and clone are not dispatched to hardware concurrently, so I
> > > don't think the reserved_tags need to subtract from the generic ones.
> > > The original request already accounts for the hardware resource, so the
> > > clone doesn't need to consume another one.
> >  
> > Yeah, that is why I thought the tag could be reused for the spawned(cloned)
> > request, but it needs ufshpb developer to confirm, or at least
> > ufshcd_queuecommand() can handle this situation. If that is true, it isn't
> > necessary to use reserve tags, since the current blk-mq implementation
> > requires to reserve real hardware tags space, which has to take normal
> > tags.
> 
> It is true that pre-request can use the tag of READ request, but the READ
> request should wait to completion of the pre-request command. However, if
> the pre-request and the READ request are dispatched concurrently, it can
> save the time to completion of the pre-request.
> 
> So I implemented as allocating new request and it has limit time to getting
> pre-request, so it doesn't cause deadlock.

The WB pre-request and the READ IO should have been issued in single
command with same tag.

If they can't be done in single command, this sw/hw interface is really bad,
one way is to follow Jens's suggestion to reserve one tag for guaranteeing
forward progress, something like the following:

ufshpb_issue_pre_req()

        req = blk_get_request(cmd->device->request_queue,
                              REQ_OP_DRV_OUT | REQ_SYNC, BLK_MQ_REQ_NOWAIT);
        if (IS_ERR(req)) {
        	req = blk_get_request(cmd->device->request_queue,
                              REQ_OP_DRV_OUT | REQ_SYNC, BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED);
			if (IS_ERR(req))
                return -EAGAIN;
		}

But the above isn't what Christoph complained:

	The HPB support added this merge window is fundanetally flawed as it
	uses blk_insert_cloned_request to insert a cloned request onto the same
	queue as the one that the original request came from, leading to all
	kinds of issues in blk-mq accounting (in addition to this API being
	a special case for dm-mpath that should not see other users).

IMO, accounting seems fine since block layer always counts driver
private request.


thanks,
Ming

Bart Van Assche Oct. 28, 2021, 8:21 p.m. UTC | #35

On 10/27/21 5:20 AM, James Bottomley wrote:
> We don't have to revert it entirely, we can just remove the API since
> it's in an optimization path.  The below patch does exactly that.  It's
> compile tested but I don't have hardware so I've no idea if it works;
> Daejun can you please test it as a matter of urgency since we have to
> get this in before the end of the week.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
> index 66b19500844e..1c89eafe0c1d 100644
> --- a/drivers/scsi/ufs/ufshpb.c
> +++ b/drivers/scsi/ufs/ufshpb.c
> @@ -365,148 +365,6 @@ static inline void ufshpb_set_write_buf_cmd(unsigned char *cdb,
>   	cdb[9] = 0x00;	/* Control = 0x00 */
>   }
>   
> -static struct ufshpb_req *ufshpb_get_pre_req(struct ufshpb_lu *hpb)
> -{
> -	struct ufshpb_req *pre_req;
> -
> -	if (hpb->num_inflight_pre_req >= hpb->throttle_pre_req) {
> -		dev_info(&hpb->sdev_ufs_lu->sdev_dev,
> -			 "pre_req throttle. inflight %d throttle %d",
> -			 hpb->num_inflight_pre_req, hpb->throttle_pre_req);
> -		return NULL;
> -	}
> -
> -	pre_req = list_first_entry_or_null(&hpb->lh_pre_req_free,
> -					   struct ufshpb_req, list_req);
> -	if (!pre_req) {
> -		dev_info(&hpb->sdev_ufs_lu->sdev_dev, "There is no pre_req");
> -		return NULL;
> -	}
> -
> -	list_del_init(&pre_req->list_req);
> -	hpb->num_inflight_pre_req++;
> -
> -	return pre_req;
> -}
> -
> -static inline void ufshpb_put_pre_req(struct ufshpb_lu *hpb,
> -				      struct ufshpb_req *pre_req)
> -{
> -	pre_req->req = NULL;
> -	bio_reset(pre_req->bio);
> -	list_add_tail(&pre_req->list_req, &hpb->lh_pre_req_free);
> -	hpb->num_inflight_pre_req--;
> -}
> -
> -static void ufshpb_pre_req_compl_fn(struct request *req, blk_status_t error)
> -{
> -	struct ufshpb_req *pre_req = (struct ufshpb_req *)req->end_io_data;
> -	struct ufshpb_lu *hpb = pre_req->hpb;
> -	unsigned long flags;
> -
> -	if (error) {
> -		struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> -		struct scsi_sense_hdr sshdr;
> -
> -		dev_err(&hpb->sdev_ufs_lu->sdev_dev, "block status %d", error);
> -		scsi_command_normalize_sense(cmd, &sshdr);
> -		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
> -			"code %x sense_key %x asc %x ascq %x",
> -			sshdr.response_code,
> -			sshdr.sense_key, sshdr.asc, sshdr.ascq);
> -		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
> -			"byte4 %x byte5 %x byte6 %x additional_len %x",
> -			sshdr.byte4, sshdr.byte5,
> -			sshdr.byte6, sshdr.additional_length);
> -	}
> -
> -	blk_mq_free_request(req);
> -	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
> -	ufshpb_put_pre_req(pre_req->hpb, pre_req);
> -	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
> -}
> -
> -static int ufshpb_prep_entry(struct ufshpb_req *pre_req, struct page *page)
> -{
> -	struct ufshpb_lu *hpb = pre_req->hpb;
> -	struct ufshpb_region *rgn;
> -	struct ufshpb_subregion *srgn;
> -	__be64 *addr;
> -	int offset = 0;
> -	int copied;
> -	unsigned long lpn = pre_req->wb.lpn;
> -	int rgn_idx, srgn_idx, srgn_offset;
> -	unsigned long flags;
> -
> -	addr = page_address(page);
> -	ufshpb_get_pos_from_lpn(hpb, lpn, &rgn_idx, &srgn_idx, &srgn_offset);
> -
> -	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
> -
> -next_offset:
> -	rgn = hpb->rgn_tbl + rgn_idx;
> -	srgn = rgn->srgn_tbl + srgn_idx;
> -
> -	if (!ufshpb_is_valid_srgn(rgn, srgn))
> -		goto mctx_error;
> -
> -	if (!srgn->mctx)
> -		goto mctx_error;
> -
> -	copied = ufshpb_fill_ppn_from_page(hpb, srgn->mctx, srgn_offset,
> -					   pre_req->wb.len - offset,
> -					   &addr[offset]);
> -
> -	if (copied < 0)
> -		goto mctx_error;
> -
> -	offset += copied;
> -	srgn_offset += copied;
> -
> -	if (srgn_offset == hpb->entries_per_srgn) {
> -		srgn_offset = 0;
> -
> -		if (++srgn_idx == hpb->srgns_per_rgn) {
> -			srgn_idx = 0;
> -			rgn_idx++;
> -		}
> -	}
> -
> -	if (offset < pre_req->wb.len)
> -		goto next_offset;
> -
> -	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
> -	return 0;
> -mctx_error:
> -	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
> -	return -ENOMEM;
> -}
> -
> -static int ufshpb_pre_req_add_bio_page(struct ufshpb_lu *hpb,
> -				       struct request_queue *q,
> -				       struct ufshpb_req *pre_req)
> -{
> -	struct page *page = pre_req->wb.m_page;
> -	struct bio *bio = pre_req->bio;
> -	int entries_bytes, ret;
> -
> -	if (!page)
> -		return -ENOMEM;
> -
> -	if (ufshpb_prep_entry(pre_req, page))
> -		return -ENOMEM;
> -
> -	entries_bytes = pre_req->wb.len * sizeof(__be64);
> -
> -	ret = bio_add_pc_page(q, bio, page, entries_bytes, 0);
> -	if (ret != entries_bytes) {
> -		dev_err(&hpb->sdev_ufs_lu->sdev_dev,
> -			"bio_add_pc_page fail: %d", ret);
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
>   static inline int ufshpb_get_read_id(struct ufshpb_lu *hpb)
>   {
>   	if (++hpb->cur_read_id >= MAX_HPB_READ_ID)
> @@ -514,88 +372,6 @@ static inline int ufshpb_get_read_id(struct ufshpb_lu *hpb)
>   	return hpb->cur_read_id;
>   }
>   
> -static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
> -				  struct ufshpb_req *pre_req, int read_id)
> -{
> -	struct scsi_device *sdev = cmd->device;
> -	struct request_queue *q = sdev->request_queue;
> -	struct request *req;
> -	struct scsi_request *rq;
> -	struct bio *bio = pre_req->bio;
> -
> -	pre_req->hpb = hpb;
> -	pre_req->wb.lpn = sectors_to_logical(cmd->device,
> -					     blk_rq_pos(scsi_cmd_to_rq(cmd)));
> -	pre_req->wb.len = sectors_to_logical(cmd->device,
> -					     blk_rq_sectors(scsi_cmd_to_rq(cmd)));
> -	if (ufshpb_pre_req_add_bio_page(hpb, q, pre_req))
> -		return -ENOMEM;
> -
> -	req = pre_req->req;
> -
> -	/* 1. request setup */
> -	blk_rq_append_bio(req, bio);
> -	req->rq_disk = NULL;
> -	req->end_io_data = (void *)pre_req;
> -	req->end_io = ufshpb_pre_req_compl_fn;
> -
> -	/* 2. scsi_request setup */
> -	rq = scsi_req(req);
> -	rq->retries = 1;
> -
> -	ufshpb_set_write_buf_cmd(rq->cmd, pre_req->wb.lpn, pre_req->wb.len,
> -				 read_id);
> -	rq->cmd_len = scsi_command_size(rq->cmd);
> -
> -	if (blk_insert_cloned_request(q, req) != BLK_STS_OK)
> -		return -EAGAIN;
> -
> -	hpb->stats.pre_req_cnt++;
> -
> -	return 0;
> -}
> -
> -static int ufshpb_issue_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd,
> -				int *read_id)
> -{
> -	struct ufshpb_req *pre_req;
> -	struct request *req = NULL;
> -	unsigned long flags;
> -	int _read_id;
> -	int ret = 0;
> -
> -	req = blk_get_request(cmd->device->request_queue,
> -			      REQ_OP_DRV_OUT | REQ_SYNC, BLK_MQ_REQ_NOWAIT);
> -	if (IS_ERR(req))
> -		return -EAGAIN;
> -
> -	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
> -	pre_req = ufshpb_get_pre_req(hpb);
> -	if (!pre_req) {
> -		ret = -EAGAIN;
> -		goto unlock_out;
> -	}
> -	_read_id = ufshpb_get_read_id(hpb);
> -	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
> -
> -	pre_req->req = req;
> -
> -	ret = ufshpb_execute_pre_req(hpb, cmd, pre_req, _read_id);
> -	if (ret)
> -		goto free_pre_req;
> -
> -	*read_id = _read_id;
> -
> -	return ret;
> -free_pre_req:
> -	spin_lock_irqsave(&hpb->rgn_state_lock, flags);
> -	ufshpb_put_pre_req(hpb, pre_req);
> -unlock_out:
> -	spin_unlock_irqrestore(&hpb->rgn_state_lock, flags);
> -	blk_put_request(req);
> -	return ret;
> -}
> -
>   /*
>    * This function will set up HPB read command using host-side L2P map data.
>    */
> @@ -685,23 +461,6 @@ int ufshpb_prep(struct ufs_hba *hba, struct ufshcd_lrb *lrbp)
>   		dev_err(hba->dev, "get ppn failed. err %d\n", err);
>   		return err;
>   	}
> -	if (!ufshpb_is_legacy(hba) &&
> -	    ufshpb_is_required_wb(hpb, transfer_len)) {
> -		err = ufshpb_issue_pre_req(hpb, cmd, &read_id);
> -		if (err) {
> -			unsigned long timeout;
> -
> -			timeout = cmd->jiffies_at_alloc + msecs_to_jiffies(
> -				  hpb->params.requeue_timeout_ms);
> -
> -			if (time_before(jiffies, timeout))
> -				return -EAGAIN;
> -
> -			hpb->stats.miss_cnt++;
> -			return 0;
> -		}
> -	}
> -
>   	ufshpb_set_hpb_read_to_upiu(hba, lrbp, ppn, transfer_len, read_id);
>   
>   	hpb->stats.hit_cnt++;

Hi James,

The help with trying to find a solution is appreciated.

One of the software developers who is familiar with HPB explained to me that
READ BUFFER and WRITE BUFFER commands may be received in an arbitrary order
by UFS devices. The UFS HPB spec requires UFS devices to be able to stash up
to 128 such pairs. I'm concerned that leaving out WRITE BUFFER commands only
will break the HPB protocol in a subtle way.

Thanks,

Bart.

James Bottomley Oct. 28, 2021, 8:33 p.m. UTC | #36

On Thu, 2021-10-28 at 13:21 -0700, Bart Van Assche wrote:
[...]
> Hi James,
> 
> The help with trying to find a solution is appreciated.
> 
> One of the software developers who is familiar with HPB explained to
> me that READ BUFFER and WRITE BUFFER commands may be received in an
> arbitrary order by UFS devices. The UFS HPB spec requires UFS devices
> to be able to stash up to 128 such pairs. I'm concerned that leaving
> out WRITE BUFFER commands only will break the HPB protocol in a
> subtle way.

Based on the publicly available information (the hotstorage paper) I
don't belive it can.  The Samsung guys also appear to confirm that the
use of WRITE BUFFER is simply an optimzation for large requests:

https://lore.kernel.org/all/20211025051654epcms2p36b259d237eb2b8b885210148118c5d3f@epcms2p3/

As did the excerpt from the spec you posted.  It will cause slowdowns
for reads of > 32kb, because they have to go through the native FTL
lookup now, but there shouldn't be any functional change.  Unless
there's anything else in the proprietary spec that contradicts this?

James

Bart Van Assche Oct. 28, 2021, 8:53 p.m. UTC | #37

On 10/28/21 1:33 PM, James Bottomley wrote:
> On Thu, 2021-10-28 at 13:21 -0700, Bart Van Assche wrote:
> [...]
>> Hi James,
>>
>> The help with trying to find a solution is appreciated.
>>
>> One of the software developers who is familiar with HPB explained to
>> me that READ BUFFER and WRITE BUFFER commands may be received in an
>> arbitrary order by UFS devices. The UFS HPB spec requires UFS devices
>> to be able to stash up to 128 such pairs. I'm concerned that leaving
>> out WRITE BUFFER commands only will break the HPB protocol in a
>> subtle way.
> 
> Based on the publicly available information (the hotstorage paper) I
> don't belive it can.  The Samsung guys also appear to confirm that the
> use of WRITE BUFFER is simply an optimzation for large requests:
> 
> https://lore.kernel.org/all/20211025051654epcms2p36b259d237eb2b8b885210148118c5d3f@epcms2p3/
> 
> As did the excerpt from the spec you posted.  It will cause slowdowns
> for reads of > 32kb, because they have to go through the native FTL
> lookup now, but there shouldn't be any functional change.  Unless
> there's anything else in the proprietary spec that contradicts this?

Are the UFS standards really proprietary? On https://www.t10.org/pubs.htm
I found the following: "Approved American National Standards and Technical
Reports may be purchased from: [ ... ]". And on
https://www.jedec.org/document_search?search_api_views_fulltext=hpb I found
the following: "Available for purchase: $80.00". It seems to me that SCSI
and JEDEC standards are available under similar conditions?

Regarding the question about the impact of leaving out WRITE BUFFER
commands on the HPB protocol, I hope that one of the HPB experts will be so
kind to answer that question.

Bart.

Daejun Park Oct. 28, 2021, 9:14 p.m. UTC | #38

Hi Bart,

>On 10/28/21 1:33 PM, James Bottomley wrote:
>> On Thu, 2021-10-28 at 13:21 -0700, Bart Van Assche wrote:
>> [...]
>>> Hi James,
>>>
>>> The help with trying to find a solution is appreciated.
>>>
>>> One of the software developers who is familiar with HPB explained to
>>> me that READ BUFFER and WRITE BUFFER commands may be received in an
>>> arbitrary order by UFS devices. The UFS HPB spec requires UFS devices
>>> to be able to stash up to 128 such pairs. I'm concerned that leaving
>>> out WRITE BUFFER commands only will break the HPB protocol in a
>>> subtle way.
>> 
>> Based on the publicly available information (the hotstorage paper) I
>> don't belive it can.  The Samsung guys also appear to confirm that the
>> use of WRITE BUFFER is simply an optimzation for large requests:
>> 
>> https://lore.kernel.org/all/20211025051654epcms2p36b259d237eb2b8b885210148118c5d3f@epcms2p3/
>> 
>> As did the excerpt from the spec you posted.  It will cause slowdowns
>> for reads of > 32kb, because they have to go through the native FTL
>> lookup now, but there shouldn't be any functional change.  Unless
>> there's anything else in the proprietary spec that contradicts this?
> 
>Are the UFS standards really proprietary? On https://protect2.fireeye.com/v1/url?k=06710f38-59ea363a-06708477-0cc47a312ab0-c90f3c77bf2bad78&q=1&e=f2e21ae2-d0b7-459b-9881-2a4a38b8ec92&u=https%3A%2F%2Fwww.t10.org%2Fpubs.htm
>I found the following: "Approved American National Standards and Technical
>Reports may be purchased from: [ ... ]". And on
>https://protect2.fireeye.com/v1/url?k=0d1654b0-528d6db2-0d17dfff-0cc47a312ab0-21da8440e5f6d771&q=1&e=f2e21ae2-d0b7-459b-9881-2a4a38b8ec92&u=https%3A%2F%2Fwww.jedec.org%2Fdocument_search%3Fsearch_api_views_fulltext%3Dhpb I found
>the following: "Available for purchase: $80.00". It seems to me that SCSI
>and JEDEC standards are available under similar conditions?
> 
>Regarding the question about the impact of leaving out WRITE BUFFER
>commands on the HPB protocol, I hope that one of the HPB experts will be so
>kind to answer that question.

If the driver doesn't send WRITE BUFFER commands on the HPB protocol, the
READ which requires pre-request (e.g. >32KB) cannot be HPB READ.

Thanks,
Daejun

Christoph Hellwig Oct. 29, 2021, 10:53 a.m. UTC | #39

Given that the discussion is now turning into bikeshedding wether the
non-public UFS spec is mereley completly broken or utterly completely
broken can we please add this patch or the revert before 5.15
goes in?  I don't think this mess will be resolved in any reasonable
time.

James Bottomley Oct. 29, 2021, 11:39 a.m. UTC | #40

On Fri, 2021-10-29 at 12:53 +0200, Christoph Hellwig wrote:
> Given that the discussion is now turning into bikeshedding wether the
> non-public UFS spec is mereley completly broken or utterly completely
> broken can we please add this patch or the revert before 5.15
> goes in?  I don't think this mess will be resolved in any reasonable
> time.

No.  Removing the 2.0 HPB optimization fixes all your complaints about
the block API problems, so there's no need to do a full revert.  I just
need someone to test the partial revert ASAP.  If no-one can test the
partial revert then we can consider more drastic options.

James

Avri Altman Oct. 29, 2021, 1:35 p.m. UTC | #41

> On Fri, 2021-10-29 at 12:53 +0200, Christoph Hellwig wrote:
> > Given that the discussion is now turning into bikeshedding wether the
> > non-public UFS spec is mereley completly broken or utterly completely
> > broken can we please add this patch or the revert before 5.15 goes in?
> > I don't think this mess will be resolved in any reasonable time.
> 
> No.  Removing the 2.0 HPB optimization fixes all your complaints about the
> block API problems, so there's no need to do a full revert.  I just need
> someone to test the partial revert ASAP.  If no-one can test the partial revert
> then we can consider more drastic options.
I support Daejun's patch, but if this is your final decision, I can test it on Sunday.
Can you refer me to exact patch needed to be tested?
There were quite few suggestions flying around....

Thanks,
Avri 

> James
>

James Bottomley Oct. 29, 2021, 1:44 p.m. UTC | #42

On Fri, 2021-10-29 at 13:35 +0000, Avri Altman wrote:
> > On Fri, 2021-10-29 at 12:53 +0200, Christoph Hellwig wrote:
> > > Given that the discussion is now turning into bikeshedding wether
> > > the non-public UFS spec is mereley completly broken or utterly
> > > completely broken can we please add this patch or the revert
> > > before 5.15 goes in? I don't think this mess will be resolved in
> > > any reasonable time.
> > 
> > No.  Removing the 2.0 HPB optimization fixes all your complaints
> > about the block API problems, so there's no need to do a full
> > revert.  I just need someone to test the partial revert ASAP.  If
> > no-one can test the partial revert then we can consider more
> > drastic options.
> I support Daejun's patch, but if this is your final decision, I can
> test it on Sunday. Can you refer me to exact patch needed to be
> tested?

https://lore.kernel.org/all/b2bcc13ccdc584962128a69fa5992936068e1a9b.camel@HansenPartnership.com/

Ideally before Sunday since that's when Linus will likely go final with
5.15

Thanks,

James

scsi: ufs: mark HPB support as BROKEN

Commit Message

Comments

Patch