[1/3] nvme-tcp: rate limit error message in send path

Message ID	20250128-nvme-misc-fixes-v1-1-40c586581171@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9E9F1DE3C3; Tue, 28 Jan 2025 16:35:09 +0000 (UTC) From: Daniel Wagner <wagi@kernel.org> Date: Tue, 28 Jan 2025 17:34:46 +0100 Subject: [PATCH 1/3] nvme-tcp: rate limit error message in send path Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20250128-nvme-misc-fixes-v1-1-40c586581171@kernel.org> References: <20250128-nvme-misc-fixes-v1-0-40c586581171@kernel.org> In-Reply-To: <20250128-nvme-misc-fixes-v1-0-40c586581171@kernel.org> To: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Ming Lei <ming.lei@redhat.com> Cc: James Smart <james.smart@broadcom.com>, Hannes Reinecke <hare@suse.de>, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Daniel Wagner <wagi@kernel.org>
Series	misc nvme related fixes \| expand [0/3] misc nvme related fixes [1/3] nvme-tcp: rate limit error message in send path [2/3] nvme-fc: use ctrl state getter [3/3] blk-mq: fix wait condition for tagset wait completed check

Message ID

20250128-nvme-misc-fixes-v1-1-40c586581171@kernel.org (mailing list archive)

State

New

Headers

From: Daniel Wagner <wagi@kernel.org>
Date: Tue, 28 Jan 2025 17:34:46 +0100
Subject: [PATCH 1/3] nvme-tcp: rate limit error message in send path
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <20250128-nvme-misc-fixes-v1-1-40c586581171@kernel.org>
References: <20250128-nvme-misc-fixes-v1-0-40c586581171@kernel.org>
In-Reply-To: <20250128-nvme-misc-fixes-v1-0-40c586581171@kernel.org>
To: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>,
 Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
 Ming Lei <ming.lei@redhat.com>
Cc: James Smart <james.smart@broadcom.com>, Hannes Reinecke <hare@suse.de>,
 linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
 linux-block@vger.kernel.org, Daniel Wagner <wagi@kernel.org>

Series

misc nvme related fixes | expand

If a lot of request are in the queue, this message is spamming the logs, thus rate limit it. Signed-off-by: Daniel Wagner <wagi@kernel.org> --- drivers/nvme/host/tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Jan. 29, 2025, 6:05 a.m. UTC | #1

On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> If a lot of request are in the queue, this message is spamming the logs,
> thus rate limit it.

Are in the queue when what happens?  Not that I'm against this,
but if we have a known condition where this error is printed a lot
we should probably skip it entirely for that?

Daniel Wagner Jan. 30, 2025, 3:25 p.m. UTC | #2

On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> > If a lot of request are in the queue, this message is spamming the logs,
> > thus rate limit it.
> 
> Are in the queue when what happens?  Not that I'm against this,
> but if we have a known condition where this error is printed a lot
> we should probably skip it entirely for that?

The condition is that all the elements in the queue->send_list could fail as a
batch. I had a bug in my patches which re-queued all the failed command
immediately and semd them out again, thus spamming the log.

This behavior doesn't exist in upstream. I just thought it might make
sense to rate limit as precaution. I don't know if it is worth the code
churn.

Christoph Hellwig Jan. 31, 2025, 7:29 a.m. UTC | #3

On Thu, Jan 30, 2025 at 04:25:35PM +0100, Daniel Wagner wrote:
> On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> > > If a lot of request are in the queue, this message is spamming the logs,
> > > thus rate limit it.
> > 
> > Are in the queue when what happens?  Not that I'm against this,
> > but if we have a known condition where this error is printed a lot
> > we should probably skip it entirely for that?
> 
> The condition is that all the elements in the queue->send_list could fail as a
> batch. I had a bug in my patches which re-queued all the failed command
> immediately and semd them out again, thus spamming the log.
> 
> This behavior doesn't exist in upstream. I just thought it might make
> sense to rate limit as precaution. I don't know if it is worth the code
> churn.

I'm fine with the rate limiting.  I was just wondering if there is
a case where we'd easily hit it and could do even better.

Sagi Grimberg Jan. 31, 2025, 8:09 a.m. UTC | #4

On 31/01/2025 9:29, Christoph Hellwig wrote:
> On Thu, Jan 30, 2025 at 04:25:35PM +0100, Daniel Wagner wrote:
>> On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
>>> On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
>>>> If a lot of request are in the queue, this message is spamming the logs,
>>>> thus rate limit it.
>>> Are in the queue when what happens?  Not that I'm against this,
>>> but if we have a known condition where this error is printed a lot
>>> we should probably skip it entirely for that?
>> The condition is that all the elements in the queue->send_list could fail as a
>> batch. I had a bug in my patches which re-queued all the failed command
>> immediately and semd them out again, thus spamming the log.
>>
>> This behavior doesn't exist in upstream. I just thought it might make
>> sense to rate limit as precaution. I don't know if it is worth the code
>> churn.
> I'm fine with the rate limiting.  I was just wondering if there is
> a case where we'd easily hit it and could do even better.

I agree with the change. The reason why I think its useful to keep is 
because its
has been really useful indication when debugging UAF panics.

Sagi Grimberg Jan. 31, 2025, 8:09 a.m. UTC | #5

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index dc5bbca58c6dcbce40cfa3de893592d768ebc939..1ed0bc10b2dffe534536b1073abc0302056aa51e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1257,8 +1257,8 @@  static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
 	if (ret == -EAGAIN) {
 		ret = 0;
 	} else if (ret < 0) {
-		dev_err(queue->ctrl->ctrl.device,
-			"failed to send request %d\n", ret);
+		dev_err_ratelimited(queue->ctrl->ctrl.device,
+				    "failed to send request %d\n", ret);
 		nvme_tcp_fail_request(queue->request);
 		nvme_tcp_done_send_req(queue);
 	}

[1/3] nvme-tcp: rate limit error message in send path

Commit Message

Comments

Patch