blktests: fix how we handle waiting for nbd to connect

Message ID	9377610cbdc3568c172cd7c5d2e9d36da8dd2cf4.1716312272.git.josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7963D147C6F for <linux-block@vger.kernel.org>; Tue, 21 May 2024 17:24:56 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-block@vger.kernel.org, shinichiro.kawasaki@wdc.com Subject: [PATCH] blktests: fix how we handle waiting for nbd to connect Date: Tue, 21 May 2024 13:24:39 -0400 Message-ID: <9377610cbdc3568c172cd7c5d2e9d36da8dd2cf4.1716312272.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	blktests: fix how we handle waiting for nbd to connect \| expand blktests: fix how we handle waiting for nbd to connect

Message ID

9377610cbdc3568c172cd7c5d2e9d36da8dd2cf4.1716312272.git.josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-block@vger.kernel.org,
	shinichiro.kawasaki@wdc.com
Subject: [PATCH] blktests: fix how we handle waiting for nbd to connect
Date: Tue, 21 May 2024 13:24:39 -0400
Message-ID: 
 <9377610cbdc3568c172cd7c5d2e9d36da8dd2cf4.1716312272.git.josef@toxicpanda.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

blktests: fix how we handle waiting for nbd to connect | expand

Commit Message

Josef Bacik May 21, 2024, 5:24 p.m. UTC

Because NBD has the old style configure the device directly config we
sometimes have spurious failures where the device wasn't quite ready
before the rest of the test continued.

nbd/002 had been using _wait_for_nbd_connect, however this helper waits
for the recv task to show up, which actually happens slightly before the
size is set and we're actually ready to be read from.  This means we
would sometimes fail nbd/002 because the device wasn't quite right.

Additionally nbd/001 has a similar issue where we weren't waiting for
the device to be ready before going ahead with the test, which would
cause spurious failures.

Fix this by adjusting _wait_for_nbd_connect to only exit once the size
for the device is being reported properly, meaning that it's ready to be
read from.

Then add this call to nbd/001 to eliminate the random failures we would
see with this test.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 tests/nbd/001 | 1 +
 tests/nbd/rc  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Shinichiro Kawasaki May 22, 2024, 7:02 a.m. UTC | #1

On May 21, 2024 / 13:24, Josef Bacik wrote:
> Because NBD has the old style configure the device directly config we
> sometimes have spurious failures where the device wasn't quite ready
> before the rest of the test continued.
> 
> nbd/002 had been using _wait_for_nbd_connect, however this helper waits
> for the recv task to show up, which actually happens slightly before the
> size is set and we're actually ready to be read from.  This means we
> would sometimes fail nbd/002 because the device wasn't quite right.
> 
> Additionally nbd/001 has a similar issue where we weren't waiting for
> the device to be ready before going ahead with the test, which would
> cause spurious failures.
> 
> Fix this by adjusting _wait_for_nbd_connect to only exit once the size
> for the device is being reported properly, meaning that it's ready to be
> read from.
> 
> Then add this call to nbd/001 to eliminate the random failures we would
> see with this test.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Josef, thank you very much. I've been seeing nbd/001 on my test system and this
patch avoids it :) The change looks good to me. I will leave it for several
days before applying it, in case anyone has some more comments.

Shinichiro Kawasaki May 27, 2024, 7:49 a.m. UTC | #2

On May 22, 2024 / 07:02, Shinichiro Kawasaki wrote:
> On May 21, 2024 / 13:24, Josef Bacik wrote:
> > Because NBD has the old style configure the device directly config we
> > sometimes have spurious failures where the device wasn't quite ready
> > before the rest of the test continued.
> > 
> > nbd/002 had been using _wait_for_nbd_connect, however this helper waits
> > for the recv task to show up, which actually happens slightly before the
> > size is set and we're actually ready to be read from.  This means we
> > would sometimes fail nbd/002 because the device wasn't quite right.
> > 
> > Additionally nbd/001 has a similar issue where we weren't waiting for
> > the device to be ready before going ahead with the test, which would
> > cause spurious failures.
> > 
> > Fix this by adjusting _wait_for_nbd_connect to only exit once the size
> > for the device is being reported properly, meaning that it's ready to be
> > read from.
> > 
> > Then add this call to nbd/001 to eliminate the random failures we would
> > see with this test.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> Josef, thank you very much. I've been seeing nbd/001 on my test system and this
> patch avoids it :) The change looks good to me. I will leave it for several
> days before applying it, in case anyone has some more comments.

I've applied it. Thanks!

Shinichiro Kawasaki May 31, 2024, 1:47 a.m. UTC | #3

On May 27, 2024 / 07:49, Shinichiro Kawasaki wrote:
> On May 22, 2024 / 07:02, Shinichiro Kawasaki wrote:
> > On May 21, 2024 / 13:24, Josef Bacik wrote:
> > > Because NBD has the old style configure the device directly config we
> > > sometimes have spurious failures where the device wasn't quite ready
> > > before the rest of the test continued.
> > > 
> > > nbd/002 had been using _wait_for_nbd_connect, however this helper waits
> > > for the recv task to show up, which actually happens slightly before the
> > > size is set and we're actually ready to be read from.  This means we
> > > would sometimes fail nbd/002 because the device wasn't quite right.
> > > 
> > > Additionally nbd/001 has a similar issue where we weren't waiting for
> > > the device to be ready before going ahead with the test, which would
> > > cause spurious failures.
> > > 
> > > Fix this by adjusting _wait_for_nbd_connect to only exit once the size
> > > for the device is being reported properly, meaning that it's ready to be
> > > read from.
> > > 
> > > Then add this call to nbd/001 to eliminate the random failures we would
> > > see with this test.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > 
> > Josef, thank you very much. I've been seeing nbd/001 on my test system and this
> > patch avoids it :) The change looks good to me. I will leave it for several
> > days before applying it, in case anyone has some more comments.
> 
> I've applied it. Thanks!

To: Yi Zhang

Yi, I guess this fix may avoid the nbd/002 failure that CKI reports recently
[*]. To apply the fix, could you rebase the blktests for CKI runs?

[*] https://datawarehouse.cki-project.org/kcidb/tests/12631448

Yi Zhang May 31, 2024, 2:32 a.m. UTC | #4

On Fri, May 31, 2024 at 9:47 AM Shinichiro Kawasaki
<shinichiro.kawasaki@wdc.com> wrote:
>
> On May 27, 2024 / 07:49, Shinichiro Kawasaki wrote:
> > On May 22, 2024 / 07:02, Shinichiro Kawasaki wrote:
> > > On May 21, 2024 / 13:24, Josef Bacik wrote:
> > > > Because NBD has the old style configure the device directly config we
> > > > sometimes have spurious failures where the device wasn't quite ready
> > > > before the rest of the test continued.
> > > >
> > > > nbd/002 had been using _wait_for_nbd_connect, however this helper waits
> > > > for the recv task to show up, which actually happens slightly before the
> > > > size is set and we're actually ready to be read from.  This means we
> > > > would sometimes fail nbd/002 because the device wasn't quite right.
> > > >
> > > > Additionally nbd/001 has a similar issue where we weren't waiting for
> > > > the device to be ready before going ahead with the test, which would
> > > > cause spurious failures.
> > > >
> > > > Fix this by adjusting _wait_for_nbd_connect to only exit once the size
> > > > for the device is being reported properly, meaning that it's ready to be
> > > > read from.
> > > >
> > > > Then add this call to nbd/001 to eliminate the random failures we would
> > > > see with this test.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > >
> > > Josef, thank you very much. I've been seeing nbd/001 on my test system and this
> > > patch avoids it :) The change looks good to me. I will leave it for several
> > > days before applying it, in case anyone has some more comments.
> >
> > I've applied it. Thanks!
>
> To: Yi Zhang
>
> Yi, I guess this fix may avoid the nbd/002 failure that CKI reports recently
> [*]. To apply the fix, could you rebase the blktests for CKI runs?

Sure, will apply this patch and verify the failure.

>
> [*] https://datawarehouse.cki-project.org/kcidb/tests/12631448
>

diff --git a/tests/nbd/001 b/tests/nbd/001
index 5fd0d43..0975af0 100755
--- a/tests/nbd/001
+++ b/tests/nbd/001
@@ -18,6 +18,7 @@  test() {
 	echo "Running ${TEST_NAME}"
 	_start_nbd_server
 	nbd-client -L -N export localhost /dev/nbd0 >> "$FULL" 2>&1
+	_wait_for_nbd_connect
 	udevadm settle
 
 	parted -s /dev/nbd0 print 2>> "$FULL" | grep 'Disk /dev/nbd0'
diff --git a/tests/nbd/rc b/tests/nbd/rc
index 9c1c15b..e96dc61 100644
--- a/tests/nbd/rc
+++ b/tests/nbd/rc
@@ -43,7 +43,8 @@  _have_nbd_netlink() {
 
 _wait_for_nbd_connect() {
 	for ((i = 0; i < 3; i++)); do
-		if [[ -e /sys/kernel/debug/nbd/nbd0/tasks ]]; then
+		sz=$(lsblk --raw --noheadings -o SIZE /dev/nbd0)
+		if [ "$sz" != "0B"  ]; then
 			return 0
 		fi
 		sleep 1

blktests: fix how we handle waiting for nbd to connect

Commit Message

Comments

Patch