From patchwork Fri Jun 29 16:57:38 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Zorro Lang <zlang@redhat.com>
X-Patchwork-Id: 10497085
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	3A2256016C for <patchwork-linux-xfs@patchwork.kernel.org>;
	Fri, 29 Jun 2018 16:57:50 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 21ADF29748
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Fri, 29 Jun 2018 16:57:50 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 161E329768; Fri, 29 Jun 2018 16:57:50 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,HEXHASH_WORD,
	MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 93DA629766
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Fri, 29 Jun 2018 16:57:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S934616AbeF2Q5s (ORCPT
	<rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
	Fri, 29 Jun 2018 12:57:48 -0400
Received: from mx3-rdu2.redhat.com ([66.187.233.73]:36844 "EHLO
	mx1.redhat.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S932312AbeF2Q5s (ORCPT <rfc822;linux-xfs@vger.kernel.org>);
	Fri, 29 Jun 2018 12:57:48 -0400
Received: from smtp.corp.redhat.com
	(int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 4F34A818F042;
	Fri, 29 Jun 2018 16:57:47 +0000 (UTC)
Received: from localhost.localdomain.com (ovpn-12-57.pek2.redhat.com
	[10.72.12.57])
	by smtp.corp.redhat.com (Postfix) with ESMTP id B8F6A51DC;
	Fri, 29 Jun 2018 16:57:44 +0000 (UTC)
From: Zorro Lang <zlang@redhat.com>
To: fstests@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Subject: [PATCH] generic: test dm-thin running out of data space vs
	concurrent discard
Date: Sat, 30 Jun 2018 00:57:38 +0800
Message-Id: <20180629165738.8106-1-zlang@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.11.55.8]);
	Fri, 29 Jun 2018 16:57:47 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com
	[10.11.55.8]);
	Fri, 29 Jun 2018 16:57:47 +0000 (UTC) for IP:'10.11.54.5'
	DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com'
	HELO:'smtp.corp.redhat.com' FROM:'zlang@redhat.com' RCPT:''
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

If a user constructs a test that loops repeatedly over below steps
on dm-thin, block allocation can fail due to discards not having
completed yet (Fixed by a685557 dm thin: handle running out of data
space vs concurrent discard):
1) fill thin device via filesystem file
2) remove file
3) fstrim

And this maybe cause a deadlock (fast device likes ramdisk can help
a lot) when racing a fstrim with a filesystem (XFS) shutdown. (Fixed
by 8c81dd46ef3c Force log to disk before reading the AGF during a
fstrim)

This case can reproduce both two bugs if they're not fixed. If only
the dm-thin bug is fixed, then the test will pass. If only the fs
bug is fixed, then the test will fail. If both of bugs aren't fixed,
the test will hang.

Signed-off-by: Zorro Lang <zlang@redhat.com>
---

Hi,

If both of two bugs aren't fixed, a loop device base on tmpfs can help
reproduce the XFS deadlock:
1) mount -t tmpfs tmpfs /tmp
2) dd if=/dev/zero of=/tmp/test.img bs=1M count=100
3) losetup /dev/loop0 /tmp/test.img
4) use /dev/loop0 to be SCRATCH_DEV, run this case. The test will hang there.

Ramdisk can help trigger the race. Maybe NVME device can help too. But it's
hard to reproduce on general disk.

If the XFS bug is fixed, above steps can reproduce dm-thin bug, the test
will fail.

Unfortunately, if the dm-thin bug is fixed, then this case can't reproduce
the XFS bug singly.

Thanks,
Zorro

 tests/generic/499     | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/499.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 88 insertions(+)
 create mode 100755 tests/generic/499
 create mode 100644 tests/generic/499.out

diff --git a/tests/generic/499 b/tests/generic/499
new file mode 100755
index 00000000..24adfc3a
--- /dev/null
+++ b/tests/generic/499
@@ -0,0 +1,85 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Red Hat Inc.  All Rights Reserved.
+#
+# FS QA Test 499
+#
+# Race test running out of data space with concurrent discard operation on
+# dm-thin.
+#
+# If a user constructs a test that loops repeatedly over below steps on
+# dm-thin, block allocation can fail due to discards not having completed
+# yet (Fixed by a685557 dm thin: handle running out of data space vs
+# concurrent discard):
+# 1) fill thin device via filesystem file
+# 2) remove file
+# 3) fstrim
+#
+# And this maybe cause a deadlock when racing a fstrim with a filesystem
+# (XFS) shutdown. (Fixed by 8c81dd46ef3c Force log to disk before reading
+# the AGF during a fstrim)
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	_dmthin_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/dmthin
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch_nocheck
+_require_dm_target thin-pool
+
+# Create a thin pool and a *slightly smaller* thin volume, it's helpful
+# to reproduce the bug
+BACKING_SIZE=$((50 * 1024 * 1024 / 512))	# 50M
+VIRTUAL_SIZE=$((BACKING_SIZE + 1024))		# 50M + 1k
+CLUSTER_SIZE=$((64 * 1024 / 512))		# 64K
+
+_dmthin_init $BACKING_SIZE $VIRTUAL_SIZE $CLUSTER_SIZE 0
+_dmthin_set_fail
+_mkfs_dev $DMTHIN_VOL_DEV
+_dmthin_mount
+
+# There're two bugs at here, one is dm-thin bug, the other is filesystem
+# (XFS especially) bug. The dm-thin bug can't handle running out of data
+# space with concurrent discard well. Then the dm-thin bug cause fs unmount
+# hang when racing a fstrim with a filesystem shutdown.
+#
+# If both of two bugs haven't been fixed, below test maybe cause deadlock.
+# Else if the fs bug has been fixed, but the dm-thin bug hasn't. below test
+# will cause the test fail (no deadlock).
+# Else the test will pass.
+for ((i=0; i<100; i++)); do
+	$XFS_IO_PROG -f -c "pwrite -b 64k 0 100M" \
+		$SCRATCH_MNT/testfile &>/dev/null
+	rm -f $SCRATCH_MNT/testfile
+	$FSTRIM_PROG $SCRATCH_MNT
+done
+
+_dmthin_check_fs
+_dmthin_cleanup
+
+echo "Silence is golden"
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/499.out b/tests/generic/499.out
new file mode 100644
index 00000000..c363e684
--- /dev/null
+++ b/tests/generic/499.out
@@ -0,0 +1,2 @@
+QA output created by 499
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 83a6fdab..bbeac4af 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -501,3 +501,4 @@
 496 auto quick swap
 497 auto quick swap collapse
 498 auto quick log
+499 auto thin trim