[40/40] fstests: check-parallel

From: Dave Chinner <dchinner@redhat.com>

From: Dave Chinner <dchinner@redhat.com>

Runs tests in parallel runner threads. Each runner thread has it's
own set of tests to run, and runs a separate instance of check
to run those tests.

check-parallel sets up loop devices, mount points, results
directories, etc for each instance and divides the tests up between
the runner threads.

It currently hard codes the XFS and generic test lists, and then
gives each check invocation an explicit list of tests to run. It
also passes through exclusions so that test exclude filtering is
still done by check.

This is far from ideal, but I didn't want to have to embark on a
major refactoring of check to be able to run stuff in parallel.
It was quite the challenge just to get all the tests and test
infrastructure up to the point where they can run reliably in
parallel.

Hence I've left the actual factoring of test selection and setup
out of the patchset for the moment. The plan is to factor both the
test setup and the test list runner loop out of check and share them
between check and check-parallel, hence not requiring check-parallel
to run check directly. That is future work, however.

With the current test runner setup, it is not uncommon to see >5000%
cpu usage, 150-200kiops and 4-5GB/s of disk bandwidth being used
when running 64 runners. This is a serious stress load as it is
constantly mounting and unmounting dozens of filesystems, creating
and destroying devices, dropping caches, running sync, running CPU
hot plug, running page cache migration, etc.

The massive amount of IO that load generates causes qemu hosts to
abort (i.e. crash) because they run out of vm map segments. Hence
bumping up the max_map_count on the host like so:

echo 1048576 > /proc/sys/vm/max_map_count

is necessary.

There is no significant memory pressure to speak of from running the
tests like this. I've seen a maximum of about 50GB of RAM used when
running tests like this, so running on a 64p/64GB VM the additional
concurrency doesn't really stress memory capacity like it does CPU
and IO.

All the runners are executed in private mount namespaces. This is
to prevent ephemeral mount namespace clones from taking a reference
to every mounted filesystem in the machine and so causing random
"device busy after unmount" failures in the tests that are running
concurrently with the mount namespace setup and teardown.

A typical `pstree -N mnt` looks like:

$ pstree -N mnt
[4026531841]
bash
bash───pstree
[0]
sudo───sudo───check-parallel─┬─check-parallel───nsexec───check───311─┬─cut
                             │                                       └─md5sum
                             ├─check-parallel───nsexec───check───750─┬─750───sleep
                             │                                       └─750.fsstress───4*[750.fsstress───{750.fsstress}]
                             ├─check-parallel───nsexec───check───013───013───sed
                             ├─check-parallel───nsexec───check───251───cp
                             ├─check-parallel───nsexec───check───467───open_by_handle
                             ├─check-parallel───nsexec───check───650─┬─650───sleep
                             │                                       └─650.fsstress─┬─61*[650.fsstress───{650.fsstress}]
                             │                                                      └─2*[650.fsstress]
                             ├─check-parallel───nsexec───check───707
                             ├─check-parallel───nsexec───check───705
                             ├─check-parallel───nsexec───check───416
                             ├─check-parallel───nsexec───check───477───2*[open_by_handle]
                             ├─check-parallel───nsexec───check───140───140
                             ├─check-parallel───nsexec───check───562
                             ├─check-parallel───nsexec───check───415───xfs_io───{xfs_io}
                             ├─check-parallel───nsexec───check───291
                             ├─check-parallel───nsexec───check───017
                             ├─check-parallel───nsexec───check───016
                             ├─check-parallel───nsexec───check───168───2*[168───168]
                             ├─check-parallel───nsexec───check───672───2*[672───672]
                             ├─check-parallel───nsexec───check───170─┬─170───170───170
                             │                                       └─170───170
                             ├─check-parallel───nsexec───check───531───122*[t_open_tmpfiles]
                             ├─check-parallel───nsexec───check───387
                             ├─check-parallel───nsexec───check───748
                             ├─check-parallel───nsexec───check───388─┬─388.fsstress───4*[388.fsstress───{388.fsstress}]
                             │                                       └─sleep
                             ├─check-parallel───nsexec───check───328───328
                             ├─check-parallel───nsexec───check───352
                             ├─check-parallel───nsexec───check───042
                             ├─check-parallel───nsexec───check───426───open_by_handle
                             ├─check-parallel───nsexec───check───756───2*[open_by_handle]
                             ├─check-parallel───nsexec───check───227
                             ├─check-parallel───nsexec───check───208───aio-dio-invalid───2*[aio-dio-invalid]
                             ├─check-parallel───nsexec───check───746───cp
                             ├─check-parallel───nsexec───check───187───187
                             ├─check-parallel───nsexec───check───027───8*[027]
                             ├─check-parallel───nsexec───check───045───xfs_io───{xfs_io}
                             ├─check-parallel───nsexec───check───044
                             ├─check-parallel───nsexec───check───204
                             ├─check-parallel───nsexec───check───186───186
                             ├─check-parallel───nsexec───check───449
                             ├─check-parallel───nsexec───check───231───su───fsx
                             ├─check-parallel───nsexec───check───509
                             ├─check-parallel───nsexec───check───127───5*[127───fsx]
                             ├─check-parallel───nsexec───check───047
                             ├─check-parallel───nsexec───check───043
                             ├─check-parallel───nsexec───check───475───pkill
                             ├─check-parallel───nsexec───check───299─┬─fio─┬─4*[fio]
                             │                                       │     ├─2*[fio───4*[{fio}]]
                             │                                       │     └─{fio}
                             │                                       └─pgrep
                             ├─check-parallel───nsexec───check───551───aio-dio-write-v
                             ├─check-parallel───nsexec───check───323───aio-last-ref-he───100*[{aio-last-ref-he}]
                             ├─check-parallel───nsexec───check───648───sleep
                             ├─check-parallel───nsexec───check───046
                             ├─check-parallel───nsexec───check───753─┬─753.fsstress───4*[753.fsstress]
                             │                                       └─pkill
                             ├─check-parallel───nsexec───check───507───507
                             ├─check-parallel───nsexec───check───629─┬─3*[629───xfs_io───{xfs_io}]
                             │                                       └─5*[629]
                             ├─check-parallel───nsexec───check───073───umount
                             ├─check-parallel───nsexec───check───615───615
                             ├─check-parallel───nsexec───check───176───punch-alternati
                             ├─check-parallel───nsexec───check───294
                             ├─check-parallel───nsexec───check───236───236
                             ├─check-parallel───nsexec───check───165─┬─165─┬─165─┬─cut
                             │                                       │     │     └─xfs_io───{xfs_io}
                             │                                       │     └─165───grep
                             │                                       └─165
                             ├─check-parallel───nsexec───check───259───sync
                             ├─check-parallel───nsexec───check───442───442.fsstress───4*[442.fsstress───{442.fsstress}]
                             ├─check-parallel───nsexec───check───558───255*[558]
                             ├─check-parallel───nsexec───check───358───358───358
                             ├─check-parallel───nsexec───check───169───169
                             └─check-parallel───nsexec───check───297─┬─297.fsstress─┬─284*[297.fsstress───{297.fsstress}]
                                                                     │              └─716*[297.fsstress]
                                                                     └─sleep

A typical test run looks like:

$ time sudo ./check-parallel /mnt/xfs -s xfs -x dump
Runner 63 Failures:  xfs/170
Runner 36 Failures:  xfs/050
Runner 30 Failures:  xfs/273
Runner 29 Failures:  generic/135
Runner 25 Failures:  generic/603
Tests run: 1140
Failure count: 5

Ten slowest tests - runtime in seconds:
xfs/013 454
generic/707 414
generic/017 398
generic/387 395
generic/748 390
xfs/140 351
generic/562 351
generic/705 347
generic/251 344
xfs/016 343

Cleanup on Aisle 5?

total 0
crw-------. 1 root root 10, 236 Nov 27 09:27 control
lrwxrwxrwx. 1 root root       7 Nov 27 09:27 fast -> ../dm-0
/dev/mapper/fast  1.4T  192G  1.2T  14% /mnt/xfs

real    9m29.056s
user    0m0.005s
sys     0m0.022s
$

Yeah, that runtime is real - under 10 minutes for a full XFS auto
group test run. When running this normally (i.e. via check) on this
machine, it usually takes just under 4 hours to run the same set
of tests. i.e. I can run ./check-parallel roughly 25x times on this
machine in the same time it takes to run ./check.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 check          |   7 +-
 check-parallel | 205 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 208 insertions(+), 4 deletions(-)
 create mode 100755 check-parallel

Message ID	20241127045403.3665299-41-david@fromorbit.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E66864689 for <fstests@vger.kernel.org>; Wed, 27 Nov 2024 04:58:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732683535; cv=none; b=mNvRoPvjMKvfCypiAqoHpwobB1R8OyEkbh7NvGchQBQAh02H6C6JPOjCZbVXWItTuisyjs0FZz1s0dpRPVWfuUjdS4HoMVH60Q/b6J4XJ4sVotBGzJj/4ydCbTYVtamGuX8/4LJ+EBJqikQlmsiBnYsGGwCO8YfV1r3ptKrjjyU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732683535; c=relaxed/simple; bh=89bFUxX41Wygscy97EW5vtGNO+ZC/d4tCi45n21ktKQ=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=YYdnqTkc7Yq/N5qkgvyr8DXZpcw1+cH/OboE6XOEayCeefvyHYRIlSNHtdRrPk34jLRRtmpmZxJJ2MivA/ELGG5peagOuSC3KbEOUMHCq7NupcJ23NADjGO9dIG/HmmzB9kGLfoEZcZgXDKEJ+maVleyebcxPUYcedHExQYoyq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=m8GThCmd; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="m8GThCmd" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-211fd6a0a9cso46491905ad.3 for <fstests@vger.kernel.org>; Tue, 26 Nov 2024 20:58:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1732683533; x=1733288333; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=tlQ35L+7JZAQDVtnnWfihYDXlIAcyIpwmQwcp2xjHxA=; b=m8GThCmdcvN+JubLaqC2H85V3Tfx6yNuHHJIv8i4Ef2ZinTucY4aoJDaJoVCVCpnFY crfiOSA9LVGyv2cwjA9kr4U4hJgvkUfDc/dt5Iu4RFg0cFkcJS1MO3ippuy1Gevj51lu v4HY4GhbdeCvcSPXaArvDImp6rM8XDSePrbghr8DGW/MVAQOa5Tq8OET+fpwZcKyPQwD IZM6uQATo4KtKtBUj7jm8MMs84FKuv3FnKV1UJk/9UTVP8sKuxVU65ef9CxaHjY6dDbp akK9XuEoCznQNT6E4/a06to72VxzvrKutAN9BX0SjUhOa98r6SHhavaB0CwFIjc8C6+l 1Vpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732683533; x=1733288333; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tlQ35L+7JZAQDVtnnWfihYDXlIAcyIpwmQwcp2xjHxA=; b=U83FjLjOBqze+71v+6tcvRYO8oVKLmPOXlYoroYhO46xBo4R9e3tIZD3qdJqheIrp/ bT0RDwdPTDWyGqIgAKcnXMlQwHRFGla4HxEBgQzijxIbVfcQi4H2HkgLyWWsYYKy/7aQ QpjhSWbCMqyjmqCuG4x38YFSUB9WBJAUKuSno+WcoFq+cCOpHZ09EejEKbhTMFBAfgX6 raaVT3fZ5oq8x48dFam2ZjPI8SGgDAFGqQKw4RHCzaUUgfBeCxmhsgrasqkagOQZDiHe lvt/YIM+8UClKydrkvastjgJhKn6k5Kvy55gOnSvBeRI1iOdeNSUxqZnyjnZTKmmtnnm C9MA== X-Gm-Message-State: AOJu0Yw5OoWhaFpqK+BfKpGajSJBkRUn7c1ppAUCx7oU/7mBg4+zDFd8 IywGDgFv77cqO3KEcrRPt8LUo+ShHcqpDXHTXpNpk4FE3khj1TMmsKzrx/RJvYbjKmAU+TBc3wC V X-Gm-Gg: ASbGncs1mn6tjeGQC/i5J8GETIbCS6D7NlZwOM4M46vsh+WfiukfK4SmRIPDzTWeiKT aIPY3A3ceJtp3wN0jkS9VJ1tgRpwB3PM6gvlH5P6goAn/VI9GvILov6YT+lYAucnS0qV9Ll+WzM tdBqfNNbFCUdSuyh3caZjHOydpPndjkMj8cAkbzLzhc2zefeaLh+pOEY0e4lsvyIE4T1qtDYl8L EwHUez3HrBya1ui69I9005WbAa551Pxc3Q+IVj7CzFLMGHwa6GBnp77kcPp+PuTh3G/R7PYkCuT rgzfCpIP6do0WoUNUwPkLrSn X-Google-Smtp-Source: AGHT+IHMt7uAJO012FXYhT3FnDk5zw/4DH4aRoIcWdLeFrup1Ar9Jv+7Kt2mU7IQDCCcMPz3+6yP+A== X-Received: by 2002:a17:902:ecce:b0:20d:2804:bcde with SMTP id d9443c01a7336-21501e5c83emr23094095ad.35.1732683533034; Tue, 26 Nov 2024 20:58:53 -0800 (PST) Received: from dread.disaster.area (pa49-180-121-96.pa.nsw.optusnet.com.au. [49.180.121.96]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-214f87bc298sm12633165ad.226.2024.11.26.20.58.51 for <fstests@vger.kernel.org> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 Nov 2024 20:58:52 -0800 (PST) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.98) (envelope-from <dave@fromorbit.com>) id 1tGA3y-00000003ZWP-0XM4 for fstests@vger.kernel.org; Wed, 27 Nov 2024 15:54:06 +1100 Received: from dave by devoid.disaster.area with local (Exim 4.98) (envelope-from <dave@devoid.disaster.area>) id 1tGA3y-0000000FQgh-17fu for fstests@vger.kernel.org; Wed, 27 Nov 2024 15:54:06 +1100 From: Dave Chinner <david@fromorbit.com> To: fstests@vger.kernel.org Subject: [PATCH 40/40] fstests: check-parallel Date: Wed, 27 Nov 2024 15:52:10 +1100 Message-ID: <20241127045403.3665299-41-david@fromorbit.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241127045403.3665299-1-david@fromorbit.com> References: <20241127045403.3665299-1-david@fromorbit.com> Precedence: bulk X-Mailing-List: fstests@vger.kernel.org List-Id: <fstests.vger.kernel.org> List-Subscribe: <mailto:fstests+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:fstests+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	fstests: concurrent test execution \| expand [RFC,00/40] fstests: concurrent test execution [01/40] xfs/448: get rid of assert-on-failure [02/40] fstests: cleanup fsstress process management [03/40] fuzzy: don't use killall [04/40] fstests: per-test dmflakey instances [05/40] fstests: per-test dmerror instances [06/40] fstests: per-test dmhuge instances [07/40] fstests: per-test dmthin instances [08/40] fstests: per-test dmdust instances [09/40] fstests: per-test dmdelay instances [10/40] fstests: fix DM device creation/removal vs udev races [11/40] fstests: use syncfs rather than sync [12/40] fstests: clean up mount and unmount operations [13/40] fstests: clean up loop device instantiation [14/40] fstests: xfs/227 is really slow [15/40] fstests: mark tests that are unreliable when run in parallel [16/40] fstests: use udevadm wait in preference to settle [17/40] xfs/442: rescale load so it's not exponential [18/40] xfs/176: fix broken setup code [19/40] xfs/177: remove unused slab object count location checks [20/40] fstests: remove uses of killall where possible [21/40] generic/127: reduce runtime [22/40] quota: system project quota files need to be shared [23/40] dmesg: reduce noise from other tests [24/40] fstests: stop using /tmp directly [25/40] fstests: scale some tests for high CPU count sanity [26/40] generic/310: cleanup killing background processes [27/40] filter: handle mount errors from CONFIG_BLK_DEV_WRITE_MOUNTED=y [28/40] filters: add a filter that accepts EIO instead of other errors [29/40] generic/085: general cleanup for reliability and debugging [30/40] fstests: don't use directory stacks [31/40] fstests: clean up a couple of dm-flakey tests [32/40] fstests: clean up termination of various tests [33/40] vfstests: some tests require the testdir to be shared [34/40] xfs/629: single extent files should be within tolerance [35/40] xfs/076: fix broken mkfs filtering [36/40] fstests: capture some failures to seqres.full [37/40] fstests: always use fail-at-unmount semantics for XFS [38/40] generic/062: don't leave debug files in $here on failure [39/40] fstests: quota grace periods unreliable under load [40/40] fstests: check-parallel

[40/40] fstests: check-parallel

Commit Message

Patch