From patchwork Sun Dec 31 22:54:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 13507984 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73CD2C127 for ; Sun, 31 Dec 2023 22:54:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Ppe+RdEW" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 41AE9C433C8; Sun, 31 Dec 2023 22:54:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704063297; bh=XATPJKcf5uevdtzySLIyWQlXhBsV3vZjT89AHaEmWfg=; h=Date:Subject:From:To:Cc:In-Reply-To:References:From; b=Ppe+RdEW8Sm8PTmdN+iF7MKCZmEjWq6OkNis6MLCoM2Fc+RwFgY9XDtI4hBD6RSE4 x5r2IBqMqkr9PO7BRqWuNEf5qF9Dz0t0wn+dhcYbXEg6NeYo9oDKJKecNGCN5MkezO LutA41/qxWcWgAd0UhgurqJb3dpNnoi0J3Z+YtFXWKDRaipBFc4fUZ1AN3vjLQKBAo u6GtCci+T0VUL74NLpM35fiUF4DrL2dvvDAX4H5VFF9dX0i1LKAojY23C3U3JcxO5Y T8ruLgEvf1S/gkMmhirqRiF7L6p76QYnwokkLcUE6DT+yE/V9s6deSpET8w9uxwvFq h+vPB6h+ofnCg== Date: Sun, 31 Dec 2023 14:54:56 -0800 Subject: [PATCH 1/4] xfs_scrub_all: fix argument passing when invoking xfs_scrub manually From: "Darrick J. Wong" To: djwong@kernel.org, cem@kernel.org Cc: Christoph Hellwig , linux-xfs@vger.kernel.org Message-ID: <170405002269.1801148.13653507616866096035.stgit@frogsfrogsfrogs> In-Reply-To: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> References: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Darrick J. Wong Currently, xfs_scrub_all will try to invoke xfs_scrub with argv[1] being "-n -x". This of course is recognized by C getopt as a weird looking string, not two individual arguments, and causes the child process to exit with complaints about CLI usage. What we really want is to split the string into a proper array and then add them to the xfs_scrub command line. The code here isn't strictly correct, but as @scrub_args@ is controlled by us in the Makefile, it'll do for now. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- scrub/xfs_scrub_all.in | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in index d7d36e1bdb0..671d588177a 100644 --- a/scrub/xfs_scrub_all.in +++ b/scrub/xfs_scrub_all.in @@ -123,7 +123,9 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs): return # Invoke xfs_scrub manually - cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt] + cmd = ['@sbindir@/xfs_scrub'] + cmd += '@scrub_args@'.split() + cmd += [mnt] ret = run_killable(cmd, None, killfuncs, \ lambda proc: proc.terminate()) if ret >= 0: From patchwork Sun Dec 31 22:55:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 13507985 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FFC6C127 for ; Sun, 31 Dec 2023 22:55:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TiqXbyAB" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D4720C433C7; Sun, 31 Dec 2023 22:55:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704063312; bh=+o2JqwE17+YmTJMyCMxeMM6k0FZoFhJ4AkKSJCcmE80=; h=Date:Subject:From:To:Cc:In-Reply-To:References:From; b=TiqXbyABSMg3+49GjhZtvpUx9mVLwrgUcaIgwj9P4UEXOfvaxWCUDpRKkDub5h96K cUZ8F3yCM+n0jlbRoFIAsuJzjUvKXf2OQfkC9e4eoj0bRsP8FGx30H20ONcPvmrfFr qxXO5tzBkU2jF7cViQlyNZNHIpa97Z/NmMRcaQqGqrAiy2G6yMDGAkNMaY7bn29ZIu XGQFbnDgLVa2yBlGlxLMnjbsqThs3QYc4Dh1xrNd7VTWlg/uSGfHlSUCRzQ3lRGpeQ wAHpeqUFcA39or9Nm9lAXth+G5407rMs4+W4PxxjJbSnPIiMMmRYlefWeG9YcznvNY Via2h+0c5jlCQ== Date: Sun, 31 Dec 2023 14:55:12 -0800 Subject: [PATCH 2/4] xfs_scrub_all: survive systemd restarts when waiting for services From: "Darrick J. Wong" To: djwong@kernel.org, cem@kernel.org Cc: Christoph Hellwig , linux-xfs@vger.kernel.org Message-ID: <170405002282.1801148.13065805971923682262.stgit@frogsfrogsfrogs> In-Reply-To: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> References: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Darrick J. Wong If xfs_scrub_all detects a running systemd, it will use it to invoke xfs_scrub subprocesses in a sandboxed and resource-controlled environment. Unfortunately, if you happen to restart dbus or systemd while it's running, you get this: systemd[1]: Reexecuting. xfs_scrub_all[9958]: Warning! D-Bus connection terminated. xfs_scrub_all[9956]: Warning! D-Bus connection terminated. xfs_scrub_all[9956]: Failed to wait for response: Connection reset by peer xfs_scrub_all[9958]: Failed to wait for response: Connection reset by peer xfs_scrub_all[9930]: Scrubbing / done, (err=1) xfs_scrub_all[9930]: Scrubbing /storage done, (err=1) The xfs_scrub units themselves are still running, it's just that the `systemctl start' command that xfs_scrub_all uses to start and wait for the unit lost its connection to dbus and hence is no longer monitoring sub-services. When this happens, we don't have great options -- systemctl doesn't have a command to wait on an activating (aka running) unit. Emulate the functionality we normally get by polling the failed/active statuses. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- scrub/xfs_scrub_all.in | 78 ++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 65 insertions(+), 13 deletions(-) diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in index 671d588177a..ab9b491fb4e 100644 --- a/scrub/xfs_scrub_all.in +++ b/scrub/xfs_scrub_all.in @@ -14,6 +14,7 @@ import time import sys import os import argparse +from io import TextIOWrapper retcode = 0 terminate = False @@ -58,12 +59,18 @@ def find_mounts(): return fs -def kill_systemd(unit, proc): - '''Kill systemd unit.''' - proc.terminate() - cmd=['systemctl', 'stop', unit] - x = subprocess.Popen(cmd) - x.wait() +def backtick(cmd): + '''Generator function that yields lines of a program's stdout.''' + p = subprocess.Popen(cmd, stdout = subprocess.PIPE) + for line in TextIOWrapper(p.stdout, encoding="utf-8"): + yield line.strip() + +def remove_killfunc(killfuncs, fn): + '''Ensure fn is not in killfuncs.''' + try: + killfuncs.remove(fn) + except: + pass def run_killable(cmd, stdout, killfuncs, kill_fn): '''Run a killable program. Returns program retcode or -1 if we can't start it.''' @@ -72,10 +79,7 @@ def run_killable(cmd, stdout, killfuncs, kill_fn): real_kill_fn = lambda: kill_fn(proc) killfuncs.add(real_kill_fn) proc.wait() - try: - killfuncs.remove(real_kill_fn) - except: - pass + remove_killfunc(killfuncs, real_kill_fn) return proc.returncode except: return -1 @@ -96,6 +100,56 @@ def path_to_serviceunit(path): except: return None +def systemctl_stop(unitname): + '''Stop a systemd unit.''' + cmd = ['systemctl', 'stop', unitname] + x = subprocess.Popen(cmd) + x.wait() + +def systemctl_start(unitname, killfuncs): + '''Start a systemd unit and wait for it to complete.''' + stop_fn = None + cmd = ['systemctl', 'start', unitname] + try: + proc = subprocess.Popen(cmd, stdout = DEVNULL()) + stop_fn = lambda: systemctl_stop(unitname) + killfuncs.add(stop_fn) + proc.wait() + ret = proc.returncode + except: + if stop_fn is not None: + remove_killfunc(killfuncs, stop_fn) + return -1 + + if ret != 1: + remove_killfunc(killfuncs, stop_fn) + return ret + + # If systemctl-start returns 1, it's possible that the service failed + # or that dbus/systemd restarted and the client program lost its + # connection -- according to the systemctl man page, 1 means "unit not + # failed". + # + # Either way, we switch to polling the service status to try to wait + # for the service to end. As of systemd 249, the is-active command + # returns any of the following states: active, reloading, inactive, + # failed, activating, deactivating, or maintenance. Apparently these + # strings are not localized. + while True: + try: + for l in backtick(['systemctl', 'is-active', unitname]): + if l == 'failed': + remove_killfunc(killfuncs, stop_fn) + return 1 + if l == 'inactive': + remove_killfunc(killfuncs, stop_fn) + return 0 + except: + remove_killfunc(killfuncs, stop_fn) + return -1 + + time.sleep(1) + def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs): '''Run a scrub process.''' global retcode, terminate @@ -110,9 +164,7 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs): # Try it the systemd way unitname = path_to_serviceunit(path) if unitname is not None: - cmd=['systemctl', 'start', unitname] - ret = run_killable(cmd, DEVNULL(), killfuncs, \ - lambda proc: kill_systemd(unitname, proc)) + ret = systemctl_start(unitname, killfuncs) if ret == 0 or ret == 1: print("Scrubbing %s done, (err=%d)" % (mnt, ret)) sys.stdout.flush() From patchwork Sun Dec 31 22:55:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 13507986 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC3FCC140 for ; Sun, 31 Dec 2023 22:55:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ZSocd9+V" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7BA31C433C8; Sun, 31 Dec 2023 22:55:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704063328; bh=GVaXoHC9qDBvzS4qj+NhDzU+Ho4ZTsInEsTYURJa/AE=; h=Date:Subject:From:To:Cc:In-Reply-To:References:From; b=ZSocd9+V+LC4ZckYH836oyhi/noUEpe+w4hWrx5KQ/ssgHoC93vcIzxKZ4/uJ/c5C 9OqGT/9pZyyedo396z/5v6YttT3ThaFaeOyGVlAX5T8t2SAHyfUZJ+j1hWjWVArDF6 54qU7t1KN0n9w3c8QRsX7qFwkIKJYg2oe6agjONCYsl0gApyMJ8aSU0LdHMkljpsxT OPElopLObWbRkeHSfvs/EyNDAZMh6C/sxiYk7CcFDZH9dKp7hwunuNCWZYjlg0KfmE YoE9P4oboSCbWMtt7YYxfmhuytph99FGNblrV4kVec/L90FjEXPtA3hk9QEXrcaMVs EGVIH2NdwzBwg== Date: Sun, 31 Dec 2023 14:55:28 -0800 Subject: [PATCH 3/4] xfs_scrub_all: simplify cleanup of run_killable From: "Darrick J. Wong" To: djwong@kernel.org, cem@kernel.org Cc: Christoph Hellwig , linux-xfs@vger.kernel.org Message-ID: <170405002295.1801148.14170763288665409996.stgit@frogsfrogsfrogs> In-Reply-To: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> References: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Darrick J. Wong Get rid of the nested lambda functions to simplify the code. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- scrub/xfs_scrub_all.in | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in index ab9b491fb4e..2c20b91fdbe 100644 --- a/scrub/xfs_scrub_all.in +++ b/scrub/xfs_scrub_all.in @@ -72,14 +72,14 @@ def remove_killfunc(killfuncs, fn): except: pass -def run_killable(cmd, stdout, killfuncs, kill_fn): - '''Run a killable program. Returns program retcode or -1 if we can't start it.''' +def run_killable(cmd, stdout, killfuncs): + '''Run a killable program. Returns program retcode or -1 if we can't + start it.''' try: proc = subprocess.Popen(cmd, stdout = stdout) - real_kill_fn = lambda: kill_fn(proc) - killfuncs.add(real_kill_fn) + killfuncs.add(proc.terminate) proc.wait() - remove_killfunc(killfuncs, real_kill_fn) + remove_killfunc(killfuncs, proc.terminate) return proc.returncode except: return -1 @@ -178,8 +178,7 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs): cmd = ['@sbindir@/xfs_scrub'] cmd += '@scrub_args@'.split() cmd += [mnt] - ret = run_killable(cmd, None, killfuncs, \ - lambda proc: proc.terminate()) + ret = run_killable(cmd, None, killfuncs) if ret >= 0: print("Scrubbing %s done, (err=%d)" % (mnt, ret)) sys.stdout.flush() From patchwork Sun Dec 31 22:55:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 13507987 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 508C6C140 for ; Sun, 31 Dec 2023 22:55:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="IlIHKDk0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1AE1FC433C7; Sun, 31 Dec 2023 22:55:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704063344; bh=J0TA7YIL/gvm/xZYXcrJpDm8PpDM7j343nSoKhEVwQY=; h=Date:Subject:From:To:Cc:In-Reply-To:References:From; b=IlIHKDk0sIINJ30OiVr/5jcqe991MrE+RpILvjEsQ0LEBU4FgSQAsJFMAtYuwJer7 25+ra8svJbGxnYWvCoQAI793CpKcVxUR/pKPchum3MRR/gOh0/LBF8oM2EubyyHhcu JG5o+cPZ+llu61TokaLCReYRXUYQWeqN/XXaTQ3AfN2WBqBY/XwhQAN7L3nDTK6PlV ZiiXSua6ZIe/gY7AjkH+3L68H8LWIAuZYhv0yDlYqebzX9WPrmglsN5XPNy2FvxJQ6 kRKbVgejw71HO4NGckVEYJ+H0BnPpgP799vzafnbT16nApST59ei8c9mwQQ9iva/1j fyy4sGmdNBEQg== Date: Sun, 31 Dec 2023 14:55:43 -0800 Subject: [PATCH 4/4] xfs_scrub_all: fix termination signal handling From: "Darrick J. Wong" To: djwong@kernel.org, cem@kernel.org Cc: Christoph Hellwig , linux-xfs@vger.kernel.org Message-ID: <170405002308.1801148.15910170825443227332.stgit@frogsfrogsfrogs> In-Reply-To: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> References: <170405002254.1801148.6324602186356936873.stgit@frogsfrogsfrogs> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Darrick J. Wong Currently, xfs_scrub_all does not handle termination signals well. SIGTERM and SIGINT are left to their default handlers, which are immediate termination of the process group in the case of SIGTERM and raising KeyboardInterrupt in the case of SIGINT. Terminating the process group is fine when the xfs_scrub processes are direct children, but this completely doesn't work if we're farming the work out to systemd services since we don't terminate the child service. Instead, they keep going. Raising KeyboardInterrupt doesn't work because once the main thread calls sys.exit at the bottom of main(), it blocks in the python runtime waiting for child threads to terminate. There's no longer any context to handle an exception, so the signal is ignored and no child processes are killed. In other words, if you try to kill a running xfs_scrub_all, chances are good it won't kill the child xfs_scrub processes. This is undesirable and egregious since we actually have the ability to track and kill all the subprocesses that we create. Solve the subproblem of getting stuck in the python runtime by calling it repeatedly until we no longer have subprocesses. This means that the main thread loops until all threads have exited. Solve the subproblem of the signals doing the wrong thing by setting up our own signal handler that can wake up the main thread and initiate subprocess shutdown, no matter whether the subprocesses are systemd services or directly fork/exec'd. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- scrub/xfs_scrub_all.in | 64 +++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 52 insertions(+), 12 deletions(-) diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in index 2c20b91fdbe..d0ab27fd306 100644 --- a/scrub/xfs_scrub_all.in +++ b/scrub/xfs_scrub_all.in @@ -14,6 +14,7 @@ import time import sys import os import argparse +import signal from io import TextIOWrapper retcode = 0 @@ -196,6 +197,45 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs): cond.notify() cond.release() +def signal_scrubs(signum, cond): + '''Handle termination signals by killing xfs_scrub children.''' + global debug, terminate + + if debug: + print('Signal handler called with signal', signum) + sys.stdout.flush() + + terminate = True + cond.acquire() + cond.notify() + cond.release() + +def wait_for_termination(cond, killfuncs): + '''Wait for a child thread to terminate. Returns True if we should + abort the program, False otherwise.''' + global debug, terminate + + if debug: + print('waiting for threads to terminate') + sys.stdout.flush() + + cond.acquire() + try: + cond.wait() + except KeyboardInterrupt: + terminate = True + cond.release() + + if not terminate: + return False + + print("Terminating...") + sys.stdout.flush() + while len(killfuncs) > 0: + fn = killfuncs.pop() + fn() + return True + def main(): '''Find mounts, schedule scrub runs.''' def thr(mnt, devs): @@ -231,6 +271,10 @@ def main(): running_devs = set() killfuncs = set() cond = threading.Condition() + + signal.signal(signal.SIGINT, lambda s, f: signal_scrubs(s, cond)) + signal.signal(signal.SIGTERM, lambda s, f: signal_scrubs(s, cond)) + while len(fs) > 0: if len(running_devs) == 0: mnt, devs = fs.popitem() @@ -250,18 +294,14 @@ def main(): thr(mnt, devs) for p in poppers: fs.pop(p) - cond.acquire() - try: - cond.wait() - except KeyboardInterrupt: - terminate = True - print("Terminating...") - sys.stdout.flush() - while len(killfuncs) > 0: - fn = killfuncs.pop() - fn() - fs = [] - cond.release() + + # Wait for one thread to finish + if wait_for_termination(cond, killfuncs): + break + + # Wait for the rest of the threads to finish + while len(killfuncs) > 0: + wait_for_termination(cond, killfuncs) if journalthread is not None: journalthread.terminate()