[v3,4/5] pid: mark pids associated with group leader tasks

Message ID	20221202171620.509140-5-bfoster@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Brian Foster <bfoster@redhat.com> To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: ikent@redhat.com, onestero@redhat.com, willy@infradead.org, ebiederm@redhat.com Subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks Date: Fri, 2 Dec 2022 12:16:19 -0500 Message-Id: <20221202171620.509140-5-bfoster@redhat.com> In-Reply-To: <20221202171620.509140-1-bfoster@redhat.com> References: <20221202171620.509140-1-bfoster@redhat.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit Precedence: bulk
Series	proc: improve root readdir latency with many threads \| expand [v3,0/5] proc: improve root readdir latency with many threads [v3,1/5] pid: replace pidmap_lock with xarray lock [v3,2/5] pid: split cyclic id allocation cursor from idr [v3,3/5] pid: switch pid_namespace from idr to xarray [v3,4/5] pid: mark pids associated with group leader tasks [v3,5/5] procfs: use efficient tgid pid search on root readdir

Message ID

20221202171620.509140-5-bfoster@redhat.com (mailing list archive)

State

New, archived

Headers

From: Brian Foster <bfoster@redhat.com>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
        linux-kernel@vger.kernel.org
Cc: ikent@redhat.com, onestero@redhat.com, willy@infradead.org,
        ebiederm@redhat.com
Subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks
Date: Fri,  2 Dec 2022 12:16:19 -0500
Message-Id: <20221202171620.509140-5-bfoster@redhat.com>
In-Reply-To: <20221202171620.509140-1-bfoster@redhat.com>
References: <20221202171620.509140-1-bfoster@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

proc: improve root readdir latency with many threads | expand

Commit Message

Brian Foster Dec. 2, 2022, 5:16 p.m. UTC

Searching the pid_namespace for group leader tasks is a fairly
inefficient operation. Listing the root directory of a procfs mount
performs a linear scan of allocated pids, checking each entry for an
associated PIDTYPE_TGID task to determine whether to populate a
directory entry. This can cause a significant increase in readdir()
syscall latency when run in namespaces that might have one or more
processes with significant thread counts.

To facilitate improved TGID pid searches, mark the ids of pid
entries that are likely to have an associated PIDTYPE_TGID task. To
keep the code simple and avoid having to maintain synchronization
between mark state and post-fork pid-task association changes, the
mark is applied to all pids allocated for tasks cloned without
CLONE_THREAD.

This means that it is possible for a pid to remain marked in the
xarray after being disassociated from the group leader task. For
example, a process that does a setsid() followed by fork() and
exit() (to daemonize) will remain associated with the original pid
for the session, but link with the child pid as the group leader.
OTOH, the only place other than fork() where a tgid association
occurs is in the exec() path, which kills all other tasks in the
group and associates the current task with the preexisting leader
pid. Therefore, the semantics of the mark are that false positives
(marked pids without PIDTYPE_TGID tasks) are possible, but false
negatives (unmarked pids without PIDTYPE_TGID tasks) should never
occur.

This is an effective optimization because false negatives are fairly
uncommon and don't add overhead (i.e. we already have to check
pid_task() for marked entries), but still filters out thread pids
that are guaranteed not to have TGID task association.

Mark entries in the pid allocation path when the caller specifies
that the pid associates with a new thread group. Since false
negatives are not allowed, warn in the event that a PIDTYPE_TGID
task is ever attached to an unmarked pid. Finally, create a helper
to implement the task search based on the mark semantics defined
above (based on search logic currently implemented by next_tgid() in
procfs).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 include/linux/pid.h |  3 ++-
 kernel/fork.c       |  2 +-
 kernel/pid.c        | 44 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 46 insertions(+), 3 deletions(-)

Comments

Ian Kent Dec. 12, 2022, 1:51 a.m. UTC | #1

On 3/12/22 01:16, Brian Foster wrote:
> Searching the pid_namespace for group leader tasks is a fairly
> inefficient operation. Listing the root directory of a procfs mount
> performs a linear scan of allocated pids, checking each entry for an
> associated PIDTYPE_TGID task to determine whether to populate a
> directory entry. This can cause a significant increase in readdir()
> syscall latency when run in namespaces that might have one or more
> processes with significant thread counts.
>
> To facilitate improved TGID pid searches, mark the ids of pid
> entries that are likely to have an associated PIDTYPE_TGID task. To
> keep the code simple and avoid having to maintain synchronization
> between mark state and post-fork pid-task association changes, the
> mark is applied to all pids allocated for tasks cloned without
> CLONE_THREAD.
>
> This means that it is possible for a pid to remain marked in the
> xarray after being disassociated from the group leader task. For
> example, a process that does a setsid() followed by fork() and
> exit() (to daemonize) will remain associated with the original pid
> for the session, but link with the child pid as the group leader.
> OTOH, the only place other than fork() where a tgid association
> occurs is in the exec() path, which kills all other tasks in the
> group and associates the current task with the preexisting leader
> pid. Therefore, the semantics of the mark are that false positives
> (marked pids without PIDTYPE_TGID tasks) are possible, but false
> negatives (unmarked pids without PIDTYPE_TGID tasks) should never
> occur.
>
> This is an effective optimization because false negatives are fairly
> uncommon and don't add overhead (i.e. we already have to check
> pid_task() for marked entries), but still filters out thread pids
> that are guaranteed not to have TGID task association.
>
> Mark entries in the pid allocation path when the caller specifies
> that the pid associates with a new thread group. Since false
> negatives are not allowed, warn in the event that a PIDTYPE_TGID
> task is ever attached to an unmarked pid. Finally, create a helper
> to implement the task search based on the mark semantics defined
> above (based on search logic currently implemented by next_tgid() in
> procfs).

Yes, the tricky bit, but the analysis sounds thorough so it

should work for all cases ...


>
> Signed-off-by: Brian Foster <bfoster@redhat.com>


Reviewed-by: Ian Kent <raven@themaw.net>

> ---
>   include/linux/pid.h |  3 ++-
>   kernel/fork.c       |  2 +-
>   kernel/pid.c        | 44 +++++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 343abf22092e..64caf21be256 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -132,9 +132,10 @@ extern struct pid *find_vpid(int nr);
>    */
>   extern struct pid *find_get_pid(int nr);
>   extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *);
>   
>   extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> -			     size_t set_tid_size);
> +			     size_t set_tid_size, bool group_leader);
>   extern void free_pid(struct pid *pid);
>   extern void disable_pid_allocation(struct pid_namespace *ns);
>   
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 08969f5aa38d..1cf2644c642e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2267,7 +2267,7 @@ static __latent_entropy struct task_struct *copy_process(
>   
>   	if (pid != &init_struct_pid) {
>   		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
> -				args->set_tid_size);
> +				args->set_tid_size, !(clone_flags & CLONE_THREAD));
>   		if (IS_ERR(pid)) {
>   			retval = PTR_ERR(pid);
>   			goto bad_fork_cleanup_thread;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 53db06f9882d..d65f74c6186c 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -66,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
>   int pid_max_min = RESERVED_PIDS + 1;
>   int pid_max_max = PID_MAX_LIMIT;
>   
> +/* MARK_0 used by XA_FREE_MARK */
> +#define TGID_MARK	XA_MARK_1
> +
>   struct pid_namespace init_pid_ns = {
>   	.ns.count = REFCOUNT_INIT(2),
>   	.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> @@ -137,7 +140,7 @@ void free_pid(struct pid *pid)
>   }
>   
>   struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> -		      size_t set_tid_size)
> +		      size_t set_tid_size, bool group_leader)
>   {
>   	struct pid *pid;
>   	enum pid_type type;
> @@ -257,6 +260,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>   
>   		/* Make the PID visible to find_pid_ns. */
>   		__xa_store(&tmp->xa, upid->nr, pid, 0);
> +		if (group_leader)
> +			__xa_set_mark(&tmp->xa, upid->nr, TGID_MARK);
>   		tmp->pid_allocated++;
>   		xa_unlock_irq(&tmp->xa);
>   	}
> @@ -314,6 +319,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
>   void attach_pid(struct task_struct *task, enum pid_type type)
>   {
>   	struct pid *pid = *task_pid_ptr(task, type);
> +	struct pid_namespace *pid_ns = ns_of_pid(pid);
> +	pid_t pid_nr = pid_nr_ns(pid, pid_ns);
> +
> +	WARN_ON(type == PIDTYPE_TGID &&
> +		!xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK));
>   	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
>   }
>   
> @@ -506,6 +516,38 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>   }
>   EXPORT_SYMBOL_GPL(find_ge_pid);
>   
> +/*
> + * Used by proc to find the first thread group leader task with an id greater
> + * than or equal to *id.
> + *
> + * Use the xarray mark as a hint to find the next best pid. The mark does not
> + * guarantee a linked group leader task exists, so retry until a suitable entry
> + * is found.
> + */
> +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns)
> +{
> +	struct pid *pid;
> +	struct task_struct *t;
> +	unsigned long nr = *id;
> +
> +	rcu_read_lock();
> +	do {
> +		pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK);
> +		if (!pid) {
> +			rcu_read_unlock();
> +			return NULL;
> +		}
> +		t = pid_task(pid, PIDTYPE_TGID);
> +		nr++;
> +	} while (!t);
> +
> +	*id = pid_nr_ns(pid, ns);
> +	get_task_struct(t);
> +	rcu_read_unlock();
> +
> +	return t;
> +}
> +
>   struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
>   {
>   	struct fd f;

Yujie Liu Dec. 13, 2022, 2 a.m. UTC | #2

Greeting,

FYI, we noticed a -4.7% regression of stress-ng.vfork.ops_per_sec due to commit:

commit: 88294e6f6d1e1a9169cc9b715050bd8b52ac5f44 ("[PATCH v3 4/5] pid: mark pids associated with group leader tasks")
url: https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/proc-improve-root-readdir-latency-with-many-threads/20221203-012018
base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
patch link: https://lore.kernel.org/all/20221202171620.509140-5-bfoster@redhat.com/
patch subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks

in testcase: stress-ng
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
with following parameters:

	nr_threads: 100%
	testtime: 60s
	sc_pid_max: 4194304
	class: scheduler
	test: vfork
	cpufreq_governor: performance


Details are as below:

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
  scheduler/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-icl-2sp5/vfork/stress-ng/60s

commit: 
  eae2900480 ("pid: switch pid_namespace from idr to xarray")
  88294e6f6d ("pid: mark pids associated with group leader tasks")

eae2900480d61b93 88294e6f6d1e1a9169cc9b71505 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
  26176589 ±  3%      -5.4%   24757728        stress-ng.time.voluntary_context_switches
  11320148            -4.7%   10789611        stress-ng.vfork.ops
    188669            -4.7%     179826        stress-ng.vfork.ops_per_sec
     48483 ± 13%     +17.3%      56864 ±  3%  numa-vmstat.node1.nr_slab_unreclaimable
    721230 ±  4%      -5.7%     679849        vmstat.system.cs
   1192641 ±  3%     -21.4%     937830        sched_debug.cpu.curr->pid.max
    469229 ±  9%     -18.1%     384233 ±  5%  sched_debug.cpu.curr->pid.stddev
    768469 ±  4%      -6.0%     722315        perf-stat.i.context-switches
      0.09 ±  5%      -0.0        0.07 ±  3%  perf-stat.i.dTLB-load-miss-rate%
  10758930 ±  5%     -11.1%    9564450 ±  3%  perf-stat.i.dTLB-load-misses
   2480878 ±  2%      -4.1%    2380112 ±  2%  perf-stat.i.dTLB-store-misses
      0.15            +2.7%       0.15        perf-stat.i.ipc
  23566766            -3.5%   22751863        perf-stat.i.node-load-misses
  10701300            -4.6%   10206578        perf-stat.i.node-store-misses
      0.09 ±  4%      -0.0        0.08 ±  4%  perf-stat.overall.dTLB-load-miss-rate%
    743961 ±  4%      -6.0%     699259        perf-stat.ps.context-switches
  10436726 ±  5%     -11.3%    9261813 ±  4%  perf-stat.ps.dTLB-load-misses
  22838094            -3.4%   22057626        perf-stat.ps.node-load-misses
  10394494            -4.3%    9950784        perf-stat.ps.node-store-misses



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <yujie.liu@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202212130906.15eab7ed-yujie.liu@intel.com


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 343abf22092e..64caf21be256 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -132,9 +132,10 @@  extern struct pid *find_vpid(int nr);
  */
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
+struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *);
 
 extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
-			     size_t set_tid_size);
+			     size_t set_tid_size, bool group_leader);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..1cf2644c642e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2267,7 +2267,7 @@  static __latent_entropy struct task_struct *copy_process(
 
 	if (pid != &init_struct_pid) {
 		pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
-				args->set_tid_size);
+				args->set_tid_size, !(clone_flags & CLONE_THREAD));
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
diff --git a/kernel/pid.c b/kernel/pid.c
index 53db06f9882d..d65f74c6186c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -66,6 +66,9 @@  int pid_max = PID_MAX_DEFAULT;
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
 
+/* MARK_0 used by XA_FREE_MARK */
+#define TGID_MARK	XA_MARK_1
+
 struct pid_namespace init_pid_ns = {
 	.ns.count = REFCOUNT_INIT(2),
 	.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
@@ -137,7 +140,7 @@  void free_pid(struct pid *pid)
 }
 
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
-		      size_t set_tid_size)
+		      size_t set_tid_size, bool group_leader)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -257,6 +260,8 @@  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 		/* Make the PID visible to find_pid_ns. */
 		__xa_store(&tmp->xa, upid->nr, pid, 0);
+		if (group_leader)
+			__xa_set_mark(&tmp->xa, upid->nr, TGID_MARK);
 		tmp->pid_allocated++;
 		xa_unlock_irq(&tmp->xa);
 	}
@@ -314,6 +319,11 @@  static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
 void attach_pid(struct task_struct *task, enum pid_type type)
 {
 	struct pid *pid = *task_pid_ptr(task, type);
+	struct pid_namespace *pid_ns = ns_of_pid(pid);
+	pid_t pid_nr = pid_nr_ns(pid, pid_ns);
+
+	WARN_ON(type == PIDTYPE_TGID &&
+		!xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK));
 	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
@@ -506,6 +516,38 @@  struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 }
 EXPORT_SYMBOL_GPL(find_ge_pid);
 
+/*
+ * Used by proc to find the first thread group leader task with an id greater
+ * than or equal to *id.
+ *
+ * Use the xarray mark as a hint to find the next best pid. The mark does not
+ * guarantee a linked group leader task exists, so retry until a suitable entry
+ * is found.
+ */
+struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns)
+{
+	struct pid *pid;
+	struct task_struct *t;
+	unsigned long nr = *id;
+
+	rcu_read_lock();
+	do {
+		pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK);
+		if (!pid) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		t = pid_task(pid, PIDTYPE_TGID);
+		nr++;
+	} while (!t);
+
+	*id = pid_nr_ns(pid, ns);
+	get_task_struct(t);
+	rcu_read_unlock();
+
+	return t;
+}
+
 struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
 {
 	struct fd f;

[v3,4/5] pid: mark pids associated with group leader tasks

Commit Message

Comments

Patch