diff mbox series

[1/3] radix-tree: propagate all tags in idr tree

Message ID 20220614180949.102914-2-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series proc: improve root readdir latency with many threads | expand

Commit Message

Brian Foster June 14, 2022, 6:09 p.m. UTC
The IDR tree has hardcoded tag propagation logic to handle the
internal IDR_FREE tag and ignore all others. Fix up the hardcoded
logic to support additional tags.

This is specifically to support a new internal IDR_TGID radix tree
tag used to improve search efficiency of pids with associated
PIDTYPE_TGID tasks within a pid namespace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 lib/radix-tree.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig June 15, 2022, 11:12 a.m. UTC | #1
On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> The IDR tree has hardcoded tag propagation logic to handle the
> internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> logic to support additional tags.
> 
> This is specifically to support a new internal IDR_TGID radix tree
> tag used to improve search efficiency of pids with associated
> PIDTYPE_TGID tasks within a pid namespace.

Wouldn't it make sense to switch over to an xarray here rather
then adding new features to the radix tree?
Brian Foster June 15, 2022, 12:57 p.m. UTC | #2
On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > The IDR tree has hardcoded tag propagation logic to handle the
> > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > logic to support additional tags.
> > 
> > This is specifically to support a new internal IDR_TGID radix tree
> > tag used to improve search efficiency of pids with associated
> > PIDTYPE_TGID tasks within a pid namespace.
> 
> Wouldn't it make sense to switch over to an xarray here rather
> then adding new features to the radix tree?
> 

The xarray question crossed my mind when I first waded into this code
and realized the idr tree seems to be some sort of offshoot or custom
mode of the core radix tree. I eventually realized that the problem wrt
to normal radix tree tags in the idr variant was that the tag
propagation logic in the idr variant simply didn't care to handle
traditional tags, presumably because they were unused in that mode. So
this patch doesn't really add a feature to the radix-tree, it just fixes
up some of the grotty idr tree logic to handle both forms of tags.

I assume it makes sense for this to move towards xarray in general, but
I don't have enough context on either side to know what the sticking
points are. In particular, does xarray support something analogous to
IDR_FREE or otherwise solve whatever problem idr currently depends on it
for (i.e. efficient id allocation)? I think Willy has done work in this
area so I'm hoping he can chime in on some of that if he's put any
thought into the idr thing specifically..

WRT to this series, I really don't think it makes much sense to put a
broad rework of the idr code in front of what is otherwise an
incremental performance improvement based on using a mechanism that
radix-tree pretty much already supports (i.e. tags). Is there some
fundamental problem you see with this patch that apparently depends on
xarray for some reason, or are you just calling it out as apparent
technical debt in this area of code? If the latter, then I think it
makes more sense to consider that as a followup effort.

Brian
Ian Kent June 15, 2022, 1:33 p.m. UTC | #3
On 15/6/22 19:12, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
>> The IDR tree has hardcoded tag propagation logic to handle the
>> internal IDR_FREE tag and ignore all others. Fix up the hardcoded
>> logic to support additional tags.
>>
>> This is specifically to support a new internal IDR_TGID radix tree
>> tag used to improve search efficiency of pids with associated
>> PIDTYPE_TGID tasks within a pid namespace.
> Wouldn't it make sense to switch over to an xarray here rather
> then adding new features to the radix tree?
>
Might be a dumb question but ...


How would the, essentially sparse, pid type PIDTYPE_TGID pids

traversal get benefits from an xarray?


 From what I have seen the searching isn't the real problem, it's the

traversal (that, at the moment, does a search 'and' a traversal over

a lot of pids to get a relatively small number of them).

Ian
Matthew Wilcox June 15, 2022, 1:40 p.m. UTC | #4
On Wed, Jun 15, 2022 at 08:57:56AM -0400, Brian Foster wrote:
> On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > > The IDR tree has hardcoded tag propagation logic to handle the
> > > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > > logic to support additional tags.
> > > 
> > > This is specifically to support a new internal IDR_TGID radix tree
> > > tag used to improve search efficiency of pids with associated
> > > PIDTYPE_TGID tasks within a pid namespace.
> > 
> > Wouldn't it make sense to switch over to an xarray here rather
> > then adding new features to the radix tree?
> > 
> 
> The xarray question crossed my mind when I first waded into this code
> and realized the idr tree seems to be some sort of offshoot or custom
> mode of the core radix tree. I eventually realized that the problem wrt
> to normal radix tree tags in the idr variant was that the tag
> propagation logic in the idr variant simply didn't care to handle
> traditional tags, presumably because they were unused in that mode. So
> this patch doesn't really add a feature to the radix-tree, it just fixes
> up some of the grotty idr tree logic to handle both forms of tags.
> 
> I assume it makes sense for this to move towards xarray in general, but
> I don't have enough context on either side to know what the sticking
> points are. In particular, does xarray support something analogous to
> IDR_FREE or otherwise solve whatever problem idr currently depends on it
> for (i.e. efficient id allocation)? I think Willy has done work in this
> area so I'm hoping he can chime in on some of that if he's put any
> thought into the idr thing specifically..

Without going into the history of the idr/radix-tree/xarray, the
current hope is that we'll move all users of the idr & radix tree
over to the xarray API.  It's fundamentally the same data structure
for all three now, just a question of the API change. 

The XArray does indeed have a way to solve the IDR_FREE problem;
you need to declare an allocating XArray:
https://www.kernel.org/doc/html/latest/core-api/xarray.html#allocating-xarrays

and using XA_MARK_1 and XA_MARK_2 should work the way you want them to.
Brian Foster June 15, 2022, 2:43 p.m. UTC | #5
On Wed, Jun 15, 2022 at 02:40:43PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 15, 2022 at 08:57:56AM -0400, Brian Foster wrote:
> > On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> > > On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > > > The IDR tree has hardcoded tag propagation logic to handle the
> > > > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > > > logic to support additional tags.
> > > > 
> > > > This is specifically to support a new internal IDR_TGID radix tree
> > > > tag used to improve search efficiency of pids with associated
> > > > PIDTYPE_TGID tasks within a pid namespace.
> > > 
> > > Wouldn't it make sense to switch over to an xarray here rather
> > > then adding new features to the radix tree?
> > > 
> > 
> > The xarray question crossed my mind when I first waded into this code
> > and realized the idr tree seems to be some sort of offshoot or custom
> > mode of the core radix tree. I eventually realized that the problem wrt
> > to normal radix tree tags in the idr variant was that the tag
> > propagation logic in the idr variant simply didn't care to handle
> > traditional tags, presumably because they were unused in that mode. So
> > this patch doesn't really add a feature to the radix-tree, it just fixes
> > up some of the grotty idr tree logic to handle both forms of tags.
> > 
> > I assume it makes sense for this to move towards xarray in general, but
> > I don't have enough context on either side to know what the sticking
> > points are. In particular, does xarray support something analogous to
> > IDR_FREE or otherwise solve whatever problem idr currently depends on it
> > for (i.e. efficient id allocation)? I think Willy has done work in this
> > area so I'm hoping he can chime in on some of that if he's put any
> > thought into the idr thing specifically..
> 
> Without going into the history of the idr/radix-tree/xarray, the
> current hope is that we'll move all users of the idr & radix tree
> over to the xarray API.  It's fundamentally the same data structure
> for all three now, just a question of the API change. 
> 
> The XArray does indeed have a way to solve the IDR_FREE problem;
> you need to declare an allocating XArray:
> https://www.kernel.org/doc/html/latest/core-api/xarray.html#allocating-xarrays
> 
> and using XA_MARK_1 and XA_MARK_2 should work the way you want them to.
> 

Interesting, thanks. I'll have to dig more into this to grok the current
state of the radix-tree interface vs. the underlying data structure. If
I follow correctly, you're saying the radix-tree api is essentially
already a translation layer to the xarray these days, and we just need
to move legacy users off the radix-tree api so we can eventually kill it
off...

Brian
Matthew Wilcox June 15, 2022, 4:34 p.m. UTC | #6
On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> Interesting, thanks. I'll have to dig more into this to grok the current
> state of the radix-tree interface vs. the underlying data structure. If
> I follow correctly, you're saying the radix-tree api is essentially
> already a translation layer to the xarray these days, and we just need
> to move legacy users off the radix-tree api so we can eventually kill it
> off...

If only it were that easy ... the XArray has a whole bunch of debugging
asserts to make sure the users are actually using it correctly, and a
lot of radix tree users don't (they're probably not buggy, but they
don't use the XArray's embedded lock).

Anyway, here's a first cut at converting the PID allocator from the IDR
to the XArray API.  It boots, but I haven't tried to do anything tricky
with PID namespaces or CRIU.


diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index f32878d9a39f..cec85a07184a 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
 		nr_running(), nr_threads,
-		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
+		task_active_pid_ns(current)->cursor - 1);
 	return 0;
 }
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 07481bb87d4e..68aaaf78491b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,7 +9,7 @@
 #include <linux/threads.h>
 #include <linux/nsproxy.h>
 #include <linux/ns_common.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>
 
 /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
 #define MAX_PID_NS_LEVEL 32
@@ -17,8 +17,9 @@
 struct fs_pin;
 
 struct pid_namespace {
-	struct idr idr;
+	struct xarray xa;
 	struct rcu_head rcu;
+	unsigned int cursor;
 	unsigned int pid_allocated;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
@@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;
 
 #define PIDNS_ADDING (1U << 31)
 
+/*
+ * Note: disable interrupts while the xarray lock is held as an
+ * interrupt might come in and do read_lock(&tasklist_lock).
+ *
+ * If we don't disable interrupts there is a nasty deadlock between
+ * detach_pid()->free_pid() and another cpu that locks the PIDs
+ * followed by an interrupt routine that does read_lock(&tasklist_lock);
+ *
+ * After we clean up the tasklist_lock and know there are no
+ * irq handlers that take it we can leave the interrupts enabled.
+ * For now it is easier to be safe than to prove it can't happen.
+ */
+#define PID_XA_FLAGS	(XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
+
 #ifdef CONFIG_PID_NS
 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 {
@@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
 void pidhash_init(void);
-void pid_idr_init(void);
+void pid_init(void);
 
 static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
 {
diff --git a/include/linux/threads.h b/include/linux/threads.h
index c34173e6c5f1..37e4391ee89f 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -38,7 +38,7 @@
  * Define a minimum number of pids per cpu.  Heuristically based
  * on original pid max of 32k for 32 cpus.  Also, increase the
  * minimum settable value for pid_max on the running system based
- * on similar defaults.  See kernel/pid.c:pid_idr_init() for details.
+ * on similar defaults.  See kernel/pid.c:pid_init() for details.
  */
 #define PIDS_PER_CPU_DEFAULT	1024
 #define PIDS_PER_CPU_MIN	8
diff --git a/init/main.c b/init/main.c
index 0ee39cdcfcac..3944dcd10c09 100644
--- a/init/main.c
+++ b/init/main.c
@@ -73,7 +73,6 @@
 #include <linux/sched.h>
 #include <linux/sched/init.h>
 #include <linux/signal.h>
-#include <linux/idr.h>
 #include <linux/kgdb.h>
 #include <linux/ftrace.h>
 #include <linux/async.h>
@@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 		late_time_init();
 	sched_clock_init();
 	calibrate_delay();
-	pid_idr_init();
+	pid_init();
 	anon_vma_init();
 #ifdef CONFIG_X86
 	if (efi_enabled(EFI_RUNTIME_SERVICES))
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..de0b4614bdb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -41,7 +41,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
 
-/*
- * PID-map pages start out as NULL, they get allocated upon
- * first use and are never deallocated. This way a low pid_max
- * value does not cause lots of bitmaps to be allocated, but
- * the scheme scales to up to 4 million PIDs, runtime.
- */
 struct pid_namespace init_pid_ns = {
 	.ns.count = REFCOUNT_INIT(2),
-	.idr = IDR_INIT(init_pid_ns.idr),
+	.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
+	.cursor = 1,
 	.pid_allocated = PIDNS_ADDING,
 	.level = 0,
 	.child_reaper = &init_task,
@@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
-/*
- * Note: disable interrupts while the pidmap_lock is held as an
- * interrupt might come in and do read_lock(&tasklist_lock).
- *
- * If we don't disable interrupts there is a nasty deadlock between
- * detach_pid()->free_pid() and another cpu that does
- * spin_lock(&pidmap_lock) followed by an interrupt routine that does
- * read_lock(&tasklist_lock);
- *
- * After we clean up the tasklist_lock and know there are no
- * irq handlers that take it we can leave the interrupts enabled.
- * For now it is easier to be safe than to prove it can't happen.
- */
-
-static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
-
 void put_pid(struct pid *pid)
 {
 	struct pid_namespace *ns;
@@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
 	int i;
 	unsigned long flags;
 
-	spin_lock_irqsave(&pidmap_lock, flags);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
 		struct pid_namespace *ns = upid->ns;
+
+		xa_lock_irqsave(&ns->xa, flags);
 		switch (--ns->pid_allocated) {
 		case 2:
 		case 1:
@@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
 			break;
 		}
 
-		idr_remove(&ns->idr, upid->nr);
+		__xa_erase(&ns->xa, upid->nr);
+		xa_unlock_irqrestore(&ns->xa, flags);
 	}
-	spin_unlock_irqrestore(&pidmap_lock, flags);
 
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
@@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 {
 	struct pid *pid;
 	enum pid_type type;
-	int i, nr;
+	int i;
 	struct pid_namespace *tmp;
 	struct upid *upid;
 	int retval = -ENOMEM;
@@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			set_tid_size--;
 		}
 
-		idr_preload(GFP_KERNEL);
-		spin_lock_irq(&pidmap_lock);
-
 		if (tid) {
-			nr = idr_alloc(&tmp->idr, NULL, tid,
-				       tid + 1, GFP_ATOMIC);
+			retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
+
 			/*
-			 * If ENOSPC is returned it means that the PID is
+			 * If EBUSY is returned it means that the PID is
 			 * alreay in use. Return EEXIST in that case.
 			 */
-			if (nr == -ENOSPC)
-				nr = -EEXIST;
+			if (retval == -EBUSY)
+				retval = -EEXIST;
 		} else {
 			int pid_min = 1;
+
+			xa_lock_irq(&tmp->xa);
 			/*
 			 * init really needs pid 1, but after reaching the
 			 * maximum wrap back to RESERVED_PIDS
 			 */
-			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+			if (tmp->cursor > RESERVED_PIDS)
 				pid_min = RESERVED_PIDS;
 
 			/*
 			 * Store a null pointer so find_pid_ns does not find
 			 * a partially initialized PID (see below).
 			 */
-			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-					      pid_max, GFP_ATOMIC);
+			retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
+					XA_LIMIT(pid_min, pid_max),
+					&tmp->cursor, GFP_KERNEL);
+			xa_unlock_irq(&tmp->xa);
+			if (retval == -EBUSY)
+				retval = -EAGAIN;
 		}
-		spin_unlock_irq(&pidmap_lock);
-		idr_preload_end();
 
-		if (nr < 0) {
-			retval = (nr == -ENOSPC) ? -EAGAIN : nr;
+		if (retval < 0)
 			goto out_free;
-		}
 
-		pid->numbers[i].nr = nr;
+		pid->numbers[i].nr = tid;
 		pid->numbers[i].ns = tmp;
 		tmp = tmp->parent;
 	}
@@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	INIT_HLIST_HEAD(&pid->inodes);
 
 	upid = pid->numbers + ns->level;
-	spin_lock_irq(&pidmap_lock);
+	xa_lock_irq(&ns->xa);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		/* Make the PID visible to find_pid_ns. */
-		idr_replace(&upid->ns->idr, pid, upid->nr);
+		if (upid->ns != ns)
+			xa_lock_irq(&ns->xa);
+		__xa_store(&upid->ns->xa, upid->nr, pid, 0);
 		upid->ns->pid_allocated++;
+		xa_unlock_irq(&ns->xa);
 	}
-	spin_unlock_irq(&pidmap_lock);
 
 	return pid;
 
 out_unlock:
-	spin_unlock_irq(&pidmap_lock);
+	xa_unlock_irq(&ns->xa);
 	put_pid_ns(ns);
 
 out_free:
-	spin_lock_irq(&pidmap_lock);
 	while (++i <= ns->level) {
 		upid = pid->numbers + i;
-		idr_remove(&upid->ns->idr, upid->nr);
+		xa_erase_irq(&upid->ns->xa, upid->nr);
 	}
 
+	xa_lock_irq(&ns->xa);
 	/* On failure to allocate the first pid, reset the state */
 	if (ns->pid_allocated == PIDNS_ADDING)
-		idr_set_cursor(&ns->idr, 0);
-
-	spin_unlock_irq(&pidmap_lock);
+		ns->cursor = 0;
+	xa_unlock_irq(&ns->xa);
 
 	kmem_cache_free(ns->pid_cachep, pid);
 	return ERR_PTR(retval);
@@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 void disable_pid_allocation(struct pid_namespace *ns)
 {
-	spin_lock_irq(&pidmap_lock);
+	xa_lock_irq(&ns->xa);
 	ns->pid_allocated &= ~PIDNS_ADDING;
-	spin_unlock_irq(&pidmap_lock);
+	xa_unlock_irq(&ns->xa);
 }
 
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
 {
-	return idr_find(&ns->idr, nr);
+	return xa_load(&ns->xa, nr);
 }
 EXPORT_SYMBOL_GPL(find_pid_ns);
 
@@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
  */
 struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 {
-	return idr_get_next(&ns->idr, &nr);
+	unsigned long index = nr;
+
+	return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
 }
 
 struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
@@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	return fd;
 }
 
-void __init pid_idr_init(void)
+void __init pid_init(void)
 {
 	/* Verify no one has done anything silly: */
 	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
@@ -658,8 +640,6 @@ void __init pid_idr_init(void)
 				PIDS_PER_CPU_MIN * num_possible_cpus());
 	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
 
-	idr_init(&init_pid_ns.idr);
-
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..947e25fb8546 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -22,7 +22,7 @@
 #include <linux/export.h>
 #include <linux/sched/task.h>
 #include <linux/sched/signal.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>
 
 static DEFINE_MUTEX(pid_caches_mutex);
 static struct kmem_cache *pid_ns_cachep;
@@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	if (ns == NULL)
 		goto out_dec;
 
-	idr_init(&ns->idr);
+	xa_init_flags(&ns->xa, PID_XA_FLAGS);
 
 	ns->pid_cachep = create_pid_cachep(level);
 	if (ns->pid_cachep == NULL)
-		goto out_free_idr;
+		goto out_free_xa;
 
 	err = ns_alloc_inum(&ns->ns);
 	if (err)
-		goto out_free_idr;
+		goto out_free_xa;
 	ns->ns.ops = &pidns_operations;
 
 	refcount_set(&ns->ns.count, 1);
@@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 
 	return ns;
 
-out_free_idr:
-	idr_destroy(&ns->idr);
+out_free_xa:
+	xa_destroy(&ns->xa);
 	kmem_cache_free(pid_ns_cachep, ns);
 out_dec:
 	dec_pid_namespaces(ucounts);
@@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
 {
 	ns_free_inum(&ns->ns);
 
-	idr_destroy(&ns->idr);
+	xa_destroy(&ns->xa);
 	call_rcu(&ns->rcu, delayed_free_pidns);
 }
 
@@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
 
 void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 {
-	int nr;
+	long nr;
 	int rc;
 	struct task_struct *task, *me = current;
 	int init_pids = thread_group_leader(me) ? 1 : 2;
@@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 */
 	rcu_read_lock();
 	read_lock(&tasklist_lock);
-	nr = 2;
-	idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
+	xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
 		task = pid_task(pid, PIDTYPE_PID);
 		if (task && !__fatal_signal_pending(task))
 			group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
@@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	 * it should synchronize its usage with external means.
 	 */
 
-	next = idr_get_cursor(&pid_ns->idr) - 1;
+	next = pid_ns->cursor - 1;
 
 	tmp.data = &next;
 	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 	if (!ret && write)
-		idr_set_cursor(&pid_ns->idr, next + 1);
+		pid_ns->cursor = next + 1;
 
 	return ret;
 }
Christian Brauner June 28, 2022, 12:55 p.m. UTC | #7
On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > Interesting, thanks. I'll have to dig more into this to grok the current
> > state of the radix-tree interface vs. the underlying data structure. If
> > I follow correctly, you're saying the radix-tree api is essentially
> > already a translation layer to the xarray these days, and we just need
> > to move legacy users off the radix-tree api so we can eventually kill it
> > off...
> 
> If only it were that easy ... the XArray has a whole bunch of debugging
> asserts to make sure the users are actually using it correctly, and a
> lot of radix tree users don't (they're probably not buggy, but they
> don't use the XArray's embedded lock).
> 
> Anyway, here's a first cut at converting the PID allocator from the IDR
> to the XArray API.  It boots, but I haven't tried to do anything tricky
> with PID namespaces or CRIU.

It'd be great to see that conversion done.
Fwiw, there's test cases for e.g. nested pid namespace creation with
specifically requested PIDs in

tools/selftests/clone3

and there's additional tests in

tools/selftests/pidfd

> 
> 
> diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> index f32878d9a39f..cec85a07184a 100644
> --- a/fs/proc/loadavg.c
> +++ b/fs/proc/loadavg.c
> @@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
>  		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
>  		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
>  		nr_running(), nr_threads,
> -		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> +		task_active_pid_ns(current)->cursor - 1);
>  	return 0;
>  }
>  
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 07481bb87d4e..68aaaf78491b 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,7 +9,7 @@
>  #include <linux/threads.h>
>  #include <linux/nsproxy.h>
>  #include <linux/ns_common.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>  
>  /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
>  #define MAX_PID_NS_LEVEL 32
> @@ -17,8 +17,9 @@
>  struct fs_pin;
>  
>  struct pid_namespace {
> -	struct idr idr;
> +	struct xarray xa;
>  	struct rcu_head rcu;
> +	unsigned int cursor;
>  	unsigned int pid_allocated;
>  	struct task_struct *child_reaper;
>  	struct kmem_cache *pid_cachep;
> @@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;
>  
>  #define PIDNS_ADDING (1U << 31)
>  
> +/*
> + * Note: disable interrupts while the xarray lock is held as an
> + * interrupt might come in and do read_lock(&tasklist_lock).
> + *
> + * If we don't disable interrupts there is a nasty deadlock between
> + * detach_pid()->free_pid() and another cpu that locks the PIDs
> + * followed by an interrupt routine that does read_lock(&tasklist_lock);
> + *
> + * After we clean up the tasklist_lock and know there are no
> + * irq handlers that take it we can leave the interrupts enabled.
> + * For now it is easier to be safe than to prove it can't happen.
> + */
> +#define PID_XA_FLAGS	(XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
> +
>  #ifdef CONFIG_PID_NS
>  static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
>  {
> @@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  
>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
>  void pidhash_init(void);
> -void pid_idr_init(void);
> +void pid_init(void);
>  
>  static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
>  {
> diff --git a/include/linux/threads.h b/include/linux/threads.h
> index c34173e6c5f1..37e4391ee89f 100644
> --- a/include/linux/threads.h
> +++ b/include/linux/threads.h
> @@ -38,7 +38,7 @@
>   * Define a minimum number of pids per cpu.  Heuristically based
>   * on original pid max of 32k for 32 cpus.  Also, increase the
>   * minimum settable value for pid_max on the running system based
> - * on similar defaults.  See kernel/pid.c:pid_idr_init() for details.
> + * on similar defaults.  See kernel/pid.c:pid_init() for details.
>   */
>  #define PIDS_PER_CPU_DEFAULT	1024
>  #define PIDS_PER_CPU_MIN	8
> diff --git a/init/main.c b/init/main.c
> index 0ee39cdcfcac..3944dcd10c09 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -73,7 +73,6 @@
>  #include <linux/sched.h>
>  #include <linux/sched/init.h>
>  #include <linux/signal.h>
> -#include <linux/idr.h>
>  #include <linux/kgdb.h>
>  #include <linux/ftrace.h>
>  #include <linux/async.h>
> @@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
>  		late_time_init();
>  	sched_clock_init();
>  	calibrate_delay();
> -	pid_idr_init();
> +	pid_init();
>  	anon_vma_init();
>  #ifdef CONFIG_X86
>  	if (efi_enabled(EFI_RUNTIME_SERVICES))
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2fc0a16ec77b..de0b4614bdb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -41,7 +41,7 @@
>  #include <linux/anon_inodes.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/task.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>  #include <net/sock.h>
>  #include <uapi/linux/pidfd.h>
>  
> @@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
>  int pid_max_min = RESERVED_PIDS + 1;
>  int pid_max_max = PID_MAX_LIMIT;
>  
> -/*
> - * PID-map pages start out as NULL, they get allocated upon
> - * first use and are never deallocated. This way a low pid_max
> - * value does not cause lots of bitmaps to be allocated, but
> - * the scheme scales to up to 4 million PIDs, runtime.
> - */
>  struct pid_namespace init_pid_ns = {
>  	.ns.count = REFCOUNT_INIT(2),
> -	.idr = IDR_INIT(init_pid_ns.idr),
> +	.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> +	.cursor = 1,
>  	.pid_allocated = PIDNS_ADDING,
>  	.level = 0,
>  	.child_reaper = &init_task,
> @@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
>  };
>  EXPORT_SYMBOL_GPL(init_pid_ns);
>  
> -/*
> - * Note: disable interrupts while the pidmap_lock is held as an
> - * interrupt might come in and do read_lock(&tasklist_lock).
> - *
> - * If we don't disable interrupts there is a nasty deadlock between
> - * detach_pid()->free_pid() and another cpu that does
> - * spin_lock(&pidmap_lock) followed by an interrupt routine that does
> - * read_lock(&tasklist_lock);
> - *
> - * After we clean up the tasklist_lock and know there are no
> - * irq handlers that take it we can leave the interrupts enabled.
> - * For now it is easier to be safe than to prove it can't happen.
> - */
> -
> -static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> -
>  void put_pid(struct pid *pid)
>  {
>  	struct pid_namespace *ns;
> @@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
>  	int i;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&pidmap_lock, flags);
>  	for (i = 0; i <= pid->level; i++) {
>  		struct upid *upid = pid->numbers + i;
>  		struct pid_namespace *ns = upid->ns;
> +
> +		xa_lock_irqsave(&ns->xa, flags);
>  		switch (--ns->pid_allocated) {
>  		case 2:
>  		case 1:
> @@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
>  			break;
>  		}
>  
> -		idr_remove(&ns->idr, upid->nr);
> +		__xa_erase(&ns->xa, upid->nr);
> +		xa_unlock_irqrestore(&ns->xa, flags);
>  	}
> -	spin_unlock_irqrestore(&pidmap_lock, flags);
>  
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> @@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  {
>  	struct pid *pid;
>  	enum pid_type type;
> -	int i, nr;
> +	int i;
>  	struct pid_namespace *tmp;
>  	struct upid *upid;
>  	int retval = -ENOMEM;
> @@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			set_tid_size--;
>  		}
>  
> -		idr_preload(GFP_KERNEL);
> -		spin_lock_irq(&pidmap_lock);
> -
>  		if (tid) {
> -			nr = idr_alloc(&tmp->idr, NULL, tid,
> -				       tid + 1, GFP_ATOMIC);
> +			retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
> +
>  			/*
> -			 * If ENOSPC is returned it means that the PID is
> +			 * If EBUSY is returned it means that the PID is
>  			 * alreay in use. Return EEXIST in that case.
>  			 */
> -			if (nr == -ENOSPC)
> -				nr = -EEXIST;
> +			if (retval == -EBUSY)
> +				retval = -EEXIST;
>  		} else {
>  			int pid_min = 1;
> +
> +			xa_lock_irq(&tmp->xa);
>  			/*
>  			 * init really needs pid 1, but after reaching the
>  			 * maximum wrap back to RESERVED_PIDS
>  			 */
> -			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> +			if (tmp->cursor > RESERVED_PIDS)
>  				pid_min = RESERVED_PIDS;
>  
>  			/*
>  			 * Store a null pointer so find_pid_ns does not find
>  			 * a partially initialized PID (see below).
>  			 */
> -			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> -					      pid_max, GFP_ATOMIC);
> +			retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
> +					XA_LIMIT(pid_min, pid_max),
> +					&tmp->cursor, GFP_KERNEL);
> +			xa_unlock_irq(&tmp->xa);
> +			if (retval == -EBUSY)
> +				retval = -EAGAIN;
>  		}
> -		spin_unlock_irq(&pidmap_lock);
> -		idr_preload_end();
>  
> -		if (nr < 0) {
> -			retval = (nr == -ENOSPC) ? -EAGAIN : nr;
> +		if (retval < 0)
>  			goto out_free;
> -		}
>  
> -		pid->numbers[i].nr = nr;
> +		pid->numbers[i].nr = tid;
>  		pid->numbers[i].ns = tmp;
>  		tmp = tmp->parent;
>  	}
> @@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  	INIT_HLIST_HEAD(&pid->inodes);
>  
>  	upid = pid->numbers + ns->level;
> -	spin_lock_irq(&pidmap_lock);
> +	xa_lock_irq(&ns->xa);
>  	if (!(ns->pid_allocated & PIDNS_ADDING))
>  		goto out_unlock;
>  	for ( ; upid >= pid->numbers; --upid) {
>  		/* Make the PID visible to find_pid_ns. */
> -		idr_replace(&upid->ns->idr, pid, upid->nr);
> +		if (upid->ns != ns)
> +			xa_lock_irq(&ns->xa);
> +		__xa_store(&upid->ns->xa, upid->nr, pid, 0);
>  		upid->ns->pid_allocated++;
> +		xa_unlock_irq(&ns->xa);
>  	}
> -	spin_unlock_irq(&pidmap_lock);
>  
>  	return pid;
>  
>  out_unlock:
> -	spin_unlock_irq(&pidmap_lock);
> +	xa_unlock_irq(&ns->xa);
>  	put_pid_ns(ns);
>  
>  out_free:
> -	spin_lock_irq(&pidmap_lock);
>  	while (++i <= ns->level) {
>  		upid = pid->numbers + i;
> -		idr_remove(&upid->ns->idr, upid->nr);
> +		xa_erase_irq(&upid->ns->xa, upid->nr);
>  	}
>  
> +	xa_lock_irq(&ns->xa);
>  	/* On failure to allocate the first pid, reset the state */
>  	if (ns->pid_allocated == PIDNS_ADDING)
> -		idr_set_cursor(&ns->idr, 0);
> -
> -	spin_unlock_irq(&pidmap_lock);
> +		ns->cursor = 0;
> +	xa_unlock_irq(&ns->xa);
>  
>  	kmem_cache_free(ns->pid_cachep, pid);
>  	return ERR_PTR(retval);
> @@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  
>  void disable_pid_allocation(struct pid_namespace *ns)
>  {
> -	spin_lock_irq(&pidmap_lock);
> +	xa_lock_irq(&ns->xa);
>  	ns->pid_allocated &= ~PIDNS_ADDING;
> -	spin_unlock_irq(&pidmap_lock);
> +	xa_unlock_irq(&ns->xa);
>  }
>  
>  struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
>  {
> -	return idr_find(&ns->idr, nr);
> +	return xa_load(&ns->xa, nr);
>  }
>  EXPORT_SYMBOL_GPL(find_pid_ns);
>  
> @@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
>   */
>  struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>  {
> -	return idr_get_next(&ns->idr, &nr);
> +	unsigned long index = nr;
> +
> +	return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
>  }
>  
>  struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
> @@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
>  	return fd;
>  }
>  
> -void __init pid_idr_init(void)
> +void __init pid_init(void)
>  {
>  	/* Verify no one has done anything silly: */
>  	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
> @@ -658,8 +640,6 @@ void __init pid_idr_init(void)
>  				PIDS_PER_CPU_MIN * num_possible_cpus());
>  	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>  
> -	idr_init(&init_pid_ns.idr);
> -
>  	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
>  			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..947e25fb8546 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -22,7 +22,7 @@
>  #include <linux/export.h>
>  #include <linux/sched/task.h>
>  #include <linux/sched/signal.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>  
>  static DEFINE_MUTEX(pid_caches_mutex);
>  static struct kmem_cache *pid_ns_cachep;
> @@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	if (ns == NULL)
>  		goto out_dec;
>  
> -	idr_init(&ns->idr);
> +	xa_init_flags(&ns->xa, PID_XA_FLAGS);
>  
>  	ns->pid_cachep = create_pid_cachep(level);
>  	if (ns->pid_cachep == NULL)
> -		goto out_free_idr;
> +		goto out_free_xa;
>  
>  	err = ns_alloc_inum(&ns->ns);
>  	if (err)
> -		goto out_free_idr;
> +		goto out_free_xa;
>  	ns->ns.ops = &pidns_operations;
>  
>  	refcount_set(&ns->ns.count, 1);
> @@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  
>  	return ns;
>  
> -out_free_idr:
> -	idr_destroy(&ns->idr);
> +out_free_xa:
> +	xa_destroy(&ns->xa);
>  	kmem_cache_free(pid_ns_cachep, ns);
>  out_dec:
>  	dec_pid_namespaces(ucounts);
> @@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
>  {
>  	ns_free_inum(&ns->ns);
>  
> -	idr_destroy(&ns->idr);
> +	xa_destroy(&ns->xa);
>  	call_rcu(&ns->rcu, delayed_free_pidns);
>  }
>  
> @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
>  
>  void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  {
> -	int nr;
> +	long nr;
>  	int rc;
>  	struct task_struct *task, *me = current;
>  	int init_pids = thread_group_leader(me) ? 1 : 2;
> @@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 */
>  	rcu_read_lock();
>  	read_lock(&tasklist_lock);
> -	nr = 2;
> -	idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> +	xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
>  		task = pid_task(pid, PIDTYPE_PID);
>  		if (task && !__fatal_signal_pending(task))
>  			group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> @@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  	 * it should synchronize its usage with external means.
>  	 */
>  
> -	next = idr_get_cursor(&pid_ns->idr) - 1;
> +	next = pid_ns->cursor - 1;
>  
>  	tmp.data = &next;
>  	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>  	if (!ret && write)
> -		idr_set_cursor(&pid_ns->idr, next + 1);
> +		pid_ns->cursor = next + 1;
>  
>  	return ret;
>  }
Brian Foster June 28, 2022, 2:53 p.m. UTC | #8
On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > state of the radix-tree interface vs. the underlying data structure. If
> > > I follow correctly, you're saying the radix-tree api is essentially
> > > already a translation layer to the xarray these days, and we just need
> > > to move legacy users off the radix-tree api so we can eventually kill it
> > > off...
> > 
> > If only it were that easy ... the XArray has a whole bunch of debugging
> > asserts to make sure the users are actually using it correctly, and a
> > lot of radix tree users don't (they're probably not buggy, but they
> > don't use the XArray's embedded lock).
> > 
> > Anyway, here's a first cut at converting the PID allocator from the IDR
> > to the XArray API.  It boots, but I haven't tried to do anything tricky
> > with PID namespaces or CRIU.
> 
> It'd be great to see that conversion done.
> Fwiw, there's test cases for e.g. nested pid namespace creation with
> specifically requested PIDs in
> 

Ok, but I'm a little confused. Why open code the xarray usage as opposed
to work the idr bits closer to being able to use the xarray api (and/or
work the xarray to better support the idr use case)? I see 150+ callers
of idr_init(). Is the goal to eventually open code them all? That seems
a lot of potential api churn for something that is presumably a generic
interface (and perhaps inconsistent with ida, which looks like it uses
xarray directly?), but I'm probably missing details.

If the issue is open-coded locking across all the idr users conflicting
with internal xarray debug bits, I guess what I'm wondering is why not
implement your patch more generically within idr (i.e. expose some
locking apis, etc.)? Even if it meant creating something like a
temporary init_idr_xa() variant that users can switch over to as they're
audited for expected behavior, I don't quite grok why that couldn't be
made to work if changing this code over directly does and the various
core radix tree data structures idr uses are already #defined to xarray
variants. Hm?

Brian

> tools/selftests/clone3
> 
> and there's additional tests in
> 
> tools/selftests/pidfd
> 
> > 
> > 
> > diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> > index f32878d9a39f..cec85a07184a 100644
> > --- a/fs/proc/loadavg.c
> > +++ b/fs/proc/loadavg.c
> > @@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
> >  		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> >  		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> >  		nr_running(), nr_threads,
> > -		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > +		task_active_pid_ns(current)->cursor - 1);
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 07481bb87d4e..68aaaf78491b 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -9,7 +9,7 @@
> >  #include <linux/threads.h>
> >  #include <linux/nsproxy.h>
> >  #include <linux/ns_common.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> >  
> >  /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> >  #define MAX_PID_NS_LEVEL 32
> > @@ -17,8 +17,9 @@
> >  struct fs_pin;
> >  
> >  struct pid_namespace {
> > -	struct idr idr;
> > +	struct xarray xa;
> >  	struct rcu_head rcu;
> > +	unsigned int cursor;
> >  	unsigned int pid_allocated;
> >  	struct task_struct *child_reaper;
> >  	struct kmem_cache *pid_cachep;
> > @@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;
> >  
> >  #define PIDNS_ADDING (1U << 31)
> >  
> > +/*
> > + * Note: disable interrupts while the xarray lock is held as an
> > + * interrupt might come in and do read_lock(&tasklist_lock).
> > + *
> > + * If we don't disable interrupts there is a nasty deadlock between
> > + * detach_pid()->free_pid() and another cpu that locks the PIDs
> > + * followed by an interrupt routine that does read_lock(&tasklist_lock);
> > + *
> > + * After we clean up the tasklist_lock and know there are no
> > + * irq handlers that take it we can leave the interrupts enabled.
> > + * For now it is easier to be safe than to prove it can't happen.
> > + */
> > +#define PID_XA_FLAGS	(XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
> > +
> >  #ifdef CONFIG_PID_NS
> >  static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
> >  {
> > @@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> >  
> >  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> >  void pidhash_init(void);
> > -void pid_idr_init(void);
> > +void pid_init(void);
> >  
> >  static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
> >  {
> > diff --git a/include/linux/threads.h b/include/linux/threads.h
> > index c34173e6c5f1..37e4391ee89f 100644
> > --- a/include/linux/threads.h
> > +++ b/include/linux/threads.h
> > @@ -38,7 +38,7 @@
> >   * Define a minimum number of pids per cpu.  Heuristically based
> >   * on original pid max of 32k for 32 cpus.  Also, increase the
> >   * minimum settable value for pid_max on the running system based
> > - * on similar defaults.  See kernel/pid.c:pid_idr_init() for details.
> > + * on similar defaults.  See kernel/pid.c:pid_init() for details.
> >   */
> >  #define PIDS_PER_CPU_DEFAULT	1024
> >  #define PIDS_PER_CPU_MIN	8
> > diff --git a/init/main.c b/init/main.c
> > index 0ee39cdcfcac..3944dcd10c09 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -73,7 +73,6 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/init.h>
> >  #include <linux/signal.h>
> > -#include <linux/idr.h>
> >  #include <linux/kgdb.h>
> >  #include <linux/ftrace.h>
> >  #include <linux/async.h>
> > @@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> >  		late_time_init();
> >  	sched_clock_init();
> >  	calibrate_delay();
> > -	pid_idr_init();
> > +	pid_init();
> >  	anon_vma_init();
> >  #ifdef CONFIG_X86
> >  	if (efi_enabled(EFI_RUNTIME_SERVICES))
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 2fc0a16ec77b..de0b4614bdb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -41,7 +41,7 @@
> >  #include <linux/anon_inodes.h>
> >  #include <linux/sched/signal.h>
> >  #include <linux/sched/task.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> >  #include <net/sock.h>
> >  #include <uapi/linux/pidfd.h>
> >  
> > @@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
> >  int pid_max_min = RESERVED_PIDS + 1;
> >  int pid_max_max = PID_MAX_LIMIT;
> >  
> > -/*
> > - * PID-map pages start out as NULL, they get allocated upon
> > - * first use and are never deallocated. This way a low pid_max
> > - * value does not cause lots of bitmaps to be allocated, but
> > - * the scheme scales to up to 4 million PIDs, runtime.
> > - */
> >  struct pid_namespace init_pid_ns = {
> >  	.ns.count = REFCOUNT_INIT(2),
> > -	.idr = IDR_INIT(init_pid_ns.idr),
> > +	.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> > +	.cursor = 1,
> >  	.pid_allocated = PIDNS_ADDING,
> >  	.level = 0,
> >  	.child_reaper = &init_task,
> > @@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
> >  };
> >  EXPORT_SYMBOL_GPL(init_pid_ns);
> >  
> > -/*
> > - * Note: disable interrupts while the pidmap_lock is held as an
> > - * interrupt might come in and do read_lock(&tasklist_lock).
> > - *
> > - * If we don't disable interrupts there is a nasty deadlock between
> > - * detach_pid()->free_pid() and another cpu that does
> > - * spin_lock(&pidmap_lock) followed by an interrupt routine that does
> > - * read_lock(&tasklist_lock);
> > - *
> > - * After we clean up the tasklist_lock and know there are no
> > - * irq handlers that take it we can leave the interrupts enabled.
> > - * For now it is easier to be safe than to prove it can't happen.
> > - */
> > -
> > -static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> > -
> >  void put_pid(struct pid *pid)
> >  {
> >  	struct pid_namespace *ns;
> > @@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
> >  	int i;
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&pidmap_lock, flags);
> >  	for (i = 0; i <= pid->level; i++) {
> >  		struct upid *upid = pid->numbers + i;
> >  		struct pid_namespace *ns = upid->ns;
> > +
> > +		xa_lock_irqsave(&ns->xa, flags);
> >  		switch (--ns->pid_allocated) {
> >  		case 2:
> >  		case 1:
> > @@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
> >  			break;
> >  		}
> >  
> > -		idr_remove(&ns->idr, upid->nr);
> > +		__xa_erase(&ns->xa, upid->nr);
> > +		xa_unlock_irqrestore(&ns->xa, flags);
> >  	}
> > -	spin_unlock_irqrestore(&pidmap_lock, flags);
> >  
> >  	call_rcu(&pid->rcu, delayed_put_pid);
> >  }
> > @@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >  {
> >  	struct pid *pid;
> >  	enum pid_type type;
> > -	int i, nr;
> > +	int i;
> >  	struct pid_namespace *tmp;
> >  	struct upid *upid;
> >  	int retval = -ENOMEM;
> > @@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >  			set_tid_size--;
> >  		}
> >  
> > -		idr_preload(GFP_KERNEL);
> > -		spin_lock_irq(&pidmap_lock);
> > -
> >  		if (tid) {
> > -			nr = idr_alloc(&tmp->idr, NULL, tid,
> > -				       tid + 1, GFP_ATOMIC);
> > +			retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
> > +
> >  			/*
> > -			 * If ENOSPC is returned it means that the PID is
> > +			 * If EBUSY is returned it means that the PID is
> >  			 * alreay in use. Return EEXIST in that case.
> >  			 */
> > -			if (nr == -ENOSPC)
> > -				nr = -EEXIST;
> > +			if (retval == -EBUSY)
> > +				retval = -EEXIST;
> >  		} else {
> >  			int pid_min = 1;
> > +
> > +			xa_lock_irq(&tmp->xa);
> >  			/*
> >  			 * init really needs pid 1, but after reaching the
> >  			 * maximum wrap back to RESERVED_PIDS
> >  			 */
> > -			if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > +			if (tmp->cursor > RESERVED_PIDS)
> >  				pid_min = RESERVED_PIDS;
> >  
> >  			/*
> >  			 * Store a null pointer so find_pid_ns does not find
> >  			 * a partially initialized PID (see below).
> >  			 */
> > -			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> > -					      pid_max, GFP_ATOMIC);
> > +			retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
> > +					XA_LIMIT(pid_min, pid_max),
> > +					&tmp->cursor, GFP_KERNEL);
> > +			xa_unlock_irq(&tmp->xa);
> > +			if (retval == -EBUSY)
> > +				retval = -EAGAIN;
> >  		}
> > -		spin_unlock_irq(&pidmap_lock);
> > -		idr_preload_end();
> >  
> > -		if (nr < 0) {
> > -			retval = (nr == -ENOSPC) ? -EAGAIN : nr;
> > +		if (retval < 0)
> >  			goto out_free;
> > -		}
> >  
> > -		pid->numbers[i].nr = nr;
> > +		pid->numbers[i].nr = tid;
> >  		pid->numbers[i].ns = tmp;
> >  		tmp = tmp->parent;
> >  	}
> > @@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >  	INIT_HLIST_HEAD(&pid->inodes);
> >  
> >  	upid = pid->numbers + ns->level;
> > -	spin_lock_irq(&pidmap_lock);
> > +	xa_lock_irq(&ns->xa);
> >  	if (!(ns->pid_allocated & PIDNS_ADDING))
> >  		goto out_unlock;
> >  	for ( ; upid >= pid->numbers; --upid) {
> >  		/* Make the PID visible to find_pid_ns. */
> > -		idr_replace(&upid->ns->idr, pid, upid->nr);
> > +		if (upid->ns != ns)
> > +			xa_lock_irq(&ns->xa);
> > +		__xa_store(&upid->ns->xa, upid->nr, pid, 0);
> >  		upid->ns->pid_allocated++;
> > +		xa_unlock_irq(&ns->xa);
> >  	}
> > -	spin_unlock_irq(&pidmap_lock);
> >  
> >  	return pid;
> >  
> >  out_unlock:
> > -	spin_unlock_irq(&pidmap_lock);
> > +	xa_unlock_irq(&ns->xa);
> >  	put_pid_ns(ns);
> >  
> >  out_free:
> > -	spin_lock_irq(&pidmap_lock);
> >  	while (++i <= ns->level) {
> >  		upid = pid->numbers + i;
> > -		idr_remove(&upid->ns->idr, upid->nr);
> > +		xa_erase_irq(&upid->ns->xa, upid->nr);
> >  	}
> >  
> > +	xa_lock_irq(&ns->xa);
> >  	/* On failure to allocate the first pid, reset the state */
> >  	if (ns->pid_allocated == PIDNS_ADDING)
> > -		idr_set_cursor(&ns->idr, 0);
> > -
> > -	spin_unlock_irq(&pidmap_lock);
> > +		ns->cursor = 0;
> > +	xa_unlock_irq(&ns->xa);
> >  
> >  	kmem_cache_free(ns->pid_cachep, pid);
> >  	return ERR_PTR(retval);
> > @@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >  
> >  void disable_pid_allocation(struct pid_namespace *ns)
> >  {
> > -	spin_lock_irq(&pidmap_lock);
> > +	xa_lock_irq(&ns->xa);
> >  	ns->pid_allocated &= ~PIDNS_ADDING;
> > -	spin_unlock_irq(&pidmap_lock);
> > +	xa_unlock_irq(&ns->xa);
> >  }
> >  
> >  struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
> >  {
> > -	return idr_find(&ns->idr, nr);
> > +	return xa_load(&ns->xa, nr);
> >  }
> >  EXPORT_SYMBOL_GPL(find_pid_ns);
> >  
> > @@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
> >   */
> >  struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >  {
> > -	return idr_get_next(&ns->idr, &nr);
> > +	unsigned long index = nr;
> > +
> > +	return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
> >  }
> >  
> >  struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
> > @@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> >  	return fd;
> >  }
> >  
> > -void __init pid_idr_init(void)
> > +void __init pid_init(void)
> >  {
> >  	/* Verify no one has done anything silly: */
> >  	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
> > @@ -658,8 +640,6 @@ void __init pid_idr_init(void)
> >  				PIDS_PER_CPU_MIN * num_possible_cpus());
> >  	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
> >  
> > -	idr_init(&init_pid_ns.idr);
> > -
> >  	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
> >  			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
> >  }
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index f4f8cb0435b4..947e25fb8546 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -22,7 +22,7 @@
> >  #include <linux/export.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/sched/signal.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> >  
> >  static DEFINE_MUTEX(pid_caches_mutex);
> >  static struct kmem_cache *pid_ns_cachep;
> > @@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> >  	if (ns == NULL)
> >  		goto out_dec;
> >  
> > -	idr_init(&ns->idr);
> > +	xa_init_flags(&ns->xa, PID_XA_FLAGS);
> >  
> >  	ns->pid_cachep = create_pid_cachep(level);
> >  	if (ns->pid_cachep == NULL)
> > -		goto out_free_idr;
> > +		goto out_free_xa;
> >  
> >  	err = ns_alloc_inum(&ns->ns);
> >  	if (err)
> > -		goto out_free_idr;
> > +		goto out_free_xa;
> >  	ns->ns.ops = &pidns_operations;
> >  
> >  	refcount_set(&ns->ns.count, 1);
> > @@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> >  
> >  	return ns;
> >  
> > -out_free_idr:
> > -	idr_destroy(&ns->idr);
> > +out_free_xa:
> > +	xa_destroy(&ns->xa);
> >  	kmem_cache_free(pid_ns_cachep, ns);
> >  out_dec:
> >  	dec_pid_namespaces(ucounts);
> > @@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
> >  {
> >  	ns_free_inum(&ns->ns);
> >  
> > -	idr_destroy(&ns->idr);
> > +	xa_destroy(&ns->xa);
> >  	call_rcu(&ns->rcu, delayed_free_pidns);
> >  }
> >  
> > @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
> >  
> >  void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >  {
> > -	int nr;
> > +	long nr;
> >  	int rc;
> >  	struct task_struct *task, *me = current;
> >  	int init_pids = thread_group_leader(me) ? 1 : 2;
> > @@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >  	 */
> >  	rcu_read_lock();
> >  	read_lock(&tasklist_lock);
> > -	nr = 2;
> > -	idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> > +	xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
> >  		task = pid_task(pid, PIDTYPE_PID);
> >  		if (task && !__fatal_signal_pending(task))
> >  			group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> > @@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >  	 * it should synchronize its usage with external means.
> >  	 */
> >  
> > -	next = idr_get_cursor(&pid_ns->idr) - 1;
> > +	next = pid_ns->cursor - 1;
> >  
> >  	tmp.data = &next;
> >  	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> >  	if (!ret && write)
> > -		idr_set_cursor(&pid_ns->idr, next + 1);
> > +		pid_ns->cursor = next + 1;
> >  
> >  	return ret;
> >  }
>
Brian Foster June 29, 2022, 7:13 p.m. UTC | #9
On Tue, Jun 28, 2022 at 10:53:50AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> > On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > > state of the radix-tree interface vs. the underlying data structure. If
> > > > I follow correctly, you're saying the radix-tree api is essentially
> > > > already a translation layer to the xarray these days, and we just need
> > > > to move legacy users off the radix-tree api so we can eventually kill it
> > > > off...
> > > 
> > > If only it were that easy ... the XArray has a whole bunch of debugging
> > > asserts to make sure the users are actually using it correctly, and a
> > > lot of radix tree users don't (they're probably not buggy, but they
> > > don't use the XArray's embedded lock).
> > > 
> > > Anyway, here's a first cut at converting the PID allocator from the IDR
> > > to the XArray API.  It boots, but I haven't tried to do anything tricky
> > > with PID namespaces or CRIU.
> > 
> > It'd be great to see that conversion done.
> > Fwiw, there's test cases for e.g. nested pid namespace creation with
> > specifically requested PIDs in
> > 
> 
> Ok, but I'm a little confused. Why open code the xarray usage as opposed
> to work the idr bits closer to being able to use the xarray api (and/or
> work the xarray to better support the idr use case)? I see 150+ callers
> of idr_init(). Is the goal to eventually open code them all? That seems
> a lot of potential api churn for something that is presumably a generic
> interface (and perhaps inconsistent with ida, which looks like it uses
> xarray directly?), but I'm probably missing details.
> 
> If the issue is open-coded locking across all the idr users conflicting
> with internal xarray debug bits, I guess what I'm wondering is why not
> implement your patch more generically within idr (i.e. expose some
> locking apis, etc.)? Even if it meant creating something like a
> temporary init_idr_xa() variant that users can switch over to as they're
> audited for expected behavior, I don't quite grok why that couldn't be
> made to work if changing this code over directly does and the various
> core radix tree data structures idr uses are already #defined to xarray
> variants. Hm?
> 

Using Willy's patch as a reference, here's a hacked up example of what I
was thinking (squashed to a single diff and based on top of my pending
idr tag patches). Obviously this needs more work and thought. I skipped
the locking change to start, so this will nest the internal xarray lock
inside the pidmap lock. I'm assuming doing otherwise might cause xarray
problems based on earlier comments..?  It does boot and doesn't
immediately explode however; the tag -> mark bits seem to work as
expected, etc.

I am a little curious how pervasive the aforementioned locking issue is
here. Are we talking about the lockdep_is_held() assertions in xarray.h?
If so, could we get away with either disabling those for idr users (via
some new flag) or perhaps allow idr users to pass along a lockdep_map
associated with an external lock that the xarray can feed into
lock_is_held()..?

Brian

--- 8< ---

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 5b62dfa4a031..f7dd0e8d8ac2 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -17,28 +17,31 @@
 #include <linux/percpu.h>
 
 struct idr {
-	struct radix_tree_root	idr_rt;
+	struct xarray		idr_rt;
 	unsigned int		idr_base;
 	unsigned int		idr_next;
+	bool			idr_xa;
 };
 
 /*
- * The IDR API does not expose the tagging functionality of the radix tree
- * to users.  Use tag 0 to track whether a node has free space below it.
+ * The IDR API does not expose the tagging functionality of the tree to users.
+ * Use tag 0 to track whether a node has free space below it.
  */
 #define IDR_FREE	0
 #define IDR_TAG		1
 
 /* Set the IDR flag and the IDR_FREE tag */
-#define IDR_RT_MARKER	(ROOT_IS_IDR | (__force gfp_t)			\
-					(1 << (ROOT_TAG_SHIFT + IDR_FREE)))
+#define IDR_RT_MARKER	(ROOT_IS_IDR | XA_FLAGS_MARK(IDR_FREE))
 
-#define IDR_INIT_BASE(name, base) {					\
-	.idr_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER),			\
+#define __IDR_INIT(name, base, flags, is_xa) {				\
+	.idr_rt = XARRAY_INIT(name, flags),				\
 	.idr_base = (base),						\
 	.idr_next = 0,							\
+	.idr_xa = is_xa,						\
 }
 
+#define IDR_INIT_BASE(name, base) __IDR_INIT(name, base, IDR_RT_MARKER, false)
+
 /**
  * IDR_INIT() - Initialise an IDR.
  * @name: Name of IDR.
@@ -111,6 +114,7 @@ static inline void idr_set_cursor(struct idr *idr, unsigned int val)
 				xa_unlock_irqrestore(&(idr)->idr_rt, flags)
 
 void idr_preload(gfp_t gfp_mask);
+void idr_preload_end(void);
 
 int idr_alloc(struct idr *, void *ptr, int start, int end, gfp_t);
 int __must_check idr_alloc_u32(struct idr *, void *ptr, u32 *id,
@@ -123,6 +127,9 @@ int idr_for_each(const struct idr *,
 void *idr_get_next(struct idr *, int *nextid);
 void *idr_get_next_ul(struct idr *, unsigned long *nextid);
 void *idr_replace(struct idr *, void *, unsigned long id);
+void idr_set_tag(struct idr *idr, unsigned long id);
+bool idr_get_tag(struct idr *idr, unsigned long id);
+void *idr_get_next_tag(struct idr *idr, unsigned long id);
 void idr_destroy(struct idr *);
 
 /**
@@ -133,11 +140,17 @@ void idr_destroy(struct idr *);
  * This variation of idr_init() creates an IDR which will allocate IDs
  * starting at %base.
  */
-static inline void idr_init_base(struct idr *idr, int base)
+static inline void __idr_init(struct idr *idr, int base, gfp_t flags, bool is_xa)
 {
-	INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
+	xa_init_flags(&idr->idr_rt, flags);
 	idr->idr_base = base;
 	idr->idr_next = 0;
+	idr->idr_xa = is_xa;
+}
+
+static inline void idr_init_base(struct idr *idr, int base)
+{
+	__idr_init(idr, base, IDR_RT_MARKER, false);
 }
 
 /**
@@ -160,43 +173,8 @@ static inline void idr_init(struct idr *idr)
  */
 static inline bool idr_is_empty(const struct idr *idr)
 {
-	return radix_tree_empty(&idr->idr_rt) &&
-		radix_tree_tagged(&idr->idr_rt, IDR_FREE);
-}
-
-/**
- * idr_preload_end - end preload section started with idr_preload()
- *
- * Each idr_preload() should be matched with an invocation of this
- * function.  See idr_preload() for details.
- */
-static inline void idr_preload_end(void)
-{
-	local_unlock(&radix_tree_preloads.lock);
-}
-
-static inline void idr_set_tag(struct idr *idr, unsigned long id)
-{
-	radix_tree_tag_set(&idr->idr_rt, id, IDR_TAG);
-}
-
-static inline bool idr_get_tag(struct idr *idr, unsigned long id)
-{
-	return radix_tree_tag_get(&idr->idr_rt, id, IDR_TAG);
-}
-
-/*
- * Find the next id with the internal tag set.
- */
-static inline void *idr_get_next_tag(struct idr *idr, unsigned long id)
-{
-	unsigned int ret;
-	void *entry;
-
-	ret = radix_tree_gang_lookup_tag(&idr->idr_rt, &entry, id, 1, IDR_TAG);
-	if (ret != 1)
-		return NULL;
-	return entry;
+	return xa_empty(&idr->idr_rt) &&
+		xa_marked(&idr->idr_rt, IDR_FREE);
 }
 
 /**
diff --git a/kernel/pid.c b/kernel/pid.c
index bd72d1dbff95..d2297578466f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -66,6 +66,8 @@ int pid_max = PID_MAX_DEFAULT;
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
 
+#define PID_XA_FLAGS	(XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
+
 /*
  * PID-map pages start out as NULL, they get allocated upon
  * first use and are never deallocated. This way a low pid_max
@@ -74,7 +76,7 @@ int pid_max_max = PID_MAX_LIMIT;
  */
 struct pid_namespace init_pid_ns = {
 	.ns.count = REFCOUNT_INIT(2),
-	.idr = IDR_INIT(init_pid_ns.idr),
+	.idr = __IDR_INIT(init_pid_ns.idr, 0, PID_XA_FLAGS, true),
 	.pid_allocated = PIDNS_ADDING,
 	.level = 0,
 	.child_reaper = &init_task,
@@ -205,7 +207,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			set_tid_size--;
 		}
 
-		idr_preload(GFP_KERNEL);
 		spin_lock_irq(&pidmap_lock);
 
 		if (tid) {
@@ -234,7 +235,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 					      pid_max, GFP_ATOMIC);
 		}
 		spin_unlock_irq(&pidmap_lock);
-		idr_preload_end();
 
 		if (nr < 0) {
 			retval = (nr == -ENOSPC) ? -EAGAIN : nr;
@@ -696,7 +696,7 @@ void __init pid_idr_init(void)
 				PIDS_PER_CPU_MIN * num_possible_cpus());
 	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
 
-	idr_init(&init_pid_ns.idr);
+	__idr_init(&init_pid_ns.idr, 0, PID_XA_FLAGS, true);
 
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
diff --git a/lib/idr.c b/lib/idr.c
index f4ab4f4aa3c7..ae6dac08683c 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -37,11 +37,27 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid,
 	void __rcu **slot;
 	unsigned int base = idr->idr_base;
 	unsigned int id = *nextid;
+	int error;
+	struct xa_limit limit;
+
+	id = (id < base) ? 0 : id - base;
+
+	if (idr->idr_xa) {
+		limit.min = id;
+		limit.max = max - base;
+		error = xa_alloc(&idr->idr_rt, nextid, ptr, limit, gfp);
+		/* error compatibility w/ radix-tree */
+		if (error == -EBUSY)
+			return -ENOSPC;
+		else if (error)
+			return error;
+		*nextid += base;
+		return 0;
+	}
 
 	if (WARN_ON_ONCE(!(idr->idr_rt.xa_flags & ROOT_IS_IDR)))
 		idr->idr_rt.xa_flags |= IDR_RT_MARKER;
 
-	id = (id < base) ? 0 : id - base;
 	radix_tree_iter_init(&iter, id);
 	slot = idr_get_free(&idr->idr_rt, &iter, gfp, max - base);
 	if (IS_ERR(slot))
@@ -151,6 +167,8 @@ EXPORT_SYMBOL(idr_alloc_cyclic);
  */
 void *idr_remove(struct idr *idr, unsigned long id)
 {
+	if (idr->idr_xa)
+		return xa_erase(&idr->idr_rt, id - idr->idr_base);
 	return radix_tree_delete_item(&idr->idr_rt, id - idr->idr_base, NULL);
 }
 EXPORT_SYMBOL_GPL(idr_remove);
@@ -171,6 +189,8 @@ EXPORT_SYMBOL_GPL(idr_remove);
  */
 void *idr_find(const struct idr *idr, unsigned long id)
 {
+	if (idr->idr_xa)
+		return xa_load(&idr->idr_rt, id - idr->idr_base);
 	return radix_tree_lookup(&idr->idr_rt, id - idr->idr_base);
 }
 EXPORT_SYMBOL_GPL(idr_find);
@@ -233,6 +253,14 @@ void *idr_get_next_ul(struct idr *idr, unsigned long *nextid)
 	unsigned long id = *nextid;
 
 	id = (id < base) ? 0 : id - base;
+
+	if (idr->idr_xa) {
+		entry = xa_find(&idr->idr_rt, &id, ULONG_MAX, XA_PRESENT);
+		if (entry)
+			*nextid = id + base;
+		return entry;
+	}
+
 	radix_tree_for_each_slot(slot, &idr->idr_rt, &iter, id) {
 		entry = rcu_dereference_raw(*slot);
 		if (!entry)
@@ -295,6 +323,12 @@ void *idr_replace(struct idr *idr, void *ptr, unsigned long id)
 
 	id -= idr->idr_base;
 
+	if (idr->idr_xa) {
+		entry = xa_store(&idr->idr_rt, id, ptr, 0);
+		/* XXX: error translation? */
+		return entry;
+	}
+
 	entry = __radix_tree_lookup(&idr->idr_rt, id, &node, &slot);
 	if (!slot || radix_tree_tag_get(&idr->idr_rt, id, IDR_FREE))
 		return ERR_PTR(-ENOENT);
@@ -305,6 +339,41 @@ void *idr_replace(struct idr *idr, void *ptr, unsigned long id)
 }
 EXPORT_SYMBOL(idr_replace);
 
+void idr_set_tag(struct idr *idr, unsigned long id)
+{
+	if (idr->idr_xa)
+		xa_set_mark(&idr->idr_rt, id, IDR_TAG);
+	else
+		radix_tree_tag_set(&idr->idr_rt, id, IDR_TAG);
+}
+
+bool idr_get_tag(struct idr *idr, unsigned long id)
+{
+	if (idr->idr_xa)
+		return xa_get_mark(&idr->idr_rt, id, IDR_TAG);
+	return radix_tree_tag_get(&idr->idr_rt, id, IDR_TAG);
+}
+
+/*
+ * Find the next id with the internal tag set.
+ */
+void *idr_get_next_tag(struct idr *idr, unsigned long id)
+{
+	unsigned int ret;
+	void *entry;
+
+	if (idr->idr_xa) {
+		entry = xa_find(&idr->idr_rt, &id, ULONG_MAX, IDR_TAG);
+		return entry;
+	}
+
+	ret = radix_tree_gang_lookup_tag(&idr->idr_rt, &entry, id, 1, IDR_TAG);
+	if (ret != 1)
+		return NULL;
+	return entry;
+}
+
+
 /**
  * DOC: IDA description
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 08eef33e7820..8c6eb25aadbb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1476,6 +1476,18 @@ void idr_preload(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(idr_preload);
 
+/**
+ * idr_preload_end - end preload section started with idr_preload()
+ *
+ * Each idr_preload() should be matched with an invocation of this
+ * function.  See idr_preload() for details.
+ */
+void idr_preload_end(void)
+{
+	local_unlock(&radix_tree_preloads.lock);
+}
+EXPORT_SYMBOL(idr_preload_end);
+
 void __rcu **idr_get_free(struct radix_tree_root *root,
 			      struct radix_tree_iter *iter, gfp_t gfp,
 			      unsigned long max)
Matthew Wilcox July 11, 2022, 8:24 p.m. UTC | #10
On Tue, Jun 28, 2022 at 10:53:50AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> > On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > > state of the radix-tree interface vs. the underlying data structure. If
> > > > I follow correctly, you're saying the radix-tree api is essentially
> > > > already a translation layer to the xarray these days, and we just need
> > > > to move legacy users off the radix-tree api so we can eventually kill it
> > > > off...
> > > 
> > > If only it were that easy ... the XArray has a whole bunch of debugging
> > > asserts to make sure the users are actually using it correctly, and a
> > > lot of radix tree users don't (they're probably not buggy, but they
> > > don't use the XArray's embedded lock).
> > > 
> > > Anyway, here's a first cut at converting the PID allocator from the IDR
> > > to the XArray API.  It boots, but I haven't tried to do anything tricky
> > > with PID namespaces or CRIU.
> > 
> > It'd be great to see that conversion done.
> > Fwiw, there's test cases for e.g. nested pid namespace creation with
> > specifically requested PIDs in
> > 
> 
> Ok, but I'm a little confused. Why open code the xarray usage as opposed
> to work the idr bits closer to being able to use the xarray api (and/or
> work the xarray to better support the idr use case)? I see 150+ callers
> of idr_init(). Is the goal to eventually open code them all? That seems
> a lot of potential api churn for something that is presumably a generic
> interface (and perhaps inconsistent with ida, which looks like it uses
> xarray directly?), but I'm probably missing details.

It's not "open coding".  It's "using the XArray API instead of the
IDR API".  The IDR API is inferior in a number of ways, and yes, I
do want to be rid of it entirely.
diff mbox series

Patch

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b3afafe46fff..08eef33e7820 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -431,12 +431,14 @@  static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
 				tag_clear(node, IDR_FREE, 0);
 				root_tag_set(root, IDR_FREE);
 			}
-		} else {
-			/* Propagate the aggregated tag info to the new child */
-			for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-				if (root_tag_get(root, tag))
-					tag_set(node, tag, 0);
-			}
+		}
+
+		/* Propagate the aggregated tag info to the new child */
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+			if (is_idr(root) && tag == IDR_FREE)
+				continue;
+			if (root_tag_get(root, tag))
+				tag_set(node, tag, 0);
 		}
 
 		BUG_ON(shift > BITS_PER_LONG);
@@ -1368,11 +1370,13 @@  static bool __radix_tree_delete(struct radix_tree_root *root,
 	unsigned offset = get_slot_offset(node, slot);
 	int tag;
 
-	if (is_idr(root))
-		node_tag_set(root, node, IDR_FREE, offset);
-	else
-		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
-			node_tag_clear(root, node, tag, offset);
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (is_idr(root) && tag == IDR_FREE) {
+			node_tag_set(root, node, tag, offset);
+			continue;
+		}
+		node_tag_clear(root, node, tag, offset);
+	}
 
 	replace_slot(slot, NULL, node, -1, values);
 	return node && delete_node(root, node);