diff mbox series

[RFC] pidfs: use maple tree

Message ID 20241206-work-pidfs-maple_tree-v1-1-1cca6731b67f@kernel.org (mailing list archive)
State New
Headers show
Series [RFC] pidfs: use maple tree | expand

Commit Message

Christian Brauner Dec. 6, 2024, 3:25 p.m. UTC
So far we've been using an idr to track pidfs inodes. For some time now
each struct pid has a unique 64bit value that is used as the inode
number on 64 bit. That unique inode couldn't be used for looking up a
specific struct pid though.

Now that we support file handles we need this ability while avoiding to
leak actual pid identifiers into userspace which can be problematic in
containers.

So far I had used an idr-based mechanism where the idr is used to
generate a 32 bit number and each time it wraps we increment an upper
bit value and generate a unique 64 bit value. The lower 32 bits are used
to lookup the pid.

I've been looking at the maple tree because it now has
mas_alloc_cyclic(). Since it uses unsigned long it would simplify the
64bit implementation and its dense node mode supposedly also helps to
mitigate fragmentation.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Hey Liam,

Could you look this over and tell me whether my port makes any sense and
is safe? This is the first time I've been playing with the maple tree.

I've also considerd preallocating the node during pid allocation outside
of the spinlock using mas_preallocate() similar to how idr_preload()
works but I'm unclear how the mas_preallocate() api is supposed to
work in this case.

For the pidfs inode maple tree we use an external lock for add and
remove. While looking at the maple_tree code I saw that mas_nomem()
is called in mas_alloc_cyclic(). And mas_nomem() has a
__must_hold(mas->tree->ma_lock) annotation. But then the code checks
mt_external_lock() which is placed in a union with ma_lock iirc. So that
annotation seems wrong?

bool mas_nomem(struct ma_state *mas, gfp_t gfp)
        __must_hold(mas->tree->ma_lock)
{
        if (likely(mas->node != MA_ERROR(-ENOMEM)))
                return false;

        if (gfpflags_allow_blocking(gfp) && !mt_external_lock(mas->tree)) {
                mtree_unlock(mas->tree);
                mas_alloc_nodes(mas, gfp);
                mtree_lock(mas->tree);
        } else {
                mas_alloc_nodes(mas, gfp);
        }

        if (!mas_allocated(mas))
                return false;

        mas->status = ma_start;
        return true;
}

If you want to look at this in context I would please ask you to pull:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.14.pidfs

Thanks!
Christian
---
 fs/pidfs.c          | 35 ++++++++++++++++++-----------------
 include/linux/pid.h |  1 +
 kernel/pid.c        |  8 +++++---
 3 files changed, 24 insertions(+), 20 deletions(-)


---
base-commit: 963c8e506c6d4769d04fcb64d4bf783e4ef6093e
change-id: 20241206-work-pidfs-maple_tree-b322ff5399b7

Comments

Christian Brauner Dec. 6, 2024, 3:31 p.m. UTC | #1
On Fri, Dec 06, 2024 at 04:25:13PM +0100, Christian Brauner wrote:
> So far we've been using an idr to track pidfs inodes. For some time now
> each struct pid has a unique 64bit value that is used as the inode
> number on 64 bit. That unique inode couldn't be used for looking up a
> specific struct pid though.
> 
> Now that we support file handles we need this ability while avoiding to
> leak actual pid identifiers into userspace which can be problematic in
> containers.
> 
> So far I had used an idr-based mechanism where the idr is used to
> generate a 32 bit number and each time it wraps we increment an upper
> bit value and generate a unique 64 bit value. The lower 32 bits are used
> to lookup the pid.
> 
> I've been looking at the maple tree because it now has
> mas_alloc_cyclic(). Since it uses unsigned long it would simplify the
> 64bit implementation and its dense node mode supposedly also helps to
> mitigate fragmentation.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Hey Liam,
> 
> Could you look this over and tell me whether my port makes any sense and
> is safe? This is the first time I've been playing with the maple tree.
> 
> I've also considerd preallocating the node during pid allocation outside
> of the spinlock using mas_preallocate() similar to how idr_preload()
> works but I'm unclear how the mas_preallocate() api is supposed to
> work in this case.
> 
> For the pidfs inode maple tree we use an external lock for add and
> remove. While looking at the maple_tree code I saw that mas_nomem()
> is called in mas_alloc_cyclic(). And mas_nomem() has a
> __must_hold(mas->tree->ma_lock) annotation. But then the code checks
> mt_external_lock() which is placed in a union with ma_lock iirc. So that
> annotation seems wrong?
> 
> bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>         __must_hold(mas->tree->ma_lock)
> {
>         if (likely(mas->node != MA_ERROR(-ENOMEM)))
>                 return false;
> 
>         if (gfpflags_allow_blocking(gfp) && !mt_external_lock(mas->tree)) {
>                 mtree_unlock(mas->tree);
>                 mas_alloc_nodes(mas, gfp);
>                 mtree_lock(mas->tree);
>         } else {
>                 mas_alloc_nodes(mas, gfp);
>         }
> 
>         if (!mas_allocated(mas))
>                 return false;
> 
>         mas->status = ma_start;
>         return true;
> }
> 
> If you want to look at this in context I would please ask you to pull:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.14.pidfs

Sorry, I meant work.pidfs.maple_tree.
Matthew Wilcox (Oracle) Dec. 6, 2024, 3:46 p.m. UTC | #2
On Fri, Dec 06, 2024 at 04:25:13PM +0100, Christian Brauner wrote:
> For the pidfs inode maple tree we use an external lock for add and

Please don't.  We want to get rid of it.  That's a hack that only exists
for the VMA tree.
Christian Brauner Dec. 6, 2024, 4:03 p.m. UTC | #3
On Fri, Dec 06, 2024 at 03:46:37PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 06, 2024 at 04:25:13PM +0100, Christian Brauner wrote:
> > For the pidfs inode maple tree we use an external lock for add and
> 
> Please don't.  We want to get rid of it.  That's a hack that only exists
> for the VMA tree.

Hm, ok. Then I'll stick with idr for now because we can simply use
pidmap_lock. Otherwise we'd have to introduce more locking.
Matthew Wilcox (Oracle) Dec. 6, 2024, 4:06 p.m. UTC | #4
On Fri, Dec 06, 2024 at 05:03:03PM +0100, Christian Brauner wrote:
> On Fri, Dec 06, 2024 at 03:46:37PM +0000, Matthew Wilcox wrote:
> > On Fri, Dec 06, 2024 at 04:25:13PM +0100, Christian Brauner wrote:
> > > For the pidfs inode maple tree we use an external lock for add and
> > 
> > Please don't.  We want to get rid of it.  That's a hack that only exists
> > for the VMA tree.
> 
> Hm, ok. Then I'll stick with idr for now because we can simply use
> pidmap_lock. Otherwise we'd have to introduce more locking.

Why can you not delete pidmap_lock and use the maple tree lock
everywhere that you currently use pidmap_lock?  That's the intended
way to use it.
Christian Brauner Dec. 6, 2024, 4:57 p.m. UTC | #5
On Fri, Dec 06, 2024 at 04:06:37PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 06, 2024 at 05:03:03PM +0100, Christian Brauner wrote:
> > On Fri, Dec 06, 2024 at 03:46:37PM +0000, Matthew Wilcox wrote:
> > > On Fri, Dec 06, 2024 at 04:25:13PM +0100, Christian Brauner wrote:
> > > > For the pidfs inode maple tree we use an external lock for add and
> > > 
> > > Please don't.  We want to get rid of it.  That's a hack that only exists
> > > for the VMA tree.
> > 
> > Hm, ok. Then I'll stick with idr for now because we can simply use
> > pidmap_lock. Otherwise we'd have to introduce more locking.
> 
> Why can you not delete pidmap_lock and use the maple tree lock
> everywhere that you currently use pidmap_lock?  That's the intended
> way to use it.

Each pid namespace has it's own idr (as each pid namespace gets its own
pid number space) and the whole pid allocation across all pid namespaces
idrs is protected by pidname_lock.

All pid numbers in all pid namespaces idrs are allocated (acquiring and
dropping pidmap_lock) but storing a NULL so that find_pid_ns() isn't
able to find a half-initialized struct pid. The pidmap_lock is taken
again to publish all pid numbers once its finalized. It also protects
each pid namespaces's pid number allocation indicator
pid_ns->pid_allocated.

So in short, I'm pretty sure that it's possible to port this to the
maple tree overall and rely on the maple tree locks but I'm not sure how
straightforward it would be and I don't want to tie this work into the
pidfs file handle work as a preliminary as well.
diff mbox series

Patch

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 7a1d606b09d7b315e780c81fc7341f4ec82f2639..d1584afc9fe5729d5dbad8e084c62a6c19754dcc 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -19,13 +19,12 @@ 
 #include <linux/ipc_namespace.h>
 #include <linux/time_namespace.h>
 #include <linux/utsname.h>
+#include <linux/maple_tree.h>
 #include <net/net_namespace.h>
 
 #include "internal.h"
 #include "mount.h"
 
-static DEFINE_IDR(pidfs_ino_idr);
-
 static u32 pidfs_ino_upper_32_bits = 0;
 
 #if BITS_PER_LONG == 32
@@ -34,8 +33,6 @@  static u32 pidfs_ino_upper_32_bits = 0;
  * the higher 32 bits are the generation number. The starting
  * value for the inode number and the generation number is one.
  */
-static u32 pidfs_ino_lower_32_bits = 1;
-
 static inline unsigned long pidfs_ino(u64 ino)
 {
 	return lower_32_bits(ino);
@@ -49,8 +46,6 @@  static inline u32 pidfs_gen(u64 ino)
 
 #else
 
-static u32 pidfs_ino_lower_32_bits = 0;
-
 /* On 64 bit simply return ino. */
 static inline unsigned long pidfs_ino(u64 ino)
 {
@@ -71,22 +66,25 @@  static inline u32 pidfs_gen(u64 ino)
  */
 int pidfs_add_pid(struct pid *pid)
 {
-	u32 upper;
-	int lower;
+	static unsigned long lower_next = 0;
+	unsigned long lower;
+	int ret;
+
+	MA_STATE(mas, &pidfs_ino_mtree, 0, 0);
 
         /*
 	 * Inode numbering for pidfs start at 2. This avoids collisions
 	 * with the root inode which is 1 for pseudo filesystems.
          */
-	lower = idr_alloc_cyclic(&pidfs_ino_idr, pid, 2, 0, GFP_ATOMIC);
-	if (lower >= 0 && lower < pidfs_ino_lower_32_bits)
+	ret = mas_alloc_cyclic(&mas, &lower, pid, 2, ULONG_MAX, &lower_next, GFP_ATOMIC);
+	if (ret < 0)
+		return ret;
+
+	/* Wrapping really only happens on 32 bit. */
+	if (ret == 1)
 		pidfs_ino_upper_32_bits++;
-	upper = pidfs_ino_upper_32_bits;
-	pidfs_ino_lower_32_bits = lower;
-	if (lower < 0)
-		return lower;
 
-	pid->ino = ((u64)upper << 32) | lower;
+	pid->ino = ((u64)pidfs_ino_upper_32_bits << 32) | lower;
 	pid->stashed = NULL;
 	return 0;
 }
@@ -94,7 +92,10 @@  int pidfs_add_pid(struct pid *pid)
 /* The idr number to remove is the lower 32 bits of the inode. */
 void pidfs_remove_pid(struct pid *pid)
 {
-	idr_remove(&pidfs_ino_idr, lower_32_bits(pid->ino));
+	unsigned long pid_ino = pidfs_ino(pid->ino);
+
+	MA_STATE(mas, &pidfs_ino_mtree, pid_ino, pid_ino);
+	mas_erase(&mas);
 }
 
 #ifdef CONFIG_PROC_FS
@@ -522,7 +523,7 @@  static struct pid *pidfs_ino_get_pid(u64 ino)
 
 	guard(rcu)();
 
-	pid = idr_find(&pidfs_ino_idr, lower_32_bits(pid_ino));
+	pid = mtree_load(&pidfs_ino_mtree, pid_ino);
 	if (!pid)
 		return NULL;
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index a3aad9b4074cb09b12ec1cb98c8148250c506e0a..f023af78eaf67520311ed723f2407125f72253b7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -68,6 +68,7 @@  struct pid
 	struct upid numbers[];
 };
 
+extern struct maple_tree pidfs_ino_mtree;
 extern struct pid init_struct_pid;
 
 struct file;
diff --git a/kernel/pid.c b/kernel/pid.c
index 6131543e7c090c164a2bac014f8eeee61926b13d..bb46e3ca2468d7b6657ee4b1b00009ecf7b28fb5 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -104,6 +104,11 @@  EXPORT_SYMBOL_GPL(init_pid_ns);
 
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 
+struct maple_tree pidfs_ino_mtree = MTREE_INIT_EXT(pidfs_ino_mtree,
+						   MT_FLAGS_ALLOC_RANGE |
+						   MT_FLAGS_LOCK_EXTERN,
+						   pidmap_lock);
+
 void put_pid(struct pid *pid)
 {
 	struct pid_namespace *ns;
@@ -269,7 +274,6 @@  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	INIT_HLIST_HEAD(&pid->inodes);
 
 	upid = pid->numbers + ns->level;
-	idr_preload(GFP_KERNEL);
 	spin_lock_irq(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
@@ -282,13 +286,11 @@  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 		upid->ns->pid_allocated++;
 	}
 	spin_unlock_irq(&pidmap_lock);
-	idr_preload_end();
 
 	return pid;
 
 out_unlock:
 	spin_unlock_irq(&pidmap_lock);
-	idr_preload_end();
 	put_pid_ns(ns);
 
 out_free: