[RFC] make take_dentry_name_snapshot() lockless

Message ID	20241209035251.GV3387508@ZenIV (mailing list archive)
State	New
Headers	show Received: from zeniv.linux.org.uk (zeniv.linux.org.uk [62.89.141.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5AAD12572 for <linux-fsdevel@vger.kernel.org>; Mon, 9 Dec 2024 03:52:53 +0000 (UTC) Date: Mon, 9 Dec 2024 03:52:51 +0000 From: Al Viro <viro@zeniv.linux.org.uk> To: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH][RFC] make take_dentry_name_snapshot() lockless Message-ID: <20241209035251.GV3387508@ZenIV> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: Al Viro <viro@ftp.linux.org.uk>
Series	[RFC] make take_dentry_name_snapshot() lockless \| expand [RFC] make take_dentry_name_snapshot() lockless

Al Viro Dec. 9, 2024, 3:52 a.m. UTC

There's a bunch of places where we are accessing dentry names without
sufficient protection and where locking environment is not predictable
enough to fix the things that way; take_dentry_name_snapshot() is
one variant of solution.  It does, however, have a problem - copying
is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries.

How about the following (completely untested)?

Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries
that avoids any stores to shared data objects and in case of long names
we are down to (unavoidable) atomic_inc on the external_name refcount.
    
Makes the thing safer as well - the areas where ->d_seq is held odd are
all nested inside the areas where ->d_lock is held, and the latter are
much more numerous.
    
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Mateusz Guzik Dec. 9, 2024, 6:33 a.m. UTC | #1

On Mon, Dec 09, 2024 at 03:52:51AM +0000, Al Viro wrote:
> There's a bunch of places where we are accessing dentry names without
> sufficient protection and where locking environment is not predictable
> enough to fix the things that way; take_dentry_name_snapshot() is
> one variant of solution.  It does, however, have a problem - copying
> is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries.
> 
> How about the following (completely untested)?
> 
> Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries
> that avoids any stores to shared data objects and in case of long names
> we are down to (unavoidable) atomic_inc on the external_name refcount.
>     
> Makes the thing safer as well - the areas where ->d_seq is held odd are
> all nested inside the areas where ->d_lock is held, and the latter are
> much more numerous.

Is there a problem retaining the lock acquire if things fail?

As in maybe loop 2-3 times, but eventually take the lock to guarantee forward
progress.

I don't think there is a *real* workload where this would be a problem,
but with core counts seen today one may be able to purposefuly introduce
stalls when running this.

>     
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index b4d5e9e1e43d..78fd7e2a3011 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -329,16 +329,34 @@ static inline int dname_external(const struct dentry *dentry)
>  
>  void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
>  {
> -	spin_lock(&dentry->d_lock);
> +	unsigned seq;
> +
> +	rcu_read_lock();
> +retry:
> +	seq = read_seqcount_begin(&dentry->d_seq);
>  	name->name = dentry->d_name;
> -	if (unlikely(dname_external(dentry))) {
> -		atomic_inc(&external_name(dentry)->u.count);
> -	} else {
> -		memcpy(name->inline_name, dentry->d_iname,
> -		       dentry->d_name.len + 1);
> +	if (read_seqcount_retry(&dentry->d_seq, seq))
> +		goto retry;
> +	// ->name and ->len are at least consistent with each other, so if
> +	// ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN
> +	if (likely(name->name.name == dentry->d_iname)) {
> +		memcpy(name->inline_name, dentry->d_iname, name->name.len + 1);
>  		name->name.name = name->inline_name;
> +		if (read_seqcount_retry(&dentry->d_seq, seq))
> +			goto retry;
> +	} else {
> +		struct external_name *p;
> +		p = container_of(name->name.name, struct external_name, name[0]);
> +		// get a valid reference
> +		if (unlikely(!atomic_inc_not_zero(&p->u.count)))
> +			goto retry;
> +		if (read_seqcount_retry(&dentry->d_seq, seq)) {
> +			if (unlikely(atomic_dec_and_test(&p->u.count)))
> +				kfree_rcu(p, u.head);
> +			goto retry;
> +		}
>  	}
> -	spin_unlock(&dentry->d_lock);
> +	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL(take_dentry_name_snapshot);
>

Al Viro Dec. 9, 2024, 6:58 a.m. UTC | #2

On Mon, Dec 09, 2024 at 07:33:00AM +0100, Mateusz Guzik wrote:
> On Mon, Dec 09, 2024 at 03:52:51AM +0000, Al Viro wrote:
> > There's a bunch of places where we are accessing dentry names without
> > sufficient protection and where locking environment is not predictable
> > enough to fix the things that way; take_dentry_name_snapshot() is
> > one variant of solution.  It does, however, have a problem - copying
> > is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries.
> > 
> > How about the following (completely untested)?
> > 
> > Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries
> > that avoids any stores to shared data objects and in case of long names
> > we are down to (unavoidable) atomic_inc on the external_name refcount.
> >     
> > Makes the thing safer as well - the areas where ->d_seq is held odd are
> > all nested inside the areas where ->d_lock is held, and the latter are
> > much more numerous.
> 
> Is there a problem retaining the lock acquire if things fail?
> 
> As in maybe loop 2-3 times, but eventually take the lock to guarantee forward
> progress.
> 
> I don't think there is a *real* workload where this would be a problem,
> but with core counts seen today one may be able to purposefuly introduce
> stalls when running this.

By renaming the poor sucker back and forth in a tight loop?  Would be hard
to trigger on anything short of ramfs...

Hell knows - if anything, I was thinking about a variant that would
*not* loop at all, but take seq as an argument and return whether it
had been successful.  That could be adapted to build such thing -
with "pass ->d_seq sampled value (always even) *or* call it with
the name stabilized in some other way (e.g. ->d_lock, rename_lock or
->s_vfs_rename_mutex held) and pass 1 as argument to suppress checks"
for calling conventions.

The thing is, when its done to a chain of ancestors of some dentry,
with rename_lock retries around the entire thing, running into ->d_seq
change pretty much guarantees that you'll need to retry the whole thing
anyway.

Mateusz Guzik Dec. 9, 2024, 7:18 a.m. UTC | #3

On Mon, Dec 9, 2024 at 7:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Dec 09, 2024 at 07:33:00AM +0100, Mateusz Guzik wrote:
> > Is there a problem retaining the lock acquire if things fail?
> >
> > As in maybe loop 2-3 times, but eventually take the lock to guarantee forward
> > progress.
> >
> > I don't think there is a *real* workload where this would be a problem,
> > but with core counts seen today one may be able to purposefuly introduce
> > stalls when running this.
>
> By renaming the poor sucker back and forth in a tight loop?  Would be hard
> to trigger on anything short of ramfs...
>
> Hell knows - if anything, I was thinking about a variant that would
> *not* loop at all, but take seq as an argument and return whether it
> had been successful.  That could be adapted to build such thing -
> with "pass ->d_seq sampled value (always even) *or* call it with
> the name stabilized in some other way (e.g. ->d_lock, rename_lock or
> ->s_vfs_rename_mutex held) and pass 1 as argument to suppress checks"
> for calling conventions.
>
> The thing is, when its done to a chain of ancestors of some dentry,
> with rename_lock retries around the entire thing, running into ->d_seq
> change pretty much guarantees that you'll need to retry the whole thing
> anyway.

I'm not caught up on the lists, I presume this came up with grabbing
the name on execve?

Indeed a variant which grabs a seq argument would work nice here,
possibly with a dedicated wrapper handling the locking for standalone
consumers.

The VFS layer is rather inconsistent on the locking front -- as in
some places try a seq-protected op once and resort to locking/erroring
out, others keep looping not having forward progress guarantee at
least in principle.

How much of a problem it is nor is not presumably depends on the
particular case. But my point is if the indefinite looping (however
realistic or hypothetical) can be trivially avoided, it should be.

Al Viro Dec. 9, 2024, 7:41 a.m. UTC | #4

On Mon, Dec 09, 2024 at 08:18:35AM +0100, Mateusz Guzik wrote:

> I'm not caught up on the lists, I presume this came up with grabbing
> the name on execve?

Not at all.  fexecve() nonsense can use either variant (or go fuck itself,
for all I care) - it's not worth optimizing for.

No, the problem is in the things like ceph_mdsc_build_path()...

Linus Torvalds Dec. 9, 2024, 6:27 p.m. UTC | #5

On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>  void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)

Editing the patch down so that you just see the end result (ie all the
"-" lines are gone):

>  {
> +       unsigned seq;
> +
> +       rcu_read_lock();
> +retry:
> +       seq = read_seqcount_begin(&dentry->d_seq);
>         name->name = dentry->d_name;
> +       if (read_seqcount_retry(&dentry->d_seq, seq))
> +               goto retry;

Ugh. This early retry is cheap on x86, but on a lot of architectures
it will be a fairly expensive read barrier.

I see why you do it, sinc you want 'name.len' and 'name.name' to be
consistent, but I do hate it.

The name consistency issue is really annoying. Do we really need it
here? Because honestly, what you actually *really* care about here is
whether it's inline or not, and you do that test right afterwards:

> +       // ->name and ->len are at least consistent with each other, so if
> +       // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN
> +       if (likely(name->name.name == dentry->d_iname)) {
> +               memcpy(name->inline_name, dentry->d_iname, name->name.len + 1);

and here it would actually be more efficient to just use a
constant-sized memcpy with DNAME_INLINE_LEN, and never care about
'len' at all.

And if the length in name.len isn't right (we'll return it to the
user), the *final* seqcount check will catch it.

And in the other case, all you care about is the ref-count.

End result: I think your early retry is pointless and should just be
avoided. Instead, you should make sure to just read
dentry->d_name.name with a READ_ONCE() (and read the len any way you
want).

Hmm?

            Linus

Linus Torvalds Dec. 9, 2024, 7:06 p.m. UTC | #6

On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +               struct external_name *p;
> +               p = container_of(name->name.name, struct external_name, name[0]);
> +               // get a valid reference
> +               if (unlikely(!atomic_inc_not_zero(&p->u.count)))
> +                       goto retry;

Oh - this is very much *not* safe.

The other comment I had was really about "that's bad for performance".
But this is actually actively buggy.

If the external name ref has gone down to zero, we can *not* do that

    atomic_inc_not_zero(..)

thing any more, because the recount is in a union with the rcu_head
for delaying the free.

In other words: the *name* will exist for the duration of the
rcu_read_lock() we hold, but that "p->u.count" will not. When the
refcount has gone to zero, the refcount is no longer usable.

IOW, you may be happily incrementing what is now a RCU list head
rather than a count.

So NAK. This cannot work.

It's probably easily fixable by just not using a union in struct
external_name, and just having separate fields for the refcount and
the rcu_head, but in the current state your patch is fundamentally and
dangerously buggy.

(Dangerously, because you will never *ever* actually hit that tiny
race - you need to race with a concurrent rename *and* a drop of the
other dentry that got the external ref, so in practice this is likely
entirely impossible to ever hit, but that only makes it more subtle
and dangerous).

            Linus

Al Viro Dec. 9, 2024, 8:27 p.m. UTC | #7

On Mon, Dec 09, 2024 at 11:06:48AM -0800, Linus Torvalds wrote:
> On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > +               struct external_name *p;
> > +               p = container_of(name->name.name, struct external_name, name[0]);
> > +               // get a valid reference
> > +               if (unlikely(!atomic_inc_not_zero(&p->u.count)))
> > +                       goto retry;
> 
> Oh - this is very much *not* safe.
> 
> The other comment I had was really about "that's bad for performance".
> But this is actually actively buggy.
> 
> If the external name ref has gone down to zero, we can *not* do that
> 
>     atomic_inc_not_zero(..)
> 
> thing any more, because the recount is in a union with the rcu_head
> for delaying the free.

D'oh.  Right you are; missed it...

> In other words: the *name* will exist for the duration of the
> rcu_read_lock() we hold, but that "p->u.count" will not. When the
> refcount has gone to zero, the refcount is no longer usable.
> 
> IOW, you may be happily incrementing what is now a RCU list head
> rather than a count.
> 
> So NAK. This cannot work.
>
> It's probably easily fixable by just not using a union in struct
> external_name, and just having separate fields for the refcount and
> the rcu_head, but in the current state your patch is fundamentally and
> dangerously buggy.

Agreed.  And yes, separating the fields (and slapping a comment explaining
why they can not be combined) would be the easiest solution - any attempts
to be clever here would be too brittle for no good reason.

Al Viro Dec. 9, 2024, 9:17 p.m. UTC | #8

On Mon, Dec 09, 2024 at 10:27:04AM -0800, Linus Torvalds wrote:

> The name consistency issue is really annoying. Do we really need it
> here? Because honestly, what you actually *really* care about here is
> whether it's inline or not, and you do that test right afterwards:
> 
> > +       // ->name and ->len are at least consistent with each other, so if
> > +       // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN
> > +       if (likely(name->name.name == dentry->d_iname)) {
> > +               memcpy(name->inline_name, dentry->d_iname, name->name.len + 1);
> 
> and here it would actually be more efficient to just use a
> constant-sized memcpy with DNAME_INLINE_LEN, and never care about
> 'len' at all.

Actually, taking a look at what's generated for that memcpy()...  *ow*
amd64 is fine, but anything that doesn't like unaligned accesses is
ending up with really awful code.

gcc does not realize that pointers are word-aligned.  What's more,
even

unsigned long v[5];

void f(unsigned long *w)
{
	memcpu(v, w, sizeof(v));
}

is not enough to convince the damn thing - try it for e.g. alpha and you'll
see arseloads of extq/insq/mskq, all inlined.  

And yes, they are aligned - d_iname follows a pointer, inline_name follows
struct qstr, i.e. u64 + pointer.  How about we add struct inlined_name {
unsigned char name[DNAME_INLINE_LEN];}; and turn d_iname and inline_name
into anon unions with that?  Hell, might even make it an array of unsigned
long and use that to deal with this
                } else {
                        /*
                         * Both are internal.
                         */
                        unsigned int i;
                        BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
                        for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
                                swap(((long *) &dentry->d_iname)[i],
                                     ((long *) &target->d_iname)[i]);
                        }
                }
in swap_names().  With struct assignment in the corresponding case in
copy_name() and in take_dentry_name_snapshot() - that does generate sane
code...

Al Viro Dec. 9, 2024, 10:28 p.m. UTC | #9

On Mon, Dec 09, 2024 at 09:17:08PM +0000, Al Viro wrote:

> And yes, they are aligned - d_iname follows a pointer, inline_name follows
> struct qstr, i.e. u64 + pointer.  How about we add struct inlined_name {
> unsigned char name[DNAME_INLINE_LEN];}; and turn d_iname and inline_name
> into anon unions with that?  Hell, might even make it an array of unsigned
> long and use that to deal with this
>                 } else {
>                         /*
>                          * Both are internal.
>                          */
>                         unsigned int i;
>                         BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
>                         for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
>                                 swap(((long *) &dentry->d_iname)[i],
>                                      ((long *) &target->d_iname)[i]);
>                         }
>                 }
> in swap_names().  With struct assignment in the corresponding case in
> copy_name() and in take_dentry_name_snapshot() - that does generate sane
> code...

Do you have any objections to the diff below?  Completely untested and needs
to be split in two...  It does seem to generate decent code; it obviously
will need to be profiled - copy_name() in the common (short-to-short) case
copies the entire thing, all 5 words of it on 64bit, instead of memcpy()
of just the amount that needs to be copied.  That thing may cross the
cacheline boundary, but both cachelines had been freshly brought into
cache and dirtied, so...

diff --git a/fs/dcache.c b/fs/dcache.c
index b4d5e9e1e43d..687558622acf 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -296,10 +296,8 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 }
 
 struct external_name {
-	union {
-		atomic_t count;
-		struct rcu_head head;
-	} u;
+	atomic_t count;		// ->count and ->head can't be combined
+	struct rcu_head head;	// see take_dentry_name_snapshot()
 	unsigned char name[];
 };
 
@@ -329,16 +327,33 @@ static inline int dname_external(const struct dentry *dentry)
 
 void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
 {
-	spin_lock(&dentry->d_lock);
-	name->name = dentry->d_name;
-	if (unlikely(dname_external(dentry))) {
-		atomic_inc(&external_name(dentry)->u.count);
-	} else {
-		memcpy(name->inline_name, dentry->d_iname,
-		       dentry->d_name.len + 1);
+	unsigned seq;
+	const unsigned char *s;
+
+	rcu_read_lock();
+retry:
+	seq = read_seqcount_begin(&dentry->d_seq);
+	s = READ_ONCE(dentry->d_name.name);
+	name->name.hash_len = dentry->d_name.hash_len;
+	if (likely(s == dentry->d_iname)) {
 		name->name.name = name->inline_name;
+		name->__inline_name = dentry->__d_iname;
+		if (read_seqcount_retry(&dentry->d_seq, seq))
+			goto retry;
+	} else {
+		struct external_name *p;
+		p = container_of(s, struct external_name, name[0]);
+		name->name.name = s;
+		// get a valid reference
+		if (unlikely(!atomic_inc_not_zero(&p->count)))
+			goto retry;
+		if (read_seqcount_retry(&dentry->d_seq, seq)) {
+			if (unlikely(atomic_dec_and_test(&p->count)))
+				kfree_rcu(p, head);
+			goto retry;
+		}
 	}
-	spin_unlock(&dentry->d_lock);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(take_dentry_name_snapshot);
 
@@ -347,8 +362,8 @@ void release_dentry_name_snapshot(struct name_snapshot *name)
 	if (unlikely(name->name.name != name->inline_name)) {
 		struct external_name *p;
 		p = container_of(name->name.name, struct external_name, name[0]);
-		if (unlikely(atomic_dec_and_test(&p->u.count)))
-			kfree_rcu(p, u.head);
+		if (unlikely(atomic_dec_and_test(&p->count)))
+			kfree_rcu(p, head);
 	}
 }
 EXPORT_SYMBOL(release_dentry_name_snapshot);
@@ -386,7 +401,7 @@ static void dentry_free(struct dentry *dentry)
 	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
-		if (likely(atomic_dec_and_test(&p->u.count))) {
+		if (likely(atomic_dec_and_test(&p->count))) {
 			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
@@ -1667,7 +1682,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 			kmem_cache_free(dentry_cache, dentry); 
 			return NULL;
 		}
-		atomic_set(&p->u.count, 1);
+		atomic_set(&p->count, 1);
 		dname = p->name;
 	} else  {
 		dname = dentry->d_iname;
@@ -2749,11 +2764,9 @@ static void swap_names(struct dentry *dentry, struct dentry *target)
 			 * Both are internal.
 			 */
 			unsigned int i;
-			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
-				swap(((long *) &dentry->d_iname)[i],
-				     ((long *) &target->d_iname)[i]);
-			}
+			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++)
+				swap(dentry->__d_iname.name[i],
+				     target->__d_iname.name[i]);
 		}
 	}
 	swap(dentry->d_name.hash_len, target->d_name.hash_len);
@@ -2765,16 +2778,15 @@ static void copy_name(struct dentry *dentry, struct dentry *target)
 	if (unlikely(dname_external(dentry)))
 		old_name = external_name(dentry);
 	if (unlikely(dname_external(target))) {
-		atomic_inc(&external_name(target)->u.count);
+		atomic_inc(&external_name(target)->count);
 		dentry->d_name = target->d_name;
 	} else {
-		memcpy(dentry->d_iname, target->d_name.name,
-				target->d_name.len + 1);
+		dentry->__d_iname = target->__d_iname;
 		dentry->d_name.name = dentry->d_iname;
 		dentry->d_name.hash_len = target->d_name.hash_len;
 	}
-	if (old_name && likely(atomic_dec_and_test(&old_name->u.count)))
-		kfree_rcu(old_name, u.head);
+	if (old_name && likely(atomic_dec_and_test(&old_name->count)))
+		kfree_rcu(old_name, head);
 }
 
 /*
@@ -3189,6 +3201,7 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
+	BUILD_BUG_ON(DNAME_INLINE_LEN != sizeof(struct short_name_store));
 	/*
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1f28f56fd406..d604a4826765 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -77,6 +77,10 @@ extern const struct qstr dotdot_name;
 # endif
 #endif
 
+struct short_name_store {
+	unsigned long name[DNAME_INLINE_LEN / sizeof(unsigned long)];
+};
+
 #define d_lock	d_lockref.lock
 
 struct dentry {
@@ -88,7 +92,10 @@ struct dentry {
 	struct qstr d_name;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
-	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */
+	union {
+		unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */
+		struct short_name_store __d_iname;
+	};
 	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
 
 	/* Ref lookup also touches following */
@@ -590,7 +597,10 @@ static inline struct inode *d_real_inode(const struct dentry *dentry)
 
 struct name_snapshot {
 	struct qstr name;
-	unsigned char inline_name[DNAME_INLINE_LEN];
+	union {
+		unsigned char inline_name[DNAME_INLINE_LEN];
+		struct short_name_store __inline_name;
+	};
 };
 void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
 void release_dentry_name_snapshot(struct name_snapshot *);

Linus Torvalds Dec. 9, 2024, 10:49 p.m. UTC | #10

On Mon, 9 Dec 2024 at 14:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Do you have any objections to the diff below?

I'd split it up a bit more, and in particular, I think I'd want
DNAME_INLINE_LEN to be defined in terms of the long-words, rather than
doing it the other way around with that

        BUILD_BUG_ON(DNAME_INLINE_LEN != sizeof(struct short_name_store));

IOW, I'd *start* with something like the attached, and then build on that..

(By all means turn that

        unsigned long d_iname_words[DNAME_INLINE_WORDS];

into a struct so that you can then do the struct assignment for it - I
just hate the pattern of starting with a byte size and then doing
"DNAME_INLINE_LEN / sizeof(unsigned long)".

Hmm?

            Linus

Linus Torvalds Dec. 9, 2024, 10:55 p.m. UTC | #11

On Mon, 9 Dec 2024 at 14:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I'd *start* with something like the attached, and then build on that..

Key word being "something like". I just checked, and that suggested
patch will cause build issues in lib/test_printf.c, because it does
things like

    .d_iname = "foo"

and it turns out that doesn't work when it's a flex-array.

It doesn't have to be a flex array, of course - it could easily just
continue to use DNAME_INLINE_LEN (now just defined in terms of "words
* sizeof*(unsigned long)").

I did that flex array thing mainly to see if somebody ends up
depending on the array as such. Clearly that test_printf.c code does
exactly that, but looks like nothing else is.

           Linus

Al Viro Dec. 9, 2024, 11:12 p.m. UTC | #12

On Mon, Dec 09, 2024 at 02:55:45PM -0800, Linus Torvalds wrote:
> On Mon, 9 Dec 2024 at 14:49, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > IOW, I'd *start* with something like the attached, and then build on that..
> 
> Key word being "something like". I just checked, and that suggested
> patch will cause build issues in lib/test_printf.c, because it does
> things like
> 
>     .d_iname = "foo"
> 
> and it turns out that doesn't work when it's a flex-array.
> 
> It doesn't have to be a flex array, of course - it could easily just
> continue to use DNAME_INLINE_LEN (now just defined in terms of "words
> * sizeof*(unsigned long)").
> 
> I did that flex array thing mainly to see if somebody ends up
> depending on the array as such. Clearly that test_printf.c code does
> exactly that, but looks like nothing else is.

Actually, grepping for DNAME_INLINE_LEN brings some interesting hits:
drivers/net/ieee802154/adf7242.c:1165:  char debugfs_dir_name[DNAME_INLINE_LEN + 1];
	cargo-culted, AFAICS; would be better off with a constant of their own.

fs/ext4/fast_commit.c:326:              fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
fs/ext4/fast_commit.c:452:      if (dentry->d_name.len > DNAME_INLINE_LEN) {
fs/ext4/fast_commit.c:1332:                     fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
fs/ext4/fast_commit.h:113:      unsigned char fcd_iname[DNAME_INLINE_LEN];      /* Dirent name string */
	Looks like that might want struct name_snapshot, along with
{take,release}_dentry_name_snapshot().

fs/libfs.c:1792:        char strbuf[DNAME_INLINE_LEN];
fs/libfs.c:1819:        if (len <= DNAME_INLINE_LEN - 1) {
	memcpy() in there might very well be better off a struct
assignment.  And that thing probably ought to migrate to fs/dcache.c -
RCU-related considerations in it are much trickier than it is usual for
->d_compare() instances.

Al Viro Dec. 10, 2024, 2:45 a.m. UTC | #13

On Mon, Dec 09, 2024 at 11:12:37PM +0000, Al Viro wrote:
> 
> Actually, grepping for DNAME_INLINE_LEN brings some interesting hits:
> drivers/net/ieee802154/adf7242.c:1165:  char debugfs_dir_name[DNAME_INLINE_LEN + 1];
> 	cargo-culted, AFAICS; would be better off with a constant of their own.
> 
> fs/ext4/fast_commit.c:326:              fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> fs/ext4/fast_commit.c:452:      if (dentry->d_name.len > DNAME_INLINE_LEN) {
> fs/ext4/fast_commit.c:1332:                     fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> fs/ext4/fast_commit.h:113:      unsigned char fcd_iname[DNAME_INLINE_LEN];      /* Dirent name string */
> 	Looks like that might want struct name_snapshot, along with
> {take,release}_dentry_name_snapshot().

See viro/vfs.git#work.dcache.  I've thrown ext4/fast_commit conversion
into the end of that pile.  NOTE: that stuff obviously needs profiling.
It does survive light testing (boot/ltp/xfstests), but review and more
testing (including serious profiling) is obviously needed.

Patches in followups...

Al Viro Dec. 23, 2024, 4:25 a.m. UTC | #14

On Tue, Dec 10, 2024 at 02:45:23AM +0000, Al Viro wrote:
> On Mon, Dec 09, 2024 at 11:12:37PM +0000, Al Viro wrote:
> > 
> > Actually, grepping for DNAME_INLINE_LEN brings some interesting hits:
> > drivers/net/ieee802154/adf7242.c:1165:  char debugfs_dir_name[DNAME_INLINE_LEN + 1];
> > 	cargo-culted, AFAICS; would be better off with a constant of their own.
> > 
> > fs/ext4/fast_commit.c:326:              fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> > fs/ext4/fast_commit.c:452:      if (dentry->d_name.len > DNAME_INLINE_LEN) {
> > fs/ext4/fast_commit.c:1332:                     fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> > fs/ext4/fast_commit.h:113:      unsigned char fcd_iname[DNAME_INLINE_LEN];      /* Dirent name string */
> > 	Looks like that might want struct name_snapshot, along with
> > {take,release}_dentry_name_snapshot().
> 
> See viro/vfs.git#work.dcache.  I've thrown ext4/fast_commit conversion
> into the end of that pile.  NOTE: that stuff obviously needs profiling.
> It does survive light testing (boot/ltp/xfstests), but review and more
> testing (including serious profiling) is obviously needed.
> 
> Patches in followups...

More fun with ->d_name, ->d_iname and friends:

87ce955b24c9 "io_uring: add ->show_fdinfo() for the io_uring file descriptor"
is playing silly buggers with ->d_iname for some reason.  This
        seq_printf(m, "UserFiles:\t%u\n", ctx->file_table.data.nr);
        for (i = 0; has_lock && i < ctx->file_table.data.nr; i++) {
                struct file *f = NULL;

                if (ctx->file_table.data.nodes[i])
                        f = io_slot_file(ctx->file_table.data.nodes[i]);
                if (f)
                        seq_printf(m, "%5u: %s\n", i, file_dentry(f)->d_iname);
                else
                        seq_printf(m, "%5u: <none>\n", i);
        }
produces user-visible data.  For each slot in io_uring file table you
show either that it's empty (fine) or, for files with short names, the
last component of the name (no quoting, etc. - just a string as-is) or
the last short name that dentry used to have.

And that's a user-visible ABI.  What the hell?

NOTE: file here is may be anything whatsoever.  It may be a pipe,
an arbitrary file in tmpfs, a socket, etc.

How hard an ABI it is?  If it's really used by random userland code
(admin tools, etc.), we have a problem.  If that thing is cast in
stone, we'll have to emulate the current behaviour of that code,
no matter what.  I really hope it can be replaced with something
saner, though.

Incidentally, call your file "<none>"; is the current behaviour
the right thing to do?

What behaviour _is_ actually wanted?  Jens, Jann?

Al Viro Dec. 23, 2024, 4:37 a.m. UTC | #15

On Mon, Dec 23, 2024 at 04:25:40AM +0000, Al Viro wrote:

> Incidentally, call your file "<none>"; is the current behaviour
> the right thing to do?
> 
> What behaviour _is_ actually wanted?  Jens, Jann?

While we are at it, what do you want to see if your file happens to
be e.g. /home or /boot/efi?

Jens Axboe Dec. 23, 2024, 9:31 p.m. UTC | #16

On 12/22/24 9:25 PM, Al Viro wrote:
> On Tue, Dec 10, 2024 at 02:45:23AM +0000, Al Viro wrote:
>> On Mon, Dec 09, 2024 at 11:12:37PM +0000, Al Viro wrote:
>>>
>>> Actually, grepping for DNAME_INLINE_LEN brings some interesting hits:
>>> drivers/net/ieee802154/adf7242.c:1165:  char debugfs_dir_name[DNAME_INLINE_LEN + 1];
>>> 	cargo-culted, AFAICS; would be better off with a constant of their own.
>>>
>>> fs/ext4/fast_commit.c:326:              fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
>>> fs/ext4/fast_commit.c:452:      if (dentry->d_name.len > DNAME_INLINE_LEN) {
>>> fs/ext4/fast_commit.c:1332:                     fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
>>> fs/ext4/fast_commit.h:113:      unsigned char fcd_iname[DNAME_INLINE_LEN];      /* Dirent name string */
>>> 	Looks like that might want struct name_snapshot, along with
>>> {take,release}_dentry_name_snapshot().
>>
>> See viro/vfs.git#work.dcache.  I've thrown ext4/fast_commit conversion
>> into the end of that pile.  NOTE: that stuff obviously needs profiling.
>> It does survive light testing (boot/ltp/xfstests), but review and more
>> testing (including serious profiling) is obviously needed.
>>
>> Patches in followups...
> 
> More fun with ->d_name, ->d_iname and friends:
> 
> 87ce955b24c9 "io_uring: add ->show_fdinfo() for the io_uring file descriptor"
> is playing silly buggers with ->d_iname for some reason.  This
>         seq_printf(m, "UserFiles:\t%u\n", ctx->file_table.data.nr);
>         for (i = 0; has_lock && i < ctx->file_table.data.nr; i++) {
>                 struct file *f = NULL;
> 
>                 if (ctx->file_table.data.nodes[i])
>                         f = io_slot_file(ctx->file_table.data.nodes[i]);
>                 if (f)
>                         seq_printf(m, "%5u: %s\n", i, file_dentry(f)->d_iname);
>                 else
>                         seq_printf(m, "%5u: <none>\n", i);
>         }
> produces user-visible data.  For each slot in io_uring file table you
> show either that it's empty (fine) or, for files with short names, the
> last component of the name (no quoting, etc. - just a string as-is) or
> the last short name that dentry used to have.
> 
> And that's a user-visible ABI.  What the hell?
> 
> NOTE: file here is may be anything whatsoever.  It may be a pipe,
> an arbitrary file in tmpfs, a socket, etc.
> 
> How hard an ABI it is?  If it's really used by random userland code
> (admin tools, etc.), we have a problem.  If that thing is cast in
> stone, we'll have to emulate the current behaviour of that code,
> no matter what.  I really hope it can be replaced with something
> saner, though.
> 
> Incidentally, call your file "<none>"; is the current behaviour
> the right thing to do?
> 
> What behaviour _is_ actually wanted?  Jens, Jann?

It's not really API, it's just for debugging purposes. Ideal behavior -
show the file name, if possible, if not it can be anything like "anon
inode" or whatever.

IOW, we can change this however we want.

Al Viro Dec. 24, 2024, 7:18 p.m. UTC | #17

On Mon, Dec 23, 2024 at 02:31:38PM -0700, Jens Axboe wrote:

> > And that's a user-visible ABI.  What the hell?
> > 
> > NOTE: file here is may be anything whatsoever.  It may be a pipe,
> > an arbitrary file in tmpfs, a socket, etc.
> > 
> > How hard an ABI it is?  If it's really used by random userland code
> > (admin tools, etc.), we have a problem.  If that thing is cast in
> > stone, we'll have to emulate the current behaviour of that code,
> > no matter what.  I really hope it can be replaced with something
> > saner, though.
> > 
> > Incidentally, call your file "<none>"; is the current behaviour
> > the right thing to do?
> > 
> > What behaviour _is_ actually wanted?  Jens, Jann?
> 
> It's not really API, it's just for debugging purposes.

Famous last words...  Let's hope that change won't bring some deployed
userland tool screaming about breakage; it's been there for almost 5 years...

> Ideal behavior -
> show the file name, if possible, if not it can be anything like "anon
> inode" or whatever.
> 
> IOW, we can change this however we want.

All right, how about
		seq_printf(m, "%5u: ", i);
		if (f)
			seq_file_path(m, f, " \t\n\\<");
		else
			seq_puts(m, "<none>");
		seq_puts(m, "\n");
in io_uring_show_fdinfo(), instead of your
                if (f)
			seq_printf(m, "%5u: %s\n", i, file_dentry(f)->d_iname);
		else   
			seq_printf(m, "%5u: <none>\n", i);
Can you live with that?  Said that, I'm not sure that <none> case is worth printing -
you are dumping the entire table, in order of increasing index, so it's not as if
those lines carried any useful information...  If we do not care about those, it
becomes
		if (f) {
			seq_printf(m, "%5u: ", i);
			seq_file_path(m, f, " \t\n\\");
			seq_puts(m, "\n");
		}
- we only need to escape '<' in names to avoid output clashing with <none>...

Linus Torvalds Dec. 24, 2024, 7:44 p.m. UTC | #18

On Tue, 24 Dec 2024 at 11:18, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>                 if (f)
>                         seq_file_path(m, f, " \t\n\\<");
>                 else
>                         seq_puts(m, "<none>");

Maybe seq_file_path() could do the "<none>" thing itself.

Right now it looks like d_path() just oopses with a NULL pointer
dereference if you pass it a NULL path. Our printk() routines -
including '%pd' - tend to do "(null)" for this case (and also handle
ERR_PTR() cases) thanks to the 'check_pointer()' logic.

            Linus

Al Viro Dec. 24, 2024, 8:24 p.m. UTC | #19

On Tue, Dec 24, 2024 at 11:44:55AM -0800, Linus Torvalds wrote:
> On Tue, 24 Dec 2024 at 11:18, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >                 if (f)
> >                         seq_file_path(m, f, " \t\n\\<");
> >                 else
> >                         seq_puts(m, "<none>");
> 
> Maybe seq_file_path() could do the "<none>" thing itself.
> 
> Right now it looks like d_path() just oopses with a NULL pointer
> dereference if you pass it a NULL path. Our printk() routines -
> including '%pd' - tend to do "(null)" for this case (and also handle
> ERR_PTR() cases) thanks to the 'check_pointer()' logic.

	Umm...  What about the escape character set?  If we have
seq_file_path() produce "<none>" for a NULL file reference, we'd
need to make sure that file called "<none>" won't get mixed with
that...

	OTOH, there will be leading / in all cases except for
pipes/sockets/anon files/anything that has ->d_dname() and
it looks like the output of all instances has either '/' or
':' in it...  Feels brittle, though.

	Anyway, I'm not at all sure that those <none> lines are
actually useful for anything - we get something like
UserFiles:    512
followed by 512 lines of form index:name or index:<none>, with
indices going from 0 to 511.

	We are dumping an io_uring private descriptor table - all
of it in one call of ->show().  And each line is prepended with
the slot number, so it's not obvious that dumping empty slots
makes sense.

[RFC] make take_dentry_name_snapshot() lockless

Commit Message

Comments

Patch