Message ID | 20241209035251.GV3387508@ZenIV (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] make take_dentry_name_snapshot() lockless | expand |
On Mon, Dec 09, 2024 at 03:52:51AM +0000, Al Viro wrote: > There's a bunch of places where we are accessing dentry names without > sufficient protection and where locking environment is not predictable > enough to fix the things that way; take_dentry_name_snapshot() is > one variant of solution. It does, however, have a problem - copying > is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries. > > How about the following (completely untested)? > > Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries > that avoids any stores to shared data objects and in case of long names > we are down to (unavoidable) atomic_inc on the external_name refcount. > > Makes the thing safer as well - the areas where ->d_seq is held odd are > all nested inside the areas where ->d_lock is held, and the latter are > much more numerous. Is there a problem retaining the lock acquire if things fail? As in maybe loop 2-3 times, but eventually take the lock to guarantee forward progress. I don't think there is a *real* workload where this would be a problem, but with core counts seen today one may be able to purposefuly introduce stalls when running this. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > diff --git a/fs/dcache.c b/fs/dcache.c > index b4d5e9e1e43d..78fd7e2a3011 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -329,16 +329,34 @@ static inline int dname_external(const struct dentry *dentry) > > void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) > { > - spin_lock(&dentry->d_lock); > + unsigned seq; > + > + rcu_read_lock(); > +retry: > + seq = read_seqcount_begin(&dentry->d_seq); > name->name = dentry->d_name; > - if (unlikely(dname_external(dentry))) { > - atomic_inc(&external_name(dentry)->u.count); > - } else { > - memcpy(name->inline_name, dentry->d_iname, > - dentry->d_name.len + 1); > + if (read_seqcount_retry(&dentry->d_seq, seq)) > + goto retry; > + // ->name and ->len are at least consistent with each other, so if > + // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN > + if (likely(name->name.name == dentry->d_iname)) { > + memcpy(name->inline_name, dentry->d_iname, name->name.len + 1); > name->name.name = name->inline_name; > + if (read_seqcount_retry(&dentry->d_seq, seq)) > + goto retry; > + } else { > + struct external_name *p; > + p = container_of(name->name.name, struct external_name, name[0]); > + // get a valid reference > + if (unlikely(!atomic_inc_not_zero(&p->u.count))) > + goto retry; > + if (read_seqcount_retry(&dentry->d_seq, seq)) { > + if (unlikely(atomic_dec_and_test(&p->u.count))) > + kfree_rcu(p, u.head); > + goto retry; > + } > } > - spin_unlock(&dentry->d_lock); > + rcu_read_unlock(); > } > EXPORT_SYMBOL(take_dentry_name_snapshot); >
On Mon, Dec 09, 2024 at 07:33:00AM +0100, Mateusz Guzik wrote: > On Mon, Dec 09, 2024 at 03:52:51AM +0000, Al Viro wrote: > > There's a bunch of places where we are accessing dentry names without > > sufficient protection and where locking environment is not predictable > > enough to fix the things that way; take_dentry_name_snapshot() is > > one variant of solution. It does, however, have a problem - copying > > is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries. > > > > How about the following (completely untested)? > > > > Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries > > that avoids any stores to shared data objects and in case of long names > > we are down to (unavoidable) atomic_inc on the external_name refcount. > > > > Makes the thing safer as well - the areas where ->d_seq is held odd are > > all nested inside the areas where ->d_lock is held, and the latter are > > much more numerous. > > Is there a problem retaining the lock acquire if things fail? > > As in maybe loop 2-3 times, but eventually take the lock to guarantee forward > progress. > > I don't think there is a *real* workload where this would be a problem, > but with core counts seen today one may be able to purposefuly introduce > stalls when running this. By renaming the poor sucker back and forth in a tight loop? Would be hard to trigger on anything short of ramfs... Hell knows - if anything, I was thinking about a variant that would *not* loop at all, but take seq as an argument and return whether it had been successful. That could be adapted to build such thing - with "pass ->d_seq sampled value (always even) *or* call it with the name stabilized in some other way (e.g. ->d_lock, rename_lock or ->s_vfs_rename_mutex held) and pass 1 as argument to suppress checks" for calling conventions. The thing is, when its done to a chain of ancestors of some dentry, with rename_lock retries around the entire thing, running into ->d_seq change pretty much guarantees that you'll need to retry the whole thing anyway.
On Mon, Dec 9, 2024 at 7:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Dec 09, 2024 at 07:33:00AM +0100, Mateusz Guzik wrote: > > Is there a problem retaining the lock acquire if things fail? > > > > As in maybe loop 2-3 times, but eventually take the lock to guarantee forward > > progress. > > > > I don't think there is a *real* workload where this would be a problem, > > but with core counts seen today one may be able to purposefuly introduce > > stalls when running this. > > By renaming the poor sucker back and forth in a tight loop? Would be hard > to trigger on anything short of ramfs... > > Hell knows - if anything, I was thinking about a variant that would > *not* loop at all, but take seq as an argument and return whether it > had been successful. That could be adapted to build such thing - > with "pass ->d_seq sampled value (always even) *or* call it with > the name stabilized in some other way (e.g. ->d_lock, rename_lock or > ->s_vfs_rename_mutex held) and pass 1 as argument to suppress checks" > for calling conventions. > > The thing is, when its done to a chain of ancestors of some dentry, > with rename_lock retries around the entire thing, running into ->d_seq > change pretty much guarantees that you'll need to retry the whole thing > anyway. I'm not caught up on the lists, I presume this came up with grabbing the name on execve? Indeed a variant which grabs a seq argument would work nice here, possibly with a dedicated wrapper handling the locking for standalone consumers. The VFS layer is rather inconsistent on the locking front -- as in some places try a seq-protected op once and resort to locking/erroring out, others keep looping not having forward progress guarantee at least in principle. How much of a problem it is nor is not presumably depends on the particular case. But my point is if the indefinite looping (however realistic or hypothetical) can be trivially avoided, it should be.
On Mon, Dec 09, 2024 at 08:18:35AM +0100, Mateusz Guzik wrote: > I'm not caught up on the lists, I presume this came up with grabbing > the name on execve? Not at all. fexecve() nonsense can use either variant (or go fuck itself, for all I care) - it's not worth optimizing for. No, the problem is in the things like ceph_mdsc_build_path()...
On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote: > > void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) Editing the patch down so that you just see the end result (ie all the "-" lines are gone): > { > + unsigned seq; > + > + rcu_read_lock(); > +retry: > + seq = read_seqcount_begin(&dentry->d_seq); > name->name = dentry->d_name; > + if (read_seqcount_retry(&dentry->d_seq, seq)) > + goto retry; Ugh. This early retry is cheap on x86, but on a lot of architectures it will be a fairly expensive read barrier. I see why you do it, sinc you want 'name.len' and 'name.name' to be consistent, but I do hate it. The name consistency issue is really annoying. Do we really need it here? Because honestly, what you actually *really* care about here is whether it's inline or not, and you do that test right afterwards: > + // ->name and ->len are at least consistent with each other, so if > + // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN > + if (likely(name->name.name == dentry->d_iname)) { > + memcpy(name->inline_name, dentry->d_iname, name->name.len + 1); and here it would actually be more efficient to just use a constant-sized memcpy with DNAME_INLINE_LEN, and never care about 'len' at all. And if the length in name.len isn't right (we'll return it to the user), the *final* seqcount check will catch it. And in the other case, all you care about is the ref-count. End result: I think your early retry is pointless and should just be avoided. Instead, you should make sure to just read dentry->d_name.name with a READ_ONCE() (and read the len any way you want). Hmm? Linus
On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote: > > + struct external_name *p; > + p = container_of(name->name.name, struct external_name, name[0]); > + // get a valid reference > + if (unlikely(!atomic_inc_not_zero(&p->u.count))) > + goto retry; Oh - this is very much *not* safe. The other comment I had was really about "that's bad for performance". But this is actually actively buggy. If the external name ref has gone down to zero, we can *not* do that atomic_inc_not_zero(..) thing any more, because the recount is in a union with the rcu_head for delaying the free. In other words: the *name* will exist for the duration of the rcu_read_lock() we hold, but that "p->u.count" will not. When the refcount has gone to zero, the refcount is no longer usable. IOW, you may be happily incrementing what is now a RCU list head rather than a count. So NAK. This cannot work. It's probably easily fixable by just not using a union in struct external_name, and just having separate fields for the refcount and the rcu_head, but in the current state your patch is fundamentally and dangerously buggy. (Dangerously, because you will never *ever* actually hit that tiny race - you need to race with a concurrent rename *and* a drop of the other dentry that got the external ref, so in practice this is likely entirely impossible to ever hit, but that only makes it more subtle and dangerous). Linus
On Mon, Dec 09, 2024 at 11:06:48AM -0800, Linus Torvalds wrote: > On Sun, 8 Dec 2024 at 19:52, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > + struct external_name *p; > > + p = container_of(name->name.name, struct external_name, name[0]); > > + // get a valid reference > > + if (unlikely(!atomic_inc_not_zero(&p->u.count))) > > + goto retry; > > Oh - this is very much *not* safe. > > The other comment I had was really about "that's bad for performance". > But this is actually actively buggy. > > If the external name ref has gone down to zero, we can *not* do that > > atomic_inc_not_zero(..) > > thing any more, because the recount is in a union with the rcu_head > for delaying the free. D'oh. Right you are; missed it... > In other words: the *name* will exist for the duration of the > rcu_read_lock() we hold, but that "p->u.count" will not. When the > refcount has gone to zero, the refcount is no longer usable. > > IOW, you may be happily incrementing what is now a RCU list head > rather than a count. > > So NAK. This cannot work. > > It's probably easily fixable by just not using a union in struct > external_name, and just having separate fields for the refcount and > the rcu_head, but in the current state your patch is fundamentally and > dangerously buggy. Agreed. And yes, separating the fields (and slapping a comment explaining why they can not be combined) would be the easiest solution - any attempts to be clever here would be too brittle for no good reason.
On Mon, Dec 09, 2024 at 10:27:04AM -0800, Linus Torvalds wrote: > The name consistency issue is really annoying. Do we really need it > here? Because honestly, what you actually *really* care about here is > whether it's inline or not, and you do that test right afterwards: > > > + // ->name and ->len are at least consistent with each other, so if > > + // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN > > + if (likely(name->name.name == dentry->d_iname)) { > > + memcpy(name->inline_name, dentry->d_iname, name->name.len + 1); > > and here it would actually be more efficient to just use a > constant-sized memcpy with DNAME_INLINE_LEN, and never care about > 'len' at all. Actually, taking a look at what's generated for that memcpy()... *ow* amd64 is fine, but anything that doesn't like unaligned accesses is ending up with really awful code. gcc does not realize that pointers are word-aligned. What's more, even unsigned long v[5]; void f(unsigned long *w) { memcpu(v, w, sizeof(v)); } is not enough to convince the damn thing - try it for e.g. alpha and you'll see arseloads of extq/insq/mskq, all inlined. And yes, they are aligned - d_iname follows a pointer, inline_name follows struct qstr, i.e. u64 + pointer. How about we add struct inlined_name { unsigned char name[DNAME_INLINE_LEN];}; and turn d_iname and inline_name into anon unions with that? Hell, might even make it an array of unsigned long and use that to deal with this } else { /* * Both are internal. */ unsigned int i; BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long))); for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) { swap(((long *) &dentry->d_iname)[i], ((long *) &target->d_iname)[i]); } } in swap_names(). With struct assignment in the corresponding case in copy_name() and in take_dentry_name_snapshot() - that does generate sane code...
On Mon, Dec 09, 2024 at 09:17:08PM +0000, Al Viro wrote: > And yes, they are aligned - d_iname follows a pointer, inline_name follows > struct qstr, i.e. u64 + pointer. How about we add struct inlined_name { > unsigned char name[DNAME_INLINE_LEN];}; and turn d_iname and inline_name > into anon unions with that? Hell, might even make it an array of unsigned > long and use that to deal with this > } else { > /* > * Both are internal. > */ > unsigned int i; > BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long))); > for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) { > swap(((long *) &dentry->d_iname)[i], > ((long *) &target->d_iname)[i]); > } > } > in swap_names(). With struct assignment in the corresponding case in > copy_name() and in take_dentry_name_snapshot() - that does generate sane > code... Do you have any objections to the diff below? Completely untested and needs to be split in two... It does seem to generate decent code; it obviously will need to be profiled - copy_name() in the common (short-to-short) case copies the entire thing, all 5 words of it on 64bit, instead of memcpy() of just the amount that needs to be copied. That thing may cross the cacheline boundary, but both cachelines had been freshly brought into cache and dirtied, so... diff --git a/fs/dcache.c b/fs/dcache.c index b4d5e9e1e43d..687558622acf 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -296,10 +296,8 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c } struct external_name { - union { - atomic_t count; - struct rcu_head head; - } u; + atomic_t count; // ->count and ->head can't be combined + struct rcu_head head; // see take_dentry_name_snapshot() unsigned char name[]; }; @@ -329,16 +327,33 @@ static inline int dname_external(const struct dentry *dentry) void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) { - spin_lock(&dentry->d_lock); - name->name = dentry->d_name; - if (unlikely(dname_external(dentry))) { - atomic_inc(&external_name(dentry)->u.count); - } else { - memcpy(name->inline_name, dentry->d_iname, - dentry->d_name.len + 1); + unsigned seq; + const unsigned char *s; + + rcu_read_lock(); +retry: + seq = read_seqcount_begin(&dentry->d_seq); + s = READ_ONCE(dentry->d_name.name); + name->name.hash_len = dentry->d_name.hash_len; + if (likely(s == dentry->d_iname)) { name->name.name = name->inline_name; + name->__inline_name = dentry->__d_iname; + if (read_seqcount_retry(&dentry->d_seq, seq)) + goto retry; + } else { + struct external_name *p; + p = container_of(s, struct external_name, name[0]); + name->name.name = s; + // get a valid reference + if (unlikely(!atomic_inc_not_zero(&p->count))) + goto retry; + if (read_seqcount_retry(&dentry->d_seq, seq)) { + if (unlikely(atomic_dec_and_test(&p->count))) + kfree_rcu(p, head); + goto retry; + } } - spin_unlock(&dentry->d_lock); + rcu_read_unlock(); } EXPORT_SYMBOL(take_dentry_name_snapshot); @@ -347,8 +362,8 @@ void release_dentry_name_snapshot(struct name_snapshot *name) if (unlikely(name->name.name != name->inline_name)) { struct external_name *p; p = container_of(name->name.name, struct external_name, name[0]); - if (unlikely(atomic_dec_and_test(&p->u.count))) - kfree_rcu(p, u.head); + if (unlikely(atomic_dec_and_test(&p->count))) + kfree_rcu(p, head); } } EXPORT_SYMBOL(release_dentry_name_snapshot); @@ -386,7 +401,7 @@ static void dentry_free(struct dentry *dentry) WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias)); if (unlikely(dname_external(dentry))) { struct external_name *p = external_name(dentry); - if (likely(atomic_dec_and_test(&p->u.count))) { + if (likely(atomic_dec_and_test(&p->count))) { call_rcu(&dentry->d_u.d_rcu, __d_free_external); return; } @@ -1667,7 +1682,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) kmem_cache_free(dentry_cache, dentry); return NULL; } - atomic_set(&p->u.count, 1); + atomic_set(&p->count, 1); dname = p->name; } else { dname = dentry->d_iname; @@ -2749,11 +2764,9 @@ static void swap_names(struct dentry *dentry, struct dentry *target) * Both are internal. */ unsigned int i; - BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long))); - for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) { - swap(((long *) &dentry->d_iname)[i], - ((long *) &target->d_iname)[i]); - } + for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) + swap(dentry->__d_iname.name[i], + target->__d_iname.name[i]); } } swap(dentry->d_name.hash_len, target->d_name.hash_len); @@ -2765,16 +2778,15 @@ static void copy_name(struct dentry *dentry, struct dentry *target) if (unlikely(dname_external(dentry))) old_name = external_name(dentry); if (unlikely(dname_external(target))) { - atomic_inc(&external_name(target)->u.count); + atomic_inc(&external_name(target)->count); dentry->d_name = target->d_name; } else { - memcpy(dentry->d_iname, target->d_name.name, - target->d_name.len + 1); + dentry->__d_iname = target->__d_iname; dentry->d_name.name = dentry->d_iname; dentry->d_name.hash_len = target->d_name.hash_len; } - if (old_name && likely(atomic_dec_and_test(&old_name->u.count))) - kfree_rcu(old_name, u.head); + if (old_name && likely(atomic_dec_and_test(&old_name->count))) + kfree_rcu(old_name, head); } /* @@ -3189,6 +3201,7 @@ static void __init dcache_init_early(void) static void __init dcache_init(void) { + BUILD_BUG_ON(DNAME_INLINE_LEN != sizeof(struct short_name_store)); /* * A constructor could be added for stable state like the lists, * but it is probably not worth it because of the cache nature diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 1f28f56fd406..d604a4826765 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -77,6 +77,10 @@ extern const struct qstr dotdot_name; # endif #endif +struct short_name_store { + unsigned long name[DNAME_INLINE_LEN / sizeof(unsigned long)]; +}; + #define d_lock d_lockref.lock struct dentry { @@ -88,7 +92,10 @@ struct dentry { struct qstr d_name; struct inode *d_inode; /* Where the name belongs to - NULL is * negative */ - unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ + union { + unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ + struct short_name_store __d_iname; + }; /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */ /* Ref lookup also touches following */ @@ -590,7 +597,10 @@ static inline struct inode *d_real_inode(const struct dentry *dentry) struct name_snapshot { struct qstr name; - unsigned char inline_name[DNAME_INLINE_LEN]; + union { + unsigned char inline_name[DNAME_INLINE_LEN]; + struct short_name_store __inline_name; + }; }; void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *); void release_dentry_name_snapshot(struct name_snapshot *);
On Mon, 9 Dec 2024 at 14:28, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Do you have any objections to the diff below? I'd split it up a bit more, and in particular, I think I'd want DNAME_INLINE_LEN to be defined in terms of the long-words, rather than doing it the other way around with that BUILD_BUG_ON(DNAME_INLINE_LEN != sizeof(struct short_name_store)); IOW, I'd *start* with something like the attached, and then build on that.. (By all means turn that unsigned long d_iname_words[DNAME_INLINE_WORDS]; into a struct so that you can then do the struct assignment for it - I just hate the pattern of starting with a byte size and then doing "DNAME_INLINE_LEN / sizeof(unsigned long)". Hmm? Linus
On Mon, 9 Dec 2024 at 14:49, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > IOW, I'd *start* with something like the attached, and then build on that.. Key word being "something like". I just checked, and that suggested patch will cause build issues in lib/test_printf.c, because it does things like .d_iname = "foo" and it turns out that doesn't work when it's a flex-array. It doesn't have to be a flex array, of course - it could easily just continue to use DNAME_INLINE_LEN (now just defined in terms of "words * sizeof*(unsigned long)"). I did that flex array thing mainly to see if somebody ends up depending on the array as such. Clearly that test_printf.c code does exactly that, but looks like nothing else is. Linus
On Mon, Dec 09, 2024 at 02:55:45PM -0800, Linus Torvalds wrote: > On Mon, 9 Dec 2024 at 14:49, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > IOW, I'd *start* with something like the attached, and then build on that.. > > Key word being "something like". I just checked, and that suggested > patch will cause build issues in lib/test_printf.c, because it does > things like > > .d_iname = "foo" > > and it turns out that doesn't work when it's a flex-array. > > It doesn't have to be a flex array, of course - it could easily just > continue to use DNAME_INLINE_LEN (now just defined in terms of "words > * sizeof*(unsigned long)"). > > I did that flex array thing mainly to see if somebody ends up > depending on the array as such. Clearly that test_printf.c code does > exactly that, but looks like nothing else is. Actually, grepping for DNAME_INLINE_LEN brings some interesting hits: drivers/net/ieee802154/adf7242.c:1165: char debugfs_dir_name[DNAME_INLINE_LEN + 1]; cargo-culted, AFAICS; would be better off with a constant of their own. fs/ext4/fast_commit.c:326: fc_dentry->fcd_name.len > DNAME_INLINE_LEN) fs/ext4/fast_commit.c:452: if (dentry->d_name.len > DNAME_INLINE_LEN) { fs/ext4/fast_commit.c:1332: fc_dentry->fcd_name.len > DNAME_INLINE_LEN) fs/ext4/fast_commit.h:113: unsigned char fcd_iname[DNAME_INLINE_LEN]; /* Dirent name string */ Looks like that might want struct name_snapshot, along with {take,release}_dentry_name_snapshot(). fs/libfs.c:1792: char strbuf[DNAME_INLINE_LEN]; fs/libfs.c:1819: if (len <= DNAME_INLINE_LEN - 1) { memcpy() in there might very well be better off a struct assignment. And that thing probably ought to migrate to fs/dcache.c - RCU-related considerations in it are much trickier than it is usual for ->d_compare() instances.
On Mon, Dec 09, 2024 at 11:12:37PM +0000, Al Viro wrote: > > Actually, grepping for DNAME_INLINE_LEN brings some interesting hits: > drivers/net/ieee802154/adf7242.c:1165: char debugfs_dir_name[DNAME_INLINE_LEN + 1]; > cargo-culted, AFAICS; would be better off with a constant of their own. > > fs/ext4/fast_commit.c:326: fc_dentry->fcd_name.len > DNAME_INLINE_LEN) > fs/ext4/fast_commit.c:452: if (dentry->d_name.len > DNAME_INLINE_LEN) { > fs/ext4/fast_commit.c:1332: fc_dentry->fcd_name.len > DNAME_INLINE_LEN) > fs/ext4/fast_commit.h:113: unsigned char fcd_iname[DNAME_INLINE_LEN]; /* Dirent name string */ > Looks like that might want struct name_snapshot, along with > {take,release}_dentry_name_snapshot(). See viro/vfs.git#work.dcache. I've thrown ext4/fast_commit conversion into the end of that pile. NOTE: that stuff obviously needs profiling. It does survive light testing (boot/ltp/xfstests), but review and more testing (including serious profiling) is obviously needed. Patches in followups...
diff --git a/fs/dcache.c b/fs/dcache.c index b4d5e9e1e43d..78fd7e2a3011 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -329,16 +329,34 @@ static inline int dname_external(const struct dentry *dentry) void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) { - spin_lock(&dentry->d_lock); + unsigned seq; + + rcu_read_lock(); +retry: + seq = read_seqcount_begin(&dentry->d_seq); name->name = dentry->d_name; - if (unlikely(dname_external(dentry))) { - atomic_inc(&external_name(dentry)->u.count); - } else { - memcpy(name->inline_name, dentry->d_iname, - dentry->d_name.len + 1); + if (read_seqcount_retry(&dentry->d_seq, seq)) + goto retry; + // ->name and ->len are at least consistent with each other, so if + // ->name points to dentry->d_iname, ->len is below DNAME_INLINE_LEN + if (likely(name->name.name == dentry->d_iname)) { + memcpy(name->inline_name, dentry->d_iname, name->name.len + 1); name->name.name = name->inline_name; + if (read_seqcount_retry(&dentry->d_seq, seq)) + goto retry; + } else { + struct external_name *p; + p = container_of(name->name.name, struct external_name, name[0]); + // get a valid reference + if (unlikely(!atomic_inc_not_zero(&p->u.count))) + goto retry; + if (read_seqcount_retry(&dentry->d_seq, seq)) { + if (unlikely(atomic_dec_and_test(&p->u.count))) + kfree_rcu(p, u.head); + goto retry; + } } - spin_unlock(&dentry->d_lock); + rcu_read_unlock(); } EXPORT_SYMBOL(take_dentry_name_snapshot);
There's a bunch of places where we are accessing dentry names without sufficient protection and where locking environment is not predictable enough to fix the things that way; take_dentry_name_snapshot() is one variant of solution. It does, however, have a problem - copying is cheap, but bouncing ->d_lock may be nasty on seriously shared dentries. How about the following (completely untested)? Use ->d_seq instead of grabbing ->d_lock; in case of shortname dentries that avoids any stores to shared data objects and in case of long names we are down to (unavoidable) atomic_inc on the external_name refcount. Makes the thing safer as well - the areas where ->d_seq is held odd are all nested inside the areas where ->d_lock is held, and the latter are much more numerous. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> ---