diff mbox series

[1/2] lockref: speculatively spin waiting for the lock to be released

Message ID 20240613001215.648829-2-mjguzik@gmail.com (mailing list archive)
State New
Headers show
Series stop lockref from degrading to locked-only ops | expand

Commit Message

Mateusz Guzik June 13, 2024, 12:12 a.m. UTC
The usual inc/dec scheme is to try to do it with atomics if the lock is
not taken and fallback to locked operation otherwise.

If the lock is only transiently held routines could instead
speculatively wait and avoid the acquire altogether.

This has a minor benefit of slightly speeding up cases of the sort,
which do happen in real workloads.

More importantly this fixes a corner case where lockref consumers
degrade to locked operation and are unable to go back to atomics.

Consider a continuous stream of lockref ops from several threads and
another thread taking the lock for a brief moment. Any lockref user
which manages to observe the lock as taken is going to fallback to
locking it itself, but this increases the likelihood that more threads
executing the code will do the same. This eventually can culminate in
nobody being able to go back to atomics on the problematic lockref.

While this is not a state which can persist in a real workload for
anything but few calls, it very much can permanently degrade when
benchmarking and in consequence grossly disfigure results.

Here is an example of will-it-scale doing 20-way stat(2) on the same
file residing on tmpfs, reports once per second with number of ops
completed:
[snip several ok results]
min:417297 max:426249 total:8401766	<-- expected performance
min:219024 max:221764 total:4398413	<-- some locking started
min:62855 max:64949 total:1273712	<-- everyone degraded
min:62472 max:64873 total:1268733
min:62179 max:63843 total:1256375
[snip remaining bad results]

While I did not try to figure out who transiently took the lock (it was
something outside of the benchmark), I devised a trivial reproducer
which triggers the problem almost every time: merely issue "ls" of the
directory containing the tested file (in this case: "ls /tmp").

The problem does not persist with the fix below.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 lib/lockref.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

Comments

Linus Torvalds June 13, 2024, 1:23 a.m. UTC | #1
On Wed, 12 Jun 2024 at 17:12, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> While I did not try to figure out who transiently took the lock (it was
> something outside of the benchmark), I devised a trivial reproducer
> which triggers the problem almost every time: merely issue "ls" of the
> directory containing the tested file (in this case: "ls /tmp").

So I have no problem with your patch 2/2 - moving the lockref data
structure away from everything else that can be shared read-only makes
a ton of sense independently of anything else.

Except you also randomly increased a retry count in there, which makes no sense.

But this patch 1/2 makes me go "Eww, hacky hacky".

We already *have* the retry loop, it's just that currently it only
covers the cmpxchg failures.

The natural thing to do is to just make the "wait for unlocked" be
part of the same loop.

In fact, I have this memory of trying this originally, and it not
mattering and just making the code uglier, but that may be me
confusing myself. It's a *loong* time ago.

With the attached patch, lockref_get() (to pick one random case) ends
up looking like this:

        mov    (%rdi),%rax
        mov    $0x64,%ecx
  loop:
        test   %eax,%eax
        jne    locked
        mov    %rax,%rdx
        sar    $0x20,%rdx
        add    $0x1,%edx
        shl    $0x20,%rdx
        lock cmpxchg %rdx,(%rdi)
        jne    fail
        // SUCCESS
        ret
  locked:
        pause
        mov    (%rdi),%rax
  fail:
        sub    $0x1,%ecx
        jne    loop

(with the rest being the "take lock and go slow" case).

It seems much better to me to have *one* retry loop that handles both
the causes of failures.

Entirely untested, I only looked at the generated code and it looked
reasonable. The patch may be entirely broken for some random reason I
didn't think of.

And in case you wonder, that 'lockref_locked()' macro I introduce is
purely to make the code more readable. Without it, that one
conditional line ends up being insanely long, the macro is there just
to break things up into slightly more manageable chunks.

Mind testing this approach instead?

                 Linus
Linus Torvalds June 13, 2024, 1:49 a.m. UTC | #2
On Wed, 12 Jun 2024 at 18:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The natural thing to do is to just make the "wait for unlocked" be
> part of the same loop.

Oh, and while I like my approach a lot more than your patch, I do
think that the real issue here is likely that something takes the
d_lock way too much.

One of the ideas behind the reflux was that locking should be an
exceptional thing when something special happens. So things like final
dput() and friends.

What I *think* is going on - judging by your description of how you
triggered this - is that sadly our libfs 'readdir()' thing is pretty
nasty.

It does use d_lock a lot for the cursor handling, and things like
scan_positives() in particular.

I understand *why* it does that, and maybe it's practically unfixable,
but I do think the most likely deeper reason for that "go into slow
mode" is the cursor on the directory causing issues.

Put another way: while I think doing the retry loop will help
benchmarks, it would be lovely if you were to look at that arguably
deeper issue of the 'd_sib' list.

                   Linus
Mateusz Guzik June 13, 2024, 6:09 a.m. UTC | #3
On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote:
> So I have no problem with your patch 2/2 - moving the lockref data
> structure away from everything else that can be shared read-only makes
> a ton of sense independently of anything else.
> 
> Except you also randomly increased a retry count in there, which makes no sense.
> 

Cmon man, that's a change which unintentionally crept into the patch and
I failed to notice.

> But this patch 1/2 makes me go "Eww, hacky hacky".
> 
> We already *have* the retry loop, it's just that currently it only
> covers the cmpxchg failures.
> 
> The natural thing to do is to just make the "wait for unlocked" be
> part of the same loop.
> 

I was playing with a bpftrace probe reporting how many spins were
performed waiting for the lock. For my intentional usage with ls it was
always < 30 or so. The random-ass intruder which messes with my bench on
occasion needed over 100.

The bump is something I must have fat-fingered into the wrong patch.

Ultimately even if *some* iterations will still take the lock, they
should be able to avoid it next time around, so the permanent
degradation problem is still solved.

> Mind testing this approach instead?
> 

So I'm not going to argue about the fix.

I tested your code and confirm it fixes the problem, nothing blows up
either and I fwiw I don't see any bugs in it.

When writing the commit message feel free to use mine in whatever capacity
(including none) you want. 

On Wed, Jun 12, 2024 at 06:49:00PM -0700, Linus Torvalds wrote:

> On Wed, 12 Jun 2024 at 18:23, Linus Torvalds
> One of the ideas behind the reflux was that locking should be an
> exceptional thing when something special happens. So things like final
> dput() and friends.
> 
> What I *think* is going on - judging by your description of how you
> triggered this - is that sadly our libfs 'readdir()' thing is pretty
> nasty.
> 
> It does use d_lock a lot for the cursor handling, and things like
> scan_positives() in particular.
> 
> I understand *why* it does that, and maybe it's practically unfixable,
> but I do think the most likely deeper reason for that "go into slow
> mode" is the cursor on the directory causing issues.
> 
> Put another way: while I think doing the retry loop will help
> benchmarks, it would be lovely if you were to look at that arguably
> deeper issue of the 'd_sib' list.
> 

I think lockref claiming to be a general locking facility means it
should not be suffering the degradation problem to begin with, so that
would be the real problem as far as lockref goes.

For vfs as a whole I do agree that the d_lock usage in various spots is
probably avoidable and if so would be nice to get patched out, readdir
included. So Someone(tm) should certainly look into it.

As for me at the moment I have other stuff on my todo list, so I am not
going to do it for the time being.

Regardless, patching up lockref to dodge the lock is a step forward in
the right direction AFAICS.

=====

All that aside, you did not indicate how do you want to move forward
regarding patch submission.

As indicated in my cover letter the vfs change (if it is to land) needs
to be placed *after* the lockref issue is addressed, otherwise it may
result in bogus regression reports. Thus I think it makes most sense to
just ship them together.

So maybe you can send me a patch and I send a v2 of the patchset with
that, alternatively perhaps you can edit out the unintional retry
adjustment from my dentry patch and handle the rest yourself.

Or maybe you have some other idea.

The thing that matters to me is not landing in a state where d_lockref
is moved around, but the lockref corner case is not patched (even it is
patched *after*). I really don't want to investigate a regression report
only to find it's caused by the above.
Christian Brauner June 13, 2024, 1:46 p.m. UTC | #4
> All that aside, you did not indicate how do you want to move forward
> regarding patch submission.

I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
Was trivial to fix your typo and to add Linus as author with your commit
message. Let's see what syzbot and that perf bot have to say.
Mateusz Guzik June 13, 2024, 1:50 p.m. UTC | #5
On Thu, Jun 13, 2024 at 3:46 PM Christian Brauner <brauner@kernel.org> wrote:
>
> > All that aside, you did not indicate how do you want to move forward
> > regarding patch submission.
>
> I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
> Was trivial to fix your typo and to add Linus as author with your commit
> message. Let's see what syzbot and that perf bot have to say.

sounds good to me, thanks
Linus Torvalds June 13, 2024, 4:50 p.m. UTC | #6
On Wed, 12 Jun 2024 at 23:10, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote:
> > So I have no problem with your patch 2/2 - moving the lockref data
> > structure away from everything else that can be shared read-only makes
> > a ton of sense independently of anything else.
> >
> > Except you also randomly increased a retry count in there, which makes no sense.
>
> Cmon man, that's a change which unintentionally crept into the patch and
> I failed to notice.

Heh. It wasn't entirely obvious that it was unintentional, since the
series very much was about that whole rety thing.

But good.

> I was playing with a bpftrace probe reporting how many spins were
> performed waiting for the lock. For my intentional usage with ls it was
> always < 30 or so. The random-ass intruder which messes with my bench on
> occasion needed over 100.

Ok. So keeping it at 100 is likely fine.

Of course, one option is to simply get rid of the retry count
entirely, and just make it all be entirely unconditional.

The fallback to locking is not technically required at all for the
USE_CMPXCHG_LOCKREF case.

That has some unfairness issues, though. But my point is mostly that
the count is probably not hugely important or meaningful.

> I tested your code and confirm it fixes the problem, nothing blows up
> either and I fwiw I don't see any bugs in it.

I was more worried about some fat-fingering than any huge conceptual
bugs, and any minor testing with performance checks would find that.

Just as an example: my first attempt switched the while(likely(..)) to
the if (unlikely(..)) in the loop, but didn't add the "!" to negate
the test.

I caught it immediately and obviously never sent that broken thing out
(and it was one reason why I decided I needed to make the code more
readable with that lockref_locked() helper macro). But that's mainly
the kind of thing I was worried about.

> When writing the commit message feel free to use mine in whatever capacity
> (including none) you want.

Ack.
> I think lockref claiming to be a general locking facility means it
> should not be suffering the degradation problem to begin with, so that
> would be the real problem as far as lockref goes.

Well, it was certainly originally meant to be generic, but after more
than a decade, the number of users is three. And the two non-dcache
users are pretty darn random.

So it's effectively really just a dcache special case with two
filesystems that started using it just because the authors had clearly
seen the vfs use of it...

> All that aside, you did not indicate how do you want to move forward
> regarding patch submission.

lockref is fairly unusual, and is pretty much mine. The initial
concept was from Waiman Long:

   https://lore.kernel.org/all/1372268603-46748-1-git-send-email-Waiman.Long@hp.com/

but I ended up redoing it pretty much from scratch, and credited him
as author for the initial commit.

It's _technically_ a locking thing, but realistically it has never
been locking tree and it's effectively a dcache thing.

           Linus
Linus Torvalds June 13, 2024, 5 p.m. UTC | #7
On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote:
>
> I've picked Linus patch and your for testing into the vfs.inode.rcu branch.

Btw, if you added [patch 2/2] too, I think the exact byte counts in
the comments are a bit misleading.

The actual cacheline details will very much depend on 32-bit vs 64-bit
builds, but also on things like the slab allocator debug settings.

I think the important part is to keep the d_lockref - that is often
dirty and exclusive in the cache - away from the mostly RO parts of
the dentry that can be shared across CPU's in the cache.

So rather than talk about exact byte offsets, maybe just state that
overall rule?

              Linus
Mateusz Guzik June 13, 2024, 6:13 p.m. UTC | #8
On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote:
> >
> > I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
>
> Btw, if you added [patch 2/2] too, I think the exact byte counts in
> the comments are a bit misleading.
>
> The actual cacheline details will very much depend on 32-bit vs 64-bit
> builds, but also on things like the slab allocator debug settings.
>

I indeed assumed "x86_64 production", with lines just copied from pahole.

However, to the best of my understanding the counts are what one
should expect on a 64-bit kernel.

That said:

> I think the important part is to keep the d_lockref - that is often
> dirty and exclusive in the cache - away from the mostly RO parts of
> the dentry that can be shared across CPU's in the cache.
>
> So rather than talk about exact byte offsets, maybe just state that
> overall rule?
>

I would assume the rule is pretty much well known and instead an
indicator where is what (as added in my comments) would be welcome.

But if going that route, then perhaps:
"Make sure d_lockref does not share a cacheline with fields used by
RCU lookup, otherwise it can result in a signification throughput
reduction. You can use pahole(1) to check the layout."
[maybe a link to perfbook or something goes here?]

Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap
(only active without whatever runtime debug messing with the layout).
Mateusz Guzik June 13, 2024, 6:41 p.m. UTC | #9
On Thu, Jun 13, 2024 at 8:13 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
> >
> > Btw, if you added [patch 2/2] too, I think the exact byte counts in
> > the comments are a bit misleading.
> >
> > The actual cacheline details will very much depend on 32-bit vs 64-bit
> > builds, but also on things like the slab allocator debug settings.
> >
>
> I indeed assumed "x86_64 production", with lines just copied from pahole.
>
> However, to the best of my understanding the counts are what one
> should expect on a 64-bit kernel.
>
> That said:
>
> > I think the important part is to keep the d_lockref - that is often
> > dirty and exclusive in the cache - away from the mostly RO parts of
> > the dentry that can be shared across CPU's in the cache.
> >
> > So rather than talk about exact byte offsets, maybe just state that
> > overall rule?
> >
>
> I would assume the rule is pretty much well known and instead an
> indicator where is what (as added in my comments) would be welcome.
>
> But if going that route, then perhaps:
> "Make sure d_lockref does not share a cacheline with fields used by
> RCU lookup, otherwise it can result in a signification throughput
> reduction. You can use pahole(1) to check the layout."
> [maybe a link to perfbook or something goes here?]
>
> Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap
> (only active without whatever runtime debug messing with the layout).
>

So I tried to add the BUILD_BUG_ONs but I got some compilation errors
immediately concerning the syntax, maybe I'm brainfarting here. I am
not pasting that in case you want to take a stab yourself from
scratch.

I did however type up the following:
/*
 * Note: dentries have a read-mostly area heavily used by RCU (denoted with
 * "RCU lookup touched fields") which must not share cachelines with a
 * frequently written-to field like d_lockref.
 *
 * If you are modifying the layout you can check with pahole(1) that there is
 * no overlap.
 */

could land just above the struct.

That's my $0,03. I am not going to give further commentary on the
matter, you guys touch it up however you see fit.
Linus Torvalds June 13, 2024, 6:43 p.m. UTC | #10
On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I would assume the rule is pretty much well known and instead an
> indicator where is what (as added in my comments) would be welcome.

Oh, the rule is well-known, but I think what is worth pointing out is
the different classes of fields, and the name[] field in particular.

This ordering was last really updated back in 2011, by commit
44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And
it was actually somewhat intentional at the time. Quoting from that
commit:

    We also fit in 8 bytes of inline name in the first 64 bytes, so for short
    names, only 64 bytes needs to be touched to perform the lookup. We should
    get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
    of name in there, which will take care of 81% rather than 32% of the kernel
    tree.

but what has actually really changed - and that I didn't even realize
until I now did a 'pahole' on it, was that this was all COMPLETELY
broken by

       seqcount_spinlock_t        d_seq;

because seqcount_spinlock_t has been entirely broken and went from
being 4 bytes back when, to now being 64 bytes.

Crazy crazy.

How did that happen?

               Linus
Mateusz Guzik June 13, 2024, 6:47 p.m. UTC | #11
On Thu, Jun 13, 2024 at 8:43 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I would assume the rule is pretty much well known and instead an
> > indicator where is what (as added in my comments) would be welcome.
>
> Oh, the rule is well-known, but I think what is worth pointing out is
> the different classes of fields, and the name[] field in particular.
>
> This ordering was last really updated back in 2011, by commit
> 44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And
> it was actually somewhat intentional at the time. Quoting from that
> commit:
>
>     We also fit in 8 bytes of inline name in the first 64 bytes, so for short
>     names, only 64 bytes needs to be touched to perform the lookup. We should
>     get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
>     of name in there, which will take care of 81% rather than 32% of the kernel
>     tree.
>

Right. Things degrading by accident is why I suggested BUILD_BUG_ON.

> but what has actually really changed - and that I didn't even realize
> until I now did a 'pahole' on it, was that this was all COMPLETELY
> broken by
>
>        seqcount_spinlock_t        d_seq;
>
> because seqcount_spinlock_t has been entirely broken and went from
> being 4 bytes back when, to now being 64 bytes.
>
> Crazy crazy.
>
> How did that happen?
>

perhaps lockdep in your config?

this is the layout on my prod config:
struct dentry {
        unsigned int               d_flags;              /*     0     4 */
        seqcount_spinlock_t        d_seq;                /*     4     4 */
        struct hlist_bl_node       d_hash;               /*     8    16 */
        struct dentry *            d_parent;             /*    24     8 */
        struct qstr                d_name;               /*    32    16 */
        struct inode *             d_inode;              /*    48     8 */
        unsigned char              d_iname[40];          /*    56    40 */
        /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
        const struct dentry_operations  * d_op;          /*    96     8 */
        struct super_block *       d_sb;                 /*   104     8 */
        long unsigned int          d_time;               /*   112     8 */
        void *                     d_fsdata;             /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        struct lockref             d_lockref
__attribute__((__aligned__(8))); /*   128     8 */
        union {
                struct list_head   d_lru;                /*   136    16 */
                wait_queue_head_t * d_wait;              /*   136     8 */
        };                                               /*   136    16 */
        struct hlist_node          d_sib;                /*   152    16 */
        struct hlist_head          d_children;           /*   168     8 */
        union {
                struct hlist_node  d_alias;              /*   176    16 */
                struct hlist_bl_node d_in_lookup_hash;   /*   176    16 */
                struct callback_head d_rcu
__attribute__((__aligned__(8))); /*   176    16 */
        } d_u __attribute__((__aligned__(8)));           /*   176    16 */

        /* size: 192, cachelines: 3, members: 16 */
        /* forced alignments: 2 */
} __attribute__((__aligned__(8)));
Al Viro June 13, 2024, 6:55 p.m. UTC | #12
On Thu, Jun 13, 2024 at 11:43:05AM -0700, Linus Torvalds wrote:

>        seqcount_spinlock_t        d_seq;
> 
> because seqcount_spinlock_t has been entirely broken and went from
> being 4 bytes back when, to now being 64 bytes.

1ca7d67cf5d5 "seqcount: Add lockdep functionality to seqcount/seqlock structures"
Linus Torvalds June 13, 2024, 6:56 p.m. UTC | #13
On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> perhaps lockdep in your config?

Yes, happily it was just lockdep, and the fact that my regular tree
doesn't have debug info, so I looked at my allmodconfig tree.

I didn't *think* anything in the dentry struct should care about
debugging, but clearly that sequence number thing did.

But with that fixed, it's still the case that "we used to know about
this", but what you actually fixed is the case of larger names than 8
bytes.

You did mention the name clashing in your commit message, but I think
that should be the important part in the code comments too: make them
about "these fields are hot and pretty much read-only", "these fields
don't matter" and "this field is hot and written, and shouldn't be
near the read-only ones".

The exact byte counts may change, the basic notion doesn't.

(Of course, putting it at the *end* of the structure then possibly
causes cache conflicts with the next one - we don't force the dentries
be cacheline aligned even if we've tried to make them generally work
that way)

             Linus
Mateusz Guzik June 13, 2024, 7:02 p.m. UTC | #14
On Thu, Jun 13, 2024 at 8:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > perhaps lockdep in your config?
>
> Yes, happily it was just lockdep, and the fact that my regular tree
> doesn't have debug info, so I looked at my allmodconfig tree.
>
> I didn't *think* anything in the dentry struct should care about
> debugging, but clearly that sequence number thing did.
>
> But with that fixed, it's still the case that "we used to know about
> this", but what you actually fixed is the case of larger names than 8
> bytes.
>
> You did mention the name clashing in your commit message, but I think
> that should be the important part in the code comments too: make them
> about "these fields are hot and pretty much read-only", "these fields
> don't matter" and "this field is hot and written, and shouldn't be
> near the read-only ones".
>
> The exact byte counts may change, the basic notion doesn't.
>
> (Of course, putting it at the *end* of the structure then possibly
> causes cache conflicts with the next one - we don't force the dentries
> be cacheline aligned even if we've tried to make them generally work
> that way)
>

Look mate, I'm not going to go back and forth about this bit.

Nobody is getting a turing award for moving a field elsewhere, so as
far as I am concerned you are free to write your own version of the
patch, I don't even need to be mentioned. You are also free to grab
the commit message in whatever capacity (I guess you would at least
bench results?).

As long as lockref (the facility) and d_lockref are out of the way I'm
a happy camper.
Linus Torvalds June 13, 2024, 7:33 p.m. UTC | #15
On Thu, 13 Jun 2024 at 11:56, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I didn't *think* anything in the dentry struct should care about
> debugging, but clearly that sequence number thing did.

Looking at the 32-bit build, it looks like out current 'struct dentry'
is 136 bytes in size, not 128.

Looks like DNAME_INLINE_LEN should be reduced to 36 on 32-bit.

And moving d_lockref to after d_fsdata works there too.

Not that anybody really cares, but let's make sure it's actually
properly done when this is changed. Christian?

              Linus
Christian Brauner June 18, 2024, 12:11 p.m. UTC | #16
On Thu, Jun 13, 2024 at 12:33:59PM GMT, Linus Torvalds wrote:
> On Thu, 13 Jun 2024 at 11:56, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I didn't *think* anything in the dentry struct should care about
> > debugging, but clearly that sequence number thing did.
> 
> Looking at the 32-bit build, it looks like out current 'struct dentry'
> is 136 bytes in size, not 128.
> 
> Looks like DNAME_INLINE_LEN should be reduced to 36 on 32-bit.
> 
> And moving d_lockref to after d_fsdata works there too.
> 
> Not that anybody really cares, but let's make sure it's actually
> properly done when this is changed. Christian?

So I verified that indeed on i386 with CONFIG_SMP we need 36 to get to
128 bytes and yes, moving d_lockref after d_fsdata works. So I've put
that patch on top. (Sorry, took a bit. I'm back tomorrow.)
diff mbox series

Patch

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..596b521bc1f1 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -26,10 +26,46 @@ 
 	}									\
 } while (0)
 
+/*
+ * Most routines only ever inc/dec the reference, but the lock may be
+ * transiently held forcing them to take it as well.
+ *
+ * Should the lock be taken for any reason (including outside of lockref),
+ * a steady stream of ref/unref requests may find itself unable to go back
+ * to lockless operation.
+ *
+ * Combat the problem by giving the routines a way to speculatively wait in
+ * hopes of avoiding having to take the lock.
+ *
+ * The spin count is limited to guarantee forward progress, although the
+ * value is arbitrarily chosen.
+ *
+ * Note this routine is only used if the lock was found to be taken.
+ */
+static inline bool lockref_trywait_unlocked(struct lockref *lockref)
+{
+	struct lockref old;
+	int retry = 100;
+
+	for (;;) {
+		cpu_relax();
+		old.lock_count = READ_ONCE(lockref->lock_count);
+		if (arch_spin_value_unlocked(old.lock.rlock.raw_lock))
+			return true;
+		if (!--retry)
+			return false;
+	}
+}
+
 #else
 
 #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
 
+static inline bool lockref_trywait_unlocked(struct lockref *lockref)
+{
+	return false;
+}
+
 #endif
 
 /**
@@ -47,6 +83,14 @@  void lockref_get(struct lockref *lockref)
 		return;
 	);
 
+	if (lockref_trywait_unlocked(lockref)) {
+		CMPXCHG_LOOP(
+			new.count++;
+		,
+			return;
+		);
+	}
+
 	spin_lock(&lockref->lock);
 	lockref->count++;
 	spin_unlock(&lockref->lock);
@@ -70,6 +114,16 @@  int lockref_get_not_zero(struct lockref *lockref)
 		return 1;
 	);
 
+	if (lockref_trywait_unlocked(lockref)) {
+		CMPXCHG_LOOP(
+			new.count++;
+			if (old.count <= 0)
+				return 0;
+		,
+			return 1;
+		);
+	}
+
 	spin_lock(&lockref->lock);
 	retval = 0;
 	if (lockref->count > 0) {
@@ -98,6 +152,16 @@  int lockref_put_not_zero(struct lockref *lockref)
 		return 1;
 	);
 
+	if (lockref_trywait_unlocked(lockref)) {
+		CMPXCHG_LOOP(
+			new.count--;
+			if (old.count <= 1)
+				return 0;
+		,
+			return 1;
+		);
+	}
+
 	spin_lock(&lockref->lock);
 	retval = 0;
 	if (lockref->count > 1) {
@@ -125,6 +189,17 @@  int lockref_put_return(struct lockref *lockref)
 	,
 		return new.count;
 	);
+
+	if (lockref_trywait_unlocked(lockref)) {
+		CMPXCHG_LOOP(
+			new.count--;
+			if (old.count <= 0)
+				return -1;
+		,
+			return new.count;
+		);
+	}
+
 	return -1;
 }
 EXPORT_SYMBOL(lockref_put_return);
@@ -181,6 +256,16 @@  int lockref_get_not_dead(struct lockref *lockref)
 		return 1;
 	);
 
+	if (lockref_trywait_unlocked(lockref)) {
+		CMPXCHG_LOOP(
+			new.count++;
+			if (old.count < 0)
+				return 0;
+		,
+			return 1;
+		);
+	}
+
 	spin_lock(&lockref->lock);
 	retval = 0;
 	if (lockref->count >= 0) {