diff mbox series

padata: use smp_mb in padata_reorder to avoid orphaned padata jobs

Message ID 20190711221205.29889-1-daniel.m.jordan@oracle.com (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show
Series padata: use smp_mb in padata_reorder to avoid orphaned padata jobs | expand

Commit Message

Daniel Jordan July 11, 2019, 10:12 p.m. UTC
Testing padata with the tcrypt module on a 5.2 kernel...

    # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
    # modprobe tcrypt mode=211 sec=1

...produces this splat:

    INFO: task modprobe:10075 blocked for more than 120 seconds.
          Not tainted 5.2.0-base+ #16
    modprobe        D    0 10075  10064 0x80004080
    Call Trace:
     ? __schedule+0x4dd/0x610
     ? ring_buffer_unlock_commit+0x23/0x100
     schedule+0x6c/0x90
     schedule_timeout+0x3b/0x320
     ? trace_buffer_unlock_commit_regs+0x4f/0x1f0
     wait_for_common+0x160/0x1a0
     ? wake_up_q+0x80/0x80
     { crypto_wait_req }             # entries in braces added by hand
     { do_one_aead_op }
     { test_aead_jiffies }
     test_aead_speed.constprop.17+0x681/0xf30 [tcrypt]
     do_test+0x4053/0x6a2b [tcrypt]
     ? 0xffffffffa00f4000
     tcrypt_mod_init+0x50/0x1000 [tcrypt]
     ...

The second modprobe command never finishes because in padata_reorder,
CPU0's load of reorder_objects is executed before the unlocking store in
spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment:

CPU0                                 CPU1

padata_reorder                       padata_do_serial
  LOAD reorder_objects  // 0
                                       INC reorder_objects  // 1
                                       padata_reorder
                                         TRYLOCK pd->lock   // failed
  UNLOCK pd->lock

CPU0 deletes the timer before returning from padata_reorder and since no
other job is submitted to padata, modprobe waits indefinitely.

Add a full barrier to prevent this scenario.  The hang was happening
about once every three runs, but now the test has finished successfully
fifty times in a row.

Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---

memory-barriers.txt says that a full barrier pairs with a release barrier, but
I'd appreciate a look from someone more familiar with barriers.  Thanks.

 kernel/padata.c | 5 +++++
 1 file changed, 5 insertions(+)


base-commit: 0ecfebd2b52404ae0c54a878c872bb93363ada36

Comments

Herbert Xu July 12, 2019, 10:06 a.m. UTC | #1
Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
>
> CPU0                                 CPU1
> 
> padata_reorder                       padata_do_serial
>  LOAD reorder_objects  // 0
>                                       INC reorder_objects  // 1
>                                       padata_reorder
>                                         TRYLOCK pd->lock   // failed
>  UNLOCK pd->lock

I think this can't happen because CPU1 won't call padata_reorder
at all as it's the wrong CPU so it gets pushed back onto a work
queue which will go back to CPU0.

Steffen, could you please take a look at this as there clearly
is a problem here?

Thanks,
Steffen Klassert July 12, 2019, 10:10 a.m. UTC | #2
On Fri, Jul 12, 2019 at 06:06:36PM +0800, Herbert Xu wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> >
> > CPU0                                 CPU1
> > 
> > padata_reorder                       padata_do_serial
> >  LOAD reorder_objects  // 0
> >                                       INC reorder_objects  // 1
> >                                       padata_reorder
> >                                         TRYLOCK pd->lock   // failed
> >  UNLOCK pd->lock
> 
> I think this can't happen because CPU1 won't call padata_reorder
> at all as it's the wrong CPU so it gets pushed back onto a work
> queue which will go back to CPU0.
> 
> Steffen, could you please take a look at this as there clearly
> is a problem here?

I'm currently travelling, will have a look at it when I'm back.
Daniel Jordan July 12, 2019, 4:07 p.m. UTC | #3
On Fri, Jul 12, 2019 at 12:10:12PM +0200, Steffen Klassert wrote:
> On Fri, Jul 12, 2019 at 06:06:36PM +0800, Herbert Xu wrote:
> > Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> > >
> > > CPU0                                 CPU1
> > >
> > > padata_reorder                       padata_do_serial
> > >  LOAD reorder_objects  // 0
> > >                                       INC reorder_objects  // 1
> > >                                       padata_reorder
> > >                                         TRYLOCK pd->lock   // failed
> > >  UNLOCK pd->lock
> >
> > I think this can't happen because CPU1 won't call padata_reorder
> > at all as it's the wrong CPU so it gets pushed back onto a work
> > queue which will go back to CPU0.

Thanks for looking at this.

When you say the wrong CPU, I think you're referring to the reorder_via_wq
logic in padata_do_serial.  If so, I think my explanation was unclear, because
there were two padata jobs in flight and my tracepoints showed neither of them
punted padata_reorder to a workqueue.  Let me expand on this, hopefully it
helps.

pcrypt used CPU 5 for its callback CPU.  The first job in question, with
sequence number 16581, hashed to CPU 21 on my system.  This is a more complete
version of what led to the hanging modprobe command.

modprobe (CPU2)               kworker/21:1-293 (CPU21)                              kworker/5:2-276 (CPU5)
--------------------------    ------------------------                              ----------------------
<submit job, seq_nr=16581>
...
  padata_do_parallel
    queue_work_on(21, ...)
<sleeps>
                              padata_parallel_worker
                                pcrypt_aead_dec
                                  padata_do_serial
                                    padata_reorder
                                    | padata_get_next  // returns job, seq_nr=16581
                                    | // serialize seq_nr=16581
                                    | queue_work_on(5, ...)
                                    | padata_get_next  // returns -EINPROGRESS
                                    // padata_reorder doesn't return yet!
                                    | |                                             padata_serial_worker
                                    | |                                               pcrypt_aead_serial
                                    | |                                                 <wake up modprobe>
                                    | |                                             <worker finishes>
<submit job, seq_nr=16582>          | |
...                                 | |
  padata_do_parallel                | |
    queue_work_on(22, ...)          | | (LOAD reorder_objects as 0 somewhere
<sleeps>                            | |    in here, before the INC)
                                    | |                                             kworker/22:1-291 (CPU22)
                                    | |                                             ------------------------
                                    | |                                             padata_parallel_worker
                                    | |                                               pcrypt_aead_dec
                                    | |                                                 padata_do_serial
                                    | |                                                   // INC reorder_objects to 1
                                    | |                                                   padata_reorder
                                    | |                                                     // trylock fail, CPU21 _should_
                                    | |                                                     //   serialize 16582 but doesn't
                                    | |                                             <worker finishes>
                                    | // deletes pd->timer
                                    // padata_reorder returns
                              <worker finishes>
<keeps on sleeping lazily>

My tracepoints showed CPU22 increased reorder_objects to 1 but CPU21 read it as
0.

I think adding the full barrier guarantees the following ordering, and the
memory model people can correct me if I'm wrong:

CPU21                      CPU22
------------------------   --------------------------
UNLOCK pd->lock
smp_mb()
LOAD reorder_objects
                           INC reorder_objects
                           spin_unlock(&pqueue->reorder.lock) // release barrier
                           TRYLOCK pd->lock

So if CPU22 has incremented reorder_objects but CPU21 reads 0 for it, CPU21
should also have unlocked pd->lock so CPU22 can get it and serialize any
remaining jobs.
Herbert Xu July 13, 2019, 5:03 a.m. UTC | #4
On Fri, Jul 12, 2019 at 12:07:37PM -0400, Daniel Jordan wrote:
>
> modprobe (CPU2)               kworker/21:1-293 (CPU21)                              kworker/5:2-276 (CPU5)
> --------------------------    ------------------------                              ----------------------
> <submit job, seq_nr=16581>
> ...
>   padata_do_parallel
>     queue_work_on(21, ...)
> <sleeps>
>                               padata_parallel_worker
>                                 pcrypt_aead_dec
>                                   padata_do_serial
>                                     padata_reorder

This can't happen because if the job started on CPU2 then it must
go back to CPU2 for completion.  IOW padata_do_serial should be
punting this to a work queue for CPU2 rather than calling
padata_reorder on CPU21.

Cheers,
Daniel Jordan July 15, 2019, 4:10 p.m. UTC | #5
On Sat, Jul 13, 2019 at 01:03:21PM +0800, Herbert Xu wrote:
> On Fri, Jul 12, 2019 at 12:07:37PM -0400, Daniel Jordan wrote:
> >
> > modprobe (CPU2)               kworker/21:1-293 (CPU21)                              kworker/5:2-276 (CPU5)
> > --------------------------    ------------------------                              ----------------------
> > <submit job, seq_nr=16581>
> > ...
> >   padata_do_parallel
> >     queue_work_on(21, ...)
> > <sleeps>
> >                               padata_parallel_worker
> >                                 pcrypt_aead_dec
> >                                   padata_do_serial
> >                                     padata_reorder
> 
> This can't happen because if the job started on CPU2 then it must
> go back to CPU2 for completion.  IOW padata_do_serial should be
> punting this to a work queue for CPU2 rather than calling
> padata_reorder on CPU21.

I've been wrong before plenty of times, and there's nothing preventing this
from being one of those times :) , but in this case I believe what I'm showing
is correct.

The padata_do_serial call for a given job ensures padata_reorder runs on the
CPU that the job hashed to in padata_do_parallel, which is not necessarily the
same CPU as the one that padata_do_parallel itself ran on.

In this case, the padata job in question started via padata_do_parallel, where
it hashed to CPU 21:

  padata_do_parallel                    // ran on CPU 2
    ...
    target_cpu = padata_cpu_hash(pd);   // target_cpu == 21
    padata->cpu = target_cpu;
    ...
    queue_work_on(21, ...)

The corresponding kworker then started:

  padata_parallel_worker                // bound to CPU 21
    pcrypt_aead_dec
      padata_do_serial
        padata_reorder

Daniel
Herbert Xu July 16, 2019, 10:04 a.m. UTC | #6
On Mon, Jul 15, 2019 at 12:10:46PM -0400, Daniel Jordan wrote:
>
> I've been wrong before plenty of times, and there's nothing preventing this
> from being one of those times :) , but in this case I believe what I'm showing
> is correct.
> 
> The padata_do_serial call for a given job ensures padata_reorder runs on the
> CPU that the job hashed to in padata_do_parallel, which is not necessarily the
> same CPU as the one that padata_do_parallel itself ran on.

You're right.  I was taking the comment in the code at face value,
never trust comments :)

While looking at the code in question, I think it is seriously
broken.  For instance, padata_replace does not deal with async
crypto at all.  It would fail miserably if the underlying async
crypto held onto references to the old pd.

So we may have to restrict pcrypt to sync crypto only, which
would obviously mean that it can no longer use aesni.  Or we
will have to spend a lot of time to fix this up properly.

Cheers,
Steffen Klassert July 16, 2019, 11:14 a.m. UTC | #7
On Tue, Jul 16, 2019 at 06:04:47PM +0800, Herbert Xu wrote:
> On Mon, Jul 15, 2019 at 12:10:46PM -0400, Daniel Jordan wrote:
> >
> > I've been wrong before plenty of times, and there's nothing preventing this
> > from being one of those times :) , but in this case I believe what I'm showing
> > is correct.
> > 
> > The padata_do_serial call for a given job ensures padata_reorder runs on the
> > CPU that the job hashed to in padata_do_parallel, which is not necessarily the
> > same CPU as the one that padata_do_parallel itself ran on.
> 
> You're right.  I was taking the comment in the code at face value,
> never trust comments :)
> 
> While looking at the code in question, I think it is seriously
> broken.  For instance, padata_replace does not deal with async
> crypto at all.  It would fail miserably if the underlying async
> crypto held onto references to the old pd.

Hm, yes looks like that.

padata_replace should not call padata_free_pd() as long as the
refcount is not zero. Currenlty padata_flush_queues() will
BUG if there are references left.

Maybe we can fix it if we call padata_free_pd() from
padata_serial_worker() when it sent out the last object.
Andrea Parri July 16, 2019, 12:53 p.m. UTC | #8
Hi Daniel,

My two cents (summarizing some findings we discussed privately):


> I think adding the full barrier guarantees the following ordering, and the
> memory model people can correct me if I'm wrong:
> 
> CPU21                      CPU22
> ------------------------   --------------------------
> UNLOCK pd->lock
> smp_mb()
> LOAD reorder_objects
>                            INC reorder_objects
>                            spin_unlock(&pqueue->reorder.lock) // release barrier
>                            TRYLOCK pd->lock
> 
> So if CPU22 has incremented reorder_objects but CPU21 reads 0 for it, CPU21
> should also have unlocked pd->lock so CPU22 can get it and serialize any
> remaining jobs.

This information inspired me to write down the following litmus test:
(AFAICT, you independently wrote a very similar test, which is indeed
quite reassuring! ;D)

C daniel-padata

{ }

P0(atomic_t *reorder_objects, spinlock_t *pd_lock)
{
	int r0;

	spin_lock(pd_lock);
	spin_unlock(pd_lock);
	smp_mb();
	r0 = atomic_read(reorder_objects);
}

P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock)
{
	int r1;

	spin_lock(reorder_lock);
	atomic_inc(reorder_objects);
	spin_unlock(reorder_lock);
	//smp_mb();
	r1 = spin_trylock(pd_lock);
}

exists (0:r0=0 /\ 1:r1=0)

It seems worth noticing that this test's "exists" clause is satisfiable
according to the (current) memory consistency model.  (Informally, this
can be explained by noticing that the RELEASE from the spin_unlock() in
P1 does not provide any order between the atomic increment and the read
part of the spin_trylock() operation.)  FWIW, uncommenting the smp_mb()
in P1 would suffice to prevent this clause from being satisfiable; I am
not sure, however, whether this approach is feasible or ideal... (sorry,
I'm definitely not too familiar with this code... ;/)

Thanks,
  Andrea
Peter Zijlstra July 16, 2019, 1:13 p.m. UTC | #9
On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote:

> C daniel-padata
> 
> { }
> 
> P0(atomic_t *reorder_objects, spinlock_t *pd_lock)
> {
> 	int r0;
> 
> 	spin_lock(pd_lock);
> 	spin_unlock(pd_lock);
> 	smp_mb();
> 	r0 = atomic_read(reorder_objects);
> }
> 
> P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock)
> {
> 	int r1;
> 
> 	spin_lock(reorder_lock);
> 	atomic_inc(reorder_objects);
> 	spin_unlock(reorder_lock);
> 	//smp_mb();
> 	r1 = spin_trylock(pd_lock);
> }
> 
> exists (0:r0=0 /\ 1:r1=0)
> 
> It seems worth noticing that this test's "exists" clause is satisfiable
> according to the (current) memory consistency model.  (Informally, this
> can be explained by noticing that the RELEASE from the spin_unlock() in
> P1 does not provide any order between the atomic increment and the read
> part of the spin_trylock() operation.)  FWIW, uncommenting the smp_mb()
> in P1 would suffice to prevent this clause from being satisfiable; I am
> not sure, however, whether this approach is feasible or ideal... (sorry,
> I'm definitely not too familiar with this code... ;/)

Urgh, that one again.

Yes, you need the smp_mb(); although a whole bunch of architectures can
live without it. IIRC it is part of the eternal RCsc/RCpc debate.

Paul/RCU have their smp_mb__after_unlock_lock() that is about something
similar, although we've so far confinsed that to the RCU code, because
of how confusing that all is.
Herbert Xu July 16, 2019, 3:01 p.m. UTC | #10
On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote:
>
> P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock)
> {
> 	int r1;
> 
> 	spin_lock(reorder_lock);
> 	atomic_inc(reorder_objects);
> 	spin_unlock(reorder_lock);
> 	//smp_mb();
> 	r1 = spin_trylock(pd_lock);
> }

Yes we need a matching mb on the other side.  However, we can
get away with using smp_mb__after_atomic thanks to the atomic_inc
above.

Daniel, can you please respin the patch with the matching smp_mb?

Thanks,
Daniel Jordan July 16, 2019, 3:44 p.m. UTC | #11
On 7/16/19 11:01 AM, Herbert Xu wrote:
> On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote:
>>
>> P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock)
>> {
>> 	int r1;
>>
>> 	spin_lock(reorder_lock);
>> 	atomic_inc(reorder_objects);
>> 	spin_unlock(reorder_lock);
>> 	//smp_mb();
>> 	r1 = spin_trylock(pd_lock);
>> }
> 
> Yes we need a matching mb on the other side.  However, we can
> get away with using smp_mb__after_atomic thanks to the atomic_inc
> above.
> 
> Daniel, can you please respin the patch with the matching smp_mb?

Sure, Herbert, will do.

Thanks,
Daniel
diff mbox series

Patch

diff --git a/kernel/padata.c b/kernel/padata.c
index 2d2fddbb7a4c..9cffd4c303cb 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -267,7 +267,12 @@  static void padata_reorder(struct parallel_data *pd)
 	 * The next object that needs serialization might have arrived to
 	 * the reorder queues in the meantime, we will be called again
 	 * from the timer function if no one else cares for it.
+	 *
+	 * Ensure reorder_objects is read after pd->lock is dropped so we see
+	 * an increment from another task in padata_do_serial.  Pairs with
+	 * spin_unlock(&pqueue->reorder.lock) in padata_do_serial.
 	 */
+	smp_mb();
 	if (atomic_read(&pd->reorder_objects)
 			&& !(pinst->flags & PADATA_RESET))
 		mod_timer(&pd->timer, jiffies + HZ);