diff mbox series

[v2] RISC-V: Break load reservations during switch_to

Message ID 20190607222222.15300-1-palmer@sifive.com (mailing list archive)
State New, archived
Headers show
Series [v2] RISC-V: Break load reservations during switch_to | expand

Commit Message

Palmer Dabbelt June 7, 2019, 10:22 p.m. UTC
The comment describes why in detail.  This was found because QEMU never
gives up load reservations, the issue is unlikely to manifest on real
hardware.

Thanks to Carlos Eduardo for finding the bug!

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
---
Changes since v1 <20190605231735.26581-1-palmer@sifive.com>:

* REG_SC is now defined as a helper macro, for any code that wants to SC
  a register-sized value.
* The explicit #ifdef to check that TASK_THREAD_RA_RA is 0 has been
  removed.  Instead we rely on the assembler to catch non-zero SC
  offsets.  I've tested this does actually work.

 arch/riscv/include/asm/asm.h |  1 +
 arch/riscv/kernel/entry.S    | 11 +++++++++++
 2 files changed, 12 insertions(+)

Comments

Christoph Hellwig June 8, 2019, 8:19 a.m. UTC | #1
Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Joel Sing June 16, 2019, 5:54 p.m. UTC | #2
On 19-06-07 15:22:22, Palmer Dabbelt wrote:
> The comment describes why in detail.  This was found because QEMU never
> gives up load reservations, the issue is unlikely to manifest on real
> hardware.

Makes sense, however it obviously will not help until qemu actually
clears load reservations on SC (or otherwise handles the interleaved
SC case).

See comment inline.

> Thanks to Carlos Eduardo for finding the bug!
> 
> Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
> ---
> Changes since v1 <20190605231735.26581-1-palmer@sifive.com>:
> 
> * REG_SC is now defined as a helper macro, for any code that wants to SC
>   a register-sized value.
> * The explicit #ifdef to check that TASK_THREAD_RA_RA is 0 has been
>   removed.  Instead we rely on the assembler to catch non-zero SC
>   offsets.  I've tested this does actually work.
> 
>  arch/riscv/include/asm/asm.h |  1 +
>  arch/riscv/kernel/entry.S    | 11 +++++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/asm.h b/arch/riscv/include/asm/asm.h
> index 5ad4cb622bed..946b671f996c 100644
> --- a/arch/riscv/include/asm/asm.h
> +++ b/arch/riscv/include/asm/asm.h
> @@ -30,6 +30,7 @@
>  
>  #define REG_L		__REG_SEL(ld, lw)
>  #define REG_S		__REG_SEL(sd, sw)
> +#define REG_SC		__REG_SEL(sc.w, sc.d)

The instructions appear to be inverted here (i.e. "sc.d, sc.w").

>  #define SZREG		__REG_SEL(8, 4)
>  #define LGREG		__REG_SEL(3, 2)
>  
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 1c1ecc238cfa..af685782af17 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -330,6 +330,17 @@ ENTRY(__switch_to)
>  	add   a3, a0, a4
>  	add   a4, a1, a4
>  	REG_S ra,  TASK_THREAD_RA_RA(a3)
> +	/*
> +	 * The Linux ABI allows programs to depend on load reservations being
> +	 * broken on context switches, but the ISA doesn't require that the
> +	 * hardware ever breaks a load reservation.  The only way to break a
> +	 * load reservation is with a store conditional, so we emit one here.
> +	 * Since nothing ever takes a load reservation on TASK_THREAD_RA_RA we
> +	 * know this will always fail, but just to be on the safe side this
> +	 * writes the same value that was unconditionally written by the
> +	 * previous instruction.
> +	 */
> +	REG_SC x0, ra, TASK_THREAD_RA_RA(a3)
>  	REG_S sp,  TASK_THREAD_SP_RA(a3)
>  	REG_S s0,  TASK_THREAD_S0_RA(a3)
>  	REG_S s1,  TASK_THREAD_S1_RA(a3)
> -- 
> 2.21.0
>
Palmer Dabbelt June 17, 2019, 3:09 a.m. UTC | #3
On Sun, 16 Jun 2019 10:54:06 PDT (-0700), joel@sing.id.au wrote:
> On 19-06-07 15:22:22, Palmer Dabbelt wrote:
>> The comment describes why in detail.  This was found because QEMU never
>> gives up load reservations, the issue is unlikely to manifest on real
>> hardware.
>
> Makes sense, however it obviously will not help until qemu actually
> clears load reservations on SC (or otherwise handles the interleaved
> SC case).

Ya, I wasn't paying close enough attention.  I think this should do it

    diff --git a/target/riscv/insn_trans/trans_rva.inc.c b/target/riscv/insn_trans/trans_rva.inc.c
    index f6dbbc065e15..001a68ced005 100644
    --- a/target/riscv/insn_trans/trans_rva.inc.c
    +++ b/target/riscv/insn_trans/trans_rva.inc.c
    @@ -69,6 +69,7 @@ static inline bool gen_sc(DisasContext *ctx, arg_atomic *a, TCGMemOp mop)
         gen_set_gpr(a->rd, dat);
    
         gen_set_label(l2);
    +    tcg_gen_movi_tl(load_res, -1);
         tcg_temp_free(dat);
         tcg_temp_free(src1);
         tcg_temp_free(src2);

I'll send the patch out to the QEMU mailing list.

> See comment inline.
>
>> Thanks to Carlos Eduardo for finding the bug!
>>
>> Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
>> ---
>> Changes since v1 <20190605231735.26581-1-palmer@sifive.com>:
>>
>> * REG_SC is now defined as a helper macro, for any code that wants to SC
>>   a register-sized value.
>> * The explicit #ifdef to check that TASK_THREAD_RA_RA is 0 has been
>>   removed.  Instead we rely on the assembler to catch non-zero SC
>>   offsets.  I've tested this does actually work.
>>
>>  arch/riscv/include/asm/asm.h |  1 +
>>  arch/riscv/kernel/entry.S    | 11 +++++++++++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/arch/riscv/include/asm/asm.h b/arch/riscv/include/asm/asm.h
>> index 5ad4cb622bed..946b671f996c 100644
>> --- a/arch/riscv/include/asm/asm.h
>> +++ b/arch/riscv/include/asm/asm.h
>> @@ -30,6 +30,7 @@
>>
>>  #define REG_L		__REG_SEL(ld, lw)
>>  #define REG_S		__REG_SEL(sd, sw)
>> +#define REG_SC		__REG_SEL(sc.w, sc.d)
>
> The instructions appear to be inverted here (i.e. "sc.d, sc.w").

Thanks, I'll send a v3.

>>  #define SZREG		__REG_SEL(8, 4)
>>  #define LGREG		__REG_SEL(3, 2)
>>
>> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
>> index 1c1ecc238cfa..af685782af17 100644
>> --- a/arch/riscv/kernel/entry.S
>> +++ b/arch/riscv/kernel/entry.S
>> @@ -330,6 +330,17 @@ ENTRY(__switch_to)
>>  	add   a3, a0, a4
>>  	add   a4, a1, a4
>>  	REG_S ra,  TASK_THREAD_RA_RA(a3)
>> +	/*
>> +	 * The Linux ABI allows programs to depend on load reservations being
>> +	 * broken on context switches, but the ISA doesn't require that the
>> +	 * hardware ever breaks a load reservation.  The only way to break a
>> +	 * load reservation is with a store conditional, so we emit one here.
>> +	 * Since nothing ever takes a load reservation on TASK_THREAD_RA_RA we
>> +	 * know this will always fail, but just to be on the safe side this
>> +	 * writes the same value that was unconditionally written by the
>> +	 * previous instruction.
>> +	 */
>> +	REG_SC x0, ra, TASK_THREAD_RA_RA(a3)
>>  	REG_S sp,  TASK_THREAD_SP_RA(a3)
>>  	REG_S s0,  TASK_THREAD_S0_RA(a3)
>>  	REG_S s1,  TASK_THREAD_S1_RA(a3)
>> --
>> 2.21.0
>>
Mark Rutland June 19, 2019, 7:36 a.m. UTC | #4
On Fri, Jun 07, 2019 at 03:22:22PM -0700, Palmer Dabbelt wrote:
> The comment describes why in detail.  This was found because QEMU never
> gives up load reservations, the issue is unlikely to manifest on real
> hardware.
> 
> Thanks to Carlos Eduardo for finding the bug!

> @@ -330,6 +330,17 @@ ENTRY(__switch_to)
>  	add   a3, a0, a4
>  	add   a4, a1, a4
>  	REG_S ra,  TASK_THREAD_RA_RA(a3)
> +	/*
> +	 * The Linux ABI allows programs to depend on load reservations being
> +	 * broken on context switches, but the ISA doesn't require that the
> +	 * hardware ever breaks a load reservation.  The only way to break a
> +	 * load reservation is with a store conditional, so we emit one here.
> +	 * Since nothing ever takes a load reservation on TASK_THREAD_RA_RA we
> +	 * know this will always fail, but just to be on the safe side this
> +	 * writes the same value that was unconditionally written by the
> +	 * previous instruction.
> +	 */

I suspect that you need to do the same as 32-bit ARM, and clear this in
your exception return path, rather than in __switch_to, since handlers
for interrupts and other exceptions could leave a dangling reservation.

For ARM, the architecture permits a store-exclusive to succeed even if
the address differed from the load-exclusive. I don't know if the same
applies here, but regardless I believe the case above applies if an IRQ
is taken from kernel context, since the handler can manipulate the same
variable as the interrupted code.

Thanks,
Mark.
Palmer Dabbelt June 21, 2019, 1:50 a.m. UTC | #5
On Wed, 19 Jun 2019 00:36:01 PDT (-0700), mark.rutland@arm.com wrote:
> On Fri, Jun 07, 2019 at 03:22:22PM -0700, Palmer Dabbelt wrote:
>> The comment describes why in detail.  This was found because QEMU never
>> gives up load reservations, the issue is unlikely to manifest on real
>> hardware.
>>
>> Thanks to Carlos Eduardo for finding the bug!
>
>> @@ -330,6 +330,17 @@ ENTRY(__switch_to)
>>  	add   a3, a0, a4
>>  	add   a4, a1, a4
>>  	REG_S ra,  TASK_THREAD_RA_RA(a3)
>> +	/*
>> +	 * The Linux ABI allows programs to depend on load reservations being
>> +	 * broken on context switches, but the ISA doesn't require that the
>> +	 * hardware ever breaks a load reservation.  The only way to break a
>> +	 * load reservation is with a store conditional, so we emit one here.
>> +	 * Since nothing ever takes a load reservation on TASK_THREAD_RA_RA we
>> +	 * know this will always fail, but just to be on the safe side this
>> +	 * writes the same value that was unconditionally written by the
>> +	 * previous instruction.
>> +	 */
>
> I suspect that you need to do the same as 32-bit ARM, and clear this in
> your exception return path, rather than in __switch_to, since handlers
> for interrupts and other exceptions could leave a dangling reservation.
>
> For ARM, the architecture permits a store-exclusive to succeed even if
> the address differed from the load-exclusive. I don't know if the same
> applies here, but regardless I believe the case above applies if an IRQ
> is taken from kernel context, since the handler can manipulate the same
> variable as the interrupted code.

RISC-V has the same constraint: an LR can cause the subsequent SC on any
address to succeed.  When writing the patch I thought they had to have matching
addresses, v4 should have a correct comment (assuming I've managed to send it,
I'm on my third continent this week so I'm a bit out of it).

I'd considered breaking reservations on trap entry, but decided it wasn't
necessary.  I hadn't considered doing this on trap exit, but I don't see a
difference so I might be missing something.  The case that I see as an issue is
when a trap comes in the middle of an LR/SC sequence, which boils down to three
cases:

* The trap handler doesn't contend with the LR/SC sequence in any way, in which
  case it's fine for the sequence to succeed.
* The trap handler contends by doing its own LR/SC sequence.  Since the trap
  handler must execute completely before returning control back the parent, we
  know the SC in the trap handler will execute.  Thus there is no way the SC in
  the parent can pair with the LR in the trap handler.  This applies even when
  traps are nested.
* The trap handler contends by performing a regular store to the same address
  as the LR that was interrupted.  In this case the SC must fail, and while I
  assumed that the store would cause that failure the ISA manual doesn't appear
  to require that behavior -- it does allow the SC to always fail in that case,
  but it doesn't mandate it always fails (which is how I got confused).

Assuming the ISA manual is correct in not specifying that stores within an
LR/SC sequence break the load reservations, then I think we do need to break
load reservations on all traps.  I'll go check with the ISA people, but now
that I've noticed it this does seem somewhat reasonable -- essentially it lets
LR just take a line exclusively, SC to succeed only on already exclusively held
lines, and doesn't impose any extra constraints on regular memory operations.

I don't see any reason that breaking reservations on entry as opposed to exit
would be incorrect.  I feel like doing this on entry is a better bet, as we're
already doing a bunch of stores there so I don't need to bring an additional
cache line in for writing.  These lines would be on the kernel stack so it's
unlikely anyone else has them for writing anyway, so maybe it just doesn't
matter.  The only issue I can think of here is that there might be something
tricky for implementations to handle WRT the regular store -> store conditional
forwarding, as that's a pattern that is unlikely to manifest on regular code
but is now in a high performance path.  The regular load -> store conditional
may be easier to handle, and since this is done on the kernel stack it's
unlikely that upgrading a line for writing would be an expensive operation
anyway.

LMK if I'm missing something, otherwise I'll spin another version of the patch
to break reservations on trap entry.
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/asm.h b/arch/riscv/include/asm/asm.h
index 5ad4cb622bed..946b671f996c 100644
--- a/arch/riscv/include/asm/asm.h
+++ b/arch/riscv/include/asm/asm.h
@@ -30,6 +30,7 @@ 
 
 #define REG_L		__REG_SEL(ld, lw)
 #define REG_S		__REG_SEL(sd, sw)
+#define REG_SC		__REG_SEL(sc.w, sc.d)
 #define SZREG		__REG_SEL(8, 4)
 #define LGREG		__REG_SEL(3, 2)
 
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 1c1ecc238cfa..af685782af17 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -330,6 +330,17 @@  ENTRY(__switch_to)
 	add   a3, a0, a4
 	add   a4, a1, a4
 	REG_S ra,  TASK_THREAD_RA_RA(a3)
+	/*
+	 * The Linux ABI allows programs to depend on load reservations being
+	 * broken on context switches, but the ISA doesn't require that the
+	 * hardware ever breaks a load reservation.  The only way to break a
+	 * load reservation is with a store conditional, so we emit one here.
+	 * Since nothing ever takes a load reservation on TASK_THREAD_RA_RA we
+	 * know this will always fail, but just to be on the safe side this
+	 * writes the same value that was unconditionally written by the
+	 * previous instruction.
+	 */
+	REG_SC x0, ra, TASK_THREAD_RA_RA(a3)
 	REG_S sp,  TASK_THREAD_SP_RA(a3)
 	REG_S s0,  TASK_THREAD_S0_RA(a3)
 	REG_S s1,  TASK_THREAD_S1_RA(a3)