diff mbox

[v3,10/24] RISC-V: Hold rcu_read_lock when accessing memory

Message ID 1521229281-73637-11-git-send-email-mjc@sifive.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michael Clark March 16, 2018, 7:41 p.m. UTC
From reading other code that accesses memory regions directly,
it appears that the rcu_read_lock needs to be held. Note: the
original code for accessing RAM directly was added because
there is no other way to use atomic_cmpxchg on guest physical
address space.

Cc: Sagar Karandikar <sagark@eecs.berkeley.edu>
Cc: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Signed-off-by: Michael Clark <mjc@sifive.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
---
 target/riscv/helper.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini March 19, 2018, 9:41 a.m. UTC | #1
On 16/03/2018 20:41, Michael Clark wrote:
> From reading other code that accesses memory regions directly,
> it appears that the rcu_read_lock needs to be held. Note: the
> original code for accessing RAM directly was added because
> there is no other way to use atomic_cmpxchg on guest physical
> address space.
> 
> Cc: Sagar Karandikar <sagark@eecs.berkeley.edu>
> Cc: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
> Signed-off-by: Michael Clark <mjc@sifive.com>
> Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
> ---
>  target/riscv/helper.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)

I think the bug here is that rcu_read_lock/unlock is missing in
cpu_memory_rw_debug.  Are there any other callers you had in mind?

Paolo

> diff --git a/target/riscv/helper.c b/target/riscv/helper.c
> index 02cbcea..e71633a 100644
> --- a/target/riscv/helper.c
> +++ b/target/riscv/helper.c
> @@ -209,6 +209,9 @@ restart:
>                     as the PTE is no longer valid */
>                  MemoryRegion *mr;
>                  hwaddr l = sizeof(target_ulong), addr1;
> +                enum { success, translate_fail, restart_walk} action = success;
> +
> +                rcu_read_lock();
>                  mr = address_space_translate(cs->as, pte_addr,
>                      &addr1, &l, false);
>                  if (memory_access_is_direct(mr, true)) {
> @@ -222,7 +225,7 @@ restart:
>                      target_ulong old_pte =
>                          atomic_cmpxchg(pte_pa, pte, updated_pte);
>                      if (old_pte != pte) {
> -                        goto restart;
> +                        action = restart_walk;
>                      } else {
>                          pte = updated_pte;
>                      }
> @@ -230,7 +233,14 @@ restart:
>                  } else {
>                      /* misconfigured PTE in ROM (AD bits are not preset) or
>                       * PTE is in IO space and can't be updated atomically */
> -                    return TRANSLATE_FAIL;
> +                    action = translate_fail;
> +                }
> +                rcu_read_unlock();
> +
> +                switch (action) {
> +                    case success: break;
> +                    case translate_fail: return TRANSLATE_FAIL;
> +                    case restart_walk: goto restart;
>                  }
>              }
>  
>
Michael Clark March 19, 2018, 9:07 p.m. UTC | #2
On Mon, Mar 19, 2018 at 2:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 16/03/2018 20:41, Michael Clark wrote:
> > From reading other code that accesses memory regions directly,
> > it appears that the rcu_read_lock needs to be held. Note: the
> > original code for accessing RAM directly was added because
> > there is no other way to use atomic_cmpxchg on guest physical
> > address space.
> >
> > Cc: Sagar Karandikar <sagark@eecs.berkeley.edu>
> > Cc: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
> > Signed-off-by: Michael Clark <mjc@sifive.com>
> > Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
> > ---
> >  target/riscv/helper.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
>
> I think the bug here is that rcu_read_lock/unlock is missing in
> cpu_memory_rw_debug.  Are there any other callers you had in mind?


So this is not a bug in our patch per se, rather a bug in
cpu_memory_rw_debug?

It seems that most other code that that uses the address_space_* interfaces
(e.g. hw/virtio/virtio.c) holds the rcu_read_lock, presumably to handle
cases where the memory map might change at runtime, e.g. during ballooning.
This would be exhibited if a page walk collided with a balloon.

Ideally we could add a cpu_physical_memory_atomic_cmpxchg API to generic
code, and we could avoid holding the rcu_read_lock in target/riscv i.e.
encapsulate guest physical memory atomic_cmpxchg. The atomic_cmpxchg
wrapper handles multiple word widthes, so we would need
cpu_physical_memory_atomic_cmpxchg32
and cpu_physical_memory_atomic_cmpxchg64. We need to use atomic_cmpxchg in
the PTE update to detect the case where the PTE has changed between reading
it and updating the accessed dirty bits. As far as I can tell, the code is
correct. In fact we are reading a very strong interpretation of the RISC-V
specification.

This is what The RISC-V Instruction Set Manual Volume II: Privileged
Architecture says about accessed and dirty updates, which is the reason for
the code in question:

    When a virtual page is accessed and the A bit is clear, or is written
and the D bit is clear,
    the implementation sets the corresponding bit in the PTE. The PTE
update must be atomic
    with respect to other accesses to the PTE, and must atomically check
that the PTE is valid
    and grants suffi cient permissions. The PTE update must be exact (i.e.,
not speculative), and
    observed in program order by the local hart. The ordering on loads and
stores provided by
    FENCE instructions and the acquire/release bits on atomic instructions
also orders the PTE
    updates associated with those loads and stores as observed by remote
harts.

This is an interesting constraint. I believe the intent is that we just
AMOOR bit A+D bits on the PTE (although we don't have AMOOR, we have
atomic_cmpxchg), however we have read a stronger interpretation (while
stronger, it is not incorrect), and that is that the atomicity guarantee
extends from the page walker reading the PTE entry, checking its
permissions and then updating it, which in hardware would require the page
walker to take load reservations, and I don't think it does. Apparently the
hardware just AMOORs the bits, so the PTE could actually be pointing
somewhere else by the time we go to update the bits, although it is the
same virtual address has been accessed or dirtied (albiet with a different
physical translation). In any case, we have a very strong interpretation of
the specification wording, potentially stronger than hardware may provide.
We had a long discussion about this on the RISC-V QEMU issue tracker.
Stefan O'Rear mentioned he might make a test that hammers a PTE from
another thread while one thread is doing a page walk to try to cause the
page walker to set the A+D bits on a PTE that is different to the one it
used to create the virtual to physical mapping. That's some background
around why the code exists in the first place, which provides the strongest
possible gaurantee on PTE atomicity.

- https://github.com/riscv/riscv-qemu/pull/103
- https://github.com/riscv/riscv-qemu/pull/105

The get_physical_address bug that this patch fixes does not show up in
practice. i.e. we aren't actually hitting cases where we have page walks
colliding with address space changes, however I think holding rcu_read_lock
seems to be correct and this bug may show up in the future.

> diff --git a/target/riscv/helper.c b/target/riscv/helper.c
> > index 02cbcea..e71633a 100644
> > --- a/target/riscv/helper.c
> > +++ b/target/riscv/helper.c
> > @@ -209,6 +209,9 @@ restart:
> >                     as the PTE is no longer valid */
> >                  MemoryRegion *mr;
> >                  hwaddr l = sizeof(target_ulong), addr1;
> > +                enum { success, translate_fail, restart_walk} action =
> success;
> > +
> > +                rcu_read_lock();
> >                  mr = address_space_translate(cs->as, pte_addr,
> >                      &addr1, &l, false);
> >                  if (memory_access_is_direct(mr, true)) {
> > @@ -222,7 +225,7 @@ restart:
> >                      target_ulong old_pte =
> >                          atomic_cmpxchg(pte_pa, pte, updated_pte);
> >                      if (old_pte != pte) {
> > -                        goto restart;
> > +                        action = restart_walk;
> >                      } else {
> >                          pte = updated_pte;
> >                      }
> > @@ -230,7 +233,14 @@ restart:
> >                  } else {
> >                      /* misconfigured PTE in ROM (AD bits are not
> preset) or
> >                       * PTE is in IO space and can't be updated
> atomically */
> > -                    return TRANSLATE_FAIL;
> > +                    action = translate_fail;
> > +                }
> > +                rcu_read_unlock();
> > +
> > +                switch (action) {
> > +                    case success: break;
> > +                    case translate_fail: return TRANSLATE_FAIL;
> > +                    case restart_walk: goto restart;
> >                  }
> >              }
> >
> >
>
>
Paolo Bonzini March 21, 2018, 1:55 p.m. UTC | #3
On 19/03/2018 22:07, Michael Clark wrote:
> 
> 
> On Mon, Mar 19, 2018 at 2:41 AM, Paolo Bonzini <pbonzini@redhat.com
> <mailto:pbonzini@redhat.com>> wrote:
> 
>     On 16/03/2018 20:41, Michael Clark wrote:
>     > From reading other code that accesses memory regions directly,
>     > it appears that the rcu_read_lock needs to be held. Note: the
>     > original code for accessing RAM directly was added because
>     > there is no other way to use atomic_cmpxchg on guest physical
>     > address space.
>     >
>     > Cc: Sagar Karandikar <sagark@eecs.berkeley.edu <mailto:sagark@eecs.berkeley.edu>>
>     > Cc: Bastian Koppelmann <kbastian@mail.uni-paderborn.de <mailto:kbastian@mail.uni-paderborn.de>>
>     > Signed-off-by: Michael Clark <mjc@sifive.com <mailto:mjc@sifive.com>>
>     > Signed-off-by: Palmer Dabbelt <palmer@sifive.com <mailto:palmer@sifive.com>>
>     > ---
>     >  target/riscv/helper.c | 14 ++++++++++++--
>     >  1 file changed, 12 insertions(+), 2 deletions(-)
> 
>     I think the bug here is that rcu_read_lock/unlock is missing in
>     cpu_memory_rw_debug.  Are there any other callers you had in mind?
> 
> 
> So this is not a bug in our patch per se, rather a bug in
> cpu_memory_rw_debug?

Yes.

> It seems that most other code that that uses the address_space_*
> interfaces (e.g. hw/virtio/virtio.c) holds the rcu_read_lock, presumably
> to handle cases where the memory map might change at runtime, e.g.
> during ballooning. This would be exhibited if a page walk collided with
> a balloon.

Note that hw/virtio/virtio.c calls rcu_read_lock() not so much to
protect from memory map changes, but rather to protect reads of
vq->vring.caches.

In general, exec.c and memory.c take care of taking the lock.  Functions
that need the caller to take the lock are annotated with something like

    /* Called from RCU critical section.  */

(see for example flatview_write).

> Ideally we could add a cpu_physical_memory_atomic_cmpxchg API to generic
> code, and we could avoid holding the rcu_read_lock in target/riscv i.e.
> encapsulate guest physical memory atomic_cmpxchg. The atomic_cmpxchg
> wrapper handles multiple word widthes, so we would
> need cpu_physical_memory_atomic_cmpxchg32
> and cpu_physical_memory_atomic_cmpxchg64. We need to use atomic_cmpxchg
> in the PTE update to detect the case where the PTE has changed between
> reading it and updating the accessed dirty bits.

Yes, this makes sense.  In fact having such a function (more precisely
address_space_atomic_cmpxchg) would be useful for x86 too.  Right now
x86 is wrong in not using cmpxchg.

Even without such a function, however, I have two more things that I
noticed:

1) you're using a very low-level function, which complicates things a
bit for yourself.  You can use address_space_map/unmap too, which is a
bit simpler.

2) I'm wondering if the page walk needs to use load-acquire instructions
(and I cannot really find the relevant text in the privileged spec)

> This is an interesting constraint. I believe the intent is that we just
> AMOOR bit A+D bits on the PTE (although we don't have AMOOR, we have
> atomic_cmpxchg), however we have read a stronger interpretation (while
> stronger, it is not incorrect),

Stronger is never incorrect. :)

> and that is that the atomicity guarantee
> extends from the page walker reading the PTE entry, checking its
> permissions and then updating it, which in hardware would require the
> page walker to take load reservations, and I don't think it does.

Indeed.  But maybe another possibility could be to do an atomic_cmpxchg
and ignore it if the comparison failed?  This would be susceptible to
ABA problems, but apart from that it would be the same as taking a load
reservation (in assembly language tems, that's
load-link/or/store-conditional).

> Apparently the hardware just AMOORs the bits, so the PTE could actually
> be pointing somewhere else by the time we go to update the bits,
> although it is the same virtual address has been accessed or dirtied
> (albiet with a different physical translation). In any case, we have a
> very strong interpretation of the specification wording, potentially
> stronger than hardware may provide. We had a long discussion about this
> on the RISC-V QEMU issue tracker. Stefan O'Rear mentioned he might make
> a test that hammers a PTE from another thread while one thread is doing
> a page walk to try to cause the page walker to set the A+D bits on a PTE
> that is different to the one it used to create the virtual to physical
> mapping. That's some background around why the code exists in the first
> place, which provides the strongest possible gaurantee on PTE atomicity.
> 
> - https://github.com/riscv/riscv-qemu/pull/103
> - https://github.com/riscv/riscv-qemu/pull/105
> 
> The get_physical_address bug that this patch fixes does not show up in
> practice. i.e. we aren't actually hitting cases where we have page walks
> colliding with address space changes, however I think
> holding rcu_read_lock seems to be correct and this bug may show up in
> the future.

Yes, it is correct, but it shouldn't be your responsibility---the bug is
not in your code.

tlb_fill and thus riscv_cpu_handle_mmu_fault are already called with
rcu_read_lock (translated code always is) and the same ought to be true
for riscv_cpu_get_phys_page_debug.

All the callers of cpu_get_phys_page_attrs_debug must therefore call
rcu_read_lock(), and that leaves us two possibilities:

1) cpu_memory_rw_debug and breakpoint_invalidate (in this case,
tb_invalidate_phys_addr's RCU lock/unlock can also be pushed up to
breakpoint_invalidate)

2) cpu_get_phys_page_attrs_debug itself.

I'll try to set aside some time, and post a patch for this before 2.12.

Paolo
Peter Maydell March 21, 2018, 4:33 p.m. UTC | #4
On 21 March 2018 at 13:55, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 19/03/2018 22:07, Michael Clark wrote:
>> We need to use atomic_cmpxchg
>> in the PTE update to detect the case where the PTE has changed between
>> reading it and updating the accessed dirty bits.
>
> Yes, this makes sense.  In fact having such a function (more precisely
> address_space_atomic_cmpxchg) would be useful for x86 too.  Right now
> x86 is wrong in not using cmpxchg.

Yeah, this is a known missing feature in our APIs for memory
accesses (it only starts to matter with MTTCG, really). We
ought to have functions that guarantee that they do the
access as a single 32/64 bit load/store, as well as
having atomic support. PPC and Arm TLB walk code will need
these. For the moment we just ignore the possibility of
races here, but for the 2.13 timeframe we really ought to
design a solution to this properly.

thanks
-- PMM
diff mbox

Patch

diff --git a/target/riscv/helper.c b/target/riscv/helper.c
index 02cbcea..e71633a 100644
--- a/target/riscv/helper.c
+++ b/target/riscv/helper.c
@@ -209,6 +209,9 @@  restart:
                    as the PTE is no longer valid */
                 MemoryRegion *mr;
                 hwaddr l = sizeof(target_ulong), addr1;
+                enum { success, translate_fail, restart_walk} action = success;
+
+                rcu_read_lock();
                 mr = address_space_translate(cs->as, pte_addr,
                     &addr1, &l, false);
                 if (memory_access_is_direct(mr, true)) {
@@ -222,7 +225,7 @@  restart:
                     target_ulong old_pte =
                         atomic_cmpxchg(pte_pa, pte, updated_pte);
                     if (old_pte != pte) {
-                        goto restart;
+                        action = restart_walk;
                     } else {
                         pte = updated_pte;
                     }
@@ -230,7 +233,14 @@  restart:
                 } else {
                     /* misconfigured PTE in ROM (AD bits are not preset) or
                      * PTE is in IO space and can't be updated atomically */
-                    return TRANSLATE_FAIL;
+                    action = translate_fail;
+                }
+                rcu_read_unlock();
+
+                switch (action) {
+                    case success: break;
+                    case translate_fail: return TRANSLATE_FAIL;
+                    case restart_walk: goto restart;
                 }
             }