diff mbox series

[v2,1/2] x86/mm: add API for marking only part of a MMIO page read only

Message ID def382a6481a9d1bcc106200b971cd5b0f3d19c1.1683321183.git-series.marmarek@invisiblethingslab.com (mailing list archive)
State New, archived
Headers show
Series Add API for making parts of a MMIO page R/O and use it in XHCI console | expand

Commit Message

Marek Marczykowski-Górecki May 5, 2023, 9:25 p.m. UTC
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
dwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash().
To reach mmio_ro_emulated_write(), change how write violations for
p2m_mmio_direct are handled - specifically, check if they relate to such
partially protected page via subpage_mmio_write_accept() and if so, call
hvm_emulate_one_mmio() for them too. This decodes what guest is trying
write and finally calls mmio_ro_emulated_write(). Note that hitting EPT
write violation for p2m_mmio_direct page can only happen if the page was
on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for
checking that again.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v2:
- Simplify subpage_mmio_ro_add() parameters
- add to mmio_ro_ranges from within subpage_mmio_ro_add()
- use ioremap() instead of caller-provided fixmap
- use 8-bytes granularity (largest supported single write) and a bitmap
  instead of a rangeset
- clarify commit message
- change how it's plugged in for HVM domain, to not change the behavior for
  read-only parts (keep it hitting domain_crash(), instead of ignoring
  write)
- remove unused subpage_mmio_ro_remove()
---
 xen/arch/x86/hvm/emulate.c      |   2 +-
 xen/arch/x86/hvm/hvm.c          |   8 +-
 xen/arch/x86/include/asm/mm.h   |  15 ++-
 xen/arch/x86/mm.c               | 253 +++++++++++++++++++++++++++++++++-
 xen/arch/x86/pv/ro-page-fault.c |   1 +-
 5 files changed, 278 insertions(+), 1 deletion(-)

Comments

Jason Andryuk May 17, 2023, 7:28 p.m. UTC | #1
On Fri, May 5, 2023 at 5:26 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> In some cases, only few registers on a page needs to be write-protected.

Maybe "In some cases, only part of a page needs to be write-protected"?

> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either), although
> in the latter case the spec forbids placing other registers on the same
> page. Current API allows only marking whole pages pages read-only,
> which sometimes may cover other registers that guest may need to
> write into.
>
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
>
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables (they are in EPT,
> but not granular enough), do this via emulation (or simply page fault
> handler for PV) that handles writes that are supposed to be allowed.
> The new subpage_mmio_ro_add() takes a start physical address and the
> region size in bytes. Both start address and the size need to be 8-byte
> aligned, as a practical simplification (allows using smaller bitmask,
> and a smaller granularity isn't really necessary right now).
> It will internally add relevant pages to mmio_ro_ranges, but if either
> start or end address is not page-aligned, it additionally adds that page
> to a list for sub-page R/O handling. The list holds a bitmask which
> dwords are supposed to be read-only and an address where page is mapped
> for write emulation - this mapping is done only on the first access. A
> plain list is used instead of more efficient structure, because there
> isn't supposed to be many pages needing this precise r/o control.
>
> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash().
> To reach mmio_ro_emulated_write(), change how write violations for
> p2m_mmio_direct are handled - specifically, check if they relate to such
> partially protected page via subpage_mmio_write_accept() and if so, call
> hvm_emulate_one_mmio() for them too. This decodes what guest is trying
> write and finally calls mmio_ro_emulated_write(). Note that hitting EPT
> write violation for p2m_mmio_direct page can only happen if the page was
> on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for
> checking that again.
> Both of those paths need an MFN to which guest tried to write (to check
> which part of the page is supposed to be read-only, and where
> the page is mapped for writes). This information currently isn't
> available directly in mmio_ro_emulated_write(), but in both cases it is
> already resolved somewhere higher in the call tree. Pass it down to
> mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
>
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -522,9 +522,24 @@ extern struct rangeset *mmio_ro_ranges;
>  void memguard_guard_stack(void *p);
>  void memguard_unguard_stack(void *p);
>
> +/*
> + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> + * will still be R/O, but the rest of the page (nor marked as R/O via another

s/nor/not/

> + * call) will have writes passed through.
> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> + *

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c

> +            /*
> +             * We don't know the write seize at this point yet, so it could be

s/seize/size/

> +             * an unalligned write, but accept it here anyway and deal with it

s/unalligned/unaligned/

> +             * later.
> +             */

Thanks,
Jason
Jan Beulich May 30, 2023, 11:56 a.m. UTC | #2
On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote:
> In some cases, only few registers on a page needs to be write-protected.
> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either), although
> in the latter case the spec forbids placing other registers on the same
> page. Current API allows only marking whole pages pages read-only,
> which sometimes may cover other registers that guest may need to
> write into.
> 
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
> 
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables (they are in EPT,
> but not granular enough), do this via emulation (or simply page fault
> handler for PV) that handles writes that are supposed to be allowed.
> The new subpage_mmio_ro_add() takes a start physical address and the
> region size in bytes. Both start address and the size need to be 8-byte

8-byte (aka qword) here, but ...

> aligned, as a practical simplification (allows using smaller bitmask,
> and a smaller granularity isn't really necessary right now).
> It will internally add relevant pages to mmio_ro_ranges, but if either
> start or end address is not page-aligned, it additionally adds that page
> to a list for sub-page R/O handling. The list holds a bitmask which
> dwords are supposed to be read-only and an address where page is mapped

... dwords here?

> for write emulation - this mapping is done only on the first access. A
> plain list is used instead of more efficient structure, because there
> isn't supposed to be many pages needing this precise r/o control.
> 
> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash().
> To reach mmio_ro_emulated_write(), change how write violations for
> p2m_mmio_direct are handled - specifically, check if they relate to such
> partially protected page via subpage_mmio_write_accept() and if so, call
> hvm_emulate_one_mmio() for them too. This decodes what guest is trying
> write and finally calls mmio_ro_emulated_write(). Note that hitting EPT
> write violation for p2m_mmio_direct page can only happen if the page was
> on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for
> checking that again.

Yet that's then putting us on a certain risk wrt potential errata.

You also specifically talk about "guests", i.e. more than just hwdom.
Adding another easy access to the emulator (for HVM) comes with a
certain risk of future XSAs, too.

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>          goto out_put_gfn;
>      }
>  
> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> +         subpage_mmio_write_accept(mfn, gla) &&
> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> +    {
> +        rc = 1;
> +        goto out_put_gfn;
> +    }

But npfec.write_access set doesn't mean it was a write permission
violation, does it? May I ask that this be accompanied by a comment
discussing the correctness/safety?

> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -522,9 +522,24 @@ extern struct rangeset *mmio_ro_ranges;
>  void memguard_guard_stack(void *p);
>  void memguard_unguard_stack(void *p);
>  
> +/*
> + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> + * will still be R/O, but the rest of the page (nor marked as R/O via another
> + * call) will have writes passed through.
> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.

With this alignment constraint, the earlier sentence can be read as
controdictory. How about "Byte-granular ranges ..." or "Ranges (using
byte granularity) ..."? I admit even that doesn't resolve the issue
fully, though.

> @@ -4882,6 +4895,243 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      return 0;
>  }
>  
> +/* This needs subpage_ro_lock already taken */
> +static int __init subpage_mmio_ro_add_page(
> +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int i;

unsigned int please (as almost always for induction variables).

> +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +    {
> +        /* iter==NULL marks it was a newly allocated entry */

Nit: Even in a comment I think it would be nice if style rules were
followed, and hence == was surrounded by blanks.

> +        iter = NULL;
> +        entry = xzalloc(struct subpage_ro_range);
> +        if ( !entry )
> +            return -ENOMEM;
> +        entry->mfn = mfn;
> +    }
> +
> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> +        set_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);

You're holding a spin lock, so won't __set_bit() suffice here? And
then __clear_bit() below?

> +    if ( !iter )
> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> +
> +    return 0;
> +}

Since you mark the qwords which are to be protected, how is one to set
up safely two discontiguous ranges on the same page? I think I had
asked for v1 already why you don't do things the other way around:
Initially the entire page is protected, and then writable regions are
carved out.

I guess I shouldn't further ask about overlapping r/o ranges and their
cleaning up. But at least a comment towards the restriction would be
nice. Perhaps even check upon registration that no part of the range
is already marked r/o.

> +static void __init subpage_mmio_ro_free(struct rcu_head *rcu)
> +{
> +    struct subpage_ro_range *entry = container_of(
> +        rcu, struct subpage_ro_range, rcu);
> +
> +    ASSERT(bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN));
> +
> +    if ( entry->mapped )
> +        iounmap(entry->mapped);
> +    xfree(entry);
> +}
> +
> +/* This needs subpage_ro_lock already taken */
> +static int __init subpage_mmio_ro_remove_page(
> +    mfn_t mfn,
> +    int offset_s,
> +    int offset_e)
> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int rc, i;
> +
> +    list_for_each_entry_rcu(iter, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +        return -ENOENT;

Yet the sole caller doesn't care at all, not even by an assertion.

> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> +        clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> +
> +    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
> +        return 0;
> +
> +    list_del_rcu(&entry->list);
> +    call_rcu(&entry->rcu, subpage_mmio_ro_free);

This being an __init function, what is the RCU-ness intended to guard
against?

> +    return rc;

DYM "return 0" here, or maybe even invert the if()'s condition to have
just a single return? "rc" was never written afaics, and the compiler
not spotting it is likely because the caller doesn't inspect the return
value.

> +}
> +
> +

Nit: No double blanks lines please.

> +int __init subpage_mmio_ro_add(
> +    paddr_t start,
> +    size_t size)
> +{
> +    mfn_t mfn_start = maddr_to_mfn(start);
> +    paddr_t end = start + size - 1;
> +    mfn_t mfn_end = maddr_to_mfn(end);
> +    int offset_end = 0;

unsigned int again, afaics. Also this can be declared in the more narrow
scope it's used in.

> +    int rc;
> +
> +    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
> +    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));

Not meeting the first assertion's condition (thinking of a release build)
is kind of okay, as too large a range will be protected. But for the 2nd
too small a range would be covered aiui, so this may want dealing with in
a release-build-safe way.

> +    if ( !size )
> +        return 0;
> +
> +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
> +    if ( rc )
> +        return rc;
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    if ( PAGE_OFFSET(start) ||
> +         (mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1) )
> +    {
> +        offset_end = mfn_eq(mfn_start, mfn_end) ?
> +                     PAGE_OFFSET(end) :
> +                     (PAGE_SIZE - 1);
> +        rc = subpage_mmio_ro_add_page(mfn_start,
> +                                      PAGE_OFFSET(start),
> +                                      offset_end);
> +        if ( rc )
> +            goto err_unlock;
> +    }
> +
> +    if ( !mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1 )
> +    {
> +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> +        if ( rc )
> +            goto err_unlock_remove;
> +    }
> +
> +    spin_unlock(&subpage_ro_lock);
> +
> +    return 0;
> +
> + err_unlock_remove:
> +    if ( offset_end )
> +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
> +
> + err_unlock:
> +    spin_unlock(&subpage_ro_lock);
> +    if ( rangeset_remove_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end)) )
> +        printk(XENLOG_ERR "Failed to cleanup on failed subpage_mmio_ro_add()\n");
> +    return rc;
> +}

None of the failures here is particularly likely, so perhaps all is fine as
you have it. But there would be an alternative of retaining the
mmio_ro_ranges entry/entries, allowing the caller to "ignore" the error.

> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> +{
> +    if ( entry->mapped )
> +        return entry->mapped;
> +
> +    spin_lock(&subpage_ro_lock);
> +    /* Re-check under the lock */
> +    if ( entry->mapped )
> +        goto out_unlock;
> +
> +    entry->mapped = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> +
> + out_unlock:
> +    spin_unlock(&subpage_ro_lock);
> +    return entry->mapped;
> +}

This is easy to deal with without any "goto".

I'm further inclined to request that the ioremap() occur without the lock
held, followed by an iounmap() (after dropping the lock) if in fact the
mapping wasn't needed (anymore).

> +static void subpage_mmio_write_emulate(
> +    mfn_t mfn,
> +    unsigned int offset,
> +    const void *data,
> +    unsigned int len)
> +{
> +    struct subpage_ro_range *entry;
> +    void __iomem *addr;
> +
> +    rcu_read_lock(&subpage_ro_rcu);
> +
> +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(entry->mfn, mfn) )
> +        {
> +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> +                goto write_ignored;

I think you can get away with just a single "goto" by putting the gprintk()
(and its label) here.

> +            addr = subpage_mmio_get_page(entry);
> +            if ( !addr )
> +            {
> +                gprintk(XENLOG_ERR,
> +                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
> +                        mfn_x(mfn), offset);
> +                goto out_unlock;
> +            }
> +
> +            switch ( len )
> +            {
> +            case 1:
> +                writeb(*(uint8_t*)data, addr);
> +                break;
> +            case 2:
> +                writew(*(uint16_t*)data, addr);
> +                break;
> +            case 4:
> +                writel(*(uint32_t*)data, addr);
> +                break;
> +            case 8:
> +                writeq(*(uint64_t*)data, addr);
> +                break;

Please avoid casting away const-ness.

> +            default:
> +                /* mmio_ro_emulated_write() already validated the size */
> +                ASSERT_UNREACHABLE();
> +                goto write_ignored;
> +            }
> +            goto out_unlock;
> +        }
> +    }
> +    /* Do not print message for pages without any writable parts. */
> +    goto out_unlock;
> +
> + write_ignored:
> +    gprintk(XENLOG_WARNING,
> +             "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
> +             mfn_x(mfn), offset, len);

Nit: Indentation.

> + out_unlock:
> +    rcu_read_unlock(&subpage_ro_rcu);
> +}
> +
> +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> +{
> +    unsigned int offset = PAGE_OFFSET(gla);
> +    const struct subpage_ro_range *entry;
> +
> +    rcu_read_lock(&subpage_ro_rcu);
> +
> +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> +        if ( mfn_eq(entry->mfn, mfn) &&
> +             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> +        {
> +            /*
> +             * We don't know the write seize at this point yet, so it could be

Nit: "size" I assume.

> +             * an unalligned write, but accept it here anyway and deal with it
> +             * later.
> +             */

Have I overlooked where unaligned writes are dealt with?

Also nit: "unaligned".

Jan
Marek Marczykowski-Górecki June 30, 2023, 10:28 p.m. UTC | #3
On Tue, May 30, 2023 at 01:56:34PM +0200, Jan Beulich wrote:
> On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either), although
> > in the latter case the spec forbids placing other registers on the same
> > page. Current API allows only marking whole pages pages read-only,
> > which sometimes may cover other registers that guest may need to
> > write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables (they are in EPT,
> > but not granular enough), do this via emulation (or simply page fault
> > handler for PV) that handles writes that are supposed to be allowed.
> > The new subpage_mmio_ro_add() takes a start physical address and the
> > region size in bytes. Both start address and the size need to be 8-byte
> 
> 8-byte (aka qword) here, but ...
> 
> > aligned, as a practical simplification (allows using smaller bitmask,
> > and a smaller granularity isn't really necessary right now).
> > It will internally add relevant pages to mmio_ro_ranges, but if either
> > start or end address is not page-aligned, it additionally adds that page
> > to a list for sub-page R/O handling. The list holds a bitmask which
> > dwords are supposed to be read-only and an address where page is mapped
> 
> ... dwords here?
> 
> > for write emulation - this mapping is done only on the first access. A
> > plain list is used instead of more efficient structure, because there
> > isn't supposed to be many pages needing this precise r/o control.
> > 
> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash().
> > To reach mmio_ro_emulated_write(), change how write violations for
> > p2m_mmio_direct are handled - specifically, check if they relate to such
> > partially protected page via subpage_mmio_write_accept() and if so, call
> > hvm_emulate_one_mmio() for them too. This decodes what guest is trying
> > write and finally calls mmio_ro_emulated_write(). Note that hitting EPT
> > write violation for p2m_mmio_direct page can only happen if the page was
> > on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for
> > checking that again.
> 
> Yet that's then putting us on a certain risk wrt potential errata.
> 
> You also specifically talk about "guests", i.e. more than just hwdom.
> Adding another easy access to the emulator (for HVM) comes with a
> certain risk of future XSAs, too.

There are a lot of ways where a guest with a PCI device can exercise the
emulator already, no? And even without PCI passthrough, more or less any
device backed by the device model allows that too. So, while emulator is
a complex piece of code, I don't think this patch makes it easier
accessible in any significant way.

Anyway, for the use in this series, it applies only if the USB
controller used for XHCI console is passed through a guest. I'd say
that's far from common setup (outside of debugging), since it requires a
special cable to be plugged in for the system to boot at all.

> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> >          goto out_put_gfn;
> >      }
> >  
> > +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> > +         subpage_mmio_write_accept(mfn, gla) &&
> > +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> > +    {
> > +        rc = 1;
> > +        goto out_put_gfn;
> > +    }
> 
> But npfec.write_access set doesn't mean it was a write permission
> violation, does it? 

Doesn't it? IIUC it means it was a write attempt, to a mapped page
(npfec.present), and since we've got EPT violation, it got denied. 

> May I ask that this be accompanied by a comment
> discussing the correctness/safety?
> 
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,24 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> > + * will still be R/O, but the rest of the page (nor marked as R/O via another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> 
> With this alignment constraint, the earlier sentence can be read as
> controdictory. How about "Byte-granular ranges ..." or "Ranges (using
> byte granularity) ..."? I admit even that doesn't resolve the issue
> fully, though.

I don't see where the issue is, alignment requirement doesn't change
anything regarding the parameter units. Anyway, I can change to one of
the suggested versions.

> > @@ -4882,6 +4895,243 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return 0;
> >  }
> >  
> > +/* This needs subpage_ro_lock already taken */
> > +static int __init subpage_mmio_ro_add_page(
> > +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int i;
> 
> unsigned int please (as almost always for induction variables).
> 
> > +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +    {
> > +        /* iter==NULL marks it was a newly allocated entry */
> 
> Nit: Even in a comment I think it would be nice if style rules were
> followed, and hence == was surrounded by blanks.
> 
> > +        iter = NULL;
> > +        entry = xzalloc(struct subpage_ro_range);
> > +        if ( !entry )
> > +            return -ENOMEM;
> > +        entry->mfn = mfn;
> > +    }
> > +
> > +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> > +        set_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> 
> You're holding a spin lock, so won't __set_bit() suffice here? And
> then __clear_bit() below?

Right.

> > +    if ( !iter )
> > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > +
> > +    return 0;
> > +}
> 
> Since you mark the qwords which are to be protected, how is one to set
> up safely two discontiguous ranges on the same page? I think I had
> asked for v1 already why you don't do things the other way around:
> Initially the entire page is protected, and then writable regions are
> carved out.

Because that's not how the API is used. This API is for those how want
to write-protect some specific ranges (to be written exclusively by
Xen). They don't need to even know what is else is on the same page.
Take XHCI case as an example: it gets the range to write-protect by
enumerating XHCI extended capabilities, which is a linked list. The
driver gets info where XHCI console registers are.  Things before/after
them on that page may not even be XHCI extended caps at all.
This in fact is very similar approach to already existing
mmio_ro_ranges.

> I guess I shouldn't further ask about overlapping r/o ranges and their
> cleaning up. But at least a comment towards the restriction would be
> nice. Perhaps even check upon registration that no part of the range
> is already marked r/o.

Yes, this is a good suggestion, I'll add that.

> > +static void __init subpage_mmio_ro_free(struct rcu_head *rcu)
> > +{
> > +    struct subpage_ro_range *entry = container_of(
> > +        rcu, struct subpage_ro_range, rcu);
> > +
> > +    ASSERT(bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN));
> > +
> > +    if ( entry->mapped )
> > +        iounmap(entry->mapped);
> > +    xfree(entry);
> > +}
> > +
> > +/* This needs subpage_ro_lock already taken */
> > +static int __init subpage_mmio_ro_remove_page(
> > +    mfn_t mfn,
> > +    int offset_s,
> > +    int offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc, i;
> > +
> > +    list_for_each_entry_rcu(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +        return -ENOENT;
> 
> Yet the sole caller doesn't care at all, not even by an assertion.

Indeed, right now it's called under the same spinlock that the page was
added. Would you prefer to drop return value (not relevant until another
use is added, beyond cleanup on failure), or add assertion?

> > +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> > +        clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> > +
> > +    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
> > +        return 0;
> > +
> > +    list_del_rcu(&entry->list);
> > +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> 
> This being an __init function, what is the RCU-ness intended to guard
> against?

Indeed, as long as it's __init only, it probably can be dropped.
Originally I prepared this API to be useful also when some ranges are
added/removed beyond just boot time, but currently the sole use is just
adding ranges and only during boot.

> > +    return rc;
> 
> DYM "return 0" here, or maybe even invert the if()'s condition to have
> just a single return? "rc" was never written afaics, and the compiler
> not spotting it is likely because the caller doesn't inspect the return
> value.

Indeed.

> > +}
> > +
> > +
> 
> Nit: No double blanks lines please.
> 
> > +int __init subpage_mmio_ro_add(
> > +    paddr_t start,
> > +    size_t size)
> > +{
> > +    mfn_t mfn_start = maddr_to_mfn(start);
> > +    paddr_t end = start + size - 1;
> > +    mfn_t mfn_end = maddr_to_mfn(end);
> > +    int offset_end = 0;
> 
> unsigned int again, afaics. Also this can be declared in the more narrow
> scope it's used in.
> 
> > +    int rc;
> > +
> > +    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
> > +    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
> 
> Not meeting the first assertion's condition (thinking of a release build)
> is kind of okay, as too large a range will be protected. But for the 2nd
> too small a range would be covered aiui, so this may want dealing with in
> a release-build-safe way.

Ok.

> > +    if ( !size )
> > +        return 0;
> > +
> > +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
> > +    if ( rc )
> > +        return rc;
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    if ( PAGE_OFFSET(start) ||
> > +         (mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1) )
> > +    {
> > +        offset_end = mfn_eq(mfn_start, mfn_end) ?
> > +                     PAGE_OFFSET(end) :
> > +                     (PAGE_SIZE - 1);
> > +        rc = subpage_mmio_ro_add_page(mfn_start,
> > +                                      PAGE_OFFSET(start),
> > +                                      offset_end);
> > +        if ( rc )
> > +            goto err_unlock;
> > +    }
> > +
> > +    if ( !mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1 )
> > +    {
> > +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> > +        if ( rc )
> > +            goto err_unlock_remove;
> > +    }
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    return 0;
> > +
> > + err_unlock_remove:
> > +    if ( offset_end )
> > +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
> > +
> > + err_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    if ( rangeset_remove_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end)) )
> > +        printk(XENLOG_ERR "Failed to cleanup on failed subpage_mmio_ro_add()\n");
> > +    return rc;
> > +}
> 
> None of the failures here is particularly likely, so perhaps all is fine as
> you have it. But there would be an alternative of retaining the
> mmio_ro_ranges entry/entries, allowing the caller to "ignore" the error.

That is an option here too, but note one of the potential failures is
about rangeset_add_range() itself, so one needs to be careful about
completely ignoring the error.

> > +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> > +{
> > +    if ( entry->mapped )
> > +        return entry->mapped;
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +    /* Re-check under the lock */
> > +    if ( entry->mapped )
> > +        goto out_unlock;
> > +
> > +    entry->mapped = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> > +
> > + out_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    return entry->mapped;
> > +}
> 
> This is easy to deal with without any "goto".
> 
> I'm further inclined to request that the ioremap() occur without the lock
> held, followed by an iounmap() (after dropping the lock) if in fact the
> mapping wasn't needed (anymore).

Is it to limit time the lock is held?

> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned int offset,
> > +    const void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> > +    void __iomem *addr;
> > +
> > +    rcu_read_lock(&subpage_ro_rcu);
> > +
> > +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> > +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> > +                goto write_ignored;
> 
> I think you can get away with just a single "goto" by putting the gprintk()
> (and its label) here.

write_ignored label is used also below in "default" case (which should
be unreachable, but still). Do you suggest jumping from default case
into here?

> > +            addr = subpage_mmio_get_page(entry);
> > +            if ( !addr )
> > +            {
> > +                gprintk(XENLOG_ERR,
> > +                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
> > +                        mfn_x(mfn), offset);
> > +                goto out_unlock;
> > +            }
> > +
> > +            switch ( len )
> > +            {
> > +            case 1:
> > +                writeb(*(uint8_t*)data, addr);
> > +                break;
> > +            case 2:
> > +                writew(*(uint16_t*)data, addr);
> > +                break;
> > +            case 4:
> > +                writel(*(uint32_t*)data, addr);
> > +                break;
> > +            case 8:
> > +                writeq(*(uint64_t*)data, addr);
> > +                break;
> 
> Please avoid casting away const-ness.
> 
> > +            default:
> > +                /* mmio_ro_emulated_write() already validated the size */
> > +                ASSERT_UNREACHABLE();
> > +                goto write_ignored;
> > +            }
> > +            goto out_unlock;
> > +        }
> > +    }
> > +    /* Do not print message for pages without any writable parts. */
> > +    goto out_unlock;
> > +
> > + write_ignored:
> > +    gprintk(XENLOG_WARNING,
> > +             "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
> > +             mfn_x(mfn), offset, len);
> 
> Nit: Indentation.
> 
> > + out_unlock:
> > +    rcu_read_unlock(&subpage_ro_rcu);
> > +}
> > +
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> > +{
> > +    unsigned int offset = PAGE_OFFSET(gla);
> > +    const struct subpage_ro_range *entry;
> > +
> > +    rcu_read_lock(&subpage_ro_rcu);
> > +
> > +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> > +        if ( mfn_eq(entry->mfn, mfn) &&
> > +             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> > +        {
> > +            /*
> > +             * We don't know the write seize at this point yet, so it could be
> 
> Nit: "size" I assume.
> 
> > +             * an unalligned write, but accept it here anyway and deal with it
> > +             * later.
> > +             */
> 
> Have I overlooked where unaligned writes are dealt with?

mmio_ro_emulated_write() handles that already.

> Also nit: "unaligned".
> 
> Jan
Jan Beulich July 5, 2023, 8:23 a.m. UTC | #4
On 01.07.2023 00:28, Marek Marczykowski-Górecki wrote:
> On Tue, May 30, 2023 at 01:56:34PM +0200, Jan Beulich wrote:
>> On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote:
>>> In some cases, only few registers on a page needs to be write-protected.
>>> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
>>> PBA table (which doesn't need to span the whole table either), although
>>> in the latter case the spec forbids placing other registers on the same
>>> page. Current API allows only marking whole pages pages read-only,
>>> which sometimes may cover other registers that guest may need to
>>> write into.
>>>
>>> Currently, when a guest tries to write to an MMIO page on the
>>> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
>>> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
>>> from userspace (like, /dev/mem), it will try to fixup by updating page
>>> tables (that Xen again will force to read-only) and will hit that #PF
>>> again (looping endlessly). Both behaviors are undesirable if guest could
>>> actually be allowed the write.
>>>
>>> Introduce an API that allows marking part of a page read-only. Since
>>> sub-page permissions are not a thing in page tables (they are in EPT,
>>> but not granular enough), do this via emulation (or simply page fault
>>> handler for PV) that handles writes that are supposed to be allowed.
>>> The new subpage_mmio_ro_add() takes a start physical address and the
>>> region size in bytes. Both start address and the size need to be 8-byte
>>
>> 8-byte (aka qword) here, but ...
>>
>>> aligned, as a practical simplification (allows using smaller bitmask,
>>> and a smaller granularity isn't really necessary right now).
>>> It will internally add relevant pages to mmio_ro_ranges, but if either
>>> start or end address is not page-aligned, it additionally adds that page
>>> to a list for sub-page R/O handling. The list holds a bitmask which
>>> dwords are supposed to be read-only and an address where page is mapped
>>
>> ... dwords here?
>>
>>> for write emulation - this mapping is done only on the first access. A
>>> plain list is used instead of more efficient structure, because there
>>> isn't supposed to be many pages needing this precise r/o control.
>>>
>>> The mechanism this API is plugged in is slightly different for PV and
>>> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
>>> it's already called for #PF on read-only MMIO page. For HVM however, EPT
>>> violation on p2m_mmio_direct page results in a direct domain_crash().
>>> To reach mmio_ro_emulated_write(), change how write violations for
>>> p2m_mmio_direct are handled - specifically, check if they relate to such
>>> partially protected page via subpage_mmio_write_accept() and if so, call
>>> hvm_emulate_one_mmio() for them too. This decodes what guest is trying
>>> write and finally calls mmio_ro_emulated_write(). Note that hitting EPT
>>> write violation for p2m_mmio_direct page can only happen if the page was
>>> on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for
>>> checking that again.
>>
>> Yet that's then putting us on a certain risk wrt potential errata.
>>
>> You also specifically talk about "guests", i.e. more than just hwdom.
>> Adding another easy access to the emulator (for HVM) comes with a
>> certain risk of future XSAs, too.
> 
> There are a lot of ways where a guest with a PCI device can exercise the
> emulator already, no? And even without PCI passthrough, more or less any
> device backed by the device model allows that too. So, while emulator is
> a complex piece of code, I don't think this patch makes it easier
> accessible in any significant way.

Hmm, "easier" was perhaps the wrong term. The emulator doesn't normally
come into play when guests access MMIO regions of passed-through devices.
That's different here. You're right that the emulator is always involved
when accessing MMIO regions of emulated devices.

>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>>>          goto out_put_gfn;
>>>      }
>>>  
>>> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
>>> +         subpage_mmio_write_accept(mfn, gla) &&
>>> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
>>> +    {
>>> +        rc = 1;
>>> +        goto out_put_gfn;
>>> +    }
>>
>> But npfec.write_access set doesn't mean it was a write permission
>> violation, does it? 
> 
> Doesn't it? IIUC it means it was a write attempt, to a mapped page
> (npfec.present), and since we've got EPT violation, it got denied. 

But the denial may have been for reasons other than the W bit being
clear, at least in principle? Abusing the bit now, even if in
practice there was no other possible reason on existing hardware
with the features we presently use, might lead to hard to locate
issues in case a different reason appears down the road.

>>> --- a/xen/arch/x86/include/asm/mm.h
>>> +++ b/xen/arch/x86/include/asm/mm.h
>>> @@ -522,9 +522,24 @@ extern struct rangeset *mmio_ro_ranges;
>>>  void memguard_guard_stack(void *p);
>>>  void memguard_unguard_stack(void *p);
>>>  
>>> +/*
>>> + * Add more precise r/o marking for a MMIO page. Bytes range specified here
>>> + * will still be R/O, but the rest of the page (nor marked as R/O via another
>>> + * call) will have writes passed through.
>>> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
>>
>> With this alignment constraint, the earlier sentence can be read as
>> controdictory. How about "Byte-granular ranges ..." or "Ranges (using
>> byte granularity) ..."? I admit even that doesn't resolve the issue
>> fully, though.
> 
> I don't see where the issue is, alignment requirement doesn't change
> anything regarding the parameter units.

And I didn't say it would. What I said is that starting a sentence with
"Bytes ranges specified ..." is ambiguous in potentially suggesting to
a reader that arbitrary ranges at byte boundaries are permitted. Which
can then be taken as contradicted by the subsequent sentence about
aligment requirements.

> Anyway, I can change to one of the suggested versions.

Well, something even less ambiguous would of course be yet better.

>>> @@ -4882,6 +4895,243 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>>      return 0;
>>>  }
>>>  
>>> +/* This needs subpage_ro_lock already taken */
>>> +static int __init subpage_mmio_ro_add_page(
>>> +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
>>> +{
>>> +    struct subpage_ro_range *entry = NULL, *iter;
>>> +    int i;
>>
>> unsigned int please (as almost always for induction variables).
>>
>>> +    list_for_each_entry(iter, &subpage_ro_ranges, list)
>>> +    {
>>> +        if ( mfn_eq(iter->mfn, mfn) )
>>> +        {
>>> +            entry = iter;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if ( !entry )
>>> +    {
>>> +        /* iter==NULL marks it was a newly allocated entry */
>>
>> Nit: Even in a comment I think it would be nice if style rules were
>> followed, and hence == was surrounded by blanks.
>>
>>> +        iter = NULL;
>>> +        entry = xzalloc(struct subpage_ro_range);
>>> +        if ( !entry )
>>> +            return -ENOMEM;
>>> +        entry->mfn = mfn;
>>> +    }
>>> +
>>> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
>>> +        set_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
>>
>> You're holding a spin lock, so won't __set_bit() suffice here? And
>> then __clear_bit() below?
> 
> Right.
> 
>>> +    if ( !iter )
>>> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
>>> +
>>> +    return 0;
>>> +}
>>
>> Since you mark the qwords which are to be protected, how is one to set
>> up safely two discontiguous ranges on the same page? I think I had
>> asked for v1 already why you don't do things the other way around:
>> Initially the entire page is protected, and then writable regions are
>> carved out.
> 
> Because that's not how the API is used. This API is for those how want
> to write-protect some specific ranges (to be written exclusively by
> Xen). They don't need to even know what is else is on the same page.
> Take XHCI case as an example: it gets the range to write-protect by
> enumerating XHCI extended capabilities, which is a linked list. The
> driver gets info where XHCI console registers are.  Things before/after
> them on that page may not even be XHCI extended caps at all.
> This in fact is very similar approach to already existing
> mmio_ro_ranges.

While I agree there's a similarity, multiple entities caring about the
same MFN isn't an expected scenario there. Whereas here you explicitly
add support for such.

Furthermore you sub-divide pages covered by mmio_ro_ranges here, so
defaulting to "full page protected" and then carving out sub-regions
would be the more natural approach imo.

>> I guess I shouldn't further ask about overlapping r/o ranges and their
>> cleaning up. But at least a comment towards the restriction would be
>> nice. Perhaps even check upon registration that no part of the range
>> is already marked r/o.
> 
> Yes, this is a good suggestion, I'll add that.

As long as all restrictions are properly spelled out, this may be
sufficient. But please don't be surprised if I ask again on a
subsequent version.

>>> +static void __init subpage_mmio_ro_free(struct rcu_head *rcu)
>>> +{
>>> +    struct subpage_ro_range *entry = container_of(
>>> +        rcu, struct subpage_ro_range, rcu);
>>> +
>>> +    ASSERT(bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN));
>>> +
>>> +    if ( entry->mapped )
>>> +        iounmap(entry->mapped);
>>> +    xfree(entry);
>>> +}
>>> +
>>> +/* This needs subpage_ro_lock already taken */
>>> +static int __init subpage_mmio_ro_remove_page(
>>> +    mfn_t mfn,
>>> +    int offset_s,
>>> +    int offset_e)
>>> +{
>>> +    struct subpage_ro_range *entry = NULL, *iter;
>>> +    int rc, i;
>>> +
>>> +    list_for_each_entry_rcu(iter, &subpage_ro_ranges, list)
>>> +    {
>>> +        if ( mfn_eq(iter->mfn, mfn) )
>>> +        {
>>> +            entry = iter;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if ( !entry )
>>> +        return -ENOENT;
>>
>> Yet the sole caller doesn't care at all, not even by an assertion.
> 
> Indeed, right now it's called under the same spinlock that the page was
> added. Would you prefer to drop return value (not relevant until another
> use is added, beyond cleanup on failure), or add assertion?

Taking Misra into account, dropping unused return values would be better.

>>> +    if ( !size )
>>> +        return 0;
>>> +
>>> +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
>>> +    if ( rc )
>>> +        return rc;
>>> +
>>> +    spin_lock(&subpage_ro_lock);
>>> +
>>> +    if ( PAGE_OFFSET(start) ||
>>> +         (mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1) )
>>> +    {
>>> +        offset_end = mfn_eq(mfn_start, mfn_end) ?
>>> +                     PAGE_OFFSET(end) :
>>> +                     (PAGE_SIZE - 1);
>>> +        rc = subpage_mmio_ro_add_page(mfn_start,
>>> +                                      PAGE_OFFSET(start),
>>> +                                      offset_end);
>>> +        if ( rc )
>>> +            goto err_unlock;
>>> +    }
>>> +
>>> +    if ( !mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1 )
>>> +    {
>>> +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
>>> +        if ( rc )
>>> +            goto err_unlock_remove;
>>> +    }
>>> +
>>> +    spin_unlock(&subpage_ro_lock);
>>> +
>>> +    return 0;
>>> +
>>> + err_unlock_remove:
>>> +    if ( offset_end )
>>> +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
>>> +
>>> + err_unlock:
>>> +    spin_unlock(&subpage_ro_lock);
>>> +    if ( rangeset_remove_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end)) )
>>> +        printk(XENLOG_ERR "Failed to cleanup on failed subpage_mmio_ro_add()\n");
>>> +    return rc;
>>> +}
>>
>> None of the failures here is particularly likely, so perhaps all is fine as
>> you have it. But there would be an alternative of retaining the
>> mmio_ro_ranges entry/entries, allowing the caller to "ignore" the error.
> 
> That is an option here too, but note one of the potential failures is
> about rangeset_add_range() itself, so one needs to be careful about
> completely ignoring the error.

I agree a fair amount of care would be needed. Hence why I did only suggest
this as an alternative for consideration.

>>> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
>>> +{
>>> +    if ( entry->mapped )
>>> +        return entry->mapped;
>>> +
>>> +    spin_lock(&subpage_ro_lock);
>>> +    /* Re-check under the lock */
>>> +    if ( entry->mapped )
>>> +        goto out_unlock;
>>> +
>>> +    entry->mapped = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
>>> +
>>> + out_unlock:
>>> +    spin_unlock(&subpage_ro_lock);
>>> +    return entry->mapped;
>>> +}
>>
>> This is easy to deal with without any "goto".
>>
>> I'm further inclined to request that the ioremap() occur without the lock
>> held, followed by an iounmap() (after dropping the lock) if in fact the
>> mapping wasn't needed (anymore).
> 
> Is it to limit time the lock is held?

That's only a secondary goal. The primary goal is to avoid lock nesting
where it can be avoided: All forms of allocation acquire some lock in
the process, and any lock nesting needs sufficient care to avoid lock
order inversion. As long as allocation is fully self-contained (no
callbacks, no call-outs or other hooks) the risk is low, but imo better
to avoid the risk altogether where (relatively) easily possible.

>>> +static void subpage_mmio_write_emulate(
>>> +    mfn_t mfn,
>>> +    unsigned int offset,
>>> +    const void *data,
>>> +    unsigned int len)
>>> +{
>>> +    struct subpage_ro_range *entry;
>>> +    void __iomem *addr;
>>> +
>>> +    rcu_read_lock(&subpage_ro_rcu);
>>> +
>>> +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
>>> +    {
>>> +        if ( mfn_eq(entry->mfn, mfn) )
>>> +        {
>>> +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
>>> +                goto write_ignored;
>>
>> I think you can get away with just a single "goto" by putting the gprintk()
>> (and its label) here.
> 
> write_ignored label is used also below in "default" case (which should
> be unreachable, but still).

Right, which is why I said 'just a single "goto"' (implying a label would
still be needed).

> Do you suggest jumping from default case into here?

Yes, that would be one way of structuring things.

Jan
Marek Marczykowski-Górecki July 7, 2023, 11:02 a.m. UTC | #5
On Wed, Jul 05, 2023 at 10:23:53AM +0200, Jan Beulich wrote:
> On 01.07.2023 00:28, Marek Marczykowski-Górecki wrote:
> > On Tue, May 30, 2023 at 01:56:34PM +0200, Jan Beulich wrote:
> >> On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote:
> >>> --- a/xen/arch/x86/hvm/hvm.c
> >>> +++ b/xen/arch/x86/hvm/hvm.c
> >>> @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> >>>          goto out_put_gfn;
> >>>      }
> >>>  
> >>> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> >>> +         subpage_mmio_write_accept(mfn, gla) &&
> >>> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> >>> +    {
> >>> +        rc = 1;
> >>> +        goto out_put_gfn;
> >>> +    }
> >>
> >> But npfec.write_access set doesn't mean it was a write permission
> >> violation, does it? 
> > 
> > Doesn't it? IIUC it means it was a write attempt, to a mapped page
> > (npfec.present), and since we've got EPT violation, it got denied. 
> 
> But the denial may have been for reasons other than the W bit being
> clear, at least in principle? Abusing the bit now, even if in
> practice there was no other possible reason on existing hardware
> with the features we presently use, might lead to hard to locate
> issues in case a different reason appears down the road.

Ok, so how do you propose to check if it was a write violation?

(...)

> >> Since you mark the qwords which are to be protected, how is one to set
> >> up safely two discontiguous ranges on the same page? I think I had
> >> asked for v1 already why you don't do things the other way around:
> >> Initially the entire page is protected, and then writable regions are
> >> carved out.
> > 
> > Because that's not how the API is used. This API is for those how want
> > to write-protect some specific ranges (to be written exclusively by
> > Xen). They don't need to even know what is else is on the same page.
> > Take XHCI case as an example: it gets the range to write-protect by
> > enumerating XHCI extended capabilities, which is a linked list. The
> > driver gets info where XHCI console registers are.  Things before/after
> > them on that page may not even be XHCI extended caps at all.
> > This in fact is very similar approach to already existing
> > mmio_ro_ranges.
> 
> While I agree there's a similarity, multiple entities caring about the
> same MFN isn't an expected scenario there. Whereas here you explicitly
> add support for such.
> 
> Furthermore you sub-divide pages covered by mmio_ro_ranges here, so
> defaulting to "full page protected" and then carving out sub-regions
> would be the more natural approach imo.

But then the API would be awkward to use. Instead of "mark this (smaller
than a page) region as read-only" so Xen can use it safely, you would
(likely) need marking _two_ regions as writable, after marking a page as
read-only. So, either you'd need separate (3?) calls, array of ranges or
something similar.

> >> I guess I shouldn't further ask about overlapping r/o ranges and their
> >> cleaning up. But at least a comment towards the restriction would be
> >> nice. Perhaps even check upon registration that no part of the range
> >> is already marked r/o.
> > 
> > Yes, this is a good suggestion, I'll add that.
> 
> As long as all restrictions are properly spelled out, this may be
> sufficient. But please don't be surprised if I ask again on a
> subsequent version.

(...)

> >>> +static void subpage_mmio_write_emulate(
> >>> +    mfn_t mfn,
> >>> +    unsigned int offset,
> >>> +    const void *data,
> >>> +    unsigned int len)
> >>> +{
> >>> +    struct subpage_ro_range *entry;
> >>> +    void __iomem *addr;
> >>> +
> >>> +    rcu_read_lock(&subpage_ro_rcu);
> >>> +
> >>> +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> >>> +    {
> >>> +        if ( mfn_eq(entry->mfn, mfn) )
> >>> +        {
> >>> +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> >>> +                goto write_ignored;
> >>
> >> I think you can get away with just a single "goto" by putting the gprintk()
> >> (and its label) here.
> > 
> > write_ignored label is used also below in "default" case (which should
> > be unreachable, but still).
> 
> Right, which is why I said 'just a single "goto"' (implying a label would
> still be needed).
> 
> > Do you suggest jumping from default case into here?
> 
> Yes, that would be one way of structuring things.

IMO jumping into a middle of some nested conditional/loop is harder to
follow, I'd prefer to avoid such practice. Furthermore, if moving the
gprintk here, it still needs "goto out_unlock", so it isn't really
saving any "goto", just changing its target.
Jan Beulich July 10, 2023, 7:04 a.m. UTC | #6
On 07.07.2023 13:02, Marek Marczykowski-Górecki wrote:
> On Wed, Jul 05, 2023 at 10:23:53AM +0200, Jan Beulich wrote:
>> On 01.07.2023 00:28, Marek Marczykowski-Górecki wrote:
>>> On Tue, May 30, 2023 at 01:56:34PM +0200, Jan Beulich wrote:
>>>> On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote:
>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>> @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>>>>>          goto out_put_gfn;
>>>>>      }
>>>>>  
>>>>> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
>>>>> +         subpage_mmio_write_accept(mfn, gla) &&
>>>>> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
>>>>> +    {
>>>>> +        rc = 1;
>>>>> +        goto out_put_gfn;
>>>>> +    }
>>>>
>>>> But npfec.write_access set doesn't mean it was a write permission
>>>> violation, does it? 
>>>
>>> Doesn't it? IIUC it means it was a write attempt, to a mapped page
>>> (npfec.present), and since we've got EPT violation, it got denied. 
>>
>> But the denial may have been for reasons other than the W bit being
>> clear, at least in principle? Abusing the bit now, even if in
>> practice there was no other possible reason on existing hardware
>> with the features we presently use, might lead to hard to locate
>> issues in case a different reason appears down the road.
> 
> Ok, so how do you propose to check if it was a write violation?
> 
> (...)

Well, that's the thing - even on VMX, where more state is provided by
hardware than is conveyed here, this can't be done reliably (afaict).
Hence any "approximation" will have its safety towards false positives
or negatives justified.

>>>> Since you mark the qwords which are to be protected, how is one to set
>>>> up safely two discontiguous ranges on the same page? I think I had
>>>> asked for v1 already why you don't do things the other way around:
>>>> Initially the entire page is protected, and then writable regions are
>>>> carved out.
>>>
>>> Because that's not how the API is used. This API is for those how want
>>> to write-protect some specific ranges (to be written exclusively by
>>> Xen). They don't need to even know what is else is on the same page.
>>> Take XHCI case as an example: it gets the range to write-protect by
>>> enumerating XHCI extended capabilities, which is a linked list. The
>>> driver gets info where XHCI console registers are.  Things before/after
>>> them on that page may not even be XHCI extended caps at all.
>>> This in fact is very similar approach to already existing
>>> mmio_ro_ranges.
>>
>> While I agree there's a similarity, multiple entities caring about the
>> same MFN isn't an expected scenario there. Whereas here you explicitly
>> add support for such.
>>
>> Furthermore you sub-divide pages covered by mmio_ro_ranges here, so
>> defaulting to "full page protected" and then carving out sub-regions
>> would be the more natural approach imo.
> 
> But then the API would be awkward to use. Instead of "mark this (smaller
> than a page) region as read-only" so Xen can use it safely, you would
> (likely) need marking _two_ regions as writable, after marking a page as
> read-only. So, either you'd need separate (3?) calls, array of ranges or
> something similar.

I understand that, and hence I'm willing to accept it provided ...

>>>> I guess I shouldn't further ask about overlapping r/o ranges and their
>>>> cleaning up. But at least a comment towards the restriction would be
>>>> nice. Perhaps even check upon registration that no part of the range
>>>> is already marked r/o.
>>>
>>> Yes, this is a good suggestion, I'll add that.
>>
>> As long as all restrictions are properly spelled out, this may be
>> sufficient. But please don't be surprised if I ask again on a
>> subsequent version.

... it's all spelled out (including the multi-subrange limitation
mentioned earlier, and left purposefully in context).

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 75ee98a73be8..0a64636fd9da 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2732,7 +2732,7 @@  int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
         .write      = mmio_ro_emulated_write,
         .validate   = hvmemul_validate,
     };
-    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
+    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
     struct hvm_emulate_ctxt ctxt;
     const struct x86_emulate_ops *ops;
     unsigned int seg, bdf;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index d7d31b53937a..0ed7ce31d5cf 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1990,6 +1990,14 @@  int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
         goto out_put_gfn;
     }
 
+    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
+         subpage_mmio_write_accept(mfn, gla) &&
+         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
+    {
+        rc = 1;
+        goto out_put_gfn;
+    }
+
     /* If we fell through, the vcpu will retry now that access restrictions have
      * been removed. It may fault again if the p2m entry type still requires so.
      * Otherwise, this is an error condition. */
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index db29e3e2059f..3867a92d2dfa 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -522,9 +522,24 @@  extern struct rangeset *mmio_ro_ranges;
 void memguard_guard_stack(void *p);
 void memguard_unguard_stack(void *p);
 
+/*
+ * Add more precise r/o marking for a MMIO page. Bytes range specified here
+ * will still be R/O, but the rest of the page (nor marked as R/O via another
+ * call) will have writes passed through.
+ * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
+ *
+ * Return values:
+ *  - negative: error
+ *  - 0: success
+ */
+#define SUBPAGE_MMIO_RO_ALIGN 8
+int subpage_mmio_ro_add(paddr_t start, size_t size);
+bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
+
 struct mmio_ro_emulate_ctxt {
         unsigned long cr2;
         unsigned int seg, bdf;
+        mfn_t mfn;
 };
 
 int cf_check mmio_ro_emulated_write(
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9741d28cbc96..59941a56c821 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -154,6 +154,19 @@  bool __read_mostly machine_to_phys_mapping_valid;
 
 struct rangeset *__read_mostly mmio_ro_ranges;
 
+/* Handling sub-page read-only MMIO regions */
+struct subpage_ro_range {
+    struct list_head list;
+    mfn_t mfn;
+    void __iomem *mapped;
+    DECLARE_BITMAP(ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN);
+    struct rcu_head rcu;
+};
+
+static LIST_HEAD(subpage_ro_ranges);
+static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
+static DEFINE_SPINLOCK(subpage_ro_lock);
+
 static uint32_t base_disallow_mask;
 /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
 #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
@@ -4882,6 +4895,243 @@  long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     return 0;
 }
 
+/* This needs subpage_ro_lock already taken */
+static int __init subpage_mmio_ro_add_page(
+    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    int i;
+
+    list_for_each_entry(iter, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    if ( !entry )
+    {
+        /* iter==NULL marks it was a newly allocated entry */
+        iter = NULL;
+        entry = xzalloc(struct subpage_ro_range);
+        if ( !entry )
+            return -ENOMEM;
+        entry->mfn = mfn;
+    }
+
+    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
+        set_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
+
+    if ( !iter )
+        list_add_rcu(&entry->list, &subpage_ro_ranges);
+
+    return 0;
+}
+
+static void __init subpage_mmio_ro_free(struct rcu_head *rcu)
+{
+    struct subpage_ro_range *entry = container_of(
+        rcu, struct subpage_ro_range, rcu);
+
+    ASSERT(bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN));
+
+    if ( entry->mapped )
+        iounmap(entry->mapped);
+    xfree(entry);
+}
+
+/* This needs subpage_ro_lock already taken */
+static int __init subpage_mmio_ro_remove_page(
+    mfn_t mfn,
+    int offset_s,
+    int offset_e)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    int rc, i;
+
+    list_for_each_entry_rcu(iter, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    if ( !entry )
+        return -ENOENT;
+
+    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
+        clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
+
+    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
+        return 0;
+
+    list_del_rcu(&entry->list);
+    call_rcu(&entry->rcu, subpage_mmio_ro_free);
+    return rc;
+}
+
+
+int __init subpage_mmio_ro_add(
+    paddr_t start,
+    size_t size)
+{
+    mfn_t mfn_start = maddr_to_mfn(start);
+    paddr_t end = start + size - 1;
+    mfn_t mfn_end = maddr_to_mfn(end);
+    int offset_end = 0;
+    int rc;
+
+    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
+    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
+
+    if ( !size )
+        return 0;
+
+    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
+    if ( rc )
+        return rc;
+
+    spin_lock(&subpage_ro_lock);
+
+    if ( PAGE_OFFSET(start) ||
+         (mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1) )
+    {
+        offset_end = mfn_eq(mfn_start, mfn_end) ?
+                     PAGE_OFFSET(end) :
+                     (PAGE_SIZE - 1);
+        rc = subpage_mmio_ro_add_page(mfn_start,
+                                      PAGE_OFFSET(start),
+                                      offset_end);
+        if ( rc )
+            goto err_unlock;
+    }
+
+    if ( !mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1 )
+    {
+        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
+        if ( rc )
+            goto err_unlock_remove;
+    }
+
+    spin_unlock(&subpage_ro_lock);
+
+    return 0;
+
+ err_unlock_remove:
+    if ( offset_end )
+        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
+
+ err_unlock:
+    spin_unlock(&subpage_ro_lock);
+    if ( rangeset_remove_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end)) )
+        printk(XENLOG_ERR "Failed to cleanup on failed subpage_mmio_ro_add()\n");
+    return rc;
+}
+
+static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
+{
+    if ( entry->mapped )
+        return entry->mapped;
+
+    spin_lock(&subpage_ro_lock);
+    /* Re-check under the lock */
+    if ( entry->mapped )
+        goto out_unlock;
+
+    entry->mapped = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
+
+ out_unlock:
+    spin_unlock(&subpage_ro_lock);
+    return entry->mapped;
+}
+
+static void subpage_mmio_write_emulate(
+    mfn_t mfn,
+    unsigned int offset,
+    const void *data,
+    unsigned int len)
+{
+    struct subpage_ro_range *entry;
+    void __iomem *addr;
+
+    rcu_read_lock(&subpage_ro_rcu);
+
+    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(entry->mfn, mfn) )
+        {
+            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
+                goto write_ignored;
+
+            addr = subpage_mmio_get_page(entry);
+            if ( !addr )
+            {
+                gprintk(XENLOG_ERR,
+                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
+                        mfn_x(mfn), offset);
+                goto out_unlock;
+            }
+
+            switch ( len )
+            {
+            case 1:
+                writeb(*(uint8_t*)data, addr);
+                break;
+            case 2:
+                writew(*(uint16_t*)data, addr);
+                break;
+            case 4:
+                writel(*(uint32_t*)data, addr);
+                break;
+            case 8:
+                writeq(*(uint64_t*)data, addr);
+                break;
+            default:
+                /* mmio_ro_emulated_write() already validated the size */
+                ASSERT_UNREACHABLE();
+                goto write_ignored;
+            }
+            goto out_unlock;
+        }
+    }
+    /* Do not print message for pages without any writable parts. */
+    goto out_unlock;
+
+ write_ignored:
+    gprintk(XENLOG_WARNING,
+             "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
+             mfn_x(mfn), offset, len);
+
+ out_unlock:
+    rcu_read_unlock(&subpage_ro_rcu);
+}
+
+bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
+{
+    unsigned int offset = PAGE_OFFSET(gla);
+    const struct subpage_ro_range *entry;
+
+    rcu_read_lock(&subpage_ro_rcu);
+
+    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
+        if ( mfn_eq(entry->mfn, mfn) &&
+             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
+        {
+            /*
+             * We don't know the write seize at this point yet, so it could be
+             * an unalligned write, but accept it here anyway and deal with it
+             * later.
+             */
+            rcu_read_unlock(&subpage_ro_rcu);
+            return true;
+        }
+
+    rcu_read_unlock(&subpage_ro_rcu);
+    return false;
+}
+
 int cf_check mmio_ro_emulated_write(
     enum x86_segment seg,
     unsigned long offset,
@@ -4900,6 +5150,9 @@  int cf_check mmio_ro_emulated_write(
         return X86EMUL_UNHANDLEABLE;
     }
 
+    subpage_mmio_write_emulate(mmio_ro_ctxt->mfn, PAGE_OFFSET(offset),
+                               p_data, bytes);
+
     return X86EMUL_OKAY;
 }
 
diff --git a/xen/arch/x86/pv/ro-page-fault.c b/xen/arch/x86/pv/ro-page-fault.c
index cad28ef928ad..578bb4caaf0b 100644
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -330,6 +330,7 @@  static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
             return X86EMUL_UNHANDLEABLE;
     }
 
+    mmio_ro_ctxt.mfn = mfn;
     ctxt->data = &mmio_ro_ctxt;
     if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
         return x86_emulate(ctxt, &mmcfg_intercept_ops);