diff mbox series

[v3,1/2] x86/mm: add API for marking only part of a MMIO page read only

Message ID 2596964a55f710a300143aa5ae195566bacb4e00.1716260066.git-series.marmarek@invisiblethingslab.com (mailing list archive)
State Superseded
Headers show
Series Add API for making parts of a MMIO page R/O and use it in XHCI console | expand

Commit Message

Marek Marczykowski-Górecki May 21, 2024, 2:54 a.m. UTC
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash().
To reach mmio_ro_emulated_write(), change how write violations for
p2m_mmio_direct are handled - specifically, check if they relate to such
partially protected page via subpage_mmio_write_accept() and if so, call
hvm_emulate_one_mmio() for them too. This decodes what guest is trying
write and finally calls mmio_ro_emulated_write(). The EPT write
violation is detected as npfec.write_access and npfec.present both being
true (similar to other places), which may cover some other (future?)
cases - if that happens, emulator might get involved unnecessarily, but
since it's limited to pages marked with subpage_mmio_ro_add() only, the
impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests, but only for pages explicitly marked with subpage_mmio_ro_add().
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v3:
- use unsigned int for loop iterators
- use __set_bit/__clear_bit when under spinlock
- avoid ioremap() under spinlock
- do not cast away const
- handle unaligned parameters in release build
- comment fixes
- remove RCU - the add functions are __init and actual usage is only
  much later after domains are running
- add checks overlapping ranges in debug build and document the
  limitations
- change subpage_mmio_ro_add() so the error path doesn't potentially
  remove pages from mmio_ro_ranges
- move printing message to avoid one goto in
  subpage_mmio_write_emulate()
Changes in v2:
- Simplify subpage_mmio_ro_add() parameters
- add to mmio_ro_ranges from within subpage_mmio_ro_add()
- use ioremap() instead of caller-provided fixmap
- use 8-bytes granularity (largest supported single write) and a bitmap
  instead of a rangeset
- clarify commit message
- change how it's plugged in for HVM domain, to not change the behavior for
  read-only parts (keep it hitting domain_crash(), instead of ignoring
  write)
- remove unused subpage_mmio_ro_remove()
---
 xen/arch/x86/hvm/emulate.c      |   2 +-
 xen/arch/x86/hvm/hvm.c          |   8 +-
 xen/arch/x86/include/asm/mm.h   |  18 ++-
 xen/arch/x86/mm.c               | 268 +++++++++++++++++++++++++++++++++-
 xen/arch/x86/pv/ro-page-fault.c |   1 +-
 5 files changed, 296 insertions(+), 1 deletion(-)

Comments

Jan Beulich May 21, 2024, 3:16 p.m. UTC | #1
On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
>  void memguard_guard_stack(void *p);
>  void memguard_unguard_stack(void *p);
>  
> +/*
> + * Add more precise r/o marking for a MMIO page. Range specified here
> + * will still be R/O, but the rest of the page (not marked as R/O via another
> + * call) will have writes passed through.
> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> + *
> + * This API cannot be used for overlapping ranges, nor for pages already added
> + * to mmio_ro_ranges separately.
> + *
> + * Return values:
> + *  - negative: error
> + *  - 0: success
> + */
> +#define SUBPAGE_MMIO_RO_ALIGN 8

This isn't just alignment, but also (and perhaps more importantly) granularity.
I think the name wants to express this.

> @@ -4910,6 +4921,260 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      return rc;
>  }
>  
> +/*
> + * Mark part of the page as R/O.
> + * Returns:
> + * - 0 on success - first range in the page
> + * - 1 on success - subsequent range in the page
> + * - <0 on error
> + *
> + * This needs subpage_ro_lock already taken */

Nit: Comment style (full stop and */ on its own line).

> +static int __init subpage_mmio_ro_add_page(
> +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    unsigned int i;
> +
> +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +    {
> +        /* iter == NULL marks it was a newly allocated entry */
> +        iter = NULL;
> +        entry = xzalloc(struct subpage_ro_range);
> +        if ( !entry )
> +            return -ENOMEM;
> +        entry->mfn = mfn;
> +    }
> +
> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> +    {
> +        int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
> +                                        entry->ro_qwords);

Why int, not bool?

> +        ASSERT(!oldbit);
> +    }
> +
> +    if ( !iter )
> +        list_add(&entry->list, &subpage_ro_ranges);
> +
> +    return iter ? 1 : 0;
> +}
> +
> +/* This needs subpage_ro_lock already taken */
> +static void __init subpage_mmio_ro_remove_page(
> +    mfn_t mfn,
> +    int offset_s,
> +    int offset_e)

Can either of these be negative? The more that ...

> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    unsigned int i;

... this is used ...

> +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +        return;
> +
> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )

... with both of them?

> +        __clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> +
> +    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
> +        return;
> +
> +    list_del(&entry->list);
> +    if ( entry->mapped )
> +        iounmap(entry->mapped);
> +    xfree(entry);
> +}
> +
> +int __init subpage_mmio_ro_add(
> +    paddr_t start,
> +    size_t size)
> +{
> +    mfn_t mfn_start = maddr_to_mfn(start);
> +    paddr_t end = start + size - 1;
> +    mfn_t mfn_end = maddr_to_mfn(end);
> +    unsigned int offset_end = 0;
> +    int rc;
> +    bool subpage_start, subpage_end;
> +
> +    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
> +    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
> +    if ( !IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN) )
> +        size = ROUNDUP(size, SUBPAGE_MMIO_RO_ALIGN);
> +
> +    if ( !size )
> +        return 0;
> +
> +    if ( mfn_eq(mfn_start, mfn_end) )
> +    {
> +        /* Both starting and ending parts handled at once */
> +        subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != PAGE_SIZE - 1;
> +        subpage_end = false;
> +    }
> +    else
> +    {
> +        subpage_start = PAGE_OFFSET(start);
> +        subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> +    }
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    if ( subpage_start )
> +    {
> +        offset_end = mfn_eq(mfn_start, mfn_end) ?
> +                     PAGE_OFFSET(end) :
> +                     (PAGE_SIZE - 1);
> +        rc = subpage_mmio_ro_add_page(mfn_start,
> +                                      PAGE_OFFSET(start),
> +                                      offset_end);
> +        if ( rc < 0 )
> +            goto err_unlock;
> +        /* Check if not marking R/W part of a page intended to be fully R/O */
> +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> +                                                  mfn_x(mfn_start)));
> +    }
> +
> +    if ( subpage_end )
> +    {
> +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> +        if ( rc < 0 )
> +            goto err_unlock_remove;
> +        /* Check if not marking R/W part of a page intended to be fully R/O */
> +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> +                                                  mfn_x(mfn_end)));
> +    }
> +
> +    spin_unlock(&subpage_ro_lock);
> +
> +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
> +    if ( rc )
> +        goto err_remove;
> +
> +    return 0;
> +
> + err_remove:
> +    spin_lock(&subpage_ro_lock);
> +    if ( subpage_end )
> +        subpage_mmio_ro_remove_page(mfn_end, 0, PAGE_OFFSET(end));
> + err_unlock_remove:
> +    if ( subpage_start )
> +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
> + err_unlock:
> +    spin_unlock(&subpage_ro_lock);
> +    return rc;
> +}
> +
> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> +{
> +    void __iomem *mapped_page;
> +
> +    if ( entry->mapped )
> +        return entry->mapped;
> +
> +    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);

mfn_to_maddr() or some such?

> +    spin_lock(&subpage_ro_lock);
> +    /* Re-check under the lock */
> +    if ( entry->mapped )
> +    {
> +        spin_unlock(&subpage_ro_lock);
> +        iounmap(mapped_page);

What if ioremap() failed?

Jan
Marek Marczykowski-Górecki May 21, 2024, 3:33 p.m. UTC | #2
On Tue, May 21, 2024 at 05:16:58PM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success
> > + */
> > +#define SUBPAGE_MMIO_RO_ALIGN 8
> 
> This isn't just alignment, but also (and perhaps more importantly) granularity.
> I think the name wants to express this.

SUBPAGE_MMIO_RO_GRANULARITY? Sounds a bit long...

> 
> > @@ -4910,6 +4921,260 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return rc;
> >  }
> >  
> > +/*
> > + * Mark part of the page as R/O.
> > + * Returns:
> > + * - 0 on success - first range in the page
> > + * - 1 on success - subsequent range in the page
> > + * - <0 on error
> > + *
> > + * This needs subpage_ro_lock already taken */
> 
> Nit: Comment style (full stop and */ on its own line).
> 
> > +static int __init subpage_mmio_ro_add_page(
> > +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    unsigned int i;
> > +
> > +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +    {
> > +        /* iter == NULL marks it was a newly allocated entry */
> > +        iter = NULL;
> > +        entry = xzalloc(struct subpage_ro_range);
> > +        if ( !entry )
> > +            return -ENOMEM;
> > +        entry->mfn = mfn;
> > +    }
> > +
> > +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> > +    {
> > +        int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
> > +                                        entry->ro_qwords);
> 
> Why int, not bool?

Because __test_and_set_bit returns int. But I can change to bool if you
prefer.

> > +        ASSERT(!oldbit);
> > +    }
> > +
> > +    if ( !iter )
> > +        list_add(&entry->list, &subpage_ro_ranges);
> > +
> > +    return iter ? 1 : 0;
> > +}
> > +
> > +/* This needs subpage_ro_lock already taken */
> > +static void __init subpage_mmio_ro_remove_page(
> > +    mfn_t mfn,
> > +    int offset_s,
> > +    int offset_e)
> 
> Can either of these be negative? The more that ...

Right, I can change them to unsigned. They are unsigned already in
subpage_mmio_ro_add_page.

> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    unsigned int i;
> 
> ... this is used ...
> 
> > +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +        return;
> > +
> > +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> 
> ... with both of them?
> 
> > +        __clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> > +
> > +    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
> > +        return;
> > +
> > +    list_del(&entry->list);
> > +    if ( entry->mapped )
> > +        iounmap(entry->mapped);
> > +    xfree(entry);
> > +}
> > +
> > +int __init subpage_mmio_ro_add(
> > +    paddr_t start,
> > +    size_t size)
> > +{
> > +    mfn_t mfn_start = maddr_to_mfn(start);
> > +    paddr_t end = start + size - 1;
> > +    mfn_t mfn_end = maddr_to_mfn(end);
> > +    unsigned int offset_end = 0;
> > +    int rc;
> > +    bool subpage_start, subpage_end;
> > +
> > +    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
> > +    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
> > +    if ( !IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN) )
> > +        size = ROUNDUP(size, SUBPAGE_MMIO_RO_ALIGN);
> > +
> > +    if ( !size )
> > +        return 0;
> > +
> > +    if ( mfn_eq(mfn_start, mfn_end) )
> > +    {
> > +        /* Both starting and ending parts handled at once */
> > +        subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > +        subpage_end = false;
> > +    }
> > +    else
> > +    {
> > +        subpage_start = PAGE_OFFSET(start);
> > +        subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > +    }
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    if ( subpage_start )
> > +    {
> > +        offset_end = mfn_eq(mfn_start, mfn_end) ?
> > +                     PAGE_OFFSET(end) :
> > +                     (PAGE_SIZE - 1);
> > +        rc = subpage_mmio_ro_add_page(mfn_start,
> > +                                      PAGE_OFFSET(start),
> > +                                      offset_end);
> > +        if ( rc < 0 )
> > +            goto err_unlock;
> > +        /* Check if not marking R/W part of a page intended to be fully R/O */
> > +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > +                                                  mfn_x(mfn_start)));
> > +    }
> > +
> > +    if ( subpage_end )
> > +    {
> > +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> > +        if ( rc < 0 )
> > +            goto err_unlock_remove;
> > +        /* Check if not marking R/W part of a page intended to be fully R/O */
> > +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > +                                                  mfn_x(mfn_end)));
> > +    }
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
> > +    if ( rc )
> > +        goto err_remove;
> > +
> > +    return 0;
> > +
> > + err_remove:
> > +    spin_lock(&subpage_ro_lock);
> > +    if ( subpage_end )
> > +        subpage_mmio_ro_remove_page(mfn_end, 0, PAGE_OFFSET(end));
> > + err_unlock_remove:
> > +    if ( subpage_start )
> > +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
> > + err_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    return rc;
> > +}
> > +
> > +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> > +{
> > +    void __iomem *mapped_page;
> > +
> > +    if ( entry->mapped )
> > +        return entry->mapped;
> > +
> > +    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> 
> mfn_to_maddr() or some such?

Makes sense.

> > +    spin_lock(&subpage_ro_lock);
> > +    /* Re-check under the lock */
> > +    if ( entry->mapped )
> > +    {
> > +        spin_unlock(&subpage_ro_lock);
> > +        iounmap(mapped_page);
> 
> What if ioremap() failed?

Good point.
Jan Beulich May 22, 2024, 5:55 a.m. UTC | #3
On 21.05.2024 17:33, Marek Marczykowski-Górecki wrote:
> On Tue, May 21, 2024 at 05:16:58PM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/arch/x86/include/asm/mm.h
>>> +++ b/xen/arch/x86/include/asm/mm.h
>>> @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
>>>  void memguard_guard_stack(void *p);
>>>  void memguard_unguard_stack(void *p);
>>>  
>>> +/*
>>> + * Add more precise r/o marking for a MMIO page. Range specified here
>>> + * will still be R/O, but the rest of the page (not marked as R/O via another
>>> + * call) will have writes passed through.
>>> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
>>> + *
>>> + * This API cannot be used for overlapping ranges, nor for pages already added
>>> + * to mmio_ro_ranges separately.
>>> + *
>>> + * Return values:
>>> + *  - negative: error
>>> + *  - 0: success
>>> + */
>>> +#define SUBPAGE_MMIO_RO_ALIGN 8
>>
>> This isn't just alignment, but also (and perhaps more importantly) granularity.
>> I think the name wants to express this.
> 
> SUBPAGE_MMIO_RO_GRANULARITY? Sounds a bit long...

..._GRAN? ..._CHUNK? ..._UNIT? (Perhaps also want to have MMIO_RO_ first.)

>>> +static int __init subpage_mmio_ro_add_page(
>>> +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
>>> +{
>>> +    struct subpage_ro_range *entry = NULL, *iter;
>>> +    unsigned int i;
>>> +
>>> +    list_for_each_entry(iter, &subpage_ro_ranges, list)
>>> +    {
>>> +        if ( mfn_eq(iter->mfn, mfn) )
>>> +        {
>>> +            entry = iter;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if ( !entry )
>>> +    {
>>> +        /* iter == NULL marks it was a newly allocated entry */
>>> +        iter = NULL;
>>> +        entry = xzalloc(struct subpage_ro_range);
>>> +        if ( !entry )
>>> +            return -ENOMEM;
>>> +        entry->mfn = mfn;
>>> +    }
>>> +
>>> +    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
>>> +    {
>>> +        int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
>>> +                                        entry->ro_qwords);
>>
>> Why int, not bool?
> 
> Because __test_and_set_bit returns int. But I can change to bool if you
> prefer.

test_bit() and the test-and set of functions should eventually all change
to be returning bool. One less use site to then also take care of.

Jan
Jan Beulich May 22, 2024, 7:52 a.m. UTC | #4
On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>          goto out_put_gfn;
>      }
>  
> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> +         subpage_mmio_write_accept(mfn, gla) &&

Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?

> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> +    {
> +        rc = 1;
> +        goto out_put_gfn;
> +    }

Overall this new if() is pretty similar to the immediate preceding one.
So similar that I wonder whether the two shouldn't be folded. In fact
it looks as if the new one is needed only for the case where you'd pass
through (to a DomU) a device partially used by Xen. That could certainly
do with mentioning explicitly.

> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)

Considering what the function does and what it returns, perhaps better
s/get/map/? The "get_page" part of the name generally has a different
meaning in Xen's memory management.

> +{
> +    void __iomem *mapped_page;
> +
> +    if ( entry->mapped )
> +        return entry->mapped;
> +
> +    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> +
> +    spin_lock(&subpage_ro_lock);
> +    /* Re-check under the lock */
> +    if ( entry->mapped )
> +    {
> +        spin_unlock(&subpage_ro_lock);
> +        iounmap(mapped_page);

The only unmap is on an error path here and on another error path elsewhere.
IOW it looks as if devices with such marked pages are meant to never be hot
unplugged. I can see that being intentional for the XHCI console, but imo
such a restriction also needs prominently calling out in a comment next to
e.g. the function declaration.

> +        return entry->mapped;
> +    }
> +
> +    entry->mapped = mapped_page;
> +    spin_unlock(&subpage_ro_lock);
> +    return entry->mapped;
> +}
> +
> +static void subpage_mmio_write_emulate(
> +    mfn_t mfn,
> +    unsigned int offset,
> +    const void *data,
> +    unsigned int len)
> +{
> +    struct subpage_ro_range *entry;
> +    void __iomem *addr;

Wouldn't this better be pointer-to-volatile, with ...

> +    list_for_each_entry(entry, &subpage_ro_ranges, list)
> +    {
> +        if ( mfn_eq(entry->mfn, mfn) )
> +        {
> +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> +            {
> + write_ignored:
> +                gprintk(XENLOG_WARNING,
> +                        "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
> +                        mfn_x(mfn), offset, len);
> +                return;
> +            }
> +
> +            addr = subpage_mmio_get_page(entry);
> +            if ( !addr )
> +            {
> +                gprintk(XENLOG_ERR,
> +                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
> +                        mfn_x(mfn), offset);
> +                return;
> +            }
> +
> +            switch ( len )
> +            {
> +            case 1:
> +                writeb(*(const uint8_t*)data, addr);
> +                break;
> +            case 2:
> +                writew(*(const uint16_t*)data, addr);
> +                break;
> +            case 4:
> +                writel(*(const uint32_t*)data, addr);
> +                break;
> +            case 8:
> +                writeq(*(const uint64_t*)data, addr);
> +                break;

... this being how it's written? (If so, volatile suitably carried through to
other places as well.)

> +            default:
> +                /* mmio_ro_emulated_write() already validated the size */
> +                ASSERT_UNREACHABLE();
> +                goto write_ignored;
> +            }
> +            return;
> +        }
> +    }
> +    /* Do not print message for pages without any writable parts. */
> +}
> +
> +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> +{
> +    unsigned int offset = PAGE_OFFSET(gla);
> +    const struct subpage_ro_range *entry;
> +
> +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)

Considering the other remark about respective devices impossible to go
away, is the RCU form here really needed? Its use gives the (false)
impression of entry removal being possible.

> +        if ( mfn_eq(entry->mfn, mfn) &&
> +             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )

Btw, "qwords" in the field name is kind of odd when SUBPAGE_MMIO_RO_ALIGN
in principle suggests that changing granularity ought to be possible by
simply adjusting that #define. Maybe "->ro_elems"?

> --- a/xen/arch/x86/pv/ro-page-fault.c
> +++ b/xen/arch/x86/pv/ro-page-fault.c
> @@ -330,6 +330,7 @@ static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
>              return X86EMUL_UNHANDLEABLE;
>      }
>  
> +    mmio_ro_ctxt.mfn = mfn;
>      ctxt->data = &mmio_ro_ctxt;
>      if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
>          return x86_emulate(ctxt, &mmcfg_intercept_ops);

Wouldn't you better set .mfn only on the "else" path, just out of context?
Suggesting that the new field in the struct could actually overlay the
(seg,bdf) tuple (being of relevance only to MMCFG intercept handling).
This would be more for documentation purposes than to actually save space.
(If so, perhaps the "else" itself would also better be dropped while making
the adjustment.)

Jan
Marek Marczykowski-Górecki May 22, 2024, 10:36 a.m. UTC | #5
On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> >          goto out_put_gfn;
> >      }
> >  
> > +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> > +         subpage_mmio_write_accept(mfn, gla) &&
> 
> Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?

Right, the PV path hits mmio_ro_emulated_write() without my changes
already.
Do you suggest to make subpage_mmio_write_accept() under #ifdef
CONFIG_HVM?

> > +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> > +    {
> > +        rc = 1;
> > +        goto out_put_gfn;
> > +    }
> 
> Overall this new if() is pretty similar to the immediate preceding one.
> So similar that I wonder whether the two shouldn't be folded. 

I can do that if you prefer.

> In fact
> it looks as if the new one is needed only for the case where you'd pass
> through (to a DomU) a device partially used by Xen. That could certainly
> do with mentioning explicitly.

Well, the change in mmio_ro_emulated_write() is relevant to both dom0
and domU. It simply wasn't reachable (in this case) for HVM domU before
(but was for PV already).

> > +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> 
> Considering what the function does and what it returns, perhaps better
> s/get/map/? The "get_page" part of the name generally has a different
> meaning in Xen's memory management.

Ok.

> > +{
> > +    void __iomem *mapped_page;
> > +
> > +    if ( entry->mapped )
> > +        return entry->mapped;
> > +
> > +    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +    /* Re-check under the lock */
> > +    if ( entry->mapped )
> > +    {
> > +        spin_unlock(&subpage_ro_lock);
> > +        iounmap(mapped_page);
> 
> The only unmap is on an error path here and on another error path elsewhere.
> IOW it looks as if devices with such marked pages are meant to never be hot
> unplugged. I can see that being intentional for the XHCI console, but imo
> such a restriction also needs prominently calling out in a comment next to
> e.g. the function declaration.

The v1 included subpage_mmio_ro_remove() function (which would need to
be used in case of hot-unplug of such device, if desirable), but since
this series doesn't introduce any use of it (as you say, it isn't
desirable for XHCI console specifically), you asked me to remove it...

Should I add an explicit comment about the limitation, instead of having
it implicit by not having subpage_mmio_ro_remove() there?

> > +        return entry->mapped;
> > +    }
> > +
> > +    entry->mapped = mapped_page;
> > +    spin_unlock(&subpage_ro_lock);
> > +    return entry->mapped;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned int offset,
> > +    const void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> > +    void __iomem *addr;
> 
> Wouldn't this better be pointer-to-volatile, with ...
> 
> > +    list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> > +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> > +            {
> > + write_ignored:
> > +                gprintk(XENLOG_WARNING,
> > +                        "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
> > +                        mfn_x(mfn), offset, len);
> > +                return;
> > +            }
> > +
> > +            addr = subpage_mmio_get_page(entry);
> > +            if ( !addr )
> > +            {
> > +                gprintk(XENLOG_ERR,
> > +                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
> > +                        mfn_x(mfn), offset);
> > +                return;
> > +            }
> > +
> > +            switch ( len )
> > +            {
> > +            case 1:
> > +                writeb(*(const uint8_t*)data, addr);
> > +                break;
> > +            case 2:
> > +                writew(*(const uint16_t*)data, addr);
> > +                break;
> > +            case 4:
> > +                writel(*(const uint32_t*)data, addr);
> > +                break;
> > +            case 8:
> > +                writeq(*(const uint64_t*)data, addr);
> > +                break;
> 
> ... this being how it's written? (If so, volatile suitably carried through to
> other places as well.)
> 
> > +            default:
> > +                /* mmio_ro_emulated_write() already validated the size */
> > +                ASSERT_UNREACHABLE();
> > +                goto write_ignored;
> > +            }
> > +            return;
> > +        }
> > +    }
> > +    /* Do not print message for pages without any writable parts. */
> > +}
> > +
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> > +{
> > +    unsigned int offset = PAGE_OFFSET(gla);
> > +    const struct subpage_ro_range *entry;
> > +
> > +    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
> 
> Considering the other remark about respective devices impossible to go
> away, is the RCU form here really needed? Its use gives the (false)
> impression of entry removal being possible.

Right, I forgot to change this one.

> > +        if ( mfn_eq(entry->mfn, mfn) &&
> > +             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> 
> Btw, "qwords" in the field name is kind of odd when SUBPAGE_MMIO_RO_ALIGN
> in principle suggests that changing granularity ought to be possible by
> simply adjusting that #define. Maybe "->ro_elems"?

Makes sense.

> > --- a/xen/arch/x86/pv/ro-page-fault.c
> > +++ b/xen/arch/x86/pv/ro-page-fault.c
> > @@ -330,6 +330,7 @@ static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
> >              return X86EMUL_UNHANDLEABLE;
> >      }
> >  
> > +    mmio_ro_ctxt.mfn = mfn;
> >      ctxt->data = &mmio_ro_ctxt;
> >      if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
> >          return x86_emulate(ctxt, &mmcfg_intercept_ops);
> 
> Wouldn't you better set .mfn only on the "else" path, just out of context?
> Suggesting that the new field in the struct could actually overlay the
> (seg,bdf) tuple (being of relevance only to MMCFG intercept handling).
> This would be more for documentation purposes than to actually save space.
> (If so, perhaps the "else" itself would also better be dropped while making
> the adjustment.)

I can do that if you prefer. But personally, I find such such use of an
union risky (without some means for a compiler to actually enforce their
proper use) - while for correct code it may save some space, it makes
the impact of a type confusion bug potentially worse - now that the
unexpected value would be potentially attacker controlled.
For a documentation purpose I can simply add a comment.
Jan Beulich May 22, 2024, 12:50 p.m. UTC | #6
On 22.05.2024 12:36, Marek Marczykowski-Górecki wrote:
> On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>>>          goto out_put_gfn;
>>>      }
>>>  
>>> +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
>>> +         subpage_mmio_write_accept(mfn, gla) &&
>>
>> Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?
> 
> Right, the PV path hits mmio_ro_emulated_write() without my changes
> already.
> Do you suggest to make subpage_mmio_write_accept() under #ifdef
> CONFIG_HVM?

That's not just me, but also Misra.

>>> +         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
>>> +    {
>>> +        rc = 1;
>>> +        goto out_put_gfn;
>>> +    }
>>
>> Overall this new if() is pretty similar to the immediate preceding one.
>> So similar that I wonder whether the two shouldn't be folded. 
> 
> I can do that if you prefer.
> 
>> In fact
>> it looks as if the new one is needed only for the case where you'd pass
>> through (to a DomU) a device partially used by Xen. That could certainly
>> do with mentioning explicitly.
> 
> Well, the change in mmio_ro_emulated_write() is relevant to both dom0
> and domU. It simply wasn't reachable (in this case) for HVM domU before
> (but was for PV already).

The remark was about the code here only. Of course that other change you
talk about is needed for both, and I wasn't meaning to suggest Dom0 had
worked (in this regard) prior to your change.

>>> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
>>> +{
>>> +    void __iomem *mapped_page;
>>> +
>>> +    if ( entry->mapped )
>>> +        return entry->mapped;
>>> +
>>> +    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
>>> +
>>> +    spin_lock(&subpage_ro_lock);
>>> +    /* Re-check under the lock */
>>> +    if ( entry->mapped )
>>> +    {
>>> +        spin_unlock(&subpage_ro_lock);
>>> +        iounmap(mapped_page);
>>
>> The only unmap is on an error path here and on another error path elsewhere.
>> IOW it looks as if devices with such marked pages are meant to never be hot
>> unplugged. I can see that being intentional for the XHCI console, but imo
>> such a restriction also needs prominently calling out in a comment next to
>> e.g. the function declaration.
> 
> The v1 included subpage_mmio_ro_remove() function (which would need to
> be used in case of hot-unplug of such device, if desirable), but since
> this series doesn't introduce any use of it (as you say, it isn't
> desirable for XHCI console specifically), you asked me to remove it...
> 
> Should I add an explicit comment about the limitation, instead of having
> it implicit by not having subpage_mmio_ro_remove() there?

That's what I was asking for in my earlier comment, yes.

>>> --- a/xen/arch/x86/pv/ro-page-fault.c
>>> +++ b/xen/arch/x86/pv/ro-page-fault.c
>>> @@ -330,6 +330,7 @@ static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
>>>              return X86EMUL_UNHANDLEABLE;
>>>      }
>>>  
>>> +    mmio_ro_ctxt.mfn = mfn;
>>>      ctxt->data = &mmio_ro_ctxt;
>>>      if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
>>>          return x86_emulate(ctxt, &mmcfg_intercept_ops);
>>
>> Wouldn't you better set .mfn only on the "else" path, just out of context?
>> Suggesting that the new field in the struct could actually overlay the
>> (seg,bdf) tuple (being of relevance only to MMCFG intercept handling).
>> This would be more for documentation purposes than to actually save space.
>> (If so, perhaps the "else" itself would also better be dropped while making
>> the adjustment.)
> 
> I can do that if you prefer. But personally, I find such such use of an
> union risky (without some means for a compiler to actually enforce their
> proper use) - while for correct code it may save some space, it makes
> the impact of a type confusion bug potentially worse - now that the
> unexpected value would be potentially attacker controlled.
> For a documentation purpose I can simply add a comment.

Well, I'm not going to insist on using a union. But I am pretty firm on
expecting the setting of .mfn to move down. Not using a union will then
mean static analysis tools may point out that .mfn is left uninitialized
for the above visible 1st invocation of x86_emulate().

Jan
Marek Marczykowski-Górecki May 22, 2024, 1:22 p.m. UTC | #7
On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned int offset,
> > +    const void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> > +    void __iomem *addr;
> 
> Wouldn't this better be pointer-to-volatile, with ...

Shouldn't then most other uses of __iomem in the code base be this way
too? I see volatile only in few places...

> > +    list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> > +            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> > +            {
> > + write_ignored:
> > +                gprintk(XENLOG_WARNING,
> > +                        "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
> > +                        mfn_x(mfn), offset, len);
> > +                return;
> > +            }
> > +
> > +            addr = subpage_mmio_get_page(entry);
> > +            if ( !addr )
> > +            {
> > +                gprintk(XENLOG_ERR,
> > +                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
> > +                        mfn_x(mfn), offset);
> > +                return;
> > +            }
> > +
> > +            switch ( len )
> > +            {
> > +            case 1:
> > +                writeb(*(const uint8_t*)data, addr);
> > +                break;
> > +            case 2:
> > +                writew(*(const uint16_t*)data, addr);
> > +                break;
> > +            case 4:
> > +                writel(*(const uint32_t*)data, addr);
> > +                break;
> > +            case 8:
> > +                writeq(*(const uint64_t*)data, addr);
> > +                break;
> 
> ... this being how it's written? (If so, volatile suitably carried through to
> other places as well.)
Jan Beulich May 22, 2024, 1:29 p.m. UTC | #8
On 22.05.2024 15:22, Marek Marczykowski-Górecki wrote:
> On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> +static void subpage_mmio_write_emulate(
>>> +    mfn_t mfn,
>>> +    unsigned int offset,
>>> +    const void *data,
>>> +    unsigned int len)
>>> +{
>>> +    struct subpage_ro_range *entry;
>>> +    void __iomem *addr;
>>
>> Wouldn't this better be pointer-to-volatile, with ...
> 
> Shouldn't then most other uses of __iomem in the code base be this way
> too? I see volatile only in few places...

Quite likely, yet being consistent at least in new code is going to be
at least desirable.

Jan
Marek Marczykowski-Górecki May 22, 2024, 3:22 p.m. UTC | #9
On Wed, May 22, 2024 at 03:29:51PM +0200, Jan Beulich wrote:
> On 22.05.2024 15:22, Marek Marczykowski-Górecki wrote:
> > On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> >> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> >>> +static void subpage_mmio_write_emulate(
> >>> +    mfn_t mfn,
> >>> +    unsigned int offset,
> >>> +    const void *data,
> >>> +    unsigned int len)
> >>> +{
> >>> +    struct subpage_ro_range *entry;
> >>> +    void __iomem *addr;
> >>
> >> Wouldn't this better be pointer-to-volatile, with ...
> > 
> > Shouldn't then most other uses of __iomem in the code base be this way
> > too? I see volatile only in few places...
> 
> Quite likely, yet being consistent at least in new code is going to be
> at least desirable.

I tried. Build fails because iounmap() doesn't declare its argument as
volatile, so it triggers -Werror=discarded-qualifiers...

I'll change it just in subpage_mmio_write_emulate(), but leave
subpage_mmio_map_page() (was _get_page) without volatile.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index ab1bc516839a..e98513afc69b 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2735,7 +2735,7 @@  int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
         .write      = mmio_ro_emulated_write,
         .validate   = hvmemul_validate,
     };
-    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
+    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
     struct hvm_emulate_ctxt ctxt;
     const struct x86_emulate_ops *ops;
     unsigned int seg, bdf;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 9594e0a5c530..9898e3fdb9ec 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2009,6 +2009,14 @@  int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
         goto out_put_gfn;
     }
 
+    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
+         subpage_mmio_write_accept(mfn, gla) &&
+         (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
+    {
+        rc = 1;
+        goto out_put_gfn;
+    }
+
     /* If we fell through, the vcpu will retry now that access restrictions have
      * been removed. It may fault again if the p2m entry type still requires so.
      * Otherwise, this is an error condition. */
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 98b66edaca5e..a8bbed71013e 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -522,9 +522,27 @@  extern struct rangeset *mmio_ro_ranges;
 void memguard_guard_stack(void *p);
 void memguard_unguard_stack(void *p);
 
+/*
+ * Add more precise r/o marking for a MMIO page. Range specified here
+ * will still be R/O, but the rest of the page (not marked as R/O via another
+ * call) will have writes passed through.
+ * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
+ *
+ * This API cannot be used for overlapping ranges, nor for pages already added
+ * to mmio_ro_ranges separately.
+ *
+ * Return values:
+ *  - negative: error
+ *  - 0: success
+ */
+#define SUBPAGE_MMIO_RO_ALIGN 8
+int subpage_mmio_ro_add(paddr_t start, size_t size);
+bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
+
 struct mmio_ro_emulate_ctxt {
         unsigned long cr2;
         unsigned int seg, bdf;
+        mfn_t mfn;
 };
 
 int cf_check mmio_ro_emulated_write(
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d968bbbc7315..e601c8c98665 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -150,6 +150,17 @@  bool __read_mostly machine_to_phys_mapping_valid;
 
 struct rangeset *__read_mostly mmio_ro_ranges;
 
+/* Handling sub-page read-only MMIO regions */
+struct subpage_ro_range {
+    struct list_head list;
+    mfn_t mfn;
+    void __iomem *mapped;
+    DECLARE_BITMAP(ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN);
+};
+
+static LIST_HEAD(subpage_ro_ranges);
+static DEFINE_SPINLOCK(subpage_ro_lock);
+
 static uint32_t base_disallow_mask;
 /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
 #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
@@ -4910,6 +4921,260 @@  long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     return rc;
 }
 
+/*
+ * Mark part of the page as R/O.
+ * Returns:
+ * - 0 on success - first range in the page
+ * - 1 on success - subsequent range in the page
+ * - <0 on error
+ *
+ * This needs subpage_ro_lock already taken */
+static int __init subpage_mmio_ro_add_page(
+    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    unsigned int i;
+
+    list_for_each_entry(iter, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    if ( !entry )
+    {
+        /* iter == NULL marks it was a newly allocated entry */
+        iter = NULL;
+        entry = xzalloc(struct subpage_ro_range);
+        if ( !entry )
+            return -ENOMEM;
+        entry->mfn = mfn;
+    }
+
+    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
+    {
+        int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
+                                        entry->ro_qwords);
+        ASSERT(!oldbit);
+    }
+
+    if ( !iter )
+        list_add(&entry->list, &subpage_ro_ranges);
+
+    return iter ? 1 : 0;
+}
+
+/* This needs subpage_ro_lock already taken */
+static void __init subpage_mmio_ro_remove_page(
+    mfn_t mfn,
+    int offset_s,
+    int offset_e)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    unsigned int i;
+
+    list_for_each_entry(iter, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    if ( !entry )
+        return;
+
+    for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
+        __clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
+
+    if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
+        return;
+
+    list_del(&entry->list);
+    if ( entry->mapped )
+        iounmap(entry->mapped);
+    xfree(entry);
+}
+
+int __init subpage_mmio_ro_add(
+    paddr_t start,
+    size_t size)
+{
+    mfn_t mfn_start = maddr_to_mfn(start);
+    paddr_t end = start + size - 1;
+    mfn_t mfn_end = maddr_to_mfn(end);
+    unsigned int offset_end = 0;
+    int rc;
+    bool subpage_start, subpage_end;
+
+    ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
+    ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
+    if ( !IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN) )
+        size = ROUNDUP(size, SUBPAGE_MMIO_RO_ALIGN);
+
+    if ( !size )
+        return 0;
+
+    if ( mfn_eq(mfn_start, mfn_end) )
+    {
+        /* Both starting and ending parts handled at once */
+        subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != PAGE_SIZE - 1;
+        subpage_end = false;
+    }
+    else
+    {
+        subpage_start = PAGE_OFFSET(start);
+        subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
+    }
+
+    spin_lock(&subpage_ro_lock);
+
+    if ( subpage_start )
+    {
+        offset_end = mfn_eq(mfn_start, mfn_end) ?
+                     PAGE_OFFSET(end) :
+                     (PAGE_SIZE - 1);
+        rc = subpage_mmio_ro_add_page(mfn_start,
+                                      PAGE_OFFSET(start),
+                                      offset_end);
+        if ( rc < 0 )
+            goto err_unlock;
+        /* Check if not marking R/W part of a page intended to be fully R/O */
+        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
+                                                  mfn_x(mfn_start)));
+    }
+
+    if ( subpage_end )
+    {
+        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
+        if ( rc < 0 )
+            goto err_unlock_remove;
+        /* Check if not marking R/W part of a page intended to be fully R/O */
+        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
+                                                  mfn_x(mfn_end)));
+    }
+
+    spin_unlock(&subpage_ro_lock);
+
+    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), mfn_x(mfn_end));
+    if ( rc )
+        goto err_remove;
+
+    return 0;
+
+ err_remove:
+    spin_lock(&subpage_ro_lock);
+    if ( subpage_end )
+        subpage_mmio_ro_remove_page(mfn_end, 0, PAGE_OFFSET(end));
+ err_unlock_remove:
+    if ( subpage_start )
+        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), offset_end);
+ err_unlock:
+    spin_unlock(&subpage_ro_lock);
+    return rc;
+}
+
+static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
+{
+    void __iomem *mapped_page;
+
+    if ( entry->mapped )
+        return entry->mapped;
+
+    mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
+
+    spin_lock(&subpage_ro_lock);
+    /* Re-check under the lock */
+    if ( entry->mapped )
+    {
+        spin_unlock(&subpage_ro_lock);
+        iounmap(mapped_page);
+        return entry->mapped;
+    }
+
+    entry->mapped = mapped_page;
+    spin_unlock(&subpage_ro_lock);
+    return entry->mapped;
+}
+
+static void subpage_mmio_write_emulate(
+    mfn_t mfn,
+    unsigned int offset,
+    const void *data,
+    unsigned int len)
+{
+    struct subpage_ro_range *entry;
+    void __iomem *addr;
+
+    list_for_each_entry(entry, &subpage_ro_ranges, list)
+    {
+        if ( mfn_eq(entry->mfn, mfn) )
+        {
+            if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
+            {
+ write_ignored:
+                gprintk(XENLOG_WARNING,
+                        "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n",
+                        mfn_x(mfn), offset, len);
+                return;
+            }
+
+            addr = subpage_mmio_get_page(entry);
+            if ( !addr )
+            {
+                gprintk(XENLOG_ERR,
+                        "Failed to map page for MMIO write at 0x%"PRI_mfn"%03x\n",
+                        mfn_x(mfn), offset);
+                return;
+            }
+
+            switch ( len )
+            {
+            case 1:
+                writeb(*(const uint8_t*)data, addr);
+                break;
+            case 2:
+                writew(*(const uint16_t*)data, addr);
+                break;
+            case 4:
+                writel(*(const uint32_t*)data, addr);
+                break;
+            case 8:
+                writeq(*(const uint64_t*)data, addr);
+                break;
+            default:
+                /* mmio_ro_emulated_write() already validated the size */
+                ASSERT_UNREACHABLE();
+                goto write_ignored;
+            }
+            return;
+        }
+    }
+    /* Do not print message for pages without any writable parts. */
+}
+
+bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
+{
+    unsigned int offset = PAGE_OFFSET(gla);
+    const struct subpage_ro_range *entry;
+
+    list_for_each_entry_rcu(entry, &subpage_ro_ranges, list)
+        if ( mfn_eq(entry->mfn, mfn) &&
+             !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
+        {
+            /*
+             * We don't know the write size at this point yet, so it could be
+             * an unaligned write, but accept it here anyway and deal with it
+             * later.
+             */
+            return true;
+        }
+
+    return false;
+}
+
 int cf_check mmio_ro_emulated_write(
     enum x86_segment seg,
     unsigned long offset,
@@ -4928,6 +5193,9 @@  int cf_check mmio_ro_emulated_write(
         return X86EMUL_UNHANDLEABLE;
     }
 
+    subpage_mmio_write_emulate(mmio_ro_ctxt->mfn, PAGE_OFFSET(offset),
+                               p_data, bytes);
+
     return X86EMUL_OKAY;
 }
 
diff --git a/xen/arch/x86/pv/ro-page-fault.c b/xen/arch/x86/pv/ro-page-fault.c
index cad28ef928ad..578bb4caaf0b 100644
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -330,6 +330,7 @@  static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
             return X86EMUL_UNHANDLEABLE;
     }
 
+    mmio_ro_ctxt.mfn = mfn;
     ctxt->data = &mmio_ro_ctxt;
     if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
         return x86_emulate(ctxt, &mmcfg_intercept_ops);