diff mbox series

[1/2] x86/mm: add API for marking only part of a MMIO page read only

Message ID f5381e06d92cccf9756ad00fd77f82fba98a9d80.1679911575.git-series.marmarek@invisiblethingslab.com (mailing list archive)
State Superseded
Headers show
Series Add API for making parts of a MMIO page R/O and use it in XHCI console | expand

Commit Message

Marek Marczykowski-Górecki March 27, 2023, 10:09 a.m. UTC
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either).
Current API allows only marking whole pages pages read-only, which
sometimes may cover other registers that guest may need to write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables, do this via
emulation (or simply page fault handler for PV) that handles writes that
are supposed to be allowed. Those writes require the page to be mapped
to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
page. The page needs to be added to mmio_ro_ranges, first anyway.
Sub-page ranges are stored using rangeset for each added page, and those
pages are stored on a plain list (as there isn't supposed to be many
pages needing this precise r/o control).

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash().
To reach mmio_ro_emulated_write(), change how write violations for
p2m_mmio_direct are handled - specifically, treat them similar to
p2m_ioreq_server. This makes relevant ioreq handler being called,
that finally end up calling mmio_ro_emulated_write().
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.
The used locking should make it safe to use similar to mmio_ro_ranges,
but frankly the only use (introduced in the next patch) could go without
locking at all, as subpage_mmio_ro_add() is called only before any
domain is constructed and subpage_mmio_ro_remove() is never called.
---
 xen/arch/x86/hvm/emulate.c      |   2 +-
 xen/arch/x86/hvm/hvm.c          |   3 +-
 xen/arch/x86/include/asm/mm.h   |  22 ++++-
 xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
 xen/arch/x86/pv/ro-page-fault.c |   1 +-
 5 files changed, 207 insertions(+), 2 deletions(-)

Comments

Roger Pau Monné March 28, 2023, 2:04 p.m. UTC | #1
On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote:
> In some cases, only few registers on a page needs to be write-protected.
> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either).
> Current API allows only marking whole pages pages read-only, which
> sometimes may cover other registers that guest may need to write into.
> 
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
> 
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables, do this via
> emulation (or simply page fault handler for PV) that handles writes that
> are supposed to be allowed. Those writes require the page to be mapped
> to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> page. The page needs to be added to mmio_ro_ranges, first anyway.
> Sub-page ranges are stored using rangeset for each added page, and those
> pages are stored on a plain list (as there isn't supposed to be many
> pages needing this precise r/o control).

Since mmio_ro_ranges is x86 only, it is possible to mutate
it to track address ranges instead of page frames.  The current type
is unsigned long, so that should be fine, and would avoid having to
create a per-page rangeset to just track offsets.

> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash().
> To reach mmio_ro_emulated_write(), change how write violations for
> p2m_mmio_direct are handled - specifically, treat them similar to
> p2m_ioreq_server. This makes relevant ioreq handler being called,
> that finally end up calling mmio_ro_emulated_write().
> Both of those paths need an MFN to which guest tried to write (to check
> which part of the page is supposed to be read-only, and where
> the page is mapped for writes). This information currently isn't
> available directly in mmio_ro_emulated_write(), but in both cases it is
> already resolved somewhere higher in the call tree. Pass it down to
> mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> Shadow mode is not tested, but I don't expect it to work differently than
> HAP in areas related to this patch.
> The used locking should make it safe to use similar to mmio_ro_ranges,
> but frankly the only use (introduced in the next patch) could go without
> locking at all, as subpage_mmio_ro_add() is called only before any
> domain is constructed and subpage_mmio_ro_remove() is never called.
> ---
>  xen/arch/x86/hvm/emulate.c      |   2 +-
>  xen/arch/x86/hvm/hvm.c          |   3 +-
>  xen/arch/x86/include/asm/mm.h   |  22 ++++-
>  xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
>  xen/arch/x86/pv/ro-page-fault.c |   1 +-
>  5 files changed, 207 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> index 95364deb1996..311102724dea 100644
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
>          .write      = mmio_ro_emulated_write,
>          .validate   = hvmemul_validate,
>      };
> -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
>      struct hvm_emulate_ctxt ctxt;
>      const struct x86_emulate_ops *ops;
>      unsigned int seg, bdf;
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index d326fa1c0136..f1c928e3e4ee 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
>       */
>      if ( (p2mt == p2m_mmio_dm) ||
>           (npfec.write_access &&
> -          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
> +          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
> +           p2mt == p2m_mmio_direct)) )
>      {
>          if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) )
>              hvm_inject_hw_exception(TRAP_gp_fault, 0);
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index db29e3e2059f..91937d556bac 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges;
>  void memguard_guard_stack(void *p);
>  void memguard_unguard_stack(void *p);
>  
> +/*
> + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> + * will still be R/O, but the rest of the page (nor marked as R/O via another
> + * call) will have writes passed through. The write passthrough requires
> + * providing fixmap entry by the caller.
> + * Since multiple callers can mark different areas of the same page, they might
> + * provide different fixmap entries (although that's very unlikely in
> + * practice). Only the one provided by the first caller will be used. Return value
> + * indicates whether this fixmap entry will be used, or a different one
> + * provided earlier (in which case the caller might decide to release it).

Why not use ioremap() to map the page instead of requiring a fixmap
entry?

> + *
> + * Return values:
> + *  - negative: error
> + *  - 0: success, fixmap entry is claimed
> + *  - 1: success, fixmap entry set earlier will be used
> + */
> +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s,
> +                        unsigned long offset_e, int fixmap_idx);
> +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s,
> +                           unsigned long offset_e);
> +
>  struct mmio_ro_emulate_ctxt {
>          unsigned long cr2;
>          unsigned int seg, bdf;
> +        mfn_t mfn;
>  };
>  
>  int cf_check mmio_ro_emulated_write(
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 0fe14faa5fa7..b50bdee40b6b 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
>  
>  struct rangeset *__read_mostly mmio_ro_ranges;
>  
> +/* Handling sub-page read-only MMIO regions */
> +struct subpage_ro_range {
> +    struct list_head list;
> +    mfn_t mfn;
> +    int fixmap_idx;
> +    struct rangeset *ro_bytes;
> +    struct rcu_head rcu;
> +};
> +
> +static LIST_HEAD(subpage_ro_ranges);
> +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> +static DEFINE_SPINLOCK(subpage_ro_lock);
> +
>  static uint32_t base_disallow_mask;
>  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
>  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      return 0;
>  }
>  
> +int subpage_mmio_ro_add(
> +    mfn_t mfn,
> +    unsigned long offset_s,
> +    unsigned long offset_e,

Since those are page offset, you can likely use unsigned int rather
than long?

I also wonder why not use [paddr_t start, paddr_t end] (or start and
size) and drop the mfn parameter?  You can certainly get the mfn from
the full address, and it seems more natural that having the caller
pass an mfn and two offsets.

> +    int fixmap_idx)
> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int rc;
> +
> +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> +    ASSERT(offset_s < PAGE_SIZE);
> +    ASSERT(offset_e < PAGE_SIZE);
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    list_for_each_entry( iter, &subpage_ro_ranges, list )
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +    {
> +        /* iter==NULL marks it was a newly allocated entry */
> +        iter = NULL;
> +        entry = xmalloc(struct subpage_ro_range);
> +        rc = -ENOMEM;
> +        if ( !entry )
> +            goto err_unlock;
> +        entry->mfn = mfn;
> +        entry->fixmap_idx = fixmap_idx;
> +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> +                                       RANGESETF_prettyprint_hex);
> +        rc = -ENOMEM;

rc will already be -ENOMEM, albeit doing error handling this way is
tricky...

> +        if ( !entry->ro_bytes )
> +            goto err_unlock;
> +    }
> +
> +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> +    if ( rc < 0 )
> +        goto err_unlock;
> +
> +    if ( !iter )
> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> +
> +    spin_unlock(&subpage_ro_lock);
> +
> +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> +        return 0;
> +    else
> +        return 1;
> +
> +err_unlock:
> +    spin_unlock(&subpage_ro_lock);
> +    if ( !iter )
> +    {
> +        if ( entry )
> +        {
> +            if ( entry->ro_bytes )
> +                rangeset_destroy(entry->ro_bytes);
> +            xfree(entry);
> +        }
> +    }
> +    return rc;
> +}
> +
> +static void subpage_mmio_ro_free(struct rcu_head *rcu)
> +{
> +    struct subpage_ro_range *entry = container_of(rcu, struct subpage_ro_range, rcu);
> +
> +    rangeset_destroy(entry->ro_bytes);
> +    xfree(entry);
> +}
> +
> +int subpage_mmio_ro_remove(
> +    mfn_t mfn,
> +    unsigned long offset_s,
> +    unsigned long offset_e)
> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int rc;
> +
> +    ASSERT(offset_s < PAGE_SIZE);
> +    ASSERT(offset_e < PAGE_SIZE);
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    list_for_each_entry_rcu( iter, &subpage_ro_ranges, list )
> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    rc = -ENOENT;
> +    if ( !entry )
> +        goto out_unlock;
> +
> +    rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e);
> +    if ( rc < 0 )
> +        goto out_unlock;
> +
> +    rc = 0;

You can use `if ( rc ) goto out_unlock;` and that avoids having to
explicitly set rc to 0.

> +
> +    if ( !rangeset_is_empty(entry->ro_bytes) )
> +        goto out_unlock;
> +
> +    list_del_rcu(&entry->list);
> +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> +
> +out_unlock:
> +    spin_unlock(&subpage_ro_lock);
> +    return rc;
> +}
> +
> +static void subpage_mmio_write_emulate(
> +    mfn_t mfn,
> +    unsigned long offset,
> +    void *data,
> +    unsigned int len)
> +{
> +    struct subpage_ro_range *entry;

const.

> +    void __iomem *addr;

Do we care about the access being aligned?

> +
> +    rcu_read_lock(&subpage_ro_rcu);
> +
> +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> +    {
> +        if ( mfn_eq(entry->mfn, mfn) )
> +        {

You need to take the spinlock at this point, since the contents of
entry->ro_bytes are not protected by the RCU.  The RCU only provides
safe iteration over the list, but not the content of the elements on
the list.

Thanks, Roger.
Marek Marczykowski-Górecki March 28, 2023, 2:49 p.m. UTC | #2
On Tue, Mar 28, 2023 at 04:04:47PM +0200, Roger Pau Monné wrote:
> On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either).
> > Current API allows only marking whole pages pages read-only, which
> > sometimes may cover other registers that guest may need to write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables, do this via
> > emulation (or simply page fault handler for PV) that handles writes that
> > are supposed to be allowed. Those writes require the page to be mapped
> > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> > page. The page needs to be added to mmio_ro_ranges, first anyway.
> > Sub-page ranges are stored using rangeset for each added page, and those
> > pages are stored on a plain list (as there isn't supposed to be many
> > pages needing this precise r/o control).
> 
> Since mmio_ro_ranges is x86 only, it is possible to mutate
> it to track address ranges instead of page frames.  The current type
> is unsigned long, so that should be fine, and would avoid having to
> create a per-page rangeset to just track offsets.

I was thinking about it, but rangeset doesn't allow attaching extra data
(fixmap index, or mapped address as you propose with ioremap()).
Changing all the places where mmio_ro_ranges is used will be a bit
tedious, but that isn't really a problem.

> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash().
> > To reach mmio_ro_emulated_write(), change how write violations for
> > p2m_mmio_direct are handled - specifically, treat them similar to
> > p2m_ioreq_server. This makes relevant ioreq handler being called,
> > that finally end up calling mmio_ro_emulated_write().
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently than
> > HAP in areas related to this patch.
> > The used locking should make it safe to use similar to mmio_ro_ranges,
> > but frankly the only use (introduced in the next patch) could go without
> > locking at all, as subpage_mmio_ro_add() is called only before any
> > domain is constructed and subpage_mmio_ro_remove() is never called.
> > ---
> >  xen/arch/x86/hvm/emulate.c      |   2 +-
> >  xen/arch/x86/hvm/hvm.c          |   3 +-
> >  xen/arch/x86/include/asm/mm.h   |  22 ++++-
> >  xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
> >  xen/arch/x86/pv/ro-page-fault.c |   1 +-
> >  5 files changed, 207 insertions(+), 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > index 95364deb1996..311102724dea 100644
> > --- a/xen/arch/x86/hvm/emulate.c
> > +++ b/xen/arch/x86/hvm/emulate.c
> > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
> >          .write      = mmio_ro_emulated_write,
> >          .validate   = hvmemul_validate,
> >      };
> > -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
> >      struct hvm_emulate_ctxt ctxt;
> >      const struct x86_emulate_ops *ops;
> >      unsigned int seg, bdf;
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index d326fa1c0136..f1c928e3e4ee 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> >       */
> >      if ( (p2mt == p2m_mmio_dm) ||
> >           (npfec.write_access &&
> > -          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
> > +          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
> > +           p2mt == p2m_mmio_direct)) )
> >      {
> >          if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) )
> >              hvm_inject_hw_exception(TRAP_gp_fault, 0);
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index db29e3e2059f..91937d556bac 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> > + * will still be R/O, but the rest of the page (nor marked as R/O via another
> > + * call) will have writes passed through. The write passthrough requires
> > + * providing fixmap entry by the caller.
> > + * Since multiple callers can mark different areas of the same page, they might
> > + * provide different fixmap entries (although that's very unlikely in
> > + * practice). Only the one provided by the first caller will be used. Return value
> > + * indicates whether this fixmap entry will be used, or a different one
> > + * provided earlier (in which case the caller might decide to release it).
> 
> Why not use ioremap() to map the page instead of requiring a fixmap
> entry?

In all the cases this feature is used (for now), I do have a fixmap
anyway. So, I don't need to worry if I can call ioremap() at that boot
stage (I think it's okay in console_init_postirq(), but that may not
be obvious in other places).

> > + *
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success, fixmap entry is claimed
> > + *  - 1: success, fixmap entry set earlier will be used
> > + */
> > +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s,
> > +                        unsigned long offset_e, int fixmap_idx);
> > +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s,
> > +                           unsigned long offset_e);
> > +
> >  struct mmio_ro_emulate_ctxt {
> >          unsigned long cr2;
> >          unsigned int seg, bdf;
> > +        mfn_t mfn;
> >  };
> >  
> >  int cf_check mmio_ro_emulated_write(
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index 0fe14faa5fa7..b50bdee40b6b 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +    struct list_head list;
> > +    mfn_t mfn;
> > +    int fixmap_idx;
> > +    struct rangeset *ro_bytes;
> > +    struct rcu_head rcu;
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> > +static DEFINE_SPINLOCK(subpage_ro_lock);
> > +
> >  static uint32_t base_disallow_mask;
> >  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
> >  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
> > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return 0;
> >  }
> >  
> > +int subpage_mmio_ro_add(
> > +    mfn_t mfn,
> > +    unsigned long offset_s,
> > +    unsigned long offset_e,
> 
> Since those are page offset, you can likely use unsigned int rather
> than long?
> 
> I also wonder why not use [paddr_t start, paddr_t end] (or start and
> size) and drop the mfn parameter?  You can certainly get the mfn from
> the full address, and it seems more natural that having the caller
> pass an mfn and two offsets.

That would work for the function parameters indeed, regardless of what's
really stored.

> > +    int fixmap_idx)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc;
> > +
> > +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> > +    ASSERT(offset_s < PAGE_SIZE);
> > +    ASSERT(offset_e < PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    list_for_each_entry( iter, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +    {
> > +        /* iter==NULL marks it was a newly allocated entry */
> > +        iter = NULL;
> > +        entry = xmalloc(struct subpage_ro_range);
> > +        rc = -ENOMEM;
> > +        if ( !entry )
> > +            goto err_unlock;
> > +        entry->mfn = mfn;
> > +        entry->fixmap_idx = fixmap_idx;
> > +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> > +                                       RANGESETF_prettyprint_hex);
> > +        rc = -ENOMEM;
> 
> rc will already be -ENOMEM, albeit doing error handling this way is
> tricky...
> 
> > +        if ( !entry->ro_bytes )
> > +            goto err_unlock;
> > +    }
> > +
> > +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> > +    if ( rc < 0 )
> > +        goto err_unlock;
> > +
> > +    if ( !iter )
> > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> > +        return 0;
> > +    else
> > +        return 1;
> > +
> > +err_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    if ( !iter )
> > +    {
> > +        if ( entry )
> > +        {
> > +            if ( entry->ro_bytes )
> > +                rangeset_destroy(entry->ro_bytes);
> > +            xfree(entry);
> > +        }
> > +    }
> > +    return rc;
> > +}
> > +
> > +static void subpage_mmio_ro_free(struct rcu_head *rcu)
> > +{
> > +    struct subpage_ro_range *entry = container_of(rcu, struct subpage_ro_range, rcu);
> > +
> > +    rangeset_destroy(entry->ro_bytes);
> > +    xfree(entry);
> > +}
> > +
> > +int subpage_mmio_ro_remove(
> > +    mfn_t mfn,
> > +    unsigned long offset_s,
> > +    unsigned long offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc;
> > +
> > +    ASSERT(offset_s < PAGE_SIZE);
> > +    ASSERT(offset_e < PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    list_for_each_entry_rcu( iter, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    rc = -ENOENT;
> > +    if ( !entry )
> > +        goto out_unlock;
> > +
> > +    rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e);
> > +    if ( rc < 0 )
> > +        goto out_unlock;
> > +
> > +    rc = 0;
> 
> You can use `if ( rc ) goto out_unlock;` and that avoids having to
> explicitly set rc to 0.
> 
> > +
> > +    if ( !rangeset_is_empty(entry->ro_bytes) )
> > +        goto out_unlock;
> > +
> > +    list_del_rcu(&entry->list);
> > +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> > +
> > +out_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    return rc;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned long offset,
> > +    void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> 
> const.
> 
> > +    void __iomem *addr;
> 
> Do we care about the access being aligned?

I don't think Xen cares about it when page is mapped R/W to the guest,
so why should it care when it's partially R/W only?

> 
> > +
> > +    rcu_read_lock(&subpage_ro_rcu);
> > +
> > +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> 
> You need to take the spinlock at this point, since the contents of
> entry->ro_bytes are not protected by the RCU.  The RCU only provides
> safe iteration over the list, but not the content of the elements on
> the list.

mfn is not supposed to change ever on the specific list element, and
IIUC, rangeset does locking itself. Am I missing something?
Roger Pau Monné March 28, 2023, 3:31 p.m. UTC | #3
On Tue, Mar 28, 2023 at 04:49:36PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Mar 28, 2023 at 04:04:47PM +0200, Roger Pau Monné wrote:
> > On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote:
> > > In some cases, only few registers on a page needs to be write-protected.
> > > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > > PBA table (which doesn't need to span the whole table either).
> > > Current API allows only marking whole pages pages read-only, which
> > > sometimes may cover other registers that guest may need to write into.
> > > 
> > > Currently, when a guest tries to write to an MMIO page on the
> > > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > > from userspace (like, /dev/mem), it will try to fixup by updating page
> > > tables (that Xen again will force to read-only) and will hit that #PF
> > > again (looping endlessly). Both behaviors are undesirable if guest could
> > > actually be allowed the write.
> > > 
> > > Introduce an API that allows marking part of a page read-only. Since
> > > sub-page permissions are not a thing in page tables, do this via
> > > emulation (or simply page fault handler for PV) that handles writes that
> > > are supposed to be allowed. Those writes require the page to be mapped
> > > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> > > page. The page needs to be added to mmio_ro_ranges, first anyway.
> > > Sub-page ranges are stored using rangeset for each added page, and those
> > > pages are stored on a plain list (as there isn't supposed to be many
> > > pages needing this precise r/o control).
> > 
> > Since mmio_ro_ranges is x86 only, it is possible to mutate
> > it to track address ranges instead of page frames.  The current type
> > is unsigned long, so that should be fine, and would avoid having to
> > create a per-page rangeset to just track offsets.
> 
> I was thinking about it, but rangeset doesn't allow attaching extra data
> (fixmap index, or mapped address as you propose with ioremap()).
> Changing all the places where mmio_ro_ranges is used will be a bit
> tedious, but that isn't really a problem.

Yes, you would still need to keep a separate structure in order to
store the maps, a simple list that contains the physical mfn and the
virtual address where it's mapped would suffice I think.  That would
avoid creating a separate rangeset for each mfn that you need to
track.

> > > The mechanism this API is plugged in is slightly different for PV and
> > > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > > violation on p2m_mmio_direct page results in a direct domain_crash().
> > > To reach mmio_ro_emulated_write(), change how write violations for
> > > p2m_mmio_direct are handled - specifically, treat them similar to
> > > p2m_ioreq_server. This makes relevant ioreq handler being called,
> > > that finally end up calling mmio_ro_emulated_write().
> > > Both of those paths need an MFN to which guest tried to write (to check
> > > which part of the page is supposed to be read-only, and where
> > > the page is mapped for writes). This information currently isn't
> > > available directly in mmio_ro_emulated_write(), but in both cases it is
> > > already resolved somewhere higher in the call tree. Pass it down to
> > > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > ---
> > > Shadow mode is not tested, but I don't expect it to work differently than
> > > HAP in areas related to this patch.
> > > The used locking should make it safe to use similar to mmio_ro_ranges,
> > > but frankly the only use (introduced in the next patch) could go without
> > > locking at all, as subpage_mmio_ro_add() is called only before any
> > > domain is constructed and subpage_mmio_ro_remove() is never called.
> > > ---
> > >  xen/arch/x86/hvm/emulate.c      |   2 +-
> > >  xen/arch/x86/hvm/hvm.c          |   3 +-
> > >  xen/arch/x86/include/asm/mm.h   |  22 ++++-
> > >  xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
> > >  xen/arch/x86/pv/ro-page-fault.c |   1 +-
> > >  5 files changed, 207 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > > index 95364deb1996..311102724dea 100644
> > > --- a/xen/arch/x86/hvm/emulate.c
> > > +++ b/xen/arch/x86/hvm/emulate.c
> > > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
> > >          .write      = mmio_ro_emulated_write,
> > >          .validate   = hvmemul_validate,
> > >      };
> > > -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > > +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
> > >      struct hvm_emulate_ctxt ctxt;
> > >      const struct x86_emulate_ops *ops;
> > >      unsigned int seg, bdf;
> > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > > index d326fa1c0136..f1c928e3e4ee 100644
> > > --- a/xen/arch/x86/hvm/hvm.c
> > > +++ b/xen/arch/x86/hvm/hvm.c
> > > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
> > >       */
> > >      if ( (p2mt == p2m_mmio_dm) ||
> > >           (npfec.write_access &&
> > > -          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
> > > +          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
> > > +           p2mt == p2m_mmio_direct)) )
> > >      {
> > >          if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) )
> > >              hvm_inject_hw_exception(TRAP_gp_fault, 0);
> > > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > > index db29e3e2059f..91937d556bac 100644
> > > --- a/xen/arch/x86/include/asm/mm.h
> > > +++ b/xen/arch/x86/include/asm/mm.h
> > > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges;
> > >  void memguard_guard_stack(void *p);
> > >  void memguard_unguard_stack(void *p);
> > >  
> > > +/*
> > > + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> > > + * will still be R/O, but the rest of the page (nor marked as R/O via another
> > > + * call) will have writes passed through. The write passthrough requires
> > > + * providing fixmap entry by the caller.
> > > + * Since multiple callers can mark different areas of the same page, they might
> > > + * provide different fixmap entries (although that's very unlikely in
> > > + * practice). Only the one provided by the first caller will be used. Return value
> > > + * indicates whether this fixmap entry will be used, or a different one
> > > + * provided earlier (in which case the caller might decide to release it).
> > 
> > Why not use ioremap() to map the page instead of requiring a fixmap
> > entry?
> 
> In all the cases this feature is used (for now), I do have a fixmap
> anyway. So, I don't need to worry if I can call ioremap() at that boot
> stage (I think it's okay in console_init_postirq(), but that may not
> be obvious in other places).

If we moved for a model where mmio_ro_ranges tracks at address
granularity I would suggest that the underlying addresses are only
mapped when there's a guest access to handle, so that we don't waste
fixmap entries for no reason.

> > > +        if ( !entry->ro_bytes )
> > > +            goto err_unlock;
> > > +    }
> > > +
> > > +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> > > +    if ( rc < 0 )
> > > +        goto err_unlock;
> > > +
> > > +    if ( !iter )
> > > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > > +
> > > +    spin_unlock(&subpage_ro_lock);
> > > +
> > > +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> > > +        return 0;
> > > +    else
> > > +        return 1;
> > > +
> > > +err_unlock:
> > > +    spin_unlock(&subpage_ro_lock);
> > > +    if ( !iter )
> > > +    {
> > > +        if ( entry )
> > > +        {
> > > +            if ( entry->ro_bytes )
> > > +                rangeset_destroy(entry->ro_bytes);
> > > +            xfree(entry);
> > > +        }
> > > +    }
> > > +    return rc;
> > > +}
> > > +
> > > +static void subpage_mmio_ro_free(struct rcu_head *rcu)
> > > +{
> > > +    struct subpage_ro_range *entry = container_of(rcu, struct subpage_ro_range, rcu);

Overly long line.

Out of precaution I would also add an extra
ASSERT(rangeset_is_empty(entry->ro_bytes)) here.

> > > +
> > > +    if ( !rangeset_is_empty(entry->ro_bytes) )
> > > +        goto out_unlock;
> > > +
> > > +    list_del_rcu(&entry->list);
> > > +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> > > +
> > > +out_unlock:
> > > +    spin_unlock(&subpage_ro_lock);
> > > +    return rc;
> > > +}
> > > +
> > > +static void subpage_mmio_write_emulate(
> > > +    mfn_t mfn,
> > > +    unsigned long offset,
> > > +    void *data,
> > > +    unsigned int len)
> > > +{
> > > +    struct subpage_ro_range *entry;
> > 
> > const.
> > 
> > > +    void __iomem *addr;
> > 
> > Do we care about the access being aligned?
> 
> I don't think Xen cares about it when page is mapped R/W to the guest,
> so why should it care when it's partially R/W only?

It's kind of the same issue we have with adjacent MSIX accesses: we
don't really know whether the device might choke on unaligned accesses
and generate some kind of exception that results in a NMI to the CPU.
Handling those it's likely more feasible if they happen in guest
context, rather than Xen context.

> > 
> > > +
> > > +    rcu_read_lock(&subpage_ro_rcu);
> > > +
> > > +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> > > +    {
> > > +        if ( mfn_eq(entry->mfn, mfn) )
> > > +        {
> > 
> > You need to take the spinlock at this point, since the contents of
> > entry->ro_bytes are not protected by the RCU.  The RCU only provides
> > safe iteration over the list, but not the content of the elements on
> > the list.
> 
> mfn is not supposed to change ever on the specific list element, and
> IIUC, rangeset does locking itself. Am I missing something?

Oh, sorry, it was me being wrong, I didn't recall and didn't check that
rangesets do have internal locking.

Thanks, Roger.
Jan Beulich March 29, 2023, 8:50 a.m. UTC | #4
On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote:
> In some cases, only few registers on a page needs to be write-protected.
> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either).

Yet like the MSI-X table the PBA is not permitted to share a page with
other registers (the table itself excluded). So a need there would
(again) only arise for devices violating the spec.

> Current API allows only marking whole pages pages read-only, which
> sometimes may cover other registers that guest may need to write into.
> 
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
> 
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables, do this via
> emulation (or simply page fault handler for PV) that handles writes that
> are supposed to be allowed. Those writes require the page to be mapped
> to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> page. The page needs to be added to mmio_ro_ranges, first anyway.
> Sub-page ranges are stored using rangeset for each added page, and those
> pages are stored on a plain list (as there isn't supposed to be many
> pages needing this precise r/o control).

While, unlike Roger, I don't want to see mmio_ro_ranges' granularity
changed, I wonder if a bitmap isn't going to be easier to use (even
if perhaps requiring a little more memory, but whether that's the case
depends on intended granularity - I'm not convinced we need byte-
granular ranges). I'm also worried of this yet more heavily tying to
x86 of (as of the next patch) the USB3 debug console driver (i.e. as
opposed to Roger I wouldn't take anything that's x86-only as
justification for making wider changes).

As to sub-page permissions: EPT has, as one of its extensions, support
for this. It might be worth at least mentioning, even if the feature
isn't intended to be used right here.

Taking both remarks together, limiting granularity to 16(?) bytes
would allow using the advanced EPT functionality down the road, and
would at the same time limit the suggested bitmap to just 256 bits /
32 bytes, which I think gets us below what even an empty rangeset
would require. Plus lookup would also be quite a bit more lightweight.

> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash().
> To reach mmio_ro_emulated_write(), change how write violations for
> p2m_mmio_direct are handled - specifically, treat them similar to
> p2m_ioreq_server. This makes relevant ioreq handler being called,
> that finally end up calling mmio_ro_emulated_write().

I wonder whether that isn't too heavy a change to existing behavior.
I understand that taking this path is necessary for the purpose of the
patch, but I don't think it is necessary for all p2m_mmio_direct pages.
Checking at least mmio_ro_ranges right in hvm_hap_nested_page_fault()
looks reasonable to me.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
>  
>  struct rangeset *__read_mostly mmio_ro_ranges;
>  
> +/* Handling sub-page read-only MMIO regions */
> +struct subpage_ro_range {
> +    struct list_head list;
> +    mfn_t mfn;
> +    int fixmap_idx;
> +    struct rangeset *ro_bytes;
> +    struct rcu_head rcu;
> +};
> +
> +static LIST_HEAD(subpage_ro_ranges);
> +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> +static DEFINE_SPINLOCK(subpage_ro_lock);
> +
>  static uint32_t base_disallow_mask;
>  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
>  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      return 0;
>  }
>  
> +int subpage_mmio_ro_add(

As long as patch 2 is going to add the only users, __init please, and
there's no need for a "remove" counterpart.

> +    mfn_t mfn,
> +    unsigned long offset_s,
> +    unsigned long offset_e,
> +    int fixmap_idx)

enum fixed_addresses?

> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int rc;
> +
> +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> +    ASSERT(offset_s < PAGE_SIZE);
> +    ASSERT(offset_e < PAGE_SIZE);
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    list_for_each_entry( iter, &subpage_ro_ranges, list )

Nit: Style (either you view list_for_each_entry as a [pseudo]keyword
for the purposes of what blanks should be present/absent, or you don't,
a mixture isn't reasonable; also elsewhere).

> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +    {
> +        /* iter==NULL marks it was a newly allocated entry */
> +        iter = NULL;
> +        entry = xmalloc(struct subpage_ro_range);
> +        rc = -ENOMEM;
> +        if ( !entry )
> +            goto err_unlock;
> +        entry->mfn = mfn;
> +        entry->fixmap_idx = fixmap_idx;
> +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> +                                       RANGESETF_prettyprint_hex);
> +        rc = -ENOMEM;
> +        if ( !entry->ro_bytes )
> +            goto err_unlock;
> +    }
> +
> +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> +    if ( rc < 0 )
> +        goto err_unlock;
> +
> +    if ( !iter )
> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> +
> +    spin_unlock(&subpage_ro_lock);
> +
> +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> +        return 0;
> +    else
> +        return 1;

If this case is really needed, this special return value (as documented
in the header) probably needs specially handling in the callers - it's
not necessarily an error after all. But I wonder whether it wouldn't be
easier to check earlier on and fail right away (with e.g. -EBUSY), rather
than adding the range and _then_ (kind of, as per patch 2) failing.

> +err_unlock:

Nit: Style (labels indented by at least one space please, also elsewhere).

> +static void subpage_mmio_write_emulate(
> +    mfn_t mfn,
> +    unsigned long offset,

unsigned int

> +    void *data,

const

> +    unsigned int len)
> +{
> +    struct subpage_ro_range *entry;
> +    void __iomem *addr;
> +
> +    rcu_read_lock(&subpage_ro_rcu);
> +
> +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> +    {
> +        if ( mfn_eq(entry->mfn, mfn) )
> +        {
> +            if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + len - 1) )
> +                goto out_unlock;

This case ...

> +            addr = fix_to_virt(entry->fixmap_idx) + offset;
> +            switch ( len )
> +            {
> +            case 1:
> +                writeb(*(uint8_t*)data, addr);
> +                break;
> +            case 2:
> +                writew(*(uint16_t*)data, addr);
> +                break;
> +            case 4:
> +                writel(*(uint32_t*)data, addr);
> +                break;
> +            case 8:
> +                writeq(*(uint64_t*)data, addr);
> +                break;
> +            default:
> +                /* mmio_ro_emulated_write() already validated the size */
> +                ASSERT_UNREACHABLE();

... as well as, in a release build, this one wants to ...

> +            }
> +            goto out_unlock;

... not use this path but ...

> +        }
> +    }
> +    gdprintk(XENLOG_WARNING,
> +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
> +             mfn_x(mfn), offset, len);

... make it here. (By implication: I'm not convinced this wants to be a
gdprintk(), as I think at least one such warning would better surface in
release builds as well.)

At the same time I don't think any message should be issued for write
attempts to pages which don't have part of it marked writable. We deal
with such silently right now, and this shouldn't change. In fact even
for ranges which don't overlap the writable portion of a page we may want
to consider remaining silent. The log message ought to be of interest
mainly for writes which we meant to permit, but which for whatever reason
we elect to suppress nevertheless.

As to the message - why do you split MFN from offset, rather than simply
logging an address ("... %" PRI_mfn "%03x ...")?

Like (iirc) Roger I think that misaligned accesses should be refused (and
hence also make it here).

Jan
Marek Marczykowski-Górecki March 29, 2023, 10:51 a.m. UTC | #5
On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote:
> On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either).
> 
> Yet like the MSI-X table the PBA is not permitted to share a page with
> other registers (the table itself excluded). So a need there would
> (again) only arise for devices violating the spec.

For PBA, indeed (and due to not seeing device needing such hack, I don't
do that for PBA yet). But the XHCI spec doesn't include such limitation, so
this case is perfectly valid.

> > Current API allows only marking whole pages pages read-only, which
> > sometimes may cover other registers that guest may need to write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables, do this via
> > emulation (or simply page fault handler for PV) that handles writes that
> > are supposed to be allowed. Those writes require the page to be mapped
> > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> > page. The page needs to be added to mmio_ro_ranges, first anyway.
> > Sub-page ranges are stored using rangeset for each added page, and those
> > pages are stored on a plain list (as there isn't supposed to be many
> > pages needing this precise r/o control).
> 
> While, unlike Roger, I don't want to see mmio_ro_ranges' granularity
> changed, I wonder if a bitmap isn't going to be easier to use (even
> if perhaps requiring a little more memory, but whether that's the case
> depends on intended granularity - I'm not convinced we need byte-
> granular ranges). I'm also worried of this yet more heavily tying to
> x86 of (as of the next patch) the USB3 debug console driver (i.e. as
> opposed to Roger I wouldn't take anything that's x86-only as
> justification for making wider changes).

Well, it's still under the very same CONFIG_X86, and for the same
purpose, so I don't think it's "more heavily tying".

> As to sub-page permissions: EPT has, as one of its extensions, support
> for this. It might be worth at least mentioning, even if the feature
> isn't intended to be used right here.

Ah, ok.

> Taking both remarks together, limiting granularity to 16(?) bytes
> would allow using the advanced EPT functionality down the road, and
> would at the same time limit the suggested bitmap to just 256 bits /
> 32 bytes, which I think gets us below what even an empty rangeset
> would require. Plus lookup would also be quite a bit more lightweight.

Indeed, in that case it makes sense.

> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash().
> > To reach mmio_ro_emulated_write(), change how write violations for
> > p2m_mmio_direct are handled - specifically, treat them similar to
> > p2m_ioreq_server. This makes relevant ioreq handler being called,
> > that finally end up calling mmio_ro_emulated_write().
> 
> I wonder whether that isn't too heavy a change to existing behavior.
> I understand that taking this path is necessary for the purpose of the
> patch, but I don't think it is necessary for all p2m_mmio_direct pages.
> Checking at least mmio_ro_ranges right in hvm_hap_nested_page_fault()
> looks reasonable to me.

Before this change, domU was crashed on write EPT violation to
p2m_mmio_direct page (which, IIUC, can happen only for mmio_ro_ranges,
as otherwise page is mapped R/W), so I don't think there are many
accesses that would hit this path for other reasons.

> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +    struct list_head list;
> > +    mfn_t mfn;
> > +    int fixmap_idx;
> > +    struct rangeset *ro_bytes;
> > +    struct rcu_head rcu;
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> > +static DEFINE_SPINLOCK(subpage_ro_lock);
> > +
> >  static uint32_t base_disallow_mask;
> >  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
> >  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
> > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return 0;
> >  }
> >  
> > +int subpage_mmio_ro_add(
> 
> As long as patch 2 is going to add the only users, __init please, and
> there's no need for a "remove" counterpart.

__init makes sense. But as for removing "remove" part, I'm not sure. I
realize it is a dead code now, but it's easier to introduce it now to
provide complete API, than later by somebody else who would want to use
it in other places. Is there some trick to make compiler/linker optimize
it out?

> 
> > +    mfn_t mfn,
> > +    unsigned long offset_s,
> > +    unsigned long offset_e,
> > +    int fixmap_idx)
> 
> enum fixed_addresses?

If that parameter stays, yes.

> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc;
> > +
> > +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> > +    ASSERT(offset_s < PAGE_SIZE);
> > +    ASSERT(offset_e < PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    list_for_each_entry( iter, &subpage_ro_ranges, list )
> 
> Nit: Style (either you view list_for_each_entry as a [pseudo]keyword
> for the purposes of what blanks should be present/absent, or you don't,
> a mixture isn't reasonable; also elsewhere).

Which version would be your preference for list_for_each_entry?

> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +    {
> > +        /* iter==NULL marks it was a newly allocated entry */
> > +        iter = NULL;
> > +        entry = xmalloc(struct subpage_ro_range);
> > +        rc = -ENOMEM;
> > +        if ( !entry )
> > +            goto err_unlock;
> > +        entry->mfn = mfn;
> > +        entry->fixmap_idx = fixmap_idx;
> > +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> > +                                       RANGESETF_prettyprint_hex);
> > +        rc = -ENOMEM;
> > +        if ( !entry->ro_bytes )
> > +            goto err_unlock;
> > +    }
> > +
> > +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> > +    if ( rc < 0 )
> > +        goto err_unlock;
> > +
> > +    if ( !iter )
> > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> > +        return 0;
> > +    else
> > +        return 1;
> 
> If this case is really needed, this special return value (as documented
> in the header) probably needs specially handling in the callers - it's
> not necessarily an error after all. But I wonder whether it wouldn't be
> easier to check earlier on and fail right away (with e.g. -EBUSY), 

The idea is to allow multiple sub-ranges in a single page. Again, if
using ioremap() internally, instead of fixmap provided externally, this
case will go away.

> rather
> than adding the range and _then_ (kind of, as per patch 2) failing.

Right, I missed "!= 0" there.

> > +err_unlock:
> 
> Nit: Style (labels indented by at least one space please, also elsewhere).
> 
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned long offset,
> 
> unsigned int
> 
> > +    void *data,
> 
> const
> 
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> > +    void __iomem *addr;
> > +
> > +    rcu_read_lock(&subpage_ro_rcu);
> > +
> > +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> > +            if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + len - 1) )
> > +                goto out_unlock;
> 
> This case ...
> 
> > +            addr = fix_to_virt(entry->fixmap_idx) + offset;
> > +            switch ( len )
> > +            {
> > +            case 1:
> > +                writeb(*(uint8_t*)data, addr);
> > +                break;
> > +            case 2:
> > +                writew(*(uint16_t*)data, addr);
> > +                break;
> > +            case 4:
> > +                writel(*(uint32_t*)data, addr);
> > +                break;
> > +            case 8:
> > +                writeq(*(uint64_t*)data, addr);
> > +                break;
> > +            default:
> > +                /* mmio_ro_emulated_write() already validated the size */
> > +                ASSERT_UNREACHABLE();
> 
> ... as well as, in a release build, this one wants to ...
> 
> > +            }
> > +            goto out_unlock;
> 
> ... not use this path but ...
> 
> > +        }
> > +    }
> > +    gdprintk(XENLOG_WARNING,
> > +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
> > +             mfn_x(mfn), offset, len);
> 
> ... make it here. (By implication: I'm not convinced this wants to be a
> gdprintk(), as I think at least one such warning would better surface in
> release builds as well.)

Right, gprintk() would make more sense indeed.

> At the same time I don't think any message should be issued for write
> attempts to pages which don't have part of it marked writable. We deal
> with such silently right now, and this shouldn't change.

At least for HVM domains, it isn't really silent. It's domain_crash()
(before my change, not reaching this function at all).
I think it is silent only for PV domains (or maybe even just hardware
domain?).

> In fact even
> for ranges which don't overlap the writable portion of a page we may want
> to consider remaining silent. The log message ought to be of interest
> mainly for writes which we meant to permit, but which for whatever reason
> we elect to suppress nevertheless.
> 
> As to the message - why do you split MFN from offset, rather than simply
> logging an address ("... %" PRI_mfn "%03x ...")?

I could use such formatting trick indeed.

> Like (iirc) Roger I think that misaligned accesses should be refused (and
> hence also make it here).

Ok, will change.
Jan Beulich March 29, 2023, 12:39 p.m. UTC | #6
On 29.03.2023 12:51, Marek Marczykowski-Górecki wrote:
> On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote:
>> On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote:
>>> In some cases, only few registers on a page needs to be write-protected.
>>> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
>>> PBA table (which doesn't need to span the whole table either).
>>
>> Yet like the MSI-X table the PBA is not permitted to share a page with
>> other registers (the table itself excluded). So a need there would
>> (again) only arise for devices violating the spec.
> 
> For PBA, indeed (and due to not seeing device needing such hack, I don't
> do that for PBA yet). But the XHCI spec doesn't include such limitation, so
> this case is perfectly valid.

My remark was merely because you mention PBA in the description.
Mentioning it is okay, but you would want to further qualify it, I
think.

>>> Current API allows only marking whole pages pages read-only, which
>>> sometimes may cover other registers that guest may need to write into.
>>>
>>> Currently, when a guest tries to write to an MMIO page on the
>>> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
>>> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
>>> from userspace (like, /dev/mem), it will try to fixup by updating page
>>> tables (that Xen again will force to read-only) and will hit that #PF
>>> again (looping endlessly). Both behaviors are undesirable if guest could
>>> actually be allowed the write.
>>>
>>> Introduce an API that allows marking part of a page read-only. Since
>>> sub-page permissions are not a thing in page tables, do this via
>>> emulation (or simply page fault handler for PV) that handles writes that
>>> are supposed to be allowed. Those writes require the page to be mapped
>>> to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
>>> page. The page needs to be added to mmio_ro_ranges, first anyway.
>>> Sub-page ranges are stored using rangeset for each added page, and those
>>> pages are stored on a plain list (as there isn't supposed to be many
>>> pages needing this precise r/o control).
>>
>> While, unlike Roger, I don't want to see mmio_ro_ranges' granularity
>> changed, I wonder if a bitmap isn't going to be easier to use (even
>> if perhaps requiring a little more memory, but whether that's the case
>> depends on intended granularity - I'm not convinced we need byte-
>> granular ranges). I'm also worried of this yet more heavily tying to
>> x86 of (as of the next patch) the USB3 debug console driver (i.e. as
>> opposed to Roger I wouldn't take anything that's x86-only as
>> justification for making wider changes).
> 
> Well, it's still under the very same CONFIG_X86, and for the same
> purpose, so I don't think it's "more heavily tying".
> 
>> As to sub-page permissions: EPT has, as one of its extensions, support
>> for this. It might be worth at least mentioning, even if the feature
>> isn't intended to be used right here.
> 
> Ah, ok.
> 
>> Taking both remarks together, limiting granularity to 16(?) bytes
>> would allow using the advanced EPT functionality down the road, and
>> would at the same time limit the suggested bitmap to just 256 bits /
>> 32 bytes, which I think gets us below what even an empty rangeset
>> would require. Plus lookup would also be quite a bit more lightweight.
> 
> Indeed, in that case it makes sense.

Hmm, I've checked the SDM, and I was misremembering: Granularity is
128 bytes, which might be too large for the purposes here.

>>> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>>      return 0;
>>>  }
>>>  
>>> +int subpage_mmio_ro_add(
>>
>> As long as patch 2 is going to add the only users, __init please, and
>> there's no need for a "remove" counterpart.
> 
> __init makes sense. But as for removing "remove" part, I'm not sure. I
> realize it is a dead code now, but it's easier to introduce it now to
> provide complete API, than later by somebody else who would want to use
> it in other places. Is there some trick to make compiler/linker optimize
> it out?

At the very least you could also mark it __init. There are also the .discard
and .discard.* sections we handle specially in the linker script. But no
matter what you do to retain the code without impacting the resulting binary,
iirc Misra tells us that there shouldn't be dead code.

>>> +    mfn_t mfn,
>>> +    unsigned long offset_s,
>>> +    unsigned long offset_e,
>>> +    int fixmap_idx)
>>
>> enum fixed_addresses?
> 
> If that parameter stays, yes.
> 
>>> +{
>>> +    struct subpage_ro_range *entry = NULL, *iter;
>>> +    int rc;
>>> +
>>> +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
>>> +    ASSERT(offset_s < PAGE_SIZE);
>>> +    ASSERT(offset_e < PAGE_SIZE);
>>> +
>>> +    spin_lock(&subpage_ro_lock);
>>> +
>>> +    list_for_each_entry( iter, &subpage_ro_ranges, list )
>>
>> Nit: Style (either you view list_for_each_entry as a [pseudo]keyword
>> for the purposes of what blanks should be present/absent, or you don't,
>> a mixture isn't reasonable; also elsewhere).
> 
> Which version would be your preference for list_for_each_entry?

I have no preference, and I'm sure we have ample examples of all forms
(including the mixed ones that we shouldn't have). There's a
for_each_online_cpu() in this file, using the non-keyword style, so
if anything this might be taken as reference. But I didn't check whether
there are any other uses of keyword-like identifiers which may use the
other style.

>>> +    {
>>> +        if ( mfn_eq(iter->mfn, mfn) )
>>> +        {
>>> +            entry = iter;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if ( !entry )
>>> +    {
>>> +        /* iter==NULL marks it was a newly allocated entry */
>>> +        iter = NULL;
>>> +        entry = xmalloc(struct subpage_ro_range);
>>> +        rc = -ENOMEM;
>>> +        if ( !entry )
>>> +            goto err_unlock;
>>> +        entry->mfn = mfn;
>>> +        entry->fixmap_idx = fixmap_idx;
>>> +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
>>> +                                       RANGESETF_prettyprint_hex);
>>> +        rc = -ENOMEM;
>>> +        if ( !entry->ro_bytes )
>>> +            goto err_unlock;
>>> +    }
>>> +
>>> +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
>>> +    if ( rc < 0 )
>>> +        goto err_unlock;
>>> +
>>> +    if ( !iter )
>>> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
>>> +
>>> +    spin_unlock(&subpage_ro_lock);
>>> +
>>> +    if ( !iter || entry->fixmap_idx == fixmap_idx )
>>> +        return 0;
>>> +    else
>>> +        return 1;
>>
>> If this case is really needed, this special return value (as documented
>> in the header) probably needs specially handling in the callers - it's
>> not necessarily an error after all. But I wonder whether it wouldn't be
>> easier to check earlier on and fail right away (with e.g. -EBUSY), 
> 
> The idea is to allow multiple sub-ranges in a single page. Again, if
> using ioremap() internally, instead of fixmap provided externally, this
> case will go away.
> 
>> rather
>> than adding the range and _then_ (kind of, as per patch 2) failing.
> 
> Right, I missed "!= 0" there.

Hmm, adding "!= 0" won't make a difference, will it? Adding "< 0" would.

>>> +err_unlock:
>>
>> Nit: Style (labels indented by at least one space please, also elsewhere).
>>
>>> +static void subpage_mmio_write_emulate(
>>> +    mfn_t mfn,
>>> +    unsigned long offset,
>>
>> unsigned int
>>
>>> +    void *data,
>>
>> const
>>
>>> +    unsigned int len)
>>> +{
>>> +    struct subpage_ro_range *entry;
>>> +    void __iomem *addr;
>>> +
>>> +    rcu_read_lock(&subpage_ro_rcu);
>>> +
>>> +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
>>> +    {
>>> +        if ( mfn_eq(entry->mfn, mfn) )
>>> +        {
>>> +            if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + len - 1) )
>>> +                goto out_unlock;
>>
>> This case ...
>>
>>> +            addr = fix_to_virt(entry->fixmap_idx) + offset;
>>> +            switch ( len )
>>> +            {
>>> +            case 1:
>>> +                writeb(*(uint8_t*)data, addr);
>>> +                break;
>>> +            case 2:
>>> +                writew(*(uint16_t*)data, addr);
>>> +                break;
>>> +            case 4:
>>> +                writel(*(uint32_t*)data, addr);
>>> +                break;
>>> +            case 8:
>>> +                writeq(*(uint64_t*)data, addr);
>>> +                break;
>>> +            default:
>>> +                /* mmio_ro_emulated_write() already validated the size */
>>> +                ASSERT_UNREACHABLE();
>>
>> ... as well as, in a release build, this one wants to ...
>>
>>> +            }
>>> +            goto out_unlock;
>>
>> ... not use this path but ...
>>
>>> +        }
>>> +    }
>>> +    gdprintk(XENLOG_WARNING,
>>> +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
>>> +             mfn_x(mfn), offset, len);
>>
>> ... make it here. (By implication: I'm not convinced this wants to be a
>> gdprintk(), as I think at least one such warning would better surface in
>> release builds as well.)
> 
> Right, gprintk() would make more sense indeed.
> 
>> At the same time I don't think any message should be issued for write
>> attempts to pages which don't have part of it marked writable. We deal
>> with such silently right now, and this shouldn't change.
> 
> At least for HVM domains, it isn't really silent. It's domain_crash()
> (before my change, not reaching this function at all).

Well, yes, but that's one instance in the lifetime of a domain.

Jan
Marek Marczykowski-Górecki March 29, 2023, 1:27 p.m. UTC | #7
On Wed, Mar 29, 2023 at 02:39:22PM +0200, Jan Beulich wrote:
> On 29.03.2023 12:51, Marek Marczykowski-Górecki wrote:
> > On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote:
> >> On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote:
> >>> In some cases, only few registers on a page needs to be write-protected.
> >>> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> >>> PBA table (which doesn't need to span the whole table either).
> >>
> >> Yet like the MSI-X table the PBA is not permitted to share a page with
> >> other registers (the table itself excluded). So a need there would
> >> (again) only arise for devices violating the spec.
> > 
> > For PBA, indeed (and due to not seeing device needing such hack, I don't
> > do that for PBA yet). But the XHCI spec doesn't include such limitation, so
> > this case is perfectly valid.
> 
> My remark was merely because you mention PBA in the description.
> Mentioning it is okay, but you would want to further qualify it, I
> think.

Ah, ok.

> >> Taking both remarks together, limiting granularity to 16(?) bytes
> >> would allow using the advanced EPT functionality down the road, and
> >> would at the same time limit the suggested bitmap to just 256 bits /
> >> 32 bytes, which I think gets us below what even an empty rangeset
> >> would require. Plus lookup would also be quite a bit more lightweight.
> > 
> > Indeed, in that case it makes sense.
> 
> Hmm, I've checked the SDM, and I was misremembering: Granularity is
> 128 bytes, which might be too large for the purposes here.

Indeed, it seems so. In case of USB3 console, I want to protect 64 bytes
of registers...

I guess 16 bytes granularity would work, but it feels kinda arbitrary
without any good reason to choose this specific number. More logical
would be 4 bytes (as common register size), but it means 128 bytes for
the bitmask.

> >>> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >>>      return 0;
> >>>  }
> >>>  
> >>> +int subpage_mmio_ro_add(
> >>
> >> As long as patch 2 is going to add the only users, __init please, and
> >> there's no need for a "remove" counterpart.
> > 
> > __init makes sense. But as for removing "remove" part, I'm not sure. I
> > realize it is a dead code now, but it's easier to introduce it now to
> > provide complete API, than later by somebody else who would want to use
> > it in other places. Is there some trick to make compiler/linker optimize
> > it out?
> 
> At the very least you could also mark it __init. There are also the .discard
> and .discard.* sections we handle specially in the linker script. But no
> matter what you do to retain the code without impacting the resulting binary,
> iirc Misra tells us that there shouldn't be dead code.

Well, if dropping remove, then I guess I could leave a comment
describing what it would need to do. Or maybe just hint in the
description that earlier version of the patch had remove implemented -
so if anybody needs it in the future, can do some mailing list
archaeology and have something to start with.

> >>> +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> >>> +        return 0;
> >>> +    else
> >>> +        return 1;
> >>
> >> If this case is really needed, this special return value (as documented
> >> in the header) probably needs specially handling in the callers - it's
> >> not necessarily an error after all. But I wonder whether it wouldn't be
> >> easier to check earlier on and fail right away (with e.g. -EBUSY), 
> > 
> > The idea is to allow multiple sub-ranges in a single page. Again, if
> > using ioremap() internally, instead of fixmap provided externally, this
> > case will go away.
> > 
> >> rather
> >> than adding the range and _then_ (kind of, as per patch 2) failing.
> > 
> > Right, I missed "!= 0" there.
> 
> Hmm, adding "!= 0" won't make a difference, will it? Adding "< 0" would.

Right.

> >>> +        }
> >>> +    }
> >>> +    gdprintk(XENLOG_WARNING,
> >>> +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
> >>> +             mfn_x(mfn), offset, len);
> >>
> >> ... make it here. (By implication: I'm not convinced this wants to be a
> >> gdprintk(), as I think at least one such warning would better surface in
> >> release builds as well.)
> > 
> > Right, gprintk() would make more sense indeed.
> > 
> >> At the same time I don't think any message should be issued for write
> >> attempts to pages which don't have part of it marked writable. We deal
> >> with such silently right now, and this shouldn't change.
> > 
> > At least for HVM domains, it isn't really silent. It's domain_crash()
> > (before my change, not reaching this function at all).
> 
> Well, yes, but that's one instance in the lifetime of a domain.

What I mean is that it isn't a common or normal case that HVM domain
would exercise in normal operation. It's abnormal situation and as such
it should IMO get a log message.
And even for PV domain, such message would save me quite some time
debugging related issues...
Jan Beulich March 29, 2023, 2:12 p.m. UTC | #8
On 29.03.2023 15:27, Marek Marczykowski-Górecki wrote:
> On Wed, Mar 29, 2023 at 02:39:22PM +0200, Jan Beulich wrote:
>> On 29.03.2023 12:51, Marek Marczykowski-Górecki wrote:
>>> On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote:
>>>> Taking both remarks together, limiting granularity to 16(?) bytes
>>>> would allow using the advanced EPT functionality down the road, and
>>>> would at the same time limit the suggested bitmap to just 256 bits /
>>>> 32 bytes, which I think gets us below what even an empty rangeset
>>>> would require. Plus lookup would also be quite a bit more lightweight.
>>>
>>> Indeed, in that case it makes sense.
>>
>> Hmm, I've checked the SDM, and I was misremembering: Granularity is
>> 128 bytes, which might be too large for the purposes here.
> 
> Indeed, it seems so. In case of USB3 console, I want to protect 64 bytes
> of registers...
> 
> I guess 16 bytes granularity would work, but it feels kinda arbitrary
> without any good reason to choose this specific number. More logical
> would be 4 bytes (as common register size), but it means 128 bytes for
> the bitmask.

Which is still acceptable, so I'd say 4 bytes it is then.

>>>>> +        }
>>>>> +    }
>>>>> +    gdprintk(XENLOG_WARNING,
>>>>> +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
>>>>> +             mfn_x(mfn), offset, len);
>>>>
>>>> ... make it here. (By implication: I'm not convinced this wants to be a
>>>> gdprintk(), as I think at least one such warning would better surface in
>>>> release builds as well.)
>>>
>>> Right, gprintk() would make more sense indeed.
>>>
>>>> At the same time I don't think any message should be issued for write
>>>> attempts to pages which don't have part of it marked writable. We deal
>>>> with such silently right now, and this shouldn't change.
>>>
>>> At least for HVM domains, it isn't really silent. It's domain_crash()
>>> (before my change, not reaching this function at all).
>>
>> Well, yes, but that's one instance in the lifetime of a domain.
> 
> What I mean is that it isn't a common or normal case that HVM domain
> would exercise in normal operation. It's abnormal situation and as such
> it should IMO get a log message.
> And even for PV domain, such message would save me quite some time
> debugging related issues...

Sure. I don't mind a single message. I also don't mind a few. But I'd
prefer if we could avoid relying on (general) rate limiting to bound
the volume, at least when it's the simple case of dropping writes
which don't overlap the intended to be writable region.

That said - if for HVM previously we crashed the domain in such cases
and this now changes, then this change in behavior would also need
mentioning / justifying.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 95364deb1996..311102724dea 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2733,7 +2733,7 @@  int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
         .write      = mmio_ro_emulated_write,
         .validate   = hvmemul_validate,
     };
-    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
+    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) };
     struct hvm_emulate_ctxt ctxt;
     const struct x86_emulate_ops *ops;
     unsigned int seg, bdf;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index d326fa1c0136..f1c928e3e4ee 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1942,7 +1942,8 @@  int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
      */
     if ( (p2mt == p2m_mmio_dm) ||
          (npfec.write_access &&
-          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
+          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
+           p2mt == p2m_mmio_direct)) )
     {
         if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) )
             hvm_inject_hw_exception(TRAP_gp_fault, 0);
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index db29e3e2059f..91937d556bac 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -522,9 +522,31 @@  extern struct rangeset *mmio_ro_ranges;
 void memguard_guard_stack(void *p);
 void memguard_unguard_stack(void *p);
 
+/*
+ * Add more precise r/o marking for a MMIO page. Bytes range specified here
+ * will still be R/O, but the rest of the page (nor marked as R/O via another
+ * call) will have writes passed through. The write passthrough requires
+ * providing fixmap entry by the caller.
+ * Since multiple callers can mark different areas of the same page, they might
+ * provide different fixmap entries (although that's very unlikely in
+ * practice). Only the one provided by the first caller will be used. Return value
+ * indicates whether this fixmap entry will be used, or a different one
+ * provided earlier (in which case the caller might decide to release it).
+ *
+ * Return values:
+ *  - negative: error
+ *  - 0: success, fixmap entry is claimed
+ *  - 1: success, fixmap entry set earlier will be used
+ */
+int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s,
+                        unsigned long offset_e, int fixmap_idx);
+int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s,
+                           unsigned long offset_e);
+
 struct mmio_ro_emulate_ctxt {
         unsigned long cr2;
         unsigned int seg, bdf;
+        mfn_t mfn;
 };
 
 int cf_check mmio_ro_emulated_write(
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0fe14faa5fa7..b50bdee40b6b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -165,6 +165,19 @@  bool __read_mostly machine_to_phys_mapping_valid;
 
 struct rangeset *__read_mostly mmio_ro_ranges;
 
+/* Handling sub-page read-only MMIO regions */
+struct subpage_ro_range {
+    struct list_head list;
+    mfn_t mfn;
+    int fixmap_idx;
+    struct rangeset *ro_bytes;
+    struct rcu_head rcu;
+};
+
+static LIST_HEAD(subpage_ro_ranges);
+static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
+static DEFINE_SPINLOCK(subpage_ro_lock);
+
 static uint32_t base_disallow_mask;
 /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
 #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL)
@@ -4893,6 +4906,172 @@  long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     return 0;
 }
 
+int subpage_mmio_ro_add(
+    mfn_t mfn,
+    unsigned long offset_s,
+    unsigned long offset_e,
+    int fixmap_idx)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    int rc;
+
+    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
+    ASSERT(offset_s < PAGE_SIZE);
+    ASSERT(offset_e < PAGE_SIZE);
+
+    spin_lock(&subpage_ro_lock);
+
+    list_for_each_entry( iter, &subpage_ro_ranges, list )
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    if ( !entry )
+    {
+        /* iter==NULL marks it was a newly allocated entry */
+        iter = NULL;
+        entry = xmalloc(struct subpage_ro_range);
+        rc = -ENOMEM;
+        if ( !entry )
+            goto err_unlock;
+        entry->mfn = mfn;
+        entry->fixmap_idx = fixmap_idx;
+        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
+                                       RANGESETF_prettyprint_hex);
+        rc = -ENOMEM;
+        if ( !entry->ro_bytes )
+            goto err_unlock;
+    }
+
+    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
+    if ( rc < 0 )
+        goto err_unlock;
+
+    if ( !iter )
+        list_add_rcu(&entry->list, &subpage_ro_ranges);
+
+    spin_unlock(&subpage_ro_lock);
+
+    if ( !iter || entry->fixmap_idx == fixmap_idx )
+        return 0;
+    else
+        return 1;
+
+err_unlock:
+    spin_unlock(&subpage_ro_lock);
+    if ( !iter )
+    {
+        if ( entry )
+        {
+            if ( entry->ro_bytes )
+                rangeset_destroy(entry->ro_bytes);
+            xfree(entry);
+        }
+    }
+    return rc;
+}
+
+static void subpage_mmio_ro_free(struct rcu_head *rcu)
+{
+    struct subpage_ro_range *entry = container_of(rcu, struct subpage_ro_range, rcu);
+
+    rangeset_destroy(entry->ro_bytes);
+    xfree(entry);
+}
+
+int subpage_mmio_ro_remove(
+    mfn_t mfn,
+    unsigned long offset_s,
+    unsigned long offset_e)
+{
+    struct subpage_ro_range *entry = NULL, *iter;
+    int rc;
+
+    ASSERT(offset_s < PAGE_SIZE);
+    ASSERT(offset_e < PAGE_SIZE);
+
+    spin_lock(&subpage_ro_lock);
+
+    list_for_each_entry_rcu( iter, &subpage_ro_ranges, list )
+    {
+        if ( mfn_eq(iter->mfn, mfn) )
+        {
+            entry = iter;
+            break;
+        }
+    }
+    rc = -ENOENT;
+    if ( !entry )
+        goto out_unlock;
+
+    rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e);
+    if ( rc < 0 )
+        goto out_unlock;
+
+    rc = 0;
+
+    if ( !rangeset_is_empty(entry->ro_bytes) )
+        goto out_unlock;
+
+    list_del_rcu(&entry->list);
+    call_rcu(&entry->rcu, subpage_mmio_ro_free);
+
+out_unlock:
+    spin_unlock(&subpage_ro_lock);
+    return rc;
+}
+
+static void subpage_mmio_write_emulate(
+    mfn_t mfn,
+    unsigned long offset,
+    void *data,
+    unsigned int len)
+{
+    struct subpage_ro_range *entry;
+    void __iomem *addr;
+
+    rcu_read_lock(&subpage_ro_rcu);
+
+    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
+    {
+        if ( mfn_eq(entry->mfn, mfn) )
+        {
+            if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + len - 1) )
+                goto out_unlock;
+
+            addr = fix_to_virt(entry->fixmap_idx) + offset;
+            switch ( len )
+            {
+            case 1:
+                writeb(*(uint8_t*)data, addr);
+                break;
+            case 2:
+                writew(*(uint16_t*)data, addr);
+                break;
+            case 4:
+                writel(*(uint32_t*)data, addr);
+                break;
+            case 8:
+                writeq(*(uint64_t*)data, addr);
+                break;
+            default:
+                /* mmio_ro_emulated_write() already validated the size */
+                ASSERT_UNREACHABLE();
+            }
+            goto out_unlock;
+        }
+    }
+    gdprintk(XENLOG_WARNING,
+             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n",
+             mfn_x(mfn), offset, len);
+
+out_unlock:
+    rcu_read_unlock(&subpage_ro_rcu);
+}
+
 int cf_check mmio_ro_emulated_write(
     enum x86_segment seg,
     unsigned long offset,
@@ -4911,6 +5090,8 @@  int cf_check mmio_ro_emulated_write(
         return X86EMUL_UNHANDLEABLE;
     }
 
+    subpage_mmio_write_emulate(mmio_ro_ctxt->mfn, offset & (PAGE_SIZE - 1), p_data, bytes);
+
     return X86EMUL_OKAY;
 }
 
diff --git a/xen/arch/x86/pv/ro-page-fault.c b/xen/arch/x86/pv/ro-page-fault.c
index 5963f5ee2d51..91caa2c8f520 100644
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -342,6 +342,7 @@  static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt,
             return X86EMUL_UNHANDLEABLE;
     }
 
+    mmio_ro_ctxt.mfn = mfn;
     ctxt->data = &mmio_ro_ctxt;
     if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) )
         return x86_emulate(ctxt, &mmcfg_intercept_ops);