diff mbox series

[V2,6/6] iommu/arm: Add Renesas IPMMU-VMSA support

Message ID 1564763985-20312-7-git-send-email-olekstysh@gmail.com (mailing list archive)
State Superseded
Headers show
Series iommu/arm: Add Renesas IPMMU-VMSA support + Linux's iommu_fwspec | expand

Commit Message

Oleksandr Tyshchenko Aug. 2, 2019, 4:39 p.m. UTC
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
which provides address translation and access protection functionalities
to processing units and interconnect networks.

Please note, current driver is supposed to work only with newest
Gen3 SoCs revisions which IPMMU hardware supports stage 2 translation
table format and is able to use CPU's P2M table as is if one is
3-level page table (up to 40 bit IPA).

The major differences compare to the Linux driver are:

1. Stage 1/Stage 2 translation. Linux driver supports Stage 1
translation only (with Stage 1 translation table format). It manages
page table by itself. But Xen driver supports Stage 2 translation
(with Stage 2 translation table format) to be able to share the P2M
with the CPU. Stage 1 translation is always bypassed in Xen driver.

So, Xen driver is supposed to be used with newest Gen3 SoC revisions only
(H3 ES3.0, M3 ES3.0, etc.) which IPMMU H/W supports stage 2 translation
table format.

2. AArch64 support. Linux driver uses VMSAv8-32 mode, while Xen driver
enables Armv8 VMSAv8-64 mode to cover up to 40 bit input address.

3. Context bank (sets of page table) usage. In Xen, each context bank is
mapped to one Xen domain. So, all devices being pass throughed to the
same Xen domain share the same context bank.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
CC: Julien Grall <julien.grall@arm.com>
CC: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

---
Changes V1 -> V2:
    - rewrited driver to use iommu_fwspec
    - removed DT parsing code for micro-TLBs
    - removed struct ipmmu_vmsa_master_cfg, dev_archdata macro
    - added ipmmu_find_mmu_by_dev(), various helpers to access
      fwspec->iommu_priv
    - implemented new callback "add_device" to add master device to IPMMU
    - removed ipmmu_protect_masters()
    - removed code to locate Root device in the first place,
      used EAGAIN to request deferred probing
    - used printk_once for the system wide error messages in ipmmu_init()
      which don't need to be shown for every device being probed
    - removed map_page/unmap_page implementation, reused them
      from iommu_helpers.c
    - used %pd for printing domaid id
    - performed various cosmetic fixes
    - changed u32 -> uint32_t, u64 -> uint64_t,
      unsigned int -> uint32_t where needed
    - clarified TODOs
    - clafiried supported SoC versions in config IPMMU_VMSA,
      set default to "n"
    - updated comments in code, provided more accurate description,
      added new comments where needed
    - updated patch description by providing differences between
      Linux/Xen implementations
    - removed fields for cache snoop transaction when configuring IMTTBCR
      (update from Renesas BSP)
---
 xen/arch/arm/platforms/Kconfig           |    1 +
 xen/drivers/passthrough/Kconfig          |   13 +
 xen/drivers/passthrough/arm/Makefile     |    1 +
 xen/drivers/passthrough/arm/ipmmu-vmsa.c | 1342 ++++++++++++++++++++++++++++++
 4 files changed, 1357 insertions(+)
 create mode 100644 xen/drivers/passthrough/arm/ipmmu-vmsa.c

Comments

Yoshihiro Shimoda Aug. 7, 2019, 2:41 a.m. UTC | #1
Hi Oleksandr-san,

I can access the datasheet of this hardware, so that I reviewed this patch.
I'm not familar about Xen development rulus, so that some comments might
be not good fit. If so, please ignore :)

> From: Oleksandr Tyshchenko, Sent: Saturday, August 3, 2019 1:40 AM
> 
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
> which provides address translation and access protection functionalities
> to processing units and interconnect networks.
> 
> Please note, current driver is supposed to work only with newest
> Gen3 SoCs revisions which IPMMU hardware supports stage 2 translation

This should be "R-Car Gen3 SoCs", instead of "Gen3 SoCs".

> table format and is able to use CPU's P2M table as is if one is
> 3-level page table (up to 40 bit IPA).
> 
> The major differences compare to the Linux driver are:
> 
> 1. Stage 1/Stage 2 translation. Linux driver supports Stage 1
> translation only (with Stage 1 translation table format). It manages
> page table by itself. But Xen driver supports Stage 2 translation
> (with Stage 2 translation table format) to be able to share the P2M
> with the CPU. Stage 1 translation is always bypassed in Xen driver.
> 
> So, Xen driver is supposed to be used with newest Gen3 SoC revisions only

Same here.

> (H3 ES3.0, M3 ES3.0, etc.) which IPMMU H/W supports stage 2 translation

According to the latest manual, M3 ES3.0 is named as "M3-W+".

> table format.

<snip>
> diff --git a/xen/drivers/passthrough/arm/ipmmu-vmsa.c b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> new file mode 100644
> index 0000000..a34a8f8
> --- /dev/null
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -0,0 +1,1342 @@
> +/*
> + * xen/drivers/passthrough/arm/ipmmu-vmsa.c
> + *
> + * Driver for the Renesas IPMMU-VMSA found in R-Car Gen3 SoCs.
> + *
> + * The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
> + * which provides address translation and access protection functionalities
> + * to processing units and interconnect networks.
> + *
> + * Please note, current driver is supposed to work only with newest Gen3 SoCs
> + * revisions which IPMMU hardware supports stage 2 translation table format and
> + * is able to use CPU's P2M table as is.
> + *
> + * Based on Linux's IPMMU-VMSA driver from Renesas BSP:
> + *    drivers/iommu/ipmmu-vmsa.c

So, I think the Linux's Copyrights should be described here.

> + * you can found at:
> + *    url: git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
> + *    branch: v4.14.75-ltsi/rcar-3.9.6
> + *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
> + * and Xen's SMMU driver:
> + *    xen/drivers/passthrough/arm/smmu.c
> + *
> + * Copyright (C) 2016-2019 EPAM Systems Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.

I don't know that Xen license description rule, but since a few source files have
SPDX-License-Identifier, can we also use it on the driver?

> + */
> +
> +#include <xen/delay.h>
> +#include <xen/err.h>
> +#include <xen/irq.h>
> +#include <xen/lib.h>
> +#include <xen/list.h>

I don't know that Xen passthrough driver rule though, doesn't here need
#include <xen/iommu.h>? (The xen/sched.h seems to have it so that
no compile error happens though.)

<snip>
> +/* Registers Definition */
> +#define IM_CTX_SIZE    0x40
> +
> +#define IMCTR                0x0000
> +/*
> + * These fields are implemented in IPMMU-MM only. So, can be set for
> + * Root IPMMU only.
> + */
> +#define IMCTR_VA64           (1 << 29)
> +#define IMCTR_TRE            (1 << 17)
> +#define IMCTR_AFE            (1 << 16)
> +#define IMCTR_RTSEL_MASK     (3 << 4)
> +#define IMCTR_RTSEL_SHIFT    4
> +#define IMCTR_TREN           (1 << 3)
> +/*
> + * These fields are common for all IPMMU devices. So, can be set for
> + * Cache IPMMUs as well.
> + */
> +#define IMCTR_INTEN          (1 << 2)
> +#define IMCTR_FLUSH          (1 << 1)
> +#define IMCTR_MMUEN          (1 << 0)
> +#define IMCTR_COMMON_MASK    (7 << 0)
> +
> +#define IMCAAR               0x0004
> +
> +#define IMTTBCR                        0x0008
> +#define IMTTBCR_EAE                    (1 << 31)
> +#define IMTTBCR_PMB                    (1 << 30)
> +#define IMTTBCR_SH1_NON_SHAREABLE      (0 << 28)
> +#define IMTTBCR_SH1_OUTER_SHAREABLE    (2 << 28)
> +#define IMTTBCR_SH1_INNER_SHAREABLE    (3 << 28)
> +#define IMTTBCR_SH1_MASK               (3 << 28)
> +#define IMTTBCR_ORGN1_NC               (0 << 26)
> +#define IMTTBCR_ORGN1_WB_WA            (1 << 26)
> +#define IMTTBCR_ORGN1_WT               (2 << 26)
> +#define IMTTBCR_ORGN1_WB               (3 << 26)
> +#define IMTTBCR_ORGN1_MASK             (3 << 26)
> +#define IMTTBCR_IRGN1_NC               (0 << 24)
> +#define IMTTBCR_IRGN1_WB_WA            (1 << 24)
> +#define IMTTBCR_IRGN1_WT               (2 << 24)
> +#define IMTTBCR_IRGN1_WB               (3 << 24)
> +#define IMTTBCR_IRGN1_MASK             (3 << 24)
> +#define IMTTBCR_TSZ1_MASK              (1f << 16)

At the moment, no one uses it though, this should be (0x1f << 16).

<snip>
+/* Xen IOMMU ops */
> +static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
> +{
> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> +
> +    if ( !xen_domain || !xen_domain->root_domain )
> +        return 0;
> +
> +    spin_lock(&xen_domain->lock);

Is local irq is already disabled here?
If no, you should use spin_lock_irqsave() because the ipmmu_irq() also
gets the lock.
# To be honest, in normal case, any irq on the current implementation
# should not happen though.

> +    ipmmu_tlb_invalidate(xen_domain->root_domain);
> +    spin_unlock(&xen_domain->lock);
> +
> +    return 0;
> +}
> +
<snip>
> +static int ipmmu_assign_device(struct domain *d, u8 devfn, struct device *dev,
> +                               uint32_t flag)
> +{
> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> +    struct ipmmu_vmsa_domain *domain;
> +    int ret;
> +
> +    if ( !xen_domain )
> +        return -EINVAL;
> +
> +    if ( !to_ipmmu(dev) )
> +        return -ENODEV;
> +
> +    spin_lock(&xen_domain->lock);

Same here.

> +    /*
> +     * The IPMMU context for the Xen domain is not allocated beforehand
> +     * (at the Xen domain creation time), but on demand only, when the first
> +     * master device being attached to it.
> +     * Create Root IPMMU domain which context will be mapped to this Xen domain
> +     * if not exits yet.
> +     */
> +    if ( !xen_domain->root_domain )
> +    {
> +        domain = ipmmu_alloc_root_domain(d);
> +        if ( IS_ERR(domain) )
> +        {
> +            ret = PTR_ERR(domain);
> +            goto out;
> +        }
> +
> +        xen_domain->root_domain = domain;
> +    }
> +
> +    if ( to_domain(dev) )
> +    {
> +        dev_err(dev, "Already attached to IPMMU domain\n");
> +        ret = -EEXIST;
> +        goto out;
> +    }
> +
> +    /*
> +     * Master devices behind the same Cache IPMMU can be attached to the same
> +     * Cache IPMMU domain.
> +     * Before creating new IPMMU domain check to see if the required one
> +     * already exists for this Xen domain.
> +     */
> +    domain = ipmmu_get_cache_domain(d, dev);
> +    if ( !domain )
> +    {
> +        /* Create new IPMMU domain this master device will be attached to. */
> +        domain = ipmmu_alloc_cache_domain(d);
> +        if ( IS_ERR(domain) )
> +        {
> +            ret = PTR_ERR(domain);
> +            goto out;
> +        }
> +
> +        /* Chain new IPMMU domain to the Xen domain. */
> +        list_add(&domain->list, &xen_domain->cache_domains);
> +    }
> +
> +    ret = ipmmu_attach_device(domain, dev);
> +    if ( ret )
> +    {
> +        /*
> +         * Destroy Cache IPMMU domain only if there are no master devices
> +         * attached to it.
> +         */
> +        if ( !domain->refcount )
> +            ipmmu_free_cache_domain(domain);
> +    }
> +    else
> +    {
> +        domain->refcount++;
> +        set_domain(dev, domain);
> +    }
> +
> +out:
> +    spin_unlock(&xen_domain->lock);
> +
> +    return ret;
> +}
> +
> +static int ipmmu_deassign_device(struct domain *d, struct device *dev)
> +{
> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> +    struct ipmmu_vmsa_domain *domain = to_domain(dev);
> +
> +    if ( !domain || domain->d != d )
> +    {
> +        dev_err(dev, "Not attached to %pd\n", d);
> +        return -ESRCH;
> +    }
> +
> +    spin_lock(&xen_domain->lock);

Same here.

> +    ipmmu_detach_device(domain, dev);
> +    set_domain(dev, NULL);
> +    domain->refcount--;
> +
> +    /*
> +     * Destroy Cache IPMMU domain only if there are no master devices
> +     * attached to it.
> +     */
> +    if ( !domain->refcount )
> +        ipmmu_free_cache_domain(domain);
> +
> +    spin_unlock(&xen_domain->lock);
> +
> +    return 0;
> +}
<snip>
> +static void __hwdom_init ipmmu_iommu_hwdom_init(struct domain *d)
> +{
> +    /* Set to false options not supported on ARM. */
> +    if ( iommu_hwdom_inclusive )
> +        printk(XENLOG_WARNING "ipmmu: map-inclusive dom0-iommu option is not supported on ARM\n");
> +    iommu_hwdom_inclusive = false;
> +    if ( iommu_hwdom_reserved == 1 )
> +        printk(XENLOG_WARNING "ipmmu: map-reserved dom0-iommu option is not supported on ARM\n");
> +    iommu_hwdom_reserved = 0;
> +
> +    arch_iommu_hwdom_init(d);
> +}
> +
> +static void ipmmu_iommu_domain_teardown(struct domain *d)
> +{
> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> +
> +    if ( !xen_domain )
> +        return;
> +
> +    spin_lock(&xen_domain->lock);

Same here.

> +    /*
> +     * Destroy Root IPMMU domain which context is mapped to this Xen domain
> +     * if exits.
> +     */
> +    if ( xen_domain->root_domain )
> +        ipmmu_free_root_domain(xen_domain->root_domain);
> +
> +    spin_unlock(&xen_domain->lock);
> +
> +    /*
> +     * We assume that all master devices have already been detached from
> +     * this Xen domain and there must be no associated Cache IPMMU domains
> +     * in use.
> +     */
> +    ASSERT(list_empty(&xen_domain->cache_domains));

I think this should be in the spin lock held by &xen_domain->lock.

> +    xfree(xen_domain);
> +    dom_iommu(d)->arch.priv = NULL;
> +}
> +
> +static const struct iommu_ops ipmmu_iommu_ops =
> +{
> +    .init            = ipmmu_iommu_domain_init,
> +    .hwdom_init      = ipmmu_iommu_hwdom_init,
> +    .teardown        = ipmmu_iommu_domain_teardown,
> +    .iotlb_flush     = ipmmu_iotlb_flush,
> +    .iotlb_flush_all = ipmmu_iotlb_flush_all,
> +    .assign_device   = ipmmu_assign_device,
> +    .reassign_device = ipmmu_reassign_device,
> +    .map_page        = arm_iommu_map_page,
> +    .unmap_page      = arm_iommu_unmap_page,
> +    .add_device      = ipmmu_add_device,
> +};
> +
> +/* RCAR GEN3 product and cut information. */

"R-Car Gen3 SoCs" is better than "RCAR GEN3".

> +#define RCAR_PRODUCT_MASK    0x00007F00
> +#define RCAR_PRODUCT_H3      0x00004F00
> +#define RCAR_PRODUCT_M3      0x00005200

At least, I think we should be M3W, instead of M3.
# FYI, M3-W and M3-W+ are the same value.

<snip>

Best regards,
Yoshihiro Shimoda
Oleksandr Tyshchenko Aug. 7, 2019, 4:01 p.m. UTC | #2
Hi, Shimoda-san.

Thank you for the review.


>
>> From: Oleksandr Tyshchenko, Sent: Saturday, August 3, 2019 1:40 AM
>>
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
>> which provides address translation and access protection functionalities
>> to processing units and interconnect networks.
>>
>> Please note, current driver is supposed to work only with newest
>> Gen3 SoCs revisions which IPMMU hardware supports stage 2 translation
> This should be "R-Car Gen3 SoCs", instead of "Gen3 SoCs".

Will update.


>
>> table format and is able to use CPU's P2M table as is if one is
>> 3-level page table (up to 40 bit IPA).
>>
>> The major differences compare to the Linux driver are:
>>
>> 1. Stage 1/Stage 2 translation. Linux driver supports Stage 1
>> translation only (with Stage 1 translation table format). It manages
>> page table by itself. But Xen driver supports Stage 2 translation
>> (with Stage 2 translation table format) to be able to share the P2M
>> with the CPU. Stage 1 translation is always bypassed in Xen driver.
>>
>> So, Xen driver is supposed to be used with newest Gen3 SoC revisions only
> Same here.

ok


>
>> (H3 ES3.0, M3 ES3.0, etc.) which IPMMU H/W supports stage 2 translation
> According to the latest manual, M3 ES3.0 is named as "M3-W+".

Will update.


>> diff --git a/xen/drivers/passthrough/arm/ipmmu-vmsa.c b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
>> new file mode 100644
>> index 0000000..a34a8f8
>> --- /dev/null
>> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
>> @@ -0,0 +1,1342 @@
>> +/*
>> + * xen/drivers/passthrough/arm/ipmmu-vmsa.c
>> + *
>> + * Driver for the Renesas IPMMU-VMSA found in R-Car Gen3 SoCs.
>> + *
>> + * The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
>> + * which provides address translation and access protection functionalities
>> + * to processing units and interconnect networks.
>> + *
>> + * Please note, current driver is supposed to work only with newest Gen3 SoCs
>> + * revisions which IPMMU hardware supports stage 2 translation table format and
>> + * is able to use CPU's P2M table as is.
>> + *
>> + * Based on Linux's IPMMU-VMSA driver from Renesas BSP:
>> + *    drivers/iommu/ipmmu-vmsa.c
> So, I think the Linux's Copyrights should be described here.

Yes, will add.


>
>> + * you can found at:
>> + *    url: git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
>> + *    branch: v4.14.75-ltsi/rcar-3.9.6
>> + *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
>> + * and Xen's SMMU driver:
>> + *    xen/drivers/passthrough/arm/smmu.c
>> + *
>> + * Copyright (C) 2016-2019 EPAM Systems Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms and conditions of the GNU General Public
>> + * License, version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> I don't know that Xen license description rule, but since a few source files have
> SPDX-License-Identifier, can we also use it on the driver?

I am afraid, I don't know a correct answer for this question. I would 
leave this to maintainers.

I just followed sample copyright notice for GPL v2 License according to 
the document:

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=CONTRIBUTING


>> + */
>> +
>> +#include <xen/delay.h>
>> +#include <xen/err.h>
>> +#include <xen/irq.h>
>> +#include <xen/lib.h>
>> +#include <xen/list.h>
> I don't know that Xen passthrough driver rule though, doesn't here need
> #include <xen/iommu.h>? (The xen/sched.h seems to have it so that
> no compile error happens though.)

Probably, yes, I should have included that header.


>> +
>> +#define IMTTBCR                        0x0008
>> +#define IMTTBCR_EAE                    (1 << 31)
>> +#define IMTTBCR_PMB                    (1 << 30)
>> +#define IMTTBCR_SH1_NON_SHAREABLE      (0 << 28)
>> +#define IMTTBCR_SH1_OUTER_SHAREABLE    (2 << 28)
>> +#define IMTTBCR_SH1_INNER_SHAREABLE    (3 << 28)
>> +#define IMTTBCR_SH1_MASK               (3 << 28)
>> +#define IMTTBCR_ORGN1_NC               (0 << 26)
>> +#define IMTTBCR_ORGN1_WB_WA            (1 << 26)
>> +#define IMTTBCR_ORGN1_WT               (2 << 26)
>> +#define IMTTBCR_ORGN1_WB               (3 << 26)
>> +#define IMTTBCR_ORGN1_MASK             (3 << 26)
>> +#define IMTTBCR_IRGN1_NC               (0 << 24)
>> +#define IMTTBCR_IRGN1_WB_WA            (1 << 24)
>> +#define IMTTBCR_IRGN1_WT               (2 << 24)
>> +#define IMTTBCR_IRGN1_WB               (3 << 24)
>> +#define IMTTBCR_IRGN1_MASK             (3 << 24)
>> +#define IMTTBCR_TSZ1_MASK              (1f << 16)
> At the moment, no one uses it though, this should be (0x1f << 16).

Will correct.


>
> <snip>
> +/* Xen IOMMU ops */
>> +static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
>> +{
>> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
>> +
>> +    if ( !xen_domain || !xen_domain->root_domain )
>> +        return 0;
>> +
>> +    spin_lock(&xen_domain->lock);
> Is local irq is already disabled here?
> If no, you should use spin_lock_irqsave() because the ipmmu_irq() also
> gets the lock.


No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I 
think, there won't be a deadlock.

Or I really missed something?

If we worry about ipmmu_tlb_invalidate() which is called here (to 
perform a flush by request from P2M code, which manages a page table) 
and from the irq handler (to perform a flush to resume address 
translation), I could use a tasklet to schedule ipmmu_tlb_invalidate() 
from the irq handler then. This way we would get this serialized. What 
do you think?


> # To be honest, in normal case, any irq on the current implementation
> # should not happen though.

Agree here.


>> +    /*
>> +     * Destroy Root IPMMU domain which context is mapped to this Xen domain
>> +     * if exits.
>> +     */
>> +    if ( xen_domain->root_domain )
>> +        ipmmu_free_root_domain(xen_domain->root_domain);
>> +
>> +    spin_unlock(&xen_domain->lock);
>> +
>> +    /*
>> +     * We assume that all master devices have already been detached from
>> +     * this Xen domain and there must be no associated Cache IPMMU domains
>> +     * in use.
>> +     */
>> +    ASSERT(list_empty(&xen_domain->cache_domains));
> I think this should be in the spin lock held by &xen_domain->lock.

OK. Will put spin_unlock after it.


>
>> +    xfree(xen_domain);
>> +    dom_iommu(d)->arch.priv = NULL;
>> +}
>> +
>> +static const struct iommu_ops ipmmu_iommu_ops =
>> +{
>> +    .init            = ipmmu_iommu_domain_init,
>> +    .hwdom_init      = ipmmu_iommu_hwdom_init,
>> +    .teardown        = ipmmu_iommu_domain_teardown,
>> +    .iotlb_flush     = ipmmu_iotlb_flush,
>> +    .iotlb_flush_all = ipmmu_iotlb_flush_all,
>> +    .assign_device   = ipmmu_assign_device,
>> +    .reassign_device = ipmmu_reassign_device,
>> +    .map_page        = arm_iommu_map_page,
>> +    .unmap_page      = arm_iommu_unmap_page,
>> +    .add_device      = ipmmu_add_device,
>> +};
>> +
>> +/* RCAR GEN3 product and cut information. */
> "R-Car Gen3 SoCs" is better than "RCAR GEN3".

Will update.


>
>> +#define RCAR_PRODUCT_MASK    0x00007F00
>> +#define RCAR_PRODUCT_H3      0x00004F00
>> +#define RCAR_PRODUCT_M3      0x00005200
> At least, I think we should be M3W, instead of M3.
> # FYI, M3-W and M3-W+ are the same value.

Will update.
Julien Grall Aug. 7, 2019, 7:15 p.m. UTC | #3
(+ Lars)

Hi,

On 8/7/19 5:01 PM, Oleksandr wrote:
>>> + * you can found at:
>>> + *    url: 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
>>> + *    branch: v4.14.75-ltsi/rcar-3.9.6
>>> + *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
>>> + * and Xen's SMMU driver:
>>> + *    xen/drivers/passthrough/arm/smmu.c
>>> + *
>>> + * Copyright (C) 2016-2019 EPAM Systems Inc.
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms and conditions of the GNU General Public
>>> + * License, version 2, as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> + * General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public
>>> + * License along with this program; If not, see 
>>> <http://www.gnu.org/licenses/>.
>> I don't know that Xen license description rule, but since a few source 
>> files have
>> SPDX-License-Identifier, can we also use it on the driver?
> 
> I am afraid, I don't know a correct answer for this question. I would 
> leave this to maintainers.
> 
> I just followed sample copyright notice for GPL v2 License according to 
> the document:
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=CONTRIBUTING

The file CONTRIBUTING is only giving example of common example of 
license. So I think this is fine to use SPDX, the more they are already 
used. The only request is to put either SDPX or the full-blown text but 
not the two :). Lars, any objection?

I am quite in favor of SPDX because it is easier to find out the 
license. With the full-blown text, the text may slightly vary between 
licenses. For instance, the only difference between GPLv2 and GPLv2+ is 
",or (at your option) any later version". I let you imagine how it can 
be easy to miss it when reviewing ;).

We had a discussion last year about using SPDX in Xen code base but I 
never got the time to formally suggest it.

> 
>>> + */
>>> +
>>> +#include <xen/delay.h>
>>> +#include <xen/err.h>
>>> +#include <xen/irq.h>
>>> +#include <xen/lib.h>
>>> +#include <xen/list.h>
>> I don't know that Xen passthrough driver rule though, doesn't here need
>> #include <xen/iommu.h>? (The xen/sched.h seems to have it so that
>> no compile error happens though.)
> 
> Probably, yes, I should have included that header.

I am fine either way :). The indirect inclusion happens quite often and 
we only notice it when someone decide to rework the headers.

[...]
>> +/* Xen IOMMU ops */
>>> +static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
>>> +{
>>> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
>>> +
>>> +    if ( !xen_domain || !xen_domain->root_domain )
>>> +        return 0;
>>> +
>>> +    spin_lock(&xen_domain->lock);
>> Is local irq is already disabled here?
>> If no, you should use spin_lock_irqsave() because the ipmmu_irq() also
>> gets the lock.
> 
> 
> No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I 
> think, there won't be a deadlock.
> 
> Or I really missed something?
> 
> If we worry about ipmmu_tlb_invalidate() which is called here (to 
> perform a flush by request from P2M code, which manages a page table) 
> and from the irq handler (to perform a flush to resume address 
> translation), I could use a tasklet to schedule ipmmu_tlb_invalidate() 
> from the irq handler then. This way we would get this serialized. What 
> do you think?

I am afraid a tasklet is not an option. You need to perform the TLB 
flush when requested otherwise you are introducing a security issue.

This is because as soon as a region is unmapped in the page table, we 
remove the drop the reference on any page backing that region. When the 
reference is dropped to zero, the page can be reallocated to another 
domain or even Xen. If the TLB flush happen after, then the guest may 
still be able to access the page for a short time if the translation has 
been cached by the any TLB (IOMMU, Processor).

[...]

>>> +    /*
>>> +     * Destroy Root IPMMU domain which context is mapped to this Xen 
>>> domain
>>> +     * if exits.
>>> +     */
>>> +    if ( xen_domain->root_domain )
>>> +        ipmmu_free_root_domain(xen_domain->root_domain);
>>> +
>>> +    spin_unlock(&xen_domain->lock);
>>> +
>>> +    /*
>>> +     * We assume that all master devices have already been detached 
>>> from
>>> +     * this Xen domain and there must be no associated Cache IPMMU 
>>> domains
>>> +     * in use.
>>> +     */
>>> +    ASSERT(list_empty(&xen_domain->cache_domains));
>> I think this should be in the spin lock held by &xen_domain->lock.
> 
> OK. Will put spin_unlock after it.

The spin_lock is actually pointless here. This is done when the domain 
is destroyed, so nobody should touch it.

If you think concurrent access can still happen, then you are going to 
be in deep trouble as you free the xen_domain (and therefore the 
spinlock) below :).

Cheers,
Oleksandr Tyshchenko Aug. 7, 2019, 8:28 p.m. UTC | #4
Hi, Julien.
Sorry for the possible format issues.


> > No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I
> > think, there won't be a deadlock.
> >
> > Or I really missed something?
> >
> > If we worry about ipmmu_tlb_invalidate() which is called here (to
> > perform a flush by request from P2M code, which manages a page table)
> > and from the irq handler (to perform a flush to resume address
> > translation), I could use a tasklet to schedule ipmmu_tlb_invalidate()
> > from the irq handler then. This way we would get this serialized. What
> > do you think?
>
> I am afraid a tasklet is not an option. You need to perform the TLB
> flush when requested otherwise you are introducing a security issue.
>
> This is because as soon as a region is unmapped in the page table, we
> remove the drop the reference on any page backing that region. When the
> reference is dropped to zero, the page can be reallocated to another
> domain or even Xen. If the TLB flush happen after, then the guest may
> still be able to access the page for a short time if the translation has
> been cached by the any TLB (IOMMU, Processor).
>

>
I understand this. I am not proposing to delay a requested by P2M code TLB
flush in any case. I just propose to issue TLB flush (which we have to
perform in case of page faults, to resolve error condition and resume
translations) from a tasklet rather than from interrupt handler directly.
This is the TLB flush I am speaking about:

https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598

Sorry if I was unclear.
Yoshihiro Shimoda Aug. 8, 2019, 4:05 a.m. UTC | #5
Hi Oleksandr-san,

> From: Oleksandr, Sent: Thursday, August 8, 2019 1:01 AM
> 
> 
> Hi, Shimoda-san.
> 
> Thank you for the review.

You're welcome.

<snip>
> > +/* Xen IOMMU ops */
> >> +static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
> >> +{
> >> +    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> >> +
> >> +    if ( !xen_domain || !xen_domain->root_domain )
> >> +        return 0;
> >> +
> >> +    spin_lock(&xen_domain->lock);
> > Is local irq is already disabled here?
> > If no, you should use spin_lock_irqsave() because the ipmmu_irq() also
> > gets the lock.
> 
> 
> No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I
> think, there won't be a deadlock.
> 
> Or I really missed something?

You're correct. I didn't realized that ipmmu_irq() used another mmu->lock.

> If we worry about ipmmu_tlb_invalidate() which is called here (to
> perform a flush by request from P2M code, which manages a page table)
> and from the irq handler (to perform a flush to resume address
> translation), I could use a tasklet to schedule ipmmu_tlb_invalidate()
> from the irq handler then. This way we would get this serialized. What
> do you think?

I just concerned about a dead-lock issue by recursive spin locks.
So, calling ipmmu_tlb_invalidate() here is OK, I think.

Best regards,
Yoshihiro Shimoda
Julien Grall Aug. 8, 2019, 9:05 a.m. UTC | #6
On 07/08/2019 21:28, Oleksandr Tyshchenko wrote:
> Hi, Julien.

Hi,

> Sorry for the possible format issues.
> 
> 
>      > No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I
>      > think, there won't be a deadlock.
>      >
>      > Or I really missed something?
>      >
>      > If we worry about ipmmu_tlb_invalidate() which is called here (to
>      > perform a flush by request from P2M code, which manages a page table)
>      > and from the irq handler (to perform a flush to resume address
>      > translation), I could use a tasklet to schedule ipmmu_tlb_invalidate()
>      > from the irq handler then. This way we would get this serialized. What
>      > do you think?
> 
>     I am afraid a tasklet is not an option. You need to perform the TLB
>     flush when requested otherwise you are introducing a security issue.
> 
>     This is because as soon as a region is unmapped in the page table, we
>     remove the drop the reference on any page backing that region. When the
>     reference is dropped to zero, the page can be reallocated to another
>     domain or even Xen. If the TLB flush happen after, then the guest may
>     still be able to access the page for a short time if the translation has
>     been cached by the any TLB (IOMMU, Processor).
> 
> 
> 
> I understand this. I am not proposing to delay a requested by P2M code TLB flush 
> in any case. I just propose to issue TLB flush (which we have to perform in case 
> of page faults, to resolve error condition and resume translations) from a 
> tasklet rather than from interrupt handler directly. This is the TLB flush I am 
> speaking about:
> 
> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598
> 
> Sorry if I was unclear.

My mistake, I misread what you wrote.

I found the flush in the renesas-bsp and not Linux upstream but it is not clear 
why this is actually required. You are not fixing any translation error. So what 
this flush will do?

Regarding the placement of the flush, then if you execute in a tasklet it will 
likely be done later on when the IRQ has been acknowledge. What's the 
implication to delay it?

Cheers,
Oleksandr Tyshchenko Aug. 8, 2019, 10:14 a.m. UTC | #7
> Hi,

Hi, Julien.


>
>> Sorry for the possible format issues.
>>
>>
>>      > No, it is not disabled. But, ipmmu_irq() uses another 
>> mmu->lock. So, I
>>      > think, there won't be a deadlock.
>>      >
>>      > Or I really missed something?
>>      >
>>      > If we worry about ipmmu_tlb_invalidate() which is called here (to
>>      > perform a flush by request from P2M code, which manages a page 
>> table)
>>      > and from the irq handler (to perform a flush to resume address
>>      > translation), I could use a tasklet to schedule 
>> ipmmu_tlb_invalidate()
>>      > from the irq handler then. This way we would get this 
>> serialized. What
>>      > do you think?
>>
>>     I am afraid a tasklet is not an option. You need to perform the TLB
>>     flush when requested otherwise you are introducing a security issue.
>>
>>     This is because as soon as a region is unmapped in the page 
>> table, we
>>     remove the drop the reference on any page backing that region. 
>> When the
>>     reference is dropped to zero, the page can be reallocated to another
>>     domain or even Xen. If the TLB flush happen after, then the guest 
>> may
>>     still be able to access the page for a short time if the 
>> translation has
>>     been cached by the any TLB (IOMMU, Processor).
>>
>>
>>
>> I understand this. I am not proposing to delay a requested by P2M 
>> code TLB flush in any case. I just propose to issue TLB flush (which 
>> we have to perform in case of page faults, to resolve error condition 
>> and resume translations) from a tasklet rather than from interrupt 
>> handler directly. This is the TLB flush I am speaking about:
>>
>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>
>>
>> Sorry if I was unclear.
>
> My mistake, I misread what you wrote.
>
> I found the flush in the renesas-bsp and not Linux upstream but it is 
> not clear why this is actually required. You are not fixing any 
> translation error. So what this flush will do?
>
> Regarding the placement of the flush, then if you execute in a tasklet 
> it will likely be done later on when the IRQ has been acknowledge. 
> What's the implication to delay it?


Looks like, there is no need to put this flush into a tasklet. As I 
understand from Shimoda-san's answer it is OK to call flush here.

So, my worry about calling ipmmu_tlb_invalidate() directly from the 
interrupt handler is not actual anymore.
----------
This is my understanding regarding the flush purpose here. This code, 
just follows the TRM, no more no less,
which mentions about a need to flush TLB after clearing error status 
register and updating a page table entries (which, I assume, means to 
resolve a reason why translation/page fault error actually have 
happened) to resume address translation request.

But, with one remark, as you have already noted, we are not trying to 
handle/fix this fault (update page table entries), driver doesn't manage 
page table and is not aware what the page table is. What is more, it is 
unclear what actually need to be fixed in the page table which is a CPU 
page table as the same time.

I have heard there is a break-before-make sequence when updating the 
page table. So, if device in a domain is issuing DMA somewhere in the 
middle of updating a page table, theoretically, we might hit into this 
fault. In this case the page table is correct and we don't need to fix 
anything...   Being honest, I have never seen a fault caused by 
break-before-make sequence.

>
> Cheers,
>
Oleksandr Tyshchenko Aug. 8, 2019, 12:28 p.m. UTC | #8
Hi, Julien.


>> I am afraid, I don't know a correct answer for this question. I would 
>> leave this to maintainers.
>>
>> I just followed sample copyright notice for GPL v2 License according 
>> to the document:
>>
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=CONTRIBUTING
>
> The file CONTRIBUTING is only giving example of common example of 
> license. So I think this is fine to use SPDX, the more they are 
> already used. The only request is to put either SDPX or the full-blown 
> text but not the two :). Lars, any objection?
>
> I am quite in favor of SPDX because it is easier to find out the 
> license. With the full-blown text, the text may slightly vary between 
> licenses. For instance, the only difference between GPLv2 and GPLv2+ 
> is ",or (at your option) any later version". I let you imagine how it 
> can be easy to miss it when reviewing ;).
>
> We had a discussion last year about using SPDX in Xen code base but I 
> never got the time to formally suggest it.

I tried to locate files in Xen where SPDX is used. After finding only 
nospec.h I got an incorrect opinion this is not popular in Xen))


Just to clarify. So the title for the driver should be the following (if 
there are no objections):

// SPDX-License-Identifier: GPL-2.0
/*
  * xen/drivers/passthrough/arm/ipmmu-vmsa.c
  *
  * Driver for the Renesas IPMMU-VMSA found in R-Car Gen3 SoCs.
  *
  * Copyright (C) 2014-2019 Renesas Electronics Corporation
  *
  * The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
  * which provides address translation and access protection functionalities
  * to processing units and interconnect networks.
  *
  * Please note, current driver is supposed to work only with newest 
R-Car Gen3
  * SoCs revisions which IPMMU hardware supports stage 2 translation 
table format
  * and is able to use CPU's P2M table as is.
  *
  * Based on Linux's IPMMU-VMSA driver from Renesas BSP:
  *    drivers/iommu/ipmmu-vmsa.c
  * you can found at:
  *    url: 
git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
  *    branch: v4.14.75-ltsi/rcar-3.9.6
  *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
  * and Xen's SMMU driver:
  *    xen/drivers/passthrough/arm/smmu.c
  *
  * Copyright (C) 2016-2019 EPAM Systems Inc.
  */


Answer to myself:

Looks like, the same I have to do with all newly added files in this 
series (iommu_fwspec, etc).


>>>> +    /*
>>>> +     * Destroy Root IPMMU domain which context is mapped to this 
>>>> Xen domain
>>>> +     * if exits.
>>>> +     */
>>>> +    if ( xen_domain->root_domain )
>>>> +        ipmmu_free_root_domain(xen_domain->root_domain);
>>>> +
>>>> +    spin_unlock(&xen_domain->lock);
>>>> +
>>>> +    /*
>>>> +     * We assume that all master devices have already been 
>>>> detached from
>>>> +     * this Xen domain and there must be no associated Cache IPMMU 
>>>> domains
>>>> +     * in use.
>>>> +     */
>>>> +    ASSERT(list_empty(&xen_domain->cache_domains));
>>> I think this should be in the spin lock held by &xen_domain->lock.
>>
>> OK. Will put spin_unlock after it.
>
> The spin_lock is actually pointless here. This is done when the domain 
> is destroyed, so nobody should touch it.
>
> If you think concurrent access can still happen, then you are going to 
> be in deep trouble as you free the xen_domain (and therefore the 
> spinlock) below :).

Indeed, this is pointless. We don't really expect any other operations 
with the domain which is being destroyed. No assign/deassign devices, no 
flush, no map, nothing...


>
>
Julien Grall Aug. 8, 2019, 12:44 p.m. UTC | #9
Hi,

Removing Lars there is no need to spam him with technical discussion :)

On 08/08/2019 11:14, Oleksandr wrote:
> 
> 
>> Hi,
> 
> Hi, Julien.
> 
> 
>>
>>> Sorry for the possible format issues.
>>>
>>>
>>>      > No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I
>>>      > think, there won't be a deadlock.
>>>      >
>>>      > Or I really missed something?
>>>      >
>>>      > If we worry about ipmmu_tlb_invalidate() which is called here (to
>>>      > perform a flush by request from P2M code, which manages a page table)
>>>      > and from the irq handler (to perform a flush to resume address
>>>      > translation), I could use a tasklet to schedule ipmmu_tlb_invalidate()
>>>      > from the irq handler then. This way we would get this serialized. What
>>>      > do you think?
>>>
>>>     I am afraid a tasklet is not an option. You need to perform the TLB
>>>     flush when requested otherwise you are introducing a security issue.
>>>
>>>     This is because as soon as a region is unmapped in the page table, we
>>>     remove the drop the reference on any page backing that region. When the
>>>     reference is dropped to zero, the page can be reallocated to another
>>>     domain or even Xen. If the TLB flush happen after, then the guest may
>>>     still be able to access the page for a short time if the translation has
>>>     been cached by the any TLB (IOMMU, Processor).
>>>
>>>
>>>
>>> I understand this. I am not proposing to delay a requested by P2M code TLB 
>>> flush in any case. I just propose to issue TLB flush (which we have to 
>>> perform in case of page faults, to resolve error condition and resume 
>>> translations) from a tasklet rather than from interrupt handler directly. 
>>> This is the TLB flush I am speaking about:
>>>
>>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>>
>>>
>>> Sorry if I was unclear.
>>
>> My mistake, I misread what you wrote.
>>
>> I found the flush in the renesas-bsp and not Linux upstream but it is not 
>> clear why this is actually required. You are not fixing any translation error. 
>> So what this flush will do?
>>
>> Regarding the placement of the flush, then if you execute in a tasklet it will 
>> likely be done later on when the IRQ has been acknowledge. What's the 
>> implication to delay it?
> 
> 
> Looks like, there is no need to put this flush into a tasklet. As I understand 
> from Shimoda-san's answer it is OK to call flush here.
> 
> So, my worry about calling ipmmu_tlb_invalidate() directly from the interrupt 
> handler is not actual anymore.
> ----------
> This is my understanding regarding the flush purpose here. This code, just 
> follows the TRM, no more no less,
> which mentions about a need to flush TLB after clearing error status register 
> and updating a page table entries (which, I assume, means to resolve a reason 
> why translation/page fault error actually have happened) to resume address 
> translation request.

Well, I don't have the TRM... so my point of reference is Linux. Why does 
upstream not do the TLB flush?

I have been told this is an errata on the IPMMU. Is it related to the series 
posted on linux-iommu [1]?

> 
> But, with one remark, as you have already noted, we are not trying to handle/fix 
> this fault (update page table entries), driver doesn't manage page table and is 
> not aware what the page table is. What is more, it is unclear what actually need 
> to be fixed in the page table which is a CPU page table as the same time.
> 
> I have heard there is a break-before-make sequence when updating the page table. 
> So, if device in a domain is issuing DMA somewhere in the middle of updating a 
> page table, theoretically, we might hit into this fault. In this case the page 
> table is correct and we don't need to fix anything...   Being honest, I have 
> never seen a fault caused by break-before-make sequence.

Ok, so it looks like you are trying to fix [1]. My first concern here is there 
are no ground for someone without access to the TRM why this is done.

Furthermore, AFAICT, the patch series never reached upstream. So is it present 
on all revision of GEN3?

Cheers,

[1] 
https://lore.kernel.org/linux-iommu/1485348842-23712-1-git-send-email-yoshihiro.shimoda.uh@renesas.com/
Lars Kurth Aug. 8, 2019, 2:23 p.m. UTC | #10
On 07/08/2019, 20:15, "Julien Grall" <julien.grall@arm.com> wrote:

    (+ Lars)
    
    Hi,
    
    On 8/7/19 5:01 PM, Oleksandr wrote:
    >>> + * you can found at:
    >>> + *    url: 
    >>> git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
    >>> + *    branch: v4.14.75-ltsi/rcar-3.9.6
    >>> + *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
    >>> + * and Xen's SMMU driver:
    >>> + *    xen/drivers/passthrough/arm/smmu.c
    >>> + *
    >>> + * Copyright (C) 2016-2019 EPAM Systems Inc.
    >>> + *
    >>> + * This program is free software; you can redistribute it and/or
    >>> + * modify it under the terms and conditions of the GNU General Public
    >>> + * License, version 2, as published by the Free Software Foundation.
    >>> + *
    >>> + * This program is distributed in the hope that it will be useful,
    >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
    >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    >>> + * General Public License for more details.
    >>> + *
    >>> + * You should have received a copy of the GNU General Public
    >>> + * License along with this program; If not, see 
    >>> <http://www.gnu.org/licenses/>.
    >> I don't know that Xen license description rule, but since a few source 
    >> files have
    >> SPDX-License-Identifier, can we also use it on the driver?
    > 
    > I am afraid, I don't know a correct answer for this question. I would 
    > leave this to maintainers.
    > 
    > I just followed sample copyright notice for GPL v2 License according to 
    > the document:
    > 
    > http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=CONTRIBUTING
    
    The file CONTRIBUTING is only giving example of common example of 
    license. So I think this is fine to use SPDX, the more they are already 
    used. The only request is to put either SDPX or the full-blown text but 
    not the two :). Lars, any objection?
    
    I am quite in favor of SPDX because it is easier to find out the 
    license. With the full-blown text, the text may slightly vary between 
    licenses. For instance, the only difference between GPLv2 and GPLv2+ is 
    ",or (at your option) any later version". I let you imagine how it can 
    be easy to miss it when reviewing ;).
    
    We had a discussion last year about using SPDX in Xen code base but I 
    never got the time to formally suggest it.
    
I did not push it either. 

In the past one of the committers had major objections against SPDX, but after a conversation last year and changes to the latest version of SPDX he dropped these.

The only remaining objection was to have both SPDX identifier AND a license in the same file. The argument against it is: what does it mean if they contradict each other? To be fair that is a valid concern.

I am not sure it is a good idea to introduce SPDX piecemeal. It would be much better to
a) agree it
b) transform the codebase using a tool
rather than introducing it piecemeal

Lars
Oleksandr Tyshchenko Aug. 8, 2019, 3:04 p.m. UTC | #11
Hi

>>>
>>>> Sorry for the possible format issues.
>>>>
>>>>
>>>>      > No, it is not disabled. But, ipmmu_irq() uses another 
>>>> mmu->lock. So, I
>>>>      > think, there won't be a deadlock.
>>>>      >
>>>>      > Or I really missed something?
>>>>      >
>>>>      > If we worry about ipmmu_tlb_invalidate() which is called 
>>>> here (to
>>>>      > perform a flush by request from P2M code, which manages a 
>>>> page table)
>>>>      > and from the irq handler (to perform a flush to resume address
>>>>      > translation), I could use a tasklet to schedule 
>>>> ipmmu_tlb_invalidate()
>>>>      > from the irq handler then. This way we would get this 
>>>> serialized. What
>>>>      > do you think?
>>>>
>>>>     I am afraid a tasklet is not an option. You need to perform the 
>>>> TLB
>>>>     flush when requested otherwise you are introducing a security 
>>>> issue.
>>>>
>>>>     This is because as soon as a region is unmapped in the page 
>>>> table, we
>>>>     remove the drop the reference on any page backing that region. 
>>>> When the
>>>>     reference is dropped to zero, the page can be reallocated to 
>>>> another
>>>>     domain or even Xen. If the TLB flush happen after, then the 
>>>> guest may
>>>>     still be able to access the page for a short time if the 
>>>> translation has
>>>>     been cached by the any TLB (IOMMU, Processor).
>>>>
>>>>
>>>>
>>>> I understand this. I am not proposing to delay a requested by P2M 
>>>> code TLB flush in any case. I just propose to issue TLB flush 
>>>> (which we have to perform in case of page faults, to resolve error 
>>>> condition and resume translations) from a tasklet rather than from 
>>>> interrupt handler directly. This is the TLB flush I am speaking about:
>>>>
>>>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>>>
>>>>
>>>> Sorry if I was unclear.
>>>
>>> My mistake, I misread what you wrote.
>>>
>>> I found the flush in the renesas-bsp and not Linux upstream but it 
>>> is not clear why this is actually required. You are not fixing any 
>>> translation error. So what this flush will do?
>>>
>>> Regarding the placement of the flush, then if you execute in a 
>>> tasklet it will likely be done later on when the IRQ has been 
>>> acknowledge. What's the implication to delay it?
>>
>>
>> Looks like, there is no need to put this flush into a tasklet. As I 
>> understand from Shimoda-san's answer it is OK to call flush here.
>>
>> So, my worry about calling ipmmu_tlb_invalidate() directly from the 
>> interrupt handler is not actual anymore.
>> ----------
>> This is my understanding regarding the flush purpose here. This code, 
>> just follows the TRM, no more no less,
>> which mentions about a need to flush TLB after clearing error status 
>> register and updating a page table entries (which, I assume, means to 
>> resolve a reason why translation/page fault error actually have 
>> happened) to resume address translation request.
>
> Well, I don't have the TRM... so my point of reference is Linux. Why 
> does upstream not do the TLB flush?

I have no idea regarding that.


>
>
> I have been told this is an errata on the IPMMU. Is it related to the 
> series posted on linux-iommu [1]?

I don't think, the TLB flush we are speaking about, is related to that 
series [1] somehow. This TLB flush, I think, is just the last step in a 
sequence of actions which should be performed when the error occurs, no 
more no less. This is how I understand this.


>
>>
>> But, with one remark, as you have already noted, we are not trying to 
>> handle/fix this fault (update page table entries), driver doesn't 
>> manage page table and is not aware what the page table is. What is 
>> more, it is unclear what actually need to be fixed in the page table 
>> which is a CPU page table as the same time.
>>
>> I have heard there is a break-before-make sequence when updating the 
>> page table. So, if device in a domain is issuing DMA somewhere in the 
>> middle of updating a page table, theoretically, we might hit into 
>> this fault. In this case the page table is correct and we don't need 
>> to fix anything...   Being honest, I have never seen a fault caused 
>> by break-before-make sequence.
>
> Ok, so it looks like you are trying to fix [1]. My first concern here 
> is there are no ground for someone without access to the TRM why this 
> is done.

No, I am definitely not trying to fix [1]. I just follow the BSP driver 
I am based on, which in turn follows the TRM. I can extend a comment in 
the code before calling ipmmu_tlb_invalidate().


>
> Furthermore, AFAICT, the patch series never reached upstream. So is it 
> present on all revision of GEN3?

I think, that the newest SoCs revisions (ES 3.0) this driver is supposed 
to support only, are *not* affected by that errata. And *not* require 
such workaround.
Julien Grall Aug. 8, 2019, 5:16 p.m. UTC | #12
On 08/08/2019 16:04, Oleksandr wrote:
> 
> Hi
> 
>>>>
>>>>> Sorry for the possible format issues.
>>>>>
>>>>>
>>>>>      > No, it is not disabled. But, ipmmu_irq() uses another mmu->lock. So, I
>>>>>      > think, there won't be a deadlock.
>>>>>      >
>>>>>      > Or I really missed something?
>>>>>      >
>>>>>      > If we worry about ipmmu_tlb_invalidate() which is called here (to
>>>>>      > perform a flush by request from P2M code, which manages a page table)
>>>>>      > and from the irq handler (to perform a flush to resume address
>>>>>      > translation), I could use a tasklet to schedule ipmmu_tlb_invalidate()
>>>>>      > from the irq handler then. This way we would get this serialized. What
>>>>>      > do you think?
>>>>>
>>>>>     I am afraid a tasklet is not an option. You need to perform the TLB
>>>>>     flush when requested otherwise you are introducing a security issue.
>>>>>
>>>>>     This is because as soon as a region is unmapped in the page table, we
>>>>>     remove the drop the reference on any page backing that region. When the
>>>>>     reference is dropped to zero, the page can be reallocated to another
>>>>>     domain or even Xen. If the TLB flush happen after, then the guest may
>>>>>     still be able to access the page for a short time if the translation has
>>>>>     been cached by the any TLB (IOMMU, Processor).
>>>>>
>>>>>
>>>>>
>>>>> I understand this. I am not proposing to delay a requested by P2M code TLB 
>>>>> flush in any case. I just propose to issue TLB flush (which we have to 
>>>>> perform in case of page faults, to resolve error condition and resume 
>>>>> translations) from a tasklet rather than from interrupt handler directly. 
>>>>> This is the TLB flush I am speaking about:
>>>>>
>>>>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>>>>
>>>>>
>>>>> Sorry if I was unclear.
>>>>
>>>> My mistake, I misread what you wrote.
>>>>
>>>> I found the flush in the renesas-bsp and not Linux upstream but it is not 
>>>> clear why this is actually required. You are not fixing any translation 
>>>> error. So what this flush will do?
>>>>
>>>> Regarding the placement of the flush, then if you execute in a tasklet it 
>>>> will likely be done later on when the IRQ has been acknowledge. What's the 
>>>> implication to delay it?
>>>
>>>
>>> Looks like, there is no need to put this flush into a tasklet. As I 
>>> understand from Shimoda-san's answer it is OK to call flush here.
>>>
>>> So, my worry about calling ipmmu_tlb_invalidate() directly from the interrupt 
>>> handler is not actual anymore.
>>> ----------
>>> This is my understanding regarding the flush purpose here. This code, just 
>>> follows the TRM, no more no less,
>>> which mentions about a need to flush TLB after clearing error status register 
>>> and updating a page table entries (which, I assume, means to resolve a reason 
>>> why translation/page fault error actually have happened) to resume address 
>>> translation request.
>>
>> Well, I don't have the TRM... so my point of reference is Linux. Why does 
>> upstream not do the TLB flush?
> 
> I have no idea regarding that. >
> 
>>
>>
>> I have been told this is an errata on the IPMMU. Is it related to the series 
>> posted on linux-iommu [1]?
> 
> I don't think, the TLB flush we are speaking about, is related to that series 
> [1] somehow. This TLB flush, I think, is just the last step in a sequence of 
> actions which should be performed when the error occurs, no more no less. This 
> is how I understand this.

If you have to flush the TLBs in the IRQ context then something has gone really 
wrong.

I don't deny that Break-Before-Make is an issue. However, if it is handled 
correctly in the P2M code. You should only be there because there are no mapping 
in the TLBs for the address accessed. So flushing the TLBs should be 
unnecessary, unless your TLB is also caching invalid entry?

>>
>>>
>>> But, with one remark, as you have already noted, we are not trying to 
>>> handle/fix this fault (update page table entries), driver doesn't manage page 
>>> table and is not aware what the page table is. What is more, it is unclear 
>>> what actually need to be fixed in the page table which is a CPU page table as 
>>> the same time.
>>>
>>> I have heard there is a break-before-make sequence when updating the page 
>>> table. So, if device in a domain is issuing DMA somewhere in the middle of 
>>> updating a page table, theoretically, we might hit into this fault. In this 
>>> case the page table is correct and we don't need to fix anything...   Being 
>>> honest, I have never seen a fault caused by break-before-make sequence.
>>
>> Ok, so it looks like you are trying to fix [1]. My first concern here is there 
>> are no ground for someone without access to the TRM why this is done.
> 
> No, I am definitely not trying to fix [1]. I just follow the BSP driver I am 
> based on, which in turn follows the TRM. I can extend a comment in the code 
> before calling ipmmu_tlb_invalidate().

The fact that the code is in the BSP and not in Linux is worrying me. The commit 
message in the BSP is quite unhelpful to determine the exact reason.

It either means Linux rejected the patch or this was not submitted. Either way, 
this should be understood why such discrepancy.

Cheers,
Oleksandr Tyshchenko Aug. 8, 2019, 7:29 p.m. UTC | #13
Hi, Julien.


>>>>>
>>>>>> Sorry for the possible format issues.
>>>>>>
>>>>>>
>>>>>>      > No, it is not disabled. But, ipmmu_irq() uses another 
>>>>>> mmu->lock. So, I
>>>>>>      > think, there won't be a deadlock.
>>>>>>      >
>>>>>>      > Or I really missed something?
>>>>>>      >
>>>>>>      > If we worry about ipmmu_tlb_invalidate() which is called 
>>>>>> here (to
>>>>>>      > perform a flush by request from P2M code, which manages a 
>>>>>> page table)
>>>>>>      > and from the irq handler (to perform a flush to resume 
>>>>>> address
>>>>>>      > translation), I could use a tasklet to schedule 
>>>>>> ipmmu_tlb_invalidate()
>>>>>>      > from the irq handler then. This way we would get this 
>>>>>> serialized. What
>>>>>>      > do you think?
>>>>>>
>>>>>>     I am afraid a tasklet is not an option. You need to perform 
>>>>>> the TLB
>>>>>>     flush when requested otherwise you are introducing a security 
>>>>>> issue.
>>>>>>
>>>>>>     This is because as soon as a region is unmapped in the page 
>>>>>> table, we
>>>>>>     remove the drop the reference on any page backing that 
>>>>>> region. When the
>>>>>>     reference is dropped to zero, the page can be reallocated to 
>>>>>> another
>>>>>>     domain or even Xen. If the TLB flush happen after, then the 
>>>>>> guest may
>>>>>>     still be able to access the page for a short time if the 
>>>>>> translation has
>>>>>>     been cached by the any TLB (IOMMU, Processor).
>>>>>>
>>>>>>
>>>>>>
>>>>>> I understand this. I am not proposing to delay a requested by P2M 
>>>>>> code TLB flush in any case. I just propose to issue TLB flush 
>>>>>> (which we have to perform in case of page faults, to resolve 
>>>>>> error condition and resume translations) from a tasklet rather 
>>>>>> than from interrupt handler directly. This is the TLB flush I am 
>>>>>> speaking about:
>>>>>>
>>>>>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>>>>>
>>>>>>
>>>>>> Sorry if I was unclear.
>>>>>
>>>>> My mistake, I misread what you wrote.
>>>>>
>>>>> I found the flush in the renesas-bsp and not Linux upstream but it 
>>>>> is not clear why this is actually required. You are not fixing any 
>>>>> translation error. So what this flush will do?
>>>>>
>>>>> Regarding the placement of the flush, then if you execute in a 
>>>>> tasklet it will likely be done later on when the IRQ has been 
>>>>> acknowledge. What's the implication to delay it?
>>>>
>>>>
>>>> Looks like, there is no need to put this flush into a tasklet. As I 
>>>> understand from Shimoda-san's answer it is OK to call flush here.
>>>>
>>>> So, my worry about calling ipmmu_tlb_invalidate() directly from the 
>>>> interrupt handler is not actual anymore.
>>>> ----------
>>>> This is my understanding regarding the flush purpose here. This 
>>>> code, just follows the TRM, no more no less,
>>>> which mentions about a need to flush TLB after clearing error 
>>>> status register and updating a page table entries (which, I assume, 
>>>> means to resolve a reason why translation/page fault error actually 
>>>> have happened) to resume address translation request.
>>>
>>> Well, I don't have the TRM... so my point of reference is Linux. Why 
>>> does upstream not do the TLB flush?
>>
>> I have no idea regarding that. >
>>
>>>
>>>
>>> I have been told this is an errata on the IPMMU. Is it related to 
>>> the series posted on linux-iommu [1]?
>>
>> I don't think, the TLB flush we are speaking about, is related to 
>> that series [1] somehow. This TLB flush, I think, is just the last 
>> step in a sequence of actions which should be performed when the 
>> error occurs, no more no less. This is how I understand this.
>
> If you have to flush the TLBs in the IRQ context then something has 
> gone really wrong.
>
> I don't deny that Break-Before-Make is an issue. However, if it is 
> handled correctly in the P2M code. You should only be there because 
> there are no mapping in the TLBs for the address accessed. So flushing 
> the TLBs should be unnecessary, unless your TLB is also caching 
> invalid entry?

Sorry, I don't quite understand why we need to worry about this flush 
too much for a case which won't occur in normal condition (if everything 
is correct). Why we can't just consider this flush as a required action, 
which needed to exit from the error state and resume stopped address 
translation request. The same required action as "clearing error status 
flags" before. We are not trying to understand, why is it so necessary 
to clear error flags when error happens, or can we end up without 
clearing it, for example. We just follow what described in document. The 
same, I think, we have for that flush, if described, then should be 
followed. Looks like this flush acts as a trigger to unblock stopped 
transaction in that particular case.

Different H/W could have different restoring sequences. Some H/W 
requires just clearing error status, other H/W requires full 
re-initialization in a specific order to recover from the error state.

Please correct me if I am wrong.

>
>>>
>>>>
>>>> But, with one remark, as you have already noted, we are not trying 
>>>> to handle/fix this fault (update page table entries), driver 
>>>> doesn't manage page table and is not aware what the page table is. 
>>>> What is more, it is unclear what actually need to be fixed in the 
>>>> page table which is a CPU page table as the same time.
>>>>
>>>> I have heard there is a break-before-make sequence when updating 
>>>> the page table. So, if device in a domain is issuing DMA somewhere 
>>>> in the middle of updating a page table, theoretically, we might hit 
>>>> into this fault. In this case the page table is correct and we 
>>>> don't need to fix anything...   Being honest, I have never seen a 
>>>> fault caused by break-before-make sequence.
>>>
>>> Ok, so it looks like you are trying to fix [1]. My first concern 
>>> here is there are no ground for someone without access to the TRM 
>>> why this is done.
>>
>> No, I am definitely not trying to fix [1]. I just follow the BSP 
>> driver I am based on, which in turn follows the TRM. I can extend a 
>> comment in the code before calling ipmmu_tlb_invalidate().
>
> The fact that the code is in the BSP and not in Linux is worrying me. 
> The commit message in the BSP is quite unhelpful to determine the 
> exact reason.
>
> It either means Linux rejected the patch or this was not submitted. 
> Either way, this should be understood why such discrepancy.

I failed to find something similar in the ML. So, probably, was not 
submitted. Hope, we will be able to clarify a reason.
Julien Grall Aug. 8, 2019, 8:32 p.m. UTC | #14
Hi Oleksandr,

On 8/8/19 8:29 PM, Oleksandr wrote:
>>>>>>
>>>>>>> Sorry for the possible format issues.
>>>>>>>
>>>>>>>
>>>>>>>      > No, it is not disabled. But, ipmmu_irq() uses another 
>>>>>>> mmu->lock. So, I
>>>>>>>      > think, there won't be a deadlock.
>>>>>>>      >
>>>>>>>      > Or I really missed something?
>>>>>>>      >
>>>>>>>      > If we worry about ipmmu_tlb_invalidate() which is called 
>>>>>>> here (to
>>>>>>>      > perform a flush by request from P2M code, which manages a 
>>>>>>> page table)
>>>>>>>      > and from the irq handler (to perform a flush to resume 
>>>>>>> address
>>>>>>>      > translation), I could use a tasklet to schedule 
>>>>>>> ipmmu_tlb_invalidate()
>>>>>>>      > from the irq handler then. This way we would get this 
>>>>>>> serialized. What
>>>>>>>      > do you think?
>>>>>>>
>>>>>>>     I am afraid a tasklet is not an option. You need to perform 
>>>>>>> the TLB
>>>>>>>     flush when requested otherwise you are introducing a security 
>>>>>>> issue.
>>>>>>>
>>>>>>>     This is because as soon as a region is unmapped in the page 
>>>>>>> table, we
>>>>>>>     remove the drop the reference on any page backing that 
>>>>>>> region. When the
>>>>>>>     reference is dropped to zero, the page can be reallocated to 
>>>>>>> another
>>>>>>>     domain or even Xen. If the TLB flush happen after, then the 
>>>>>>> guest may
>>>>>>>     still be able to access the page for a short time if the 
>>>>>>> translation has
>>>>>>>     been cached by the any TLB (IOMMU, Processor).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I understand this. I am not proposing to delay a requested by P2M 
>>>>>>> code TLB flush in any case. I just propose to issue TLB flush 
>>>>>>> (which we have to perform in case of page faults, to resolve 
>>>>>>> error condition and resume translations) from a tasklet rather 
>>>>>>> than from interrupt handler directly. This is the TLB flush I am 
>>>>>>> speaking about:
>>>>>>>
>>>>>>> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598 
>>>>>>>
>>>>>>>
>>>>>>> Sorry if I was unclear.
>>>>>>
>>>>>> My mistake, I misread what you wrote.
>>>>>>
>>>>>> I found the flush in the renesas-bsp and not Linux upstream but it 
>>>>>> is not clear why this is actually required. You are not fixing any 
>>>>>> translation error. So what this flush will do?
>>>>>>
>>>>>> Regarding the placement of the flush, then if you execute in a 
>>>>>> tasklet it will likely be done later on when the IRQ has been 
>>>>>> acknowledge. What's the implication to delay it?
>>>>>
>>>>>
>>>>> Looks like, there is no need to put this flush into a tasklet. As I 
>>>>> understand from Shimoda-san's answer it is OK to call flush here.
>>>>>
>>>>> So, my worry about calling ipmmu_tlb_invalidate() directly from the 
>>>>> interrupt handler is not actual anymore.
>>>>> ----------
>>>>> This is my understanding regarding the flush purpose here. This 
>>>>> code, just follows the TRM, no more no less,
>>>>> which mentions about a need to flush TLB after clearing error 
>>>>> status register and updating a page table entries (which, I assume, 
>>>>> means to resolve a reason why translation/page fault error actually 
>>>>> have happened) to resume address translation request.
>>>>
>>>> Well, I don't have the TRM... so my point of reference is Linux. Why 
>>>> does upstream not do the TLB flush?
>>>
>>> I have no idea regarding that. >
>>>
>>>>
>>>>
>>>> I have been told this is an errata on the IPMMU. Is it related to 
>>>> the series posted on linux-iommu [1]?
>>>
>>> I don't think, the TLB flush we are speaking about, is related to 
>>> that series [1] somehow. This TLB flush, I think, is just the last 
>>> step in a sequence of actions which should be performed when the 
>>> error occurs, no more no less. This is how I understand this.
>>
>> If you have to flush the TLBs in the IRQ context then something has 
>> gone really wrong.
>>
>> I don't deny that Break-Before-Make is an issue. However, if it is 
>> handled correctly in the P2M code. You should only be there because 
>> there are no mapping in the TLBs for the address accessed. So flushing 
>> the TLBs should be unnecessary, unless your TLB is also caching 
>> invalid entry?
> 
> Sorry, I don't quite understand why we need to worry about this flush 
> too much for a case which won't occur in normal condition (if everything 
> is correct). Why we can't just consider this flush as a required action, 

A translation error can be easy to reach. For instance if the guest does 
not program the Device correctly and request to access an address that 
is not mapped.

> which needed to exit from the error state and resume stopped address 
> translation request. The same required action as "clearing error status 
> flags" before. We are not trying to understand, why is it so necessary 
> to clear error flags when error happens, or can we end up without 
> clearing it, for example. We just follow what described in document. The 
> same, I think, we have for that flush, if described, then should be 
> followed. Looks like this flush acts as a trigger to unblock stopped 
> transaction in that particular case.

What will actually happen if the transaction fail again? For instance, 
if the IOVA was not mapped. Will you receive the interrupt again?
If so, are you going to make the flush again and again until the guest 
is killed?

> 
> Different H/W could have different restoring sequences. Some H/W 
> requires just clearing error status, other H/W requires full 
> re-initialization in a specific order to recover from the error state.
> 
> Please correct me if I am wrong.

I am not confident to accept any code that I don't understand or I don't 
find sensible. As I pointed out in my previous e-mail, this hasn't 
reached upstream so something looks quite fishy here.

Cheers,
Oleksandr Tyshchenko Aug. 8, 2019, 11:32 p.m. UTC | #15
Hi Julien.

Sorry for the possible format issues.

чт, 8 авг. 2019 г., 23:32 Julien Grall <julien.grall@arm.com>:

> Hi Oleksandr,
>
> On 8/8/19 8:29 PM, Oleksandr wrote:
> >>>>>>
> >>>>>>> Sorry for the possible format issues.
> >>>>>>>
> >>>>>>>
> >>>>>>>      > No, it is not disabled. But, ipmmu_irq() uses another
> >>>>>>> mmu->lock. So, I
> >>>>>>>      > think, there won't be a deadlock.
> >>>>>>>      >
> >>>>>>>      > Or I really missed something?
> >>>>>>>      >
> >>>>>>>      > If we worry about ipmmu_tlb_invalidate() which is called
> >>>>>>> here (to
> >>>>>>>      > perform a flush by request from P2M code, which manages a
> >>>>>>> page table)
> >>>>>>>      > and from the irq handler (to perform a flush to resume
> >>>>>>> address
> >>>>>>>      > translation), I could use a tasklet to schedule
> >>>>>>> ipmmu_tlb_invalidate()
> >>>>>>>      > from the irq handler then. This way we would get this
> >>>>>>> serialized. What
> >>>>>>>      > do you think?
> >>>>>>>
> >>>>>>>     I am afraid a tasklet is not an option. You need to perform
> >>>>>>> the TLB
> >>>>>>>     flush when requested otherwise you are introducing a security
> >>>>>>> issue.
> >>>>>>>
> >>>>>>>     This is because as soon as a region is unmapped in the page
> >>>>>>> table, we
> >>>>>>>     remove the drop the reference on any page backing that
> >>>>>>> region. When the
> >>>>>>>     reference is dropped to zero, the page can be reallocated to
> >>>>>>> another
> >>>>>>>     domain or even Xen. If the TLB flush happen after, then the
> >>>>>>> guest may
> >>>>>>>     still be able to access the page for a short time if the
> >>>>>>> translation has
> >>>>>>>     been cached by the any TLB (IOMMU, Processor).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I understand this. I am not proposing to delay a requested by P2M
> >>>>>>> code TLB flush in any case. I just propose to issue TLB flush
> >>>>>>> (which we have to perform in case of page faults, to resolve
> >>>>>>> error condition and resume translations) from a tasklet rather
> >>>>>>> than from interrupt handler directly. This is the TLB flush I am
> >>>>>>> speaking about:
> >>>>>>>
> >>>>>>>
> https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598
> >>>>>>>
> >>>>>>>
> >>>>>>> Sorry if I was unclear.
> >>>>>>
> >>>>>> My mistake, I misread what you wrote.
> >>>>>>
> >>>>>> I found the flush in the renesas-bsp and not Linux upstream but it
> >>>>>> is not clear why this is actually required. You are not fixing any
> >>>>>> translation error. So what this flush will do?
> >>>>>>
> >>>>>> Regarding the placement of the flush, then if you execute in a
> >>>>>> tasklet it will likely be done later on when the IRQ has been
> >>>>>> acknowledge. What's the implication to delay it?
> >>>>>
> >>>>>
> >>>>> Looks like, there is no need to put this flush into a tasklet. As I
> >>>>> understand from Shimoda-san's answer it is OK to call flush here.
> >>>>>
> >>>>> So, my worry about calling ipmmu_tlb_invalidate() directly from the
> >>>>> interrupt handler is not actual anymore.
> >>>>> ----------
> >>>>> This is my understanding regarding the flush purpose here. This
> >>>>> code, just follows the TRM, no more no less,
> >>>>> which mentions about a need to flush TLB after clearing error
> >>>>> status register and updating a page table entries (which, I assume,
> >>>>> means to resolve a reason why translation/page fault error actually
> >>>>> have happened) to resume address translation request.
> >>>>
> >>>> Well, I don't have the TRM... so my point of reference is Linux. Why
> >>>> does upstream not do the TLB flush?
> >>>
> >>> I have no idea regarding that. >
> >>>
> >>>>
> >>>>
> >>>> I have been told this is an errata on the IPMMU. Is it related to
> >>>> the series posted on linux-iommu [1]?
> >>>
> >>> I don't think, the TLB flush we are speaking about, is related to
> >>> that series [1] somehow. This TLB flush, I think, is just the last
> >>> step in a sequence of actions which should be performed when the
> >>> error occurs, no more no less. This is how I understand this.
> >>
> >> If you have to flush the TLBs in the IRQ context then something has
> >> gone really wrong.
> >>
> >> I don't deny that Break-Before-Make is an issue. However, if it is
> >> handled correctly in the P2M code. You should only be there because
> >> there are no mapping in the TLBs for the address accessed. So flushing
> >> the TLBs should be unnecessary, unless your TLB is also caching
> >> invalid entry?
> >
> > Sorry, I don't quite understand why we need to worry about this flush
> > too much for a case which won't occur in normal condition (if everything
> > is correct). Why we can't just consider this flush as a required action,
>
> A translation error can be easy to reach. For instance if the guest does
> not program the Device correctly and request to access an address that
> is not mapped.
>

Yes, I understand these bits. But, I wrote that error wouldn't occur in
normal condition (if everything was correct).



>
>
> > which needed to exit from the error state and resume stopped address
> > translation request. The same required action as "clearing error status
> > flags" before. We are not trying to understand, why is it so necessary
> > to clear error flags when error happens, or can we end up without
> > clearing it, for example. We just follow what described in document. The
> > same, I think, we have for that flush, if described, then should be
> > followed. Looks like this flush acts as a trigger to unblock stopped
> > transaction in that particular case.
>
> What will actually happen if the transaction fail again? For instance,
> if the IOVA was not mapped. Will you receive the interrupt again?
> If so, are you going to make the flush again and again until the guest
> is killed?
>

This is a good question. I think, if address is not mapped, the transaction
will fail again and we will get the interrupt again. Not sure, until the
guest is killed or until the driver in the guest detects timeout and
cancels DMA. Let's consider the worst case, until the guest is killed.

So my questions are what do you think would be the proper driver's behavior
in that case? Do nothing and don't even try to resolve error
condition/unblock translation at the first page fault, or give it a few
attempts, or unblock every time. How does the SMMU driver act in such
situation?

Quite clear, if we get a fault, then address is not mapped. I think, it can
be both: by issuing wrong address (baggy driver, malicious driver) or by
race (unlikely). If this is the real race (device hits brake-before-make,
for example), we could give it another attempt, for example. Looks like we
need some mechanism to deploy faulted address to P2M code (which manages
page table) to analyze? Or it is not worth doing that?


> >
> > Different H/W could have different restoring sequences. Some H/W
> > requires just clearing error status, other H/W requires full
> > re-initialization in a specific order to recover from the error state.
> >
> > Please correct me if I am wrong.
>
> I am not confident to accept any code that I don't understand or I don't
> find sensible. As I pointed out in my previous e-mail, this hasn't
> reached upstream so something looks quite fishy here.
>
>
As I answered in previous e-mail, I hope, we will be able to clarify a
reason why this hasn't reached upstream.

>
Julien Grall Aug. 9, 2019, 9:56 a.m. UTC | #16
(+ Robin)

On 09/08/2019 00:32, Oleksandr Tyshchenko wrote:
> Hi Julien.

Hi,

> 
> Sorry for the possible format issues.
> 
> чт, 8 авг. 2019 г., 23:32 Julien Grall <julien.grall@arm.com 
> <mailto:julien.grall@arm.com>>:
> 
>     Hi Oleksandr,
> 
>     On 8/8/19 8:29 PM, Oleksandr wrote:
>      >>>>>>
>      >>>>>>> Sorry for the possible format issues.
>      >>>>>>>
>      >>>>>>>
>      >>>>>>>      > No, it is not disabled. But, ipmmu_irq() uses another
>      >>>>>>> mmu->lock. So, I
>      >>>>>>>      > think, there won't be a deadlock.
>      >>>>>>>      >
>      >>>>>>>      > Or I really missed something?
>      >>>>>>>      >
>      >>>>>>>      > If we worry about ipmmu_tlb_invalidate() which is called
>      >>>>>>> here (to
>      >>>>>>>      > perform a flush by request from P2M code, which manages a
>      >>>>>>> page table)
>      >>>>>>>      > and from the irq handler (to perform a flush to resume
>      >>>>>>> address
>      >>>>>>>      > translation), I could use a tasklet to schedule
>      >>>>>>> ipmmu_tlb_invalidate()
>      >>>>>>>      > from the irq handler then. This way we would get this
>      >>>>>>> serialized. What
>      >>>>>>>      > do you think?
>      >>>>>>>
>      >>>>>>>     I am afraid a tasklet is not an option. You need to perform
>      >>>>>>> the TLB
>      >>>>>>>     flush when requested otherwise you are introducing a security
>      >>>>>>> issue.
>      >>>>>>>
>      >>>>>>>     This is because as soon as a region is unmapped in the page
>      >>>>>>> table, we
>      >>>>>>>     remove the drop the reference on any page backing that
>      >>>>>>> region. When the
>      >>>>>>>     reference is dropped to zero, the page can be reallocated to
>      >>>>>>> another
>      >>>>>>>     domain or even Xen. If the TLB flush happen after, then the
>      >>>>>>> guest may
>      >>>>>>>     still be able to access the page for a short time if the
>      >>>>>>> translation has
>      >>>>>>>     been cached by the any TLB (IOMMU, Processor).
>      >>>>>>>
>      >>>>>>>
>      >>>>>>>
>      >>>>>>> I understand this. I am not proposing to delay a requested by P2M
>      >>>>>>> code TLB flush in any case. I just propose to issue TLB flush
>      >>>>>>> (which we have to perform in case of page faults, to resolve
>      >>>>>>> error condition and resume translations) from a tasklet rather
>      >>>>>>> than from interrupt handler directly. This is the TLB flush I am
>      >>>>>>> speaking about:
>      >>>>>>>
>      >>>>>>>
>     https://github.com/otyshchenko1/xen/blob/ipmmu_upstream2/xen/drivers/passthrough/arm/ipmmu-vmsa.c#L598
> 
>      >>>>>>>
>      >>>>>>>
>      >>>>>>> Sorry if I was unclear.
>      >>>>>>
>      >>>>>> My mistake, I misread what you wrote.
>      >>>>>>
>      >>>>>> I found the flush in the renesas-bsp and not Linux upstream but it
>      >>>>>> is not clear why this is actually required. You are not fixing any
>      >>>>>> translation error. So what this flush will do?
>      >>>>>>
>      >>>>>> Regarding the placement of the flush, then if you execute in a
>      >>>>>> tasklet it will likely be done later on when the IRQ has been
>      >>>>>> acknowledge. What's the implication to delay it?
>      >>>>>
>      >>>>>
>      >>>>> Looks like, there is no need to put this flush into a tasklet. As I
>      >>>>> understand from Shimoda-san's answer it is OK to call flush here.
>      >>>>>
>      >>>>> So, my worry about calling ipmmu_tlb_invalidate() directly from the
>      >>>>> interrupt handler is not actual anymore.
>      >>>>> ----------
>      >>>>> This is my understanding regarding the flush purpose here. This
>      >>>>> code, just follows the TRM, no more no less,
>      >>>>> which mentions about a need to flush TLB after clearing error
>      >>>>> status register and updating a page table entries (which, I assume,
>      >>>>> means to resolve a reason why translation/page fault error actually
>      >>>>> have happened) to resume address translation request.
>      >>>>
>      >>>> Well, I don't have the TRM... so my point of reference is Linux. Why
>      >>>> does upstream not do the TLB flush?
>      >>>
>      >>> I have no idea regarding that. >
>      >>>
>      >>>>
>      >>>>
>      >>>> I have been told this is an errata on the IPMMU. Is it related to
>      >>>> the series posted on linux-iommu [1]?
>      >>>
>      >>> I don't think, the TLB flush we are speaking about, is related to
>      >>> that series [1] somehow. This TLB flush, I think, is just the last
>      >>> step in a sequence of actions which should be performed when the
>      >>> error occurs, no more no less. This is how I understand this.
>      >>
>      >> If you have to flush the TLBs in the IRQ context then something has
>      >> gone really wrong.
>      >>
>      >> I don't deny that Break-Before-Make is an issue. However, if it is
>      >> handled correctly in the P2M code. You should only be there because
>      >> there are no mapping in the TLBs for the address accessed. So flushing
>      >> the TLBs should be unnecessary, unless your TLB is also caching
>      >> invalid entry?
>      >
>      > Sorry, I don't quite understand why we need to worry about this flush
>      > too much for a case which won't occur in normal condition (if everything
>      > is correct). Why we can't just consider this flush as a required action,
> 
>     A translation error can be easy to reach. For instance if the guest does
>     not program the Device correctly and request to access an address that
>     is not mapped.
> 
> 
> Yes, I understand these bits. But, I wrote that error wouldn't occur in normal 
> condition (if everything was correct).

I don't understand your point here. Whether this is in an error path or correct 
path, we should be able to understand the reason behind it. Otherwise, error 
path would become the wild west...

> 
> 
> 
> 
> 
>      > which needed to exit from the error state and resume stopped address
>      > translation request. The same required action as "clearing error status
>      > flags" before. We are not trying to understand, why is it so necessary
>      > to clear error flags when error happens, or can we end up without
>      > clearing it, for example. We just follow what described in document. The
>      > same, I think, we have for that flush, if described, then should be
>      > followed. Looks like this flush acts as a trigger to unblock stopped
>      > transaction in that particular case.
> 
>     What will actually happen if the transaction fail again? For instance,
>     if the IOVA was not mapped. Will you receive the interrupt again?
>     If so, are you going to make the flush again and again until the guest
>     is killed?
> 
> 
> This is a good question. I think, if address is not mapped, the transaction will 
> fail again and we will get the interrupt again. Not sure, until the guest is 
> killed or until the driver in the guest detects timeout and cancels DMA. Let's 
> consider the worst case, until the guest is killed.
> 
> So my questions are what do you think would be the proper driver's behavior in 
> that case? Do nothing and don't even try to resolve error condition/unblock 
> translation at the first page fault, or give it a few attempts, or unblock every 
> time.

I will answer back with a question here. How is the TLB flush is going to 
unblock anything? The more you are not fixing any error condition here... And 
the print "Unhandled fault" just afterwards clearly leads to think that there 
are very little chance the fault has been resolved.

> How does the SMMU driver act in such situation?

I have CCed Robin who knows better than me the SMMU driver. Though it is the 
Linux one but Xen is based on it.

 From my understanding, it is implementation defined whether the SMMU supports 
stalling a transaction on fault. AFAICT, the current Xen driver will just 
terminate the transaction and therefore the client transaction behave as RAZ/WI.

> 
> Quite clear, if we get a fault, then address is not mapped. I think, it can be 
> both: by issuing wrong address (baggy driver, malicious driver) or by race 
> (unlikely). If this is the real race (device hits brake-before-make, for 
> example), we could give it another attempt, for example. Looks like we need some 
> mechanism to deploy faulted address to P2M code (which manages page table) to 
> analyze? Or it is not worth doing that?

You seem to speak about break-before-make as it was an error. Break-Before-Make 
is just a sequence to prevent the TLB walker to cache both old and new mapping 
at the same time. At a given point the IOVA translation can only be:
    1) The old physical address
    2) No address -> result to a fault
    3) The new physical address

1) and 3) should not result to a fault. 2) will result to a fault but then the 
TLB should not cache invalid entry, right?

In order to see 2), we always flush the TLBs after removing the old physical 
address.

Unfortunately, some of the IOMMUs are not able to restart transactions, Xen 
currently avoids to flush the TLBs after 2). So you may be able to see both 
mapping at the same time.

Looking at your driver, I believe you would have the flag IMSTR.MHIT (multiple 
tlb hits) set because this is the condition we are trying to prevent with 
break-before-make. The comment in the code leads to think this is a fault error, 
so I am not sure why you would recover here...

If your IOMMU is able to stall transaction, then it would be best if we properly 
handle break-before-make with it.

Overall, it feels to me the TLB flush is here for a different reason.

> 
> 
>      >
>      > Different H/W could have different restoring sequences. Some H/W
>      > requires just clearing error status, other H/W requires full
>      > re-initialization in a specific order to recover from the error state.
>      >
>      > Please correct me if I am wrong.
> 
>     I am not confident to accept any code that I don't understand or I don't
>     find sensible. As I pointed out in my previous e-mail, this hasn't
>     reached upstream so something looks quite fishy here.
> 
> 
> As I answered in previous e-mail, I hope, we will be able to clarify a reason 
> why this hasn't reached upstream.

Thank you.

Cheers,
Oleksandr Tyshchenko Aug. 9, 2019, 6:38 p.m. UTC | #17
Hi, Julien


>>
>>     What will actually happen if the transaction fail again? For 
>> instance,
>>     if the IOVA was not mapped. Will you receive the interrupt again?
>>     If so, are you going to make the flush again and again until the 
>> guest
>>     is killed?
>>
>>
>> This is a good question. I think, if address is not mapped, the 
>> transaction will fail again and we will get the interrupt again. Not 
>> sure, until the guest is killed or until the driver in the guest 
>> detects timeout and cancels DMA. Let's consider the worst case, until 
>> the guest is killed.
>>
>> So my questions are what do you think would be the proper driver's 
>> behavior in that case? Do nothing and don't even try to resolve error 
>> condition/unblock translation at the first page fault, or give it a 
>> few attempts, or unblock every time.
>
> I will answer back with a question here. How is the TLB flush is going 
> to unblock anything? The more you are not fixing any error condition 
> here... And the print "Unhandled fault" just afterwards clearly leads 
> to think that there are very little chance the fault has been resolved.

Now I understand your point. This really makes sense.


>
>> How does the SMMU driver act in such situation?
>
> I have CCed Robin who knows better than me the SMMU driver. Though it 
> is the Linux one but Xen is based on it.
>
> From my understanding, it is implementation defined whether the SMMU 
> supports stalling a transaction on fault. AFAICT, the current Xen 
> driver will just terminate the transaction and therefore the client 
> transaction behave as RAZ/WI.

I got it. So, sounds like the client won't be able to do something bad, 
and we won't receive an interrupt storm here in Xen.


>
>
>>
>> Quite clear, if we get a fault, then address is not mapped. I think, 
>> it can be both: by issuing wrong address (baggy driver, malicious 
>> driver) or by race (unlikely). If this is the real race (device hits 
>> brake-before-make, for example), we could give it another attempt, 
>> for example. Looks like we need some mechanism to deploy faulted 
>> address to P2M code (which manages page table) to analyze? Or it is 
>> not worth doing that?
>
> You seem to speak about break-before-make as it was an error. 
> Break-Before-Make is just a sequence to prevent the TLB walker to 
> cache both old and new mapping at the same time. At a given point the 
> IOVA translation can only be:
>    1) The old physical address
>    2) No address -> result to a fault
>    3) The new physical address
>
> 1) and 3) should not result to a fault. 2) will result to a fault but 
> then the TLB should not cache invalid entry, right?

right.


>
> In order to see 2), we always flush the TLBs after removing the old 
> physical address.
>
> Unfortunately, some of the IOMMUs are not able to restart 
> transactions, Xen currently avoids to flush the TLBs after 2). So you 
> may be able to see both mapping at the same time.
>
> Looking at your driver, I believe you would have the flag IMSTR.MHIT 
> (multiple tlb hits) set because this is the condition we are trying to 
> prevent with break-before-make. The comment in the code leads to think 
> this is a fault error, so I am not sure why you would recover here...
>
> If your IOMMU is able to stall transaction, then it would be best if 
> we properly handle break-before-make with it.

Thank you for the detailed answer. I would like to say that I have never 
seen Multiple tlb hits error raised by IPMMU in Xen.


>
> Overall, it feels to me the TLB flush is here for a different reason.


I will drop this TLB flush from interrupt handler until clarified.
Julien Grall Aug. 14, 2019, 5:38 p.m. UTC | #18
Hi Oleksandr,

On 02/08/2019 17:39, Oleksandr Tyshchenko wrote:
> +static int ipmmu_iommu_domain_init(struct domain *d)
> +{
> +    struct ipmmu_vmsa_xen_domain *xen_domain;
> +
> +    xen_domain = xzalloc(struct ipmmu_vmsa_xen_domain);
> +    if ( !xen_domain )
> +        return -ENOMEM;
> +
> +    spin_lock_init(&xen_domain->lock);
> +    INIT_LIST_HEAD(&xen_domain->cache_domains);
> +    /*
> +     * We don't create Root IPMMU domain here, it will be created on demand
> +     * only, when attaching the first master device to this Xen domain in
> +     * ipmmu_assign_device().
> +     * xen_domain->root_domain = NULL;
> +    */
> +
> +    dom_iommu(d)->arch.priv = xen_domain;

While looking at other part of Xen I realized you don't set 
IOMMU_FEAT_COHERENT_WALK. Does it mean the IOMMU walker does not support 
coherent walk (i.e snooping the cache)?

Note that when this feature is not set, the p2m code will require to clean each 
P2M entry when updated. So if the IPMMU supports coherent walk, I would strongly 
suggest to set the flag :).

Cheers,
Oleksandr Tyshchenko Aug. 14, 2019, 6:45 p.m. UTC | #19
On 14.08.19 20:38, Julien Grall wrote:
> Hi Oleksandr,

Hi Julien.


>
> On 02/08/2019 17:39, Oleksandr Tyshchenko wrote:
>> +static int ipmmu_iommu_domain_init(struct domain *d)
>> +{
>> +    struct ipmmu_vmsa_xen_domain *xen_domain;
>> +
>> +    xen_domain = xzalloc(struct ipmmu_vmsa_xen_domain);
>> +    if ( !xen_domain )
>> +        return -ENOMEM;
>> +
>> +    spin_lock_init(&xen_domain->lock);
>> +    INIT_LIST_HEAD(&xen_domain->cache_domains);
>> +    /*
>> +     * We don't create Root IPMMU domain here, it will be created on 
>> demand
>> +     * only, when attaching the first master device to this Xen 
>> domain in
>> +     * ipmmu_assign_device().
>> +     * xen_domain->root_domain = NULL;
>> +    */
>> +
>> +    dom_iommu(d)->arch.priv = xen_domain;
>
> While looking at other part of Xen I realized you don't set 
> IOMMU_FEAT_COHERENT_WALK. Does it mean the IOMMU walker does not 
> support coherent walk (i.e snooping the cache)?

*AFAIK*, not supported.

Linux driver reports coherent_walk is not supported as well.


>
>
> Note that when this feature is not set, the p2m code will require to 
> clean each P2M entry when updated. So if the IPMMU supports coherent 
> walk, I would strongly suggest to set the flag :).

When playing with non-shared IOMMU in Xen (two years ago), I noticed 
that I had forgotten to use clean_dcache after updating a page table 
entry. I could face faults when
shattering superpages for example. Once I added it, the faults went away 
completely.

So, leave IOMMU_FEAT_COHERENT_WALK in disabled state, but will keep your 
suggestion in mind.


>
> Cheers,
>
diff mbox series

Patch

diff --git a/xen/arch/arm/platforms/Kconfig b/xen/arch/arm/platforms/Kconfig
index bc0e9cd..c93a6b2 100644
--- a/xen/arch/arm/platforms/Kconfig
+++ b/xen/arch/arm/platforms/Kconfig
@@ -25,6 +25,7 @@  config RCAR3
 	bool "Renesas RCar3 support"
 	depends on ARM_64
 	select HAS_SCIF
+	select IPMMU_VMSA
 	---help---
 	Enable all the required drivers for Renesas RCar3
 
diff --git a/xen/drivers/passthrough/Kconfig b/xen/drivers/passthrough/Kconfig
index a3c0649..3daee16 100644
--- a/xen/drivers/passthrough/Kconfig
+++ b/xen/drivers/passthrough/Kconfig
@@ -12,4 +12,17 @@  config ARM_SMMU
 
 	  Say Y here if your SoC includes an IOMMU device implementing the
 	  ARM SMMU architecture.
+
+config IPMMU_VMSA
+	bool "Renesas IPMMU-VMSA found in R-Car Gen3 SoCs"
+	default n
+	depends on ARM_64
+	---help---
+	  Support for implementations of the Renesas IPMMU-VMSA found
+	  in R-Car Gen3 SoCs.
+
+	  Say Y here if you are using newest R-Car Gen3 SoCs revisions
+	  (H3 ES3.0, M3 ES3.0, etc) which IPMMU hardware supports stage 2
+	  translation table format and is able to use CPU's P2M table as is.
+
 endif
diff --git a/xen/drivers/passthrough/arm/Makefile b/xen/drivers/passthrough/arm/Makefile
index 5fbad45..fcd918e 100644
--- a/xen/drivers/passthrough/arm/Makefile
+++ b/xen/drivers/passthrough/arm/Makefile
@@ -1,2 +1,3 @@ 
 obj-y += iommu.o iommu_helpers.o iommu_fwspec.o
 obj-$(CONFIG_ARM_SMMU) += smmu.o
+obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
diff --git a/xen/drivers/passthrough/arm/ipmmu-vmsa.c b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
new file mode 100644
index 0000000..a34a8f8
--- /dev/null
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -0,0 +1,1342 @@ 
+/*
+ * xen/drivers/passthrough/arm/ipmmu-vmsa.c
+ *
+ * Driver for the Renesas IPMMU-VMSA found in R-Car Gen3 SoCs.
+ *
+ * The IPMMU-VMSA is VMSA-compatible I/O Memory Management Unit (IOMMU)
+ * which provides address translation and access protection functionalities
+ * to processing units and interconnect networks.
+ *
+ * Please note, current driver is supposed to work only with newest Gen3 SoCs
+ * revisions which IPMMU hardware supports stage 2 translation table format and
+ * is able to use CPU's P2M table as is.
+ *
+ * Based on Linux's IPMMU-VMSA driver from Renesas BSP:
+ *    drivers/iommu/ipmmu-vmsa.c
+ * you can found at:
+ *    url: git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
+ *    branch: v4.14.75-ltsi/rcar-3.9.6
+ *    commit: e206eb5b81a60e64c35fbc3a999b1a0db2b98044
+ * and Xen's SMMU driver:
+ *    xen/drivers/passthrough/arm/smmu.c
+ *
+ * Copyright (C) 2016-2019 EPAM Systems Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/delay.h>
+#include <xen/err.h>
+#include <xen/irq.h>
+#include <xen/lib.h>
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <xen/sched.h>
+#include <xen/vmap.h>
+#include <asm/atomic.h>
+#include <asm/device.h>
+#include <asm/io.h>
+
+#define dev_name(dev) dt_node_full_name(dev_to_dt(dev))
+
+/* Device logger functions */
+#define dev_print(dev, lvl, fmt, ...)    \
+    printk(lvl "ipmmu: %s: " fmt, dev_name(dev), ## __VA_ARGS__)
+
+#define dev_info(dev, fmt, ...)    \
+    dev_print(dev, XENLOG_INFO, fmt, ## __VA_ARGS__)
+#define dev_warn(dev, fmt, ...)    \
+    dev_print(dev, XENLOG_WARNING, fmt, ## __VA_ARGS__)
+#define dev_err(dev, fmt, ...)     \
+    dev_print(dev, XENLOG_ERR, fmt, ## __VA_ARGS__)
+#define dev_err_ratelimited(dev, fmt, ...)    \
+    dev_print(dev, XENLOG_ERR, fmt, ## __VA_ARGS__)
+
+/*
+ * Gen3 SoCs make use of up to 8 IPMMU contexts (sets of page table) and
+ * these can be managed independently. Each context is mapped to one Xen domain.
+ */
+#define IPMMU_CTX_MAX     8
+/* Gen3 SoCs make use of up to 48 micro-TLBs per IPMMU device. */
+#define IPMMU_UTLB_MAX    48
+
+/* IPMMU context supports IPA size up to 40 bit. */
+#define IPMMU_MAX_P2M_IPA_BITS    40
+
+/*
+ * Xen's domain IPMMU information stored in dom_iommu(d)->arch.priv
+ *
+ * As each context (set of page table) is mapped to one Xen domain,
+ * all associated IPMMU domains use the same context mapped to this Xen domain.
+ * This makes all master devices being attached to the same Xen domain share
+ * the same context (P2M table).
+ */
+struct ipmmu_vmsa_xen_domain {
+    /*
+     * Used to protect everything which belongs to this Xen domain:
+     * device assignment, domain init/destroy, flush ops, etc
+     */
+    spinlock_t lock;
+    /* One or more Cache IPMMU domains associated with this Xen domain */
+    struct list_head cache_domains;
+    /* Root IPMMU domain associated with this Xen domain */
+    struct ipmmu_vmsa_domain *root_domain;
+};
+
+/* Xen master device's IPMMU information stored in fwspec->iommu_priv */
+struct ipmmu_vmsa_xen_device {
+    /* Cache IPMMU domain this master device is logically attached to */
+    struct ipmmu_vmsa_domain *domain;
+    /* Cache IPMMU this master device is physically connected to */
+    struct ipmmu_vmsa_device *mmu;
+};
+
+/* Root/Cache IPMMU device's information */
+struct ipmmu_vmsa_device {
+    struct device *dev;
+    void __iomem *base;
+    struct ipmmu_vmsa_device *root;
+    struct list_head list;
+    unsigned int num_utlbs;
+    unsigned int num_ctx;
+    spinlock_t lock;    /* Protects ctx and domains[] */
+    DECLARE_BITMAP(ctx, IPMMU_CTX_MAX);
+    struct ipmmu_vmsa_domain *domains[IPMMU_CTX_MAX];
+};
+
+/*
+ * Root/Cache IPMMU domain's information
+ *
+ * Root IPMMU device is assigned to Root IPMMU domain while Cache IPMMU device
+ * is assigned to Cache IPMMU domain. Master devices are connected to Cache
+ * IPMMU devices through specific ports called micro-TLBs.
+ * All Cache IPMMU devices, in turn, are connected to Root IPMMU device
+ * which manages IPMMU context.
+ */
+struct ipmmu_vmsa_domain {
+    /*
+     * IPMMU device assigned to this IPMMU domain.
+     * Either Root device which is located at the main memory bus domain or
+     * Cache device which is located at each hierarchy bus domain.
+     */
+    struct ipmmu_vmsa_device *mmu;
+
+    /* Context used for this IPMMU domain */
+    unsigned int context_id;
+
+    /* Xen domain associated with this IPMMU domain */
+    struct domain *d;
+
+    /* The fields below are used for Cache IPMMU domain only */
+
+    /*
+     * Used to keep track of the master devices which are attached to this
+     * IPMMU domain (domain users). Master devices behind the same IPMMU device
+     * are grouped together by putting into the same IPMMU domain.
+     * Only when the refcount reaches 0 this IPMMU domain can be destroyed.
+     */
+    unsigned int refcount;
+    /* Used to link this IPMMU domain for the same Xen domain */
+    struct list_head list;
+};
+
+/* Used to keep track of registered IPMMU devices */
+static LIST_HEAD(ipmmu_devices);
+static DEFINE_SPINLOCK(ipmmu_devices_lock);
+
+#define TLB_LOOP_TIMEOUT    100 /* 100us */
+
+/* Registers Definition */
+#define IM_CTX_SIZE    0x40
+
+#define IMCTR                0x0000
+/*
+ * These fields are implemented in IPMMU-MM only. So, can be set for
+ * Root IPMMU only.
+ */
+#define IMCTR_VA64           (1 << 29)
+#define IMCTR_TRE            (1 << 17)
+#define IMCTR_AFE            (1 << 16)
+#define IMCTR_RTSEL_MASK     (3 << 4)
+#define IMCTR_RTSEL_SHIFT    4
+#define IMCTR_TREN           (1 << 3)
+/*
+ * These fields are common for all IPMMU devices. So, can be set for
+ * Cache IPMMUs as well.
+ */
+#define IMCTR_INTEN          (1 << 2)
+#define IMCTR_FLUSH          (1 << 1)
+#define IMCTR_MMUEN          (1 << 0)
+#define IMCTR_COMMON_MASK    (7 << 0)
+
+#define IMCAAR               0x0004
+
+#define IMTTBCR                        0x0008
+#define IMTTBCR_EAE                    (1 << 31)
+#define IMTTBCR_PMB                    (1 << 30)
+#define IMTTBCR_SH1_NON_SHAREABLE      (0 << 28)
+#define IMTTBCR_SH1_OUTER_SHAREABLE    (2 << 28)
+#define IMTTBCR_SH1_INNER_SHAREABLE    (3 << 28)
+#define IMTTBCR_SH1_MASK               (3 << 28)
+#define IMTTBCR_ORGN1_NC               (0 << 26)
+#define IMTTBCR_ORGN1_WB_WA            (1 << 26)
+#define IMTTBCR_ORGN1_WT               (2 << 26)
+#define IMTTBCR_ORGN1_WB               (3 << 26)
+#define IMTTBCR_ORGN1_MASK             (3 << 26)
+#define IMTTBCR_IRGN1_NC               (0 << 24)
+#define IMTTBCR_IRGN1_WB_WA            (1 << 24)
+#define IMTTBCR_IRGN1_WT               (2 << 24)
+#define IMTTBCR_IRGN1_WB               (3 << 24)
+#define IMTTBCR_IRGN1_MASK             (3 << 24)
+#define IMTTBCR_TSZ1_MASK              (1f << 16)
+#define IMTTBCR_TSZ1_SHIFT             16
+#define IMTTBCR_SH0_NON_SHAREABLE      (0 << 12)
+#define IMTTBCR_SH0_OUTER_SHAREABLE    (2 << 12)
+#define IMTTBCR_SH0_INNER_SHAREABLE    (3 << 12)
+#define IMTTBCR_SH0_MASK               (3 << 12)
+#define IMTTBCR_ORGN0_NC               (0 << 10)
+#define IMTTBCR_ORGN0_WB_WA            (1 << 10)
+#define IMTTBCR_ORGN0_WT               (2 << 10)
+#define IMTTBCR_ORGN0_WB               (3 << 10)
+#define IMTTBCR_ORGN0_MASK             (3 << 10)
+#define IMTTBCR_IRGN0_NC               (0 << 8)
+#define IMTTBCR_IRGN0_WB_WA            (1 << 8)
+#define IMTTBCR_IRGN0_WT               (2 << 8)
+#define IMTTBCR_IRGN0_WB               (3 << 8)
+#define IMTTBCR_IRGN0_MASK             (3 << 8)
+#define IMTTBCR_SL0_LVL_2              (0 << 6)
+#define IMTTBCR_SL0_LVL_1              (1 << 6)
+#define IMTTBCR_TSZ0_MASK              (0x1f << 0)
+#define IMTTBCR_TSZ0_SHIFT             0
+
+#define IMTTLBR0              0x0010
+#define IMTTLBR0_TTBR_MASK    (0xfffff << 12)
+#define IMTTUBR0              0x0014
+#define IMTTUBR0_TTBR_MASK    (0xff << 0)
+#define IMTTLBR1              0x0018
+#define IMTTLBR1_TTBR_MASK    (0xfffff << 12)
+#define IMTTUBR1              0x001c
+#define IMTTUBR1_TTBR_MASK    (0xff << 0)
+
+#define IMSTR                          0x0020
+#define IMSTR_ERRLVL_MASK              (3 << 12)
+#define IMSTR_ERRLVL_SHIFT             12
+#define IMSTR_ERRCODE_TLB_FORMAT       (1 << 8)
+#define IMSTR_ERRCODE_ACCESS_PERM      (4 << 8)
+#define IMSTR_ERRCODE_SECURE_ACCESS    (5 << 8)
+#define IMSTR_ERRCODE_MASK             (7 << 8)
+#define IMSTR_MHIT                     (1 << 4)
+#define IMSTR_ABORT                    (1 << 2)
+#define IMSTR_PF                       (1 << 1)
+#define IMSTR_TF                       (1 << 0)
+
+#define IMELAR    0x0030
+#define IMEUAR    0x0034
+
+#define IMUCTR(n)              ((n) < 32 ? IMUCTR0(n) : IMUCTR32(n))
+#define IMUCTR0(n)             (0x0300 + ((n) * 16))
+#define IMUCTR32(n)            (0x0600 + (((n) - 32) * 16))
+#define IMUCTR_FIXADDEN        (1 << 31)
+#define IMUCTR_FIXADD_MASK     (0xff << 16)
+#define IMUCTR_FIXADD_SHIFT    16
+#define IMUCTR_TTSEL_MMU(n)    ((n) << 4)
+#define IMUCTR_TTSEL_PMB       (8 << 4)
+#define IMUCTR_TTSEL_MASK      (15 << 4)
+#define IMUCTR_FLUSH           (1 << 1)
+#define IMUCTR_MMUEN           (1 << 0)
+
+#define IMUASID(n)             ((n) < 32 ? IMUASID0(n) : IMUASID32(n))
+#define IMUASID0(n)            (0x0308 + ((n) * 16))
+#define IMUASID32(n)           (0x0608 + (((n) - 32) * 16))
+#define IMUASID_ASID8_MASK     (0xff << 8)
+#define IMUASID_ASID8_SHIFT    8
+#define IMUASID_ASID0_MASK     (0xff << 0)
+#define IMUASID_ASID0_SHIFT    0
+
+#define IMSAUXCTLR          0x0504
+#define IMSAUXCTLR_S2PTE    (1 << 3)
+
+static struct ipmmu_vmsa_device *to_ipmmu(struct device *dev)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+    return fwspec && fwspec->iommu_priv ?
+        ((struct ipmmu_vmsa_xen_device *)fwspec->iommu_priv)->mmu : NULL;
+}
+
+static void set_ipmmu(struct device *dev, struct ipmmu_vmsa_device *mmu)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+    ((struct ipmmu_vmsa_xen_device *)fwspec->iommu_priv)->mmu = mmu;
+}
+
+static struct ipmmu_vmsa_domain *to_domain(struct device *dev)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+    return fwspec && fwspec->iommu_priv ?
+        ((struct ipmmu_vmsa_xen_device *)fwspec->iommu_priv)->domain : NULL;
+}
+
+static void set_domain(struct device *dev, struct ipmmu_vmsa_domain *domain)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+    ((struct ipmmu_vmsa_xen_device *)fwspec->iommu_priv)->domain = domain;
+}
+
+static struct ipmmu_vmsa_device *ipmmu_find_mmu_by_dev(struct device *dev)
+{
+    struct ipmmu_vmsa_device *mmu = NULL;
+    bool found = false;
+
+    spin_lock(&ipmmu_devices_lock);
+
+    list_for_each_entry ( mmu, &ipmmu_devices, list )
+    {
+        if ( mmu->dev == dev )
+        {
+            found = true;
+            break;
+        }
+    }
+
+    spin_unlock(&ipmmu_devices_lock);
+
+    return found ? mmu : NULL;
+}
+
+/* Root device handling */
+static bool ipmmu_is_root(struct ipmmu_vmsa_device *mmu)
+{
+    return mmu->root == mmu;
+}
+
+static struct ipmmu_vmsa_device *ipmmu_find_root(void)
+{
+    struct ipmmu_vmsa_device *mmu = NULL;
+    bool found = false;
+
+    spin_lock(&ipmmu_devices_lock);
+
+    list_for_each_entry( mmu, &ipmmu_devices, list )
+    {
+        if ( ipmmu_is_root(mmu) )
+        {
+            found = true;
+            break;
+        }
+    }
+
+    spin_unlock(&ipmmu_devices_lock);
+
+    return found ? mmu : NULL;
+}
+
+/* Read/Write Access */
+static uint32_t ipmmu_read(struct ipmmu_vmsa_device *mmu, uint32_t offset)
+{
+    return readl(mmu->base + offset);
+}
+
+static void ipmmu_write(struct ipmmu_vmsa_device *mmu, uint32_t offset,
+                        uint32_t data)
+{
+    writel(data, mmu->base + offset);
+}
+
+static uint32_t ipmmu_ctx_read_root(struct ipmmu_vmsa_domain *domain,
+                                    uint32_t reg)
+{
+    return ipmmu_read(domain->mmu->root,
+                      domain->context_id * IM_CTX_SIZE + reg);
+}
+
+static void ipmmu_ctx_write_root(struct ipmmu_vmsa_domain *domain,
+                                 uint32_t reg, uint32_t data)
+{
+    ipmmu_write(domain->mmu->root,
+                domain->context_id * IM_CTX_SIZE + reg, data);
+}
+
+static void ipmmu_ctx_write_cache(struct ipmmu_vmsa_domain *domain,
+                                  uint32_t reg, uint32_t data)
+{
+    /* We expect only IMCTR value to be passed as a reg. */
+    ASSERT(reg == IMCTR);
+
+    /* Mask fields which are implemented in IPMMU-MM only. */
+    if ( !ipmmu_is_root(domain->mmu) )
+        ipmmu_write(domain->mmu, domain->context_id * IM_CTX_SIZE + reg,
+                    data & IMCTR_COMMON_MASK);
+}
+
+/*
+ * Write the context to both Root IPMMU and all Cache IPMMUs assigned
+ * to this Xen domain.
+ */
+static void ipmmu_ctx_write_all(struct ipmmu_vmsa_domain *domain,
+                                uint32_t reg, uint32_t data)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(domain->d)->arch.priv;
+    struct ipmmu_vmsa_domain *cache_domain;
+
+    list_for_each_entry( cache_domain, &xen_domain->cache_domains, list )
+        ipmmu_ctx_write_cache(cache_domain, reg, data);
+
+    ipmmu_ctx_write_root(domain, reg, data);
+}
+
+/* TLB and micro-TLB Management */
+
+/* Wait for any pending TLB invalidations to complete. */
+static void ipmmu_tlb_sync(struct ipmmu_vmsa_domain *domain)
+{
+    unsigned int count = 0;
+
+    while ( ipmmu_ctx_read_root(domain, IMCTR) & IMCTR_FLUSH )
+    {
+        cpu_relax();
+        if ( ++count == TLB_LOOP_TIMEOUT )
+        {
+            dev_err_ratelimited(domain->mmu->dev, "TLB sync timed out -- MMU may be deadlocked\n");
+            return;
+        }
+        udelay(1);
+    }
+}
+
+static void ipmmu_tlb_invalidate(struct ipmmu_vmsa_domain *domain)
+{
+    uint32_t data;
+
+    data = ipmmu_ctx_read_root(domain, IMCTR);
+    data |= IMCTR_FLUSH;
+    ipmmu_ctx_write_all(domain, IMCTR, data);
+
+    ipmmu_tlb_sync(domain);
+}
+
+/* Enable MMU translation for the micro-TLB. */
+static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain,
+                              unsigned int utlb)
+{
+    struct ipmmu_vmsa_device *mmu = domain->mmu;
+
+    /*
+     * TODO: Reference-count the micro-TLB as several bus masters can be
+     * connected to the same micro-TLB. Prevent the use cases where
+     * the same micro-TLB could be shared between multiple Xen domains.
+     */
+    ipmmu_write(mmu, IMUASID(utlb), 0);
+    ipmmu_write(mmu, IMUCTR(utlb), ipmmu_read(mmu, IMUCTR(utlb)) |
+                IMUCTR_TTSEL_MMU(domain->context_id) | IMUCTR_MMUEN);
+}
+
+/* Disable MMU translation for the micro-TLB. */
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
+                               unsigned int utlb)
+{
+    struct ipmmu_vmsa_device *mmu = domain->mmu;
+
+    ipmmu_write(mmu, IMUCTR(utlb), 0);
+}
+
+/* Domain/Context Management */
+static int ipmmu_domain_allocate_context(struct ipmmu_vmsa_device *mmu,
+                                         struct ipmmu_vmsa_domain *domain)
+{
+    unsigned long flags;
+    int ret;
+
+    spin_lock_irqsave(&mmu->lock, flags);
+
+    ret = find_first_zero_bit(mmu->ctx, mmu->num_ctx);
+    if ( ret != mmu->num_ctx )
+    {
+        mmu->domains[ret] = domain;
+        set_bit(ret, mmu->ctx);
+    }
+    else
+        ret = -EBUSY;
+
+    spin_unlock_irqrestore(&mmu->lock, flags);
+
+    return ret;
+}
+
+static void ipmmu_domain_free_context(struct ipmmu_vmsa_device *mmu,
+                                      unsigned int context_id)
+{
+    unsigned long flags;
+
+    spin_lock_irqsave(&mmu->lock, flags);
+
+    clear_bit(context_id, mmu->ctx);
+    mmu->domains[context_id] = NULL;
+
+    spin_unlock_irqrestore(&mmu->lock, flags);
+}
+
+static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain)
+{
+    uint64_t ttbr;
+    uint32_t tsz0;
+    int ret;
+
+    /* Find an unused context. */
+    ret = ipmmu_domain_allocate_context(domain->mmu->root, domain);
+    if ( ret < 0 )
+        return ret;
+
+    domain->context_id = ret;
+
+    /*
+     * TTBR0
+     * Use P2M table for this Xen domain.
+     */
+    ASSERT(domain->d != NULL);
+    ttbr = page_to_maddr(domain->d->arch.p2m.root);
+
+    dev_info(domain->mmu->root->dev, "%pd: Set IPMMU context %u (pgd 0x%"PRIx64")\n",
+             domain->d, domain->context_id, ttbr);
+
+    ipmmu_ctx_write_root(domain, IMTTLBR0, ttbr & IMTTLBR0_TTBR_MASK);
+    ipmmu_ctx_write_root(domain, IMTTUBR0, (ttbr >> 32) & IMTTUBR0_TTBR_MASK);
+
+    /*
+     * TTBCR
+     * We use long descriptors and allocate the whole "p2m_ipa_bits" IPA space
+     * to TTBR0. Use 4KB page granule. Start page table walks at first level.
+     * Always bypass stage 1 translation.
+     */
+    tsz0 = (64 - p2m_ipa_bits) << IMTTBCR_TSZ0_SHIFT;
+    ipmmu_ctx_write_root(domain, IMTTBCR, IMTTBCR_EAE | IMTTBCR_PMB |
+                         IMTTBCR_SL0_LVL_1 | tsz0);
+
+    /*
+     * IMSTR
+     * Clear all interrupt flags.
+     */
+    ipmmu_ctx_write_root(domain, IMSTR, ipmmu_ctx_read_root(domain, IMSTR));
+
+    /*
+     * IMCTR
+     * Enable the MMU and interrupt generation. The long-descriptor
+     * translation table format doesn't use TEX remapping. Don't enable AF
+     * software management as we have no use for it. Use VMSAv8-64 mode.
+     * Enable the context for Root IPMMU only. Flush the TLB as required
+     * when modifying the context registers.
+     */
+    ipmmu_ctx_write_root(domain, IMCTR,
+                         IMCTR_VA64 | IMCTR_INTEN | IMCTR_FLUSH | IMCTR_MMUEN);
+
+    return 0;
+}
+
+static void ipmmu_domain_destroy_context(struct ipmmu_vmsa_domain *domain)
+{
+    if ( !domain->mmu )
+        return;
+
+    /*
+     * Disable the context for Root IPMMU only. Flush the TLB as required
+     * when modifying the context registers.
+     */
+    ipmmu_ctx_write_root(domain, IMCTR, IMCTR_FLUSH);
+    ipmmu_tlb_sync(domain);
+
+    ipmmu_domain_free_context(domain->mmu->root, domain->context_id);
+}
+
+/* Fault Handling */
+static void ipmmu_domain_irq(struct ipmmu_vmsa_domain *domain)
+{
+    const uint32_t err_mask = IMSTR_MHIT | IMSTR_ABORT | IMSTR_PF | IMSTR_TF;
+    struct ipmmu_vmsa_device *mmu = domain->mmu;
+    uint32_t status;
+    uint64_t iova;
+
+    status = ipmmu_ctx_read_root(domain, IMSTR);
+    if ( !(status & err_mask) )
+        return;
+
+    iova = ipmmu_ctx_read_root(domain, IMELAR) |
+        ((uint64_t)ipmmu_ctx_read_root(domain, IMEUAR) << 32);
+
+    /*
+     * Clear the error status flags. Unlike traditional interrupt flag
+     * registers that must be cleared by writing 1, this status register
+     * seems to require 0. The error address register must be read before,
+     * otherwise its value will be 0.
+     */
+    ipmmu_ctx_write_root(domain, IMSTR, 0);
+
+    /* Log fatal errors. */
+    if ( status & IMSTR_MHIT )
+        dev_err_ratelimited(mmu->dev, "%pd: Multiple TLB hits @0x%"PRIx64"\n",
+                            domain->d, iova);
+    if ( status & IMSTR_ABORT )
+        dev_err_ratelimited(mmu->dev, "%pd: Page Table Walk Abort @0x%"PRIx64"\n",
+                            domain->d, iova);
+
+    /* Return if it is neither Permission Fault nor Translation Fault. */
+    if ( !(status & (IMSTR_PF | IMSTR_TF)) )
+        return;
+
+    /* Flush the TLB as required when IPMMU translation error occurred. */
+    ipmmu_tlb_invalidate(domain);
+
+    dev_err_ratelimited(mmu->dev, "%pd: Unhandled fault: status 0x%08x iova 0x%"PRIx64"\n",
+                        domain->d, status, iova);
+}
+
+static void ipmmu_irq(int irq, void *dev, struct cpu_user_regs *regs)
+{
+    struct ipmmu_vmsa_device *mmu = dev;
+    unsigned int i;
+    unsigned long flags;
+
+    spin_lock_irqsave(&mmu->lock, flags);
+
+    /*
+     * When interrupt arrives, we don't know the context it is related to.
+     * So, check interrupts for all active contexts to locate a context
+     * with status bits set.
+    */
+    for ( i = 0; i < mmu->num_ctx; i++ )
+    {
+        if ( !mmu->domains[i] )
+            continue;
+        ipmmu_domain_irq(mmu->domains[i]);
+    }
+
+    spin_unlock_irqrestore(&mmu->lock, flags);
+}
+
+/* Master devices management */
+static int ipmmu_attach_device(struct ipmmu_vmsa_domain *domain,
+                               struct device *dev)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+    struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
+    unsigned int i;
+
+    if ( !mmu )
+    {
+        dev_err(dev, "Cannot attach to IPMMU\n");
+        return -ENXIO;
+    }
+
+    if ( !domain->mmu )
+    {
+        /* The domain hasn't been used yet, initialize it. */
+        domain->mmu = mmu;
+
+        /*
+         * We have already enabled context for Root IPMMU assigned to this
+         * Xen domain in ipmmu_domain_init_context().
+         * Enable the context for Cache IPMMU only. Flush the TLB as required
+         * when modifying the context registers.
+         */
+        ipmmu_ctx_write_cache(domain, IMCTR,
+                              ipmmu_ctx_read_root(domain, IMCTR) | IMCTR_FLUSH);
+
+        dev_info(dev, "Using IPMMU context %u\n", domain->context_id);
+    }
+    else if ( domain->mmu != mmu )
+    {
+        /*
+         * Something is wrong, we can't attach two master devices using
+         * different IOMMUs to the same IPMMU domain.
+         */
+        dev_err(dev, "Can't attach IPMMU %s to domain on IPMMU %s\n",
+                dev_name(mmu->dev), dev_name(domain->mmu->dev));
+        return -EINVAL;
+    }
+    else
+        dev_info(dev, "Reusing IPMMU context %u\n", domain->context_id);
+
+    for ( i = 0; i < fwspec->num_ids; ++i )
+        ipmmu_utlb_enable(domain, fwspec->ids[i]);
+
+    return 0;
+}
+
+static void ipmmu_detach_device(struct ipmmu_vmsa_domain *domain,
+                                struct device *dev)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+    unsigned int i;
+
+    for ( i = 0; i < fwspec->num_ids; ++i )
+        ipmmu_utlb_disable(domain, fwspec->ids[i]);
+}
+
+static void ipmmu_device_reset(struct ipmmu_vmsa_device *mmu)
+{
+    unsigned int i;
+
+    /* Disable all contexts. */
+    for ( i = 0; i < mmu->num_ctx; ++i )
+        ipmmu_write(mmu, i * IM_CTX_SIZE + IMCTR, 0);
+}
+
+/*
+ * This function relies on the fact that Root IPMMU device is being probed
+ * the first. If not the case, it denies further Cache IPMMU device probes
+ * (returns the -EAGAIN) until the Root IPMMU device has been registered
+ * for sure.
+ */
+static int ipmmu_probe(struct dt_device_node *node)
+{
+    struct ipmmu_vmsa_device *mmu;
+    uint64_t addr, size;
+    int irq, ret;
+
+    mmu = xzalloc(struct ipmmu_vmsa_device);
+    if ( !mmu )
+    {
+        dev_err(&node->dev, "Cannot allocate device data\n");
+        return -ENOMEM;
+    }
+
+    mmu->dev = &node->dev;
+    mmu->num_utlbs = IPMMU_UTLB_MAX;
+    mmu->num_ctx = IPMMU_CTX_MAX;
+    spin_lock_init(&mmu->lock);
+    bitmap_zero(mmu->ctx, IPMMU_CTX_MAX);
+
+    /* Map I/O memory and request IRQ. */
+    ret = dt_device_get_address(node, 0, &addr, &size);
+    if ( ret )
+    {
+        dev_err(&node->dev, "Failed to get MMIO\n");
+        goto out;
+    }
+
+    mmu->base = ioremap_nocache(addr, size);
+    if ( !mmu->base )
+    {
+        dev_err(&node->dev, "Failed to ioremap MMIO (addr 0x%"PRIx64" size 0x%"PRIx64")\n",
+                addr, size);
+        ret = -ENOMEM;
+        goto out;
+    }
+
+    /*
+     * Determine if this IPMMU node is a Root device by checking for
+     * the lack of renesas,ipmmu-main property.
+     */
+    if ( !dt_find_property(node, "renesas,ipmmu-main", NULL) )
+        mmu->root = mmu;
+    else
+        mmu->root = ipmmu_find_root();
+
+    /* Wait until the Root device has been registered for sure. */
+    if ( !mmu->root )
+    {
+        dev_err(&node->dev, "Root IPMMU hasn't been registered yet\n");
+        ret = -EAGAIN;
+        goto out;
+    }
+
+    /* Root devices have mandatory IRQs. */
+    if ( ipmmu_is_root(mmu) )
+    {
+        irq = platform_get_irq(node, 0);
+        if ( irq < 0 )
+        {
+            dev_err(&node->dev, "No IRQ found\n");
+            ret = irq;
+            goto out;
+        }
+
+        ret = request_irq(irq, 0, ipmmu_irq, dev_name(&node->dev), mmu);
+        if ( ret < 0 )
+        {
+            dev_err(&node->dev, "Failed to request IRQ %d\n", irq);
+            goto out;
+        }
+
+        ipmmu_device_reset(mmu);
+
+        /*
+         * Use stage 2 translation table format when stage 2 translation
+         * enabled.
+         */
+        ipmmu_write(mmu, IMSAUXCTLR,
+                    ipmmu_read(mmu, IMSAUXCTLR) | IMSAUXCTLR_S2PTE);
+
+        dev_info(&node->dev, "IPMMU context 0 is reserved\n");
+        set_bit(0, mmu->ctx);
+    }
+
+    spin_lock(&ipmmu_devices_lock);
+    list_add(&mmu->list, &ipmmu_devices);
+    spin_unlock(&ipmmu_devices_lock);
+
+    dev_info(&node->dev, "Registered %s IPMMU\n",
+             ipmmu_is_root(mmu) ? "Root" : "Cache");
+
+    return 0;
+
+out:
+    if ( mmu->base )
+        iounmap(mmu->base);
+    xfree(mmu);
+
+    return ret;
+}
+
+/* Xen IOMMU ops */
+static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+
+    if ( !xen_domain || !xen_domain->root_domain )
+        return 0;
+
+    spin_lock(&xen_domain->lock);
+    ipmmu_tlb_invalidate(xen_domain->root_domain);
+    spin_unlock(&xen_domain->lock);
+
+    return 0;
+}
+
+static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
+                                          unsigned int page_count,
+                                          unsigned int flush_flags)
+{
+    ASSERT(flush_flags);
+
+    /* The hardware doesn't support selective TLB flush. */
+    return ipmmu_iotlb_flush_all(d);
+}
+
+static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
+                                                        struct device *dev)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+    struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
+    struct ipmmu_vmsa_domain *domain;
+
+    if ( !mmu )
+        return NULL;
+
+    /*
+     * Loop through all Cache IPMMU domains associated with this Xen domain
+     * to locate an IPMMU domain this IPMMU device is assigned to.
+     */
+    list_for_each_entry( domain, &xen_domain->cache_domains, list )
+    {
+        if ( domain->mmu == mmu )
+            return domain;
+    }
+
+    return NULL;
+}
+
+static struct ipmmu_vmsa_domain *ipmmu_alloc_cache_domain(struct domain *d)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+    struct ipmmu_vmsa_domain *domain;
+
+    domain = xzalloc(struct ipmmu_vmsa_domain);
+    if ( !domain )
+        return ERR_PTR(-ENOMEM);
+
+    /*
+     * We don't assign the Cache IPMMU device here, it will be assigned when
+     * attaching master device to this domain in ipmmu_attach_device().
+     * domain->mmu = NULL;
+     */
+
+    domain->d = d;
+    /* Use the same context mapped to this Xen domain. */
+    domain->context_id = xen_domain->root_domain->context_id;
+
+    return domain;
+}
+
+static void ipmmu_free_cache_domain(struct ipmmu_vmsa_domain *domain)
+{
+    list_del(&domain->list);
+    /*
+     * Disable the context for Cache IPMMU only. Flush the TLB as required
+     * when modifying the context registers.
+     */
+    ipmmu_ctx_write_cache(domain, IMCTR, IMCTR_FLUSH);
+    xfree(domain);
+}
+
+static struct ipmmu_vmsa_domain *ipmmu_alloc_root_domain(struct domain *d)
+{
+    struct ipmmu_vmsa_domain *domain;
+    struct ipmmu_vmsa_device *root;
+    int ret;
+
+    /* If we are here then Root device must has been registered. */
+    root = ipmmu_find_root();
+    if ( !root )
+    {
+        printk(XENLOG_ERR "ipmmu: Unable to locate Root IPMMU\n");
+        return ERR_PTR(-ENODEV);
+    }
+
+    domain = xzalloc(struct ipmmu_vmsa_domain);
+    if ( !domain )
+        return ERR_PTR(-ENOMEM);
+
+    domain->mmu = root;
+    domain->d = d;
+
+    /* Initialize the context to be mapped to this Xen domain. */
+    ret = ipmmu_domain_init_context(domain);
+    if ( ret < 0 )
+    {
+        dev_err(root->dev, "%pd: Unable to initialize IPMMU context\n", d);
+        xfree(domain);
+        return ERR_PTR(ret);
+    }
+
+    return domain;
+}
+
+static void ipmmu_free_root_domain(struct ipmmu_vmsa_domain *domain)
+{
+    ipmmu_domain_destroy_context(domain);
+    xfree(domain);
+}
+
+static int ipmmu_assign_device(struct domain *d, u8 devfn, struct device *dev,
+                               uint32_t flag)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+    struct ipmmu_vmsa_domain *domain;
+    int ret;
+
+    if ( !xen_domain )
+        return -EINVAL;
+
+    if ( !to_ipmmu(dev) )
+        return -ENODEV;
+
+    spin_lock(&xen_domain->lock);
+
+    /*
+     * The IPMMU context for the Xen domain is not allocated beforehand
+     * (at the Xen domain creation time), but on demand only, when the first
+     * master device being attached to it.
+     * Create Root IPMMU domain which context will be mapped to this Xen domain
+     * if not exits yet.
+     */
+    if ( !xen_domain->root_domain )
+    {
+        domain = ipmmu_alloc_root_domain(d);
+        if ( IS_ERR(domain) )
+        {
+            ret = PTR_ERR(domain);
+            goto out;
+        }
+
+        xen_domain->root_domain = domain;
+    }
+
+    if ( to_domain(dev) )
+    {
+        dev_err(dev, "Already attached to IPMMU domain\n");
+        ret = -EEXIST;
+        goto out;
+    }
+
+    /*
+     * Master devices behind the same Cache IPMMU can be attached to the same
+     * Cache IPMMU domain.
+     * Before creating new IPMMU domain check to see if the required one
+     * already exists for this Xen domain.
+     */
+    domain = ipmmu_get_cache_domain(d, dev);
+    if ( !domain )
+    {
+        /* Create new IPMMU domain this master device will be attached to. */
+        domain = ipmmu_alloc_cache_domain(d);
+        if ( IS_ERR(domain) )
+        {
+            ret = PTR_ERR(domain);
+            goto out;
+        }
+
+        /* Chain new IPMMU domain to the Xen domain. */
+        list_add(&domain->list, &xen_domain->cache_domains);
+    }
+
+    ret = ipmmu_attach_device(domain, dev);
+    if ( ret )
+    {
+        /*
+         * Destroy Cache IPMMU domain only if there are no master devices
+         * attached to it.
+         */
+        if ( !domain->refcount )
+            ipmmu_free_cache_domain(domain);
+    }
+    else
+    {
+        domain->refcount++;
+        set_domain(dev, domain);
+    }
+
+out:
+    spin_unlock(&xen_domain->lock);
+
+    return ret;
+}
+
+static int ipmmu_deassign_device(struct domain *d, struct device *dev)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+    struct ipmmu_vmsa_domain *domain = to_domain(dev);
+
+    if ( !domain || domain->d != d )
+    {
+        dev_err(dev, "Not attached to %pd\n", d);
+        return -ESRCH;
+    }
+
+    spin_lock(&xen_domain->lock);
+
+    ipmmu_detach_device(domain, dev);
+    set_domain(dev, NULL);
+    domain->refcount--;
+
+    /*
+     * Destroy Cache IPMMU domain only if there are no master devices
+     * attached to it.
+     */
+    if ( !domain->refcount )
+        ipmmu_free_cache_domain(domain);
+
+    spin_unlock(&xen_domain->lock);
+
+    return 0;
+}
+
+static int ipmmu_reassign_device(struct domain *s, struct domain *t,
+                                 u8 devfn,  struct device *dev)
+{
+    int ret = 0;
+
+    /* Don't allow remapping on other domain than hwdom */
+    if ( t && t != hardware_domain )
+        return -EPERM;
+
+    if ( t == s )
+        return 0;
+
+    ret = ipmmu_deassign_device(s, dev);
+    if ( ret )
+        return ret;
+
+    if ( t )
+    {
+        /* No flags are defined for ARM. */
+        ret = ipmmu_assign_device(t, devfn, dev, 0);
+        if ( ret )
+            return ret;
+    }
+
+    return 0;
+}
+
+static int ipmmu_add_device(u8 devfn, struct device *dev)
+{
+    struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+    struct ipmmu_vmsa_device *mmu;
+    unsigned int i;
+
+    if ( !fwspec || !fwspec->iommu_dev )
+        return -EINVAL;
+
+    mmu = ipmmu_find_mmu_by_dev(fwspec->iommu_dev);
+    if ( !mmu )
+        return -ENODEV;
+
+    /*
+     * Perform sanity check of fwspec->num_ids and fwspec->ids[1] fields.
+     * These fields describe master device's connection to Cache IPMMU
+     * (micro-TLBs). Each master device gets micro-TLB assignment via
+     * the "iommus" property in DT.
+     */
+    if ( fwspec->num_ids == 0 )
+        return -EINVAL;
+
+    for ( i = 0; i < fwspec->num_ids; ++i )
+    {
+        if ( fwspec->ids[i] >= mmu->num_utlbs )
+            return -EINVAL;
+    }
+
+    if ( to_ipmmu(dev) )
+    {
+        dev_err(dev, "Already added to IPMMU\n");
+        return -EEXIST;
+    }
+
+    fwspec->iommu_priv = xzalloc(struct ipmmu_vmsa_xen_device);
+    if ( !fwspec->iommu_priv )
+        return -ENOMEM;
+
+    set_ipmmu(dev, mmu);
+
+    /* Let Xen know that the master device is protected by an IOMMU. */
+    dt_device_set_protected(dev_to_dt(dev));
+
+    dev_info(dev, "Added master device (IPMMU %s micro-TLBs %u)\n",
+             dev_name(mmu->dev), fwspec->num_ids);
+
+    return 0;
+}
+
+static int ipmmu_iommu_domain_init(struct domain *d)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain;
+
+    xen_domain = xzalloc(struct ipmmu_vmsa_xen_domain);
+    if ( !xen_domain )
+        return -ENOMEM;
+
+    spin_lock_init(&xen_domain->lock);
+    INIT_LIST_HEAD(&xen_domain->cache_domains);
+    /*
+     * We don't create Root IPMMU domain here, it will be created on demand
+     * only, when attaching the first master device to this Xen domain in
+     * ipmmu_assign_device().
+     * xen_domain->root_domain = NULL;
+    */
+
+    dom_iommu(d)->arch.priv = xen_domain;
+
+    return 0;
+}
+
+static void __hwdom_init ipmmu_iommu_hwdom_init(struct domain *d)
+{
+    /* Set to false options not supported on ARM. */
+    if ( iommu_hwdom_inclusive )
+        printk(XENLOG_WARNING "ipmmu: map-inclusive dom0-iommu option is not supported on ARM\n");
+    iommu_hwdom_inclusive = false;
+    if ( iommu_hwdom_reserved == 1 )
+        printk(XENLOG_WARNING "ipmmu: map-reserved dom0-iommu option is not supported on ARM\n");
+    iommu_hwdom_reserved = 0;
+
+    arch_iommu_hwdom_init(d);
+}
+
+static void ipmmu_iommu_domain_teardown(struct domain *d)
+{
+    struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+
+    if ( !xen_domain )
+        return;
+
+    spin_lock(&xen_domain->lock);
+    /*
+     * Destroy Root IPMMU domain which context is mapped to this Xen domain
+     * if exits.
+     */
+    if ( xen_domain->root_domain )
+        ipmmu_free_root_domain(xen_domain->root_domain);
+
+    spin_unlock(&xen_domain->lock);
+
+    /*
+     * We assume that all master devices have already been detached from
+     * this Xen domain and there must be no associated Cache IPMMU domains
+     * in use.
+     */
+    ASSERT(list_empty(&xen_domain->cache_domains));
+    xfree(xen_domain);
+    dom_iommu(d)->arch.priv = NULL;
+}
+
+static const struct iommu_ops ipmmu_iommu_ops =
+{
+    .init            = ipmmu_iommu_domain_init,
+    .hwdom_init      = ipmmu_iommu_hwdom_init,
+    .teardown        = ipmmu_iommu_domain_teardown,
+    .iotlb_flush     = ipmmu_iotlb_flush,
+    .iotlb_flush_all = ipmmu_iotlb_flush_all,
+    .assign_device   = ipmmu_assign_device,
+    .reassign_device = ipmmu_reassign_device,
+    .map_page        = arm_iommu_map_page,
+    .unmap_page      = arm_iommu_unmap_page,
+    .add_device      = ipmmu_add_device,
+};
+
+/* RCAR GEN3 product and cut information. */
+#define RCAR_PRODUCT_MASK    0x00007F00
+#define RCAR_PRODUCT_H3      0x00004F00
+#define RCAR_PRODUCT_M3      0x00005200
+#define RCAR_PRODUCT_M3N     0x00005500
+#define RCAR_CUT_MASK        0x000000FF
+#define RCAR_CUT_VER30       0x00000020
+
+static __init bool ipmmu_stage2_supported(void)
+{
+    struct dt_device_node *np;
+    uint64_t addr, size;
+    void __iomem *base;
+    uint32_t product, cut;
+    static enum
+    {
+        UNKNOWN,
+        SUPPORTED,
+        NOTSUPPORTED
+    } stage2_supported = UNKNOWN;
+
+    /* Use the flag to avoid checking for the compatibility more then once. */
+    switch ( stage2_supported )
+    {
+    case SUPPORTED:
+        return true;
+
+    case NOTSUPPORTED:
+        return false;
+
+    case UNKNOWN:
+    default:
+        stage2_supported = NOTSUPPORTED;
+        break;
+    }
+
+    np = dt_find_compatible_node(NULL, NULL, "renesas,prr");
+    if ( !np )
+    {
+        printk(XENLOG_ERR "ipmmu: Failed to find PRR node\n");
+        return false;
+    }
+
+    if ( dt_device_get_address(np, 0, &addr, &size) )
+    {
+        printk(XENLOG_ERR "ipmmu: Failed to get PRR MMIO\n");
+        return false;
+    }
+
+    base = ioremap_nocache(addr, size);
+    if ( !base )
+    {
+        printk(XENLOG_ERR "ipmmu: Failed to ioremap PRR MMIO\n");
+        return false;
+    }
+
+    product = readl(base);
+    cut = product & RCAR_CUT_MASK;
+    product &= RCAR_PRODUCT_MASK;
+
+    switch ( product )
+    {
+    case RCAR_PRODUCT_H3:
+    case RCAR_PRODUCT_M3:
+        if ( cut >= RCAR_CUT_VER30 )
+            stage2_supported = SUPPORTED;
+        break;
+
+    case RCAR_PRODUCT_M3N:
+        stage2_supported = SUPPORTED;
+        break;
+
+    default:
+        printk(XENLOG_ERR "ipmmu: Unsupported SoC version\n");
+        break;
+    }
+
+    iounmap(base);
+
+    return stage2_supported == SUPPORTED;
+}
+
+static const struct dt_device_match ipmmu_dt_match[] __initconst =
+{
+    DT_MATCH_COMPATIBLE("renesas,ipmmu-r8a7795"),
+    DT_MATCH_COMPATIBLE("renesas,ipmmu-r8a77965"),
+    DT_MATCH_COMPATIBLE("renesas,ipmmu-r8a7796"),
+    { /* sentinel */ },
+};
+
+static __init int ipmmu_init(struct dt_device_node *node, const void *data)
+{
+    int ret;
+
+    /*
+     * Even if the device can't be initialized, we don't want to give
+     * the IPMMU device to dom0.
+     */
+    dt_device_set_used_by(node, DOMID_XEN);
+
+    if ( !iommu_hap_pt_share )
+    {
+        printk_once(XENLOG_ERR "ipmmu: P2M table must always be shared between the CPU and the IPMMU\n");
+        return -EINVAL;
+    }
+
+    if ( !ipmmu_stage2_supported() )
+    {
+        printk_once(XENLOG_ERR "ipmmu: P2M sharing is not supported in current SoC revision\n");
+        return -EOPNOTSUPP;
+    }
+    else
+    {
+        /*
+         * As 4-level translation table is not supported in IPMMU, we need
+         * to check IPA size used for P2M table beforehand to be sure it is
+         * 3-level and the IPMMU will be able to use it.
+         *
+         * TODO: First initialize the IOMMU and gather the requirements and
+         * then initialize the P2M. In the P2M code, take into the account
+         * the IOMMU requirements and restrict "pa_range" if necessary.
+         */
+        if ( IPMMU_MAX_P2M_IPA_BITS < p2m_ipa_bits )
+        {
+            printk_once(XENLOG_ERR "ipmmu: P2M IPA size is not supported (P2M=%u IPMMU=%u)!\n",
+                        p2m_ipa_bits, IPMMU_MAX_P2M_IPA_BITS);
+            return -EOPNOTSUPP;
+        }
+    }
+
+    ret = ipmmu_probe(node);
+    if ( ret )
+    {
+        dev_err(&node->dev, "Failed to init IPMMU (%d)\n", ret);
+        return ret;
+    }
+
+    iommu_set_ops(&ipmmu_iommu_ops);
+
+    return 0;
+}
+
+DT_DEVICE_START(ipmmu, "Renesas IPMMU-VMSA", DEVICE_IOMMU)
+    .dt_match = ipmmu_dt_match,
+    .init = ipmmu_init,
+DT_DEVICE_END
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */