diff mbox series

[RESEND,6/9] hw/arm/smmu-common: Manage IOTLB block entries

Message ID 20200611161500.23580-7-eric.auger@redhat.com (mailing list archive)
State New, archived
Headers show
Series SMMUv3.2 Range-based TLB Invalidation Support | expand

Commit Message

Eric Auger June 11, 2020, 4:14 p.m. UTC
At the moment each entry in the IOTLB corresponds to a page sized
mapping (4K, 16K or 64K), even if the page belongs to a mapped
block. In case of block mapping this unefficiently consume IOTLB
entries.

Change the value of the entry so that it reflects the actual
mapping it belongs to (block or page start address and size).

Also the level/tg of the entry is encoded in the key. In subsequent
patches we will enable range invalidation. This latter is able
to provide the level/tg of the entry.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmu-internal.h       | 10 ++++++
 include/hw/arm/smmu-common.h |  8 +++--
 hw/arm/smmu-common.c         | 61 +++++++++++++++++++++++++++---------
 hw/arm/smmuv3.c              |  6 ++--
 hw/arm/trace-events          |  2 +-
 5 files changed, 66 insertions(+), 21 deletions(-)

Comments

Peter Maydell June 25, 2020, 3:30 p.m. UTC | #1
On Thu, 11 Jun 2020 at 17:16, Eric Auger <eric.auger@redhat.com> wrote:
>
> At the moment each entry in the IOTLB corresponds to a page sized
> mapping (4K, 16K or 64K), even if the page belongs to a mapped
> block. In case of block mapping this unefficiently consume IOTLB
> entries.
>
> Change the value of the entry so that it reflects the actual
> mapping it belongs to (block or page start address and size).
>
> Also the level/tg of the entry is encoded in the key. In subsequent
> patches we will enable range invalidation. This latter is able
> to provide the level/tg of the entry.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>


> -uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova)
> +uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova,
> +                            uint8_t tg, uint8_t level)
>  {
> -    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT;
> +    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT |
> +           (uint64_t)(level) << SMMU_IOTLB_LEVEL_SHIFT |
> +           (uint64_t)(tg) << SMMU_IOTLB_TG_SHIFT;
>  }

>  SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
> -                                 hwaddr iova)
> +                                SMMUTransTableInfo *tt, hwaddr iova)
>  {
> -    uint64_t key = smmu_get_iotlb_key(cfg->asid, iova);
> -    SMMUTLBEntry *entry = g_hash_table_lookup(bs->iotlb, &key);
> +    uint8_t tg = (tt->granule_sz - 10) / 2;
> +    int level = tt->starting_level;
> +    SMMUTLBEntry *entry = NULL;
> +
> +    while (level <= 3) {
> +        uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz);
> +        uint64_t mask = subpage_size - 1;
> +        uint64_t key;
> +
> +        key = smmu_get_iotlb_key(cfg->asid, iova & ~mask, tg, level);
> +        entry = g_hash_table_lookup(bs->iotlb, &key);
> +        if (entry) {
> +            break;
> +        }
> +        level++;

Rather than looping around doing multiple hash table lookups like
this, why not just avoid including the tg and level in the
key equality test?

If I understand the range-based-invalidation feature correctly,
the only time we care about the TG/LVL is if we're processing
an invalidate-range command that specifies them. But in that
case there should never be multiple entries in the bs->iotlb
with the same iova, so we can just check whether the entry
matches the requested TG/LVL once we've pulled it out of the
hash table. (Or we could architecturally validly just blow
it away regardless of requested TG/LVL -- they are only hints,
not required-to-match.)

thanks
-- PMM
Eric Auger June 26, 2020, 1:53 p.m. UTC | #2
Hi Peter,

On 6/25/20 5:30 PM, Peter Maydell wrote:
> On Thu, 11 Jun 2020 at 17:16, Eric Auger <eric.auger@redhat.com> wrote:
>>
>> At the moment each entry in the IOTLB corresponds to a page sized
>> mapping (4K, 16K or 64K), even if the page belongs to a mapped
>> block. In case of block mapping this unefficiently consume IOTLB
>> entries.
>>
>> Change the value of the entry so that it reflects the actual
>> mapping it belongs to (block or page start address and size).
>>
>> Also the level/tg of the entry is encoded in the key. In subsequent
>> patches we will enable range invalidation. This latter is able
>> to provide the level/tg of the entry.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> 
>> -uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova)
>> +uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova,
>> +                            uint8_t tg, uint8_t level)
>>  {
>> -    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT;
>> +    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT |
>> +           (uint64_t)(level) << SMMU_IOTLB_LEVEL_SHIFT |
>> +           (uint64_t)(tg) << SMMU_IOTLB_TG_SHIFT;
>>  }
> 
>>  SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
>> -                                 hwaddr iova)
>> +                                SMMUTransTableInfo *tt, hwaddr iova)
>>  {
>> -    uint64_t key = smmu_get_iotlb_key(cfg->asid, iova);
>> -    SMMUTLBEntry *entry = g_hash_table_lookup(bs->iotlb, &key);
>> +    uint8_t tg = (tt->granule_sz - 10) / 2;
>> +    int level = tt->starting_level;
>> +    SMMUTLBEntry *entry = NULL;
>> +
>> +    while (level <= 3) {
>> +        uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz);
>> +        uint64_t mask = subpage_size - 1;
>> +        uint64_t key;
>> +
>> +        key = smmu_get_iotlb_key(cfg->asid, iova & ~mask, tg, level);
>> +        entry = g_hash_table_lookup(bs->iotlb, &key);
>> +        if (entry) {
>> +            break;
>> +        }
>> +        level++;
> 
> Rather than looping around doing multiple hash table lookups like
> this, why not just avoid including the tg and level in the
> key equality test?
> 
> If I understand the range-based-invalidation feature correctly,
> the only time we care about the TG/LVL is if we're processing
> an invalidate-range command that specifies them. But in that
> case there should never be multiple entries in the bs->iotlb
> with the same iova, so we can just check whether the entry
> matches the requested TG/LVL once we've pulled it out of the
> hash table. (Or we could architecturally validly just blow
> it away regardless of requested TG/LVL -- they are only hints,
> not required-to-match.)

This change could have been done independently on the RIL feature. As we
now put block entries in the IOTLB , when we look for an iova
translation, the IOVA can be mapped using different block sizes or using
page entries. So we start looking at blocks of the bigger size (entry
level) downto the page, for instance 4TB/512MB/64KB. We cannot know
which block and size the address belongs to. I do not know if we can
make any hypothesis on whether the driver is forbidden to invalidate an
address that is not the starting address of an initial mapping.

Not a justification but an info, this is implemented the same way on x86
(except they don't have variable TG), see vtd_lookup_iotlb in
hw/i386/intel_iommu.c

Thanks

Eric
> 
> thanks
> -- PMM
>
Eric Auger June 30, 2020, 3:46 p.m. UTC | #3
Hi Peter,

On 6/26/20 3:53 PM, Auger Eric wrote:
> Hi Peter,
> 
> On 6/25/20 5:30 PM, Peter Maydell wrote:
>> On Thu, 11 Jun 2020 at 17:16, Eric Auger <eric.auger@redhat.com> wrote:
>>>
>>> At the moment each entry in the IOTLB corresponds to a page sized
>>> mapping (4K, 16K or 64K), even if the page belongs to a mapped
>>> block. In case of block mapping this unefficiently consume IOTLB
>>> entries.
>>>
>>> Change the value of the entry so that it reflects the actual
>>> mapping it belongs to (block or page start address and size).
>>>
>>> Also the level/tg of the entry is encoded in the key. In subsequent
>>> patches we will enable range invalidation. This latter is able
>>> to provide the level/tg of the entry.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>>
>>> -uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova)
>>> +uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova,
>>> +                            uint8_t tg, uint8_t level)
>>>  {
>>> -    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT;
>>> +    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT |
>>> +           (uint64_t)(level) << SMMU_IOTLB_LEVEL_SHIFT |
>>> +           (uint64_t)(tg) << SMMU_IOTLB_TG_SHIFT;
>>>  }
>>
>>>  SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
>>> -                                 hwaddr iova)
>>> +                                SMMUTransTableInfo *tt, hwaddr iova)
>>>  {
>>> -    uint64_t key = smmu_get_iotlb_key(cfg->asid, iova);
>>> -    SMMUTLBEntry *entry = g_hash_table_lookup(bs->iotlb, &key);
>>> +    uint8_t tg = (tt->granule_sz - 10) / 2;
>>> +    int level = tt->starting_level;
>>> +    SMMUTLBEntry *entry = NULL;
>>> +
>>> +    while (level <= 3) {
>>> +        uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz);
>>> +        uint64_t mask = subpage_size - 1;
>>> +        uint64_t key;
>>> +
>>> +        key = smmu_get_iotlb_key(cfg->asid, iova & ~mask, tg, level);
>>> +        entry = g_hash_table_lookup(bs->iotlb, &key);
>>> +        if (entry) {
>>> +            break;
>>> +        }
>>> +        level++;
>>
>> Rather than looping around doing multiple hash table lookups like
>> this, why not just avoid including the tg and level in the
>> key equality test?
>>
>> If I understand the range-based-invalidation feature correctly,
>> the only time we care about the TG/LVL is if we're processing
>> an invalidate-range command that specifies them. But in that
>> case there should never be multiple entries in the bs->iotlb
>> with the same iova, so we can just check whether the entry
>> matches the requested TG/LVL once we've pulled it out of the
>> hash table. (Or we could architecturally validly just blow
>> it away regardless of requested TG/LVL -- they are only hints,
>> not required-to-match.)
> 
> This change could have been done independently on the RIL feature. As we
> now put block entries in the IOTLB , when we look for an iova
> translation, the IOVA can be mapped using different block sizes or using
> page entries. So we start looking at blocks of the bigger size (entry
> level) downto the page, for instance 4TB/512MB/64KB. We cannot know
> which block and size the address belongs to. I do not know if we can
> make any hypothesis on whether the driver is forbidden to invalidate an
> address that is not the starting address of an initial mapping.
> 
> Not a justification but an info, this is implemented the same way on x86
> (except they don't have variable TG), see vtd_lookup_iotlb in
> hw/i386/intel_iommu.c

Does that make more sense overall? I will respin shortly.

Thanks

Eric
> 
> Thanks
> 
> Eric
>>
>> thanks
>> -- PMM
>>
> 
>
Peter Maydell June 30, 2020, 3:50 p.m. UTC | #4
On Fri, 26 Jun 2020 at 14:53, Auger Eric <eric.auger@redhat.com> wrote:
> On 6/25/20 5:30 PM, Peter Maydell wrote:
> > Rather than looping around doing multiple hash table lookups like
> > this, why not just avoid including the tg and level in the
> > key equality test?
> >
> > If I understand the range-based-invalidation feature correctly,
> > the only time we care about the TG/LVL is if we're processing
> > an invalidate-range command that specifies them. But in that
> > case there should never be multiple entries in the bs->iotlb
> > with the same iova, so we can just check whether the entry
> > matches the requested TG/LVL once we've pulled it out of the
> > hash table. (Or we could architecturally validly just blow
> > it away regardless of requested TG/LVL -- they are only hints,
> > not required-to-match.)
>
> This change could have been done independently on the RIL feature. As we
> now put block entries in the IOTLB , when we look for an iova
> translation, the IOVA can be mapped using different block sizes or using
> page entries. So we start looking at blocks of the bigger size (entry
> level) downto the page, for instance 4TB/512MB/64KB. We cannot know
> which block and size the address belongs to.

Yes, but we wouldn't need to care which TG and LVL the
address belongs to if we didn't put them into
the key, would we? I'm probably missing something here, but
just because the hardware might want to use the hints in
the invalidation-command about TG and LVL doesn't inherently
mean QEMU is most efficient if it cares about the hints.

thanks
-- PMM
Eric Auger June 30, 2020, 4:29 p.m. UTC | #5
Hi Peter,

On 6/30/20 5:50 PM, Peter Maydell wrote:
> On Fri, 26 Jun 2020 at 14:53, Auger Eric <eric.auger@redhat.com> wrote:
>> On 6/25/20 5:30 PM, Peter Maydell wrote:
>>> Rather than looping around doing multiple hash table lookups like
>>> this, why not just avoid including the tg and level in the
>>> key equality test?
>>>
>>> If I understand the range-based-invalidation feature correctly,
>>> the only time we care about the TG/LVL is if we're processing
>>> an invalidate-range command that specifies them. But in that
>>> case there should never be multiple entries in the bs->iotlb
>>> with the same iova, so we can just check whether the entry
>>> matches the requested TG/LVL once we've pulled it out of the
>>> hash table. (Or we could architecturally validly just blow
>>> it away regardless of requested TG/LVL -- they are only hints,
>>> not required-to-match.)
>>
>> This change could have been done independently on the RIL feature. As we
>> now put block entries in the IOTLB , when we look for an iova
>> translation, the IOVA can be mapped using different block sizes or using
>> page entries. So we start looking at blocks of the bigger size (entry
>> level) downto the page, for instance 4TB/512MB/64KB. We cannot know
>> which block and size the address belongs to.
> 
> Yes, but we wouldn't need to care which TG and LVL the
> address belongs to if we didn't put them into
> the key, would we? I'm probably missing something here, but
> just because the hardware might want to use the hints in
> the invalidation-command about TG and LVL doesn't inherently
> mean QEMU is most efficient if it cares about the hints.

OK I think I understand your point now. It is not necessary to put
TG/LVL in the key as log as they are in the entry. I will look at this
implementation ...

Thanks

Eric
> 
> thanks
> -- PMM
>
Eric Auger July 2, 2020, 2:39 p.m. UTC | #6
Hi Peter,

On 6/30/20 6:29 PM, Auger Eric wrote:
> Hi Peter,
> 
> On 6/30/20 5:50 PM, Peter Maydell wrote:
>> On Fri, 26 Jun 2020 at 14:53, Auger Eric <eric.auger@redhat.com> wrote:
>>> On 6/25/20 5:30 PM, Peter Maydell wrote:
>>>> Rather than looping around doing multiple hash table lookups like
>>>> this, why not just avoid including the tg and level in the
>>>> key equality test?
>>>>
>>>> If I understand the range-based-invalidation feature correctly,
>>>> the only time we care about the TG/LVL is if we're processing
>>>> an invalidate-range command that specifies them. But in that
>>>> case there should never be multiple entries in the bs->iotlb
>>>> with the same iova, so we can just check whether the entry
>>>> matches the requested TG/LVL once we've pulled it out of the
>>>> hash table. (Or we could architecturally validly just blow
>>>> it away regardless of requested TG/LVL -- they are only hints,
>>>> not required-to-match.)
>>>
>>> This change could have been done independently on the RIL feature. As we
>>> now put block entries in the IOTLB , when we look for an iova
>>> translation, the IOVA can be mapped using different block sizes or using
>>> page entries. So we start looking at blocks of the bigger size (entry
>>> level) downto the page, for instance 4TB/512MB/64KB. We cannot know
>>> which block and size the address belongs to.
>>
>> Yes, but we wouldn't need to care which TG and LVL the
>> address belongs to if we didn't put them into
>> the key, would we? I'm probably missing something here, but
>> just because the hardware might want to use the hints in
>> the invalidation-command about TG and LVL doesn't inherently
>> mean QEMU is most efficient if it cares about the hints.
> 
> OK I think I understand your point now. It is not necessary to put
> TG/LVL in the key as log as they are in the entry. I will look at this
> implementation ...

Looking further at the implementation, if we don't encode the LVL in the
key (but just encode the block addr), on invalidation, we are not able
to remove the associated entry in one shot, using g_hash_table_remove().
We are obliged to use g_hash_table_foreach_remove and sort out the
entries according to the invalidation parameters.

Putting the TG and LVL in the key allows to do this in one short if
num_pages == 1. See [8/9] hw/arm/smmuv3: Get prepared for range
invalidation, smmu_iotlb_inv_iova() implementation.

So my understanding is it is obviously feasible to get rid of TG/LVL in
the key but may be less optimal when range invalidation gets used by the
guest (most invalidations having num_pages == 1)

Thanks

Eric
> 
> Thanks
> 
> Eric
>>
>> thanks
>> -- PMM
>>
> 
>
diff mbox series

Patch

diff --git a/hw/arm/smmu-internal.h b/hw/arm/smmu-internal.h
index 2ecb6f1dc6..f60b48dc9b 100644
--- a/hw/arm/smmu-internal.h
+++ b/hw/arm/smmu-internal.h
@@ -97,7 +97,17 @@  uint64_t iova_level_offset(uint64_t iova, int inputsize,
 }
 
 #define SMMU_IOTLB_ASID_SHIFT  40
+#define SMMU_IOTLB_LEVEL_SHIFT 56
+#define SMMU_IOTLB_TG_SHIFT    58
 
 #define SMMU_IOTLB_ASID(key) (((key) >> SMMU_IOTLB_ASID_SHIFT) & 0xFFFF)
 #define SMMU_IOTLB_IOVA(key) (((key) & MAKE_64BIT_MASK(0, 40)) << 12)
+
+struct SMMUIOTLBPageInvInfo {
+    int asid;
+    uint64_t iova;
+    uint64_t mask;
+};
+typedef struct SMMUIOTLBPageInvInfo SMMUIOTLBPageInvInfo;
+
 #endif
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 676e53c086..34a13c7cd6 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -157,12 +157,14 @@  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid);
 
 #define SMMU_IOTLB_MAX_SIZE 256
 
-SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg, hwaddr iova);
+SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
+                                SMMUTransTableInfo *tt, hwaddr iova);
 void smmu_iotlb_insert(SMMUState *bs, SMMUTransCfg *cfg, SMMUTLBEntry *entry);
-uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova);
+uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova,
+                            uint8_t tg, uint8_t level);
 void smmu_iotlb_inv_all(SMMUState *s);
 void smmu_iotlb_inv_asid(SMMUState *s, uint16_t asid);
-void smmu_iotlb_inv_iova(SMMUState *s, uint16_t asid, dma_addr_t iova);
+void smmu_iotlb_inv_iova(SMMUState *s, int asid, dma_addr_t iova);
 
 /* Unmap the range of all the notifiers registered to any IOMMU mr */
 void smmu_inv_notifiers_all(SMMUState *s);
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index c2ed8346fb..27804fbb2d 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -42,16 +42,33 @@  static gboolean smmu_iotlb_key_equal(gconstpointer v1, gconstpointer v2)
     return *((const uint64_t *)v1) == *((const uint64_t *)v2);
 }
 
-uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova)
+uint64_t smmu_get_iotlb_key(uint16_t asid, uint64_t iova,
+                            uint8_t tg, uint8_t level)
 {
-    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT;
+    return iova >> 12 | (uint64_t)(asid) << SMMU_IOTLB_ASID_SHIFT |
+           (uint64_t)(level) << SMMU_IOTLB_LEVEL_SHIFT |
+           (uint64_t)(tg) << SMMU_IOTLB_TG_SHIFT;
 }
 
 SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
-                                 hwaddr iova)
+                                SMMUTransTableInfo *tt, hwaddr iova)
 {
-    uint64_t key = smmu_get_iotlb_key(cfg->asid, iova);
-    SMMUTLBEntry *entry = g_hash_table_lookup(bs->iotlb, &key);
+    uint8_t tg = (tt->granule_sz - 10) / 2;
+    int level = tt->starting_level;
+    SMMUTLBEntry *entry = NULL;
+
+    while (level <= 3) {
+        uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz);
+        uint64_t mask = subpage_size - 1;
+        uint64_t key;
+
+        key = smmu_get_iotlb_key(cfg->asid, iova & ~mask, tg, level);
+        entry = g_hash_table_lookup(bs->iotlb, &key);
+        if (entry) {
+            break;
+        }
+        level++;
+    }
 
     if (entry) {
         cfg->iotlb_hits++;
@@ -72,13 +89,14 @@  SMMUTLBEntry *smmu_iotlb_lookup(SMMUState *bs, SMMUTransCfg *cfg,
 void smmu_iotlb_insert(SMMUState *bs, SMMUTransCfg *cfg, SMMUTLBEntry *new)
 {
     uint64_t *key = g_new0(uint64_t, 1);
+    uint8_t tg = (new->granule - 10) / 2;
 
     if (g_hash_table_size(bs->iotlb) >= SMMU_IOTLB_MAX_SIZE) {
         smmu_iotlb_inv_all(bs);
     }
 
-    *key = smmu_get_iotlb_key(cfg->asid, new->entry.iova);
-    trace_smmu_iotlb_insert(cfg->asid, new->entry.iova);
+    *key = smmu_get_iotlb_key(cfg->asid, new->entry.iova, tg, new->level);
+    trace_smmu_iotlb_insert(cfg->asid, new->entry.iova, tg, new->level);
     g_hash_table_insert(bs->iotlb, key, new);
 }
 
@@ -97,12 +115,28 @@  static gboolean smmu_hash_remove_by_asid(gpointer key, gpointer value,
     return SMMU_IOTLB_ASID(*iotlb_key) == asid;
 }
 
-inline void smmu_iotlb_inv_iova(SMMUState *s, uint16_t asid, dma_addr_t iova)
+static gboolean smmu_hash_remove_by_asid_iova(gpointer key, gpointer value,
+                                         gpointer user_data)
 {
-    uint64_t key = smmu_get_iotlb_key(asid, iova);
+    SMMUTLBEntry *iter = (SMMUTLBEntry *)value;
+    IOMMUTLBEntry *entry = &iter->entry;
+    SMMUIOTLBPageInvInfo *info = (SMMUIOTLBPageInvInfo *)user_data;
+    uint64_t *iotlb_key = (uint64_t *)key;
+
+    if (info->asid >= 0) {
+        return (info->asid == SMMU_IOTLB_ASID(*iotlb_key)) &&
+                ((info->iova & ~entry->addr_mask) == entry->iova);
+    } else {
+        return (info->iova & ~entry->addr_mask) == entry->iova;
+    }
+}
+
+inline void smmu_iotlb_inv_iova(SMMUState *s, int asid, dma_addr_t iova)
+{
+    SMMUIOTLBPageInvInfo info = {.asid = asid, .iova = iova};
 
     trace_smmu_iotlb_inv_iova(asid, iova);
-    g_hash_table_remove(s->iotlb, &key);
+    g_hash_table_foreach_remove(s->iotlb, smmu_hash_remove_by_asid_iova, &info);
 }
 
 inline void smmu_iotlb_inv_asid(SMMUState *s, uint16_t asid)
@@ -229,9 +263,6 @@  static int smmu_ptw_64(SMMUTransCfg *cfg,
     baseaddr = extract64(tt->ttb, 0, 48);
     baseaddr &= ~indexmask;
 
-    tlbe->entry.iova = iova;
-    tlbe->entry.addr_mask = (1 << granule_sz) - 1;
-
     while (level <= 3) {
         uint64_t subpage_size = 1ULL << level_shift(level, granule_sz);
         uint64_t mask = subpage_size - 1;
@@ -281,7 +312,9 @@  static int smmu_ptw_64(SMMUTransCfg *cfg,
             goto error;
         }
 
-        tlbe->entry.translated_addr = gpa + (iova & mask);
+        tlbe->entry.translated_addr = gpa;
+        tlbe->entry.iova = iova & ~mask;
+        tlbe->entry.addr_mask = mask;
         tlbe->entry.perm = PTE_AP_TO_PERM(ap);
         tlbe->level = level;
         tlbe->granule = granule_sz;
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 12d3e972d6..931a2b6872 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -678,7 +678,7 @@  static IOMMUTLBEntry smmuv3_translate(IOMMUMemoryRegion *mr, hwaddr addr,
     page_mask = (1ULL << (tt->granule_sz)) - 1;
     aligned_addr = addr & ~page_mask;
 
-    cached_entry = smmu_iotlb_lookup(bs, cfg, aligned_addr);
+    cached_entry = smmu_iotlb_lookup(bs, cfg, tt, aligned_addr);
     if (cached_entry) {
         if ((flag & IOMMU_WO) && !(cached_entry->entry.perm & IOMMU_WO)) {
             status = SMMU_TRANS_ERROR;
@@ -748,7 +748,7 @@  epilogue:
     case SMMU_TRANS_SUCCESS:
         entry.perm = flag;
         entry.translated_addr = cached_entry->entry.translated_addr +
-                                    (addr & page_mask);
+                                    (addr & cached_entry->entry.addr_mask);
         entry.addr_mask = cached_entry->entry.addr_mask;
         trace_smmuv3_translate_success(mr->parent_obj.name, sid, addr,
                                        entry.translated_addr, entry.perm);
@@ -976,7 +976,7 @@  static int smmuv3_cmdq_consume(SMMUv3State *s)
 
             trace_smmuv3_cmdq_tlbi_nh_vaa(vmid, addr);
             smmuv3_inv_notifiers_iova(bs, -1, addr);
-            smmu_iotlb_inv_all(bs);
+            smmu_iotlb_inv_iova(bs, -1, addr);
             break;
         }
         case SMMU_CMD_TLBI_NH_VA:
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 7263b9c586..2d445c99b7 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -16,7 +16,7 @@  smmu_iotlb_inv_iova(uint16_t asid, uint64_t addr) "IOTLB invalidate asid=%d addr
 smmu_inv_notifiers_mr(const char *name) "iommu mr=%s"
 smmu_iotlb_lookup_hit(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t miss, uint32_t p) "IOTLB cache HIT asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit rate=%d"
 smmu_iotlb_lookup_miss(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t miss, uint32_t p) "IOTLB cache MISS asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit rate=%d"
-smmu_iotlb_insert(uint16_t asid, uint64_t addr) "IOTLB ++ asid=%d addr=0x%"PRIx64
+smmu_iotlb_insert(uint16_t asid, uint64_t addr, uint8_t tg, uint8_t level) "IOTLB ++ asid=%d addr=0x%"PRIx64" tg=%d level=%d"
 
 # smmuv3.c
 smmuv3_read_mmio(uint64_t addr, uint64_t val, unsigned size, uint32_t r) "addr: 0x%"PRIx64" val:0x%"PRIx64" size: 0x%x(%d)"