diff mbox series

[v3,04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Message ID 4-v3-d794f8d934da+411a-smmuv3_newapi_p1_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Update SMMUv3 to the modern iommu API (part 1/3) | expand

Commit Message

Jason Gunthorpe Dec. 5, 2023, 7:14 p.m. UTC
As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.

Thus, consolidate all the complexity here. Callers do not have to care
about what STE transition they are doing, this function will handle
everything optimally.

Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
required programming sequence to avoid creating an incoherent 'torn' STE
in the HW caches. The update algorithm follows the same design that the
driver already uses: it is safe to change bits that HW doesn't currently
use and then do a single 64 bit update, with sync's in between.

The basic idea is to express in a bitmask what bits the HW is actually
using based on the V and CFG bits. Based on that mask we know what STE
changes are safe and which are disruptive. We can count how many 64 bit
QWORDS need a disruptive update and know if a step with V=0 is required.

This gives two basic flows through the algorithm.

If only a single 64 bit quantity needs disruptive replacement:
 - Write the target value into all currently unused bits
 - Write the single 64 bit quantity
 - Zero the remaining different bits

If multiple 64 bit quantities need disruptive replacement then do:
 - Write V=0 to QWORD 0
 - Write the entire STE except QWORD 0
 - Write QWORD 0

With HW flushes at each step, that can be skipped if the STE didn't change
in that step.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 272 +++++++++++++++-----
 1 file changed, 208 insertions(+), 64 deletions(-)

Comments

Will Deacon Dec. 12, 2023, 4:23 p.m. UTC | #1
Hi Jason,

On Tue, Dec 05, 2023 at 03:14:36PM -0400, Jason Gunthorpe wrote:
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
> 
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
> 
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
> 
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
> 
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
> 
> This gives two basic flows through the algorithm.
> 
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
> 
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
> 
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.
> 
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).

This is certainly very clever, but at the same time I can't help but feel
that it's slightly over-engineered to solve the general case, whereas I'm
struggling to see why such a level of complexity is necessary.

In the comment, you say:

> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process.

Please can you spell out these "interesting cases"? For cases where we're
changing CONFIG, I'd have thought it would be perfectly fine to go via an
invalid STE. What am I missing?

Generally, I like where the later patches in the series take things, but
I'd really like to reduce the complexity of the strtab updating code to
what is absolutely required.

I've left some minor comments on the code below, but I'd really like to
see this whole thing simplified if possible.

> + */
> +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> +				      const __le64 *target,
> +				      const __le64 *target_used, __le64 *step,
> +				      __le64 v_bit,
> +				      unsigned int len)
> +{
> +	u8 step_used_diff = 0;
> +	u8 step_change = 0;
> +	unsigned int i;
> +
> +	/*
> +	 * Compute a step that has all the bits currently unused by HW set to
> +	 * their target values.
> +	 */

Well, ok, I do have a cosmetic nit here: using 'step' for both "STE pointer"
and "incremental change" is perhaps, err, a step too far ;)

> +	for (i = 0; i != len; i++) {
> +		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);

Isn't 'cur[i] & cur_used[i]' always cur[i]?

> +		if (cur[i] != step[i])
> +			step_change |= 1 << i;
> +		/*
> +		 * Each bit indicates if the step is incorrect compared to the
> +		 * target, considering only the used bits in the target
> +		 */
> +		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> +			step_used_diff |= 1 << i;
> +	}
> +
> +	if (hweight8(step_used_diff) > 1) {
> +		/*
> +		 * More than 1 qword is mismatched, this cannot be done without
> +		 * a break. Clear the V bit and go again.
> +		 */
> +		step[0] &= ~v_bit;
> +	} else if (!step_change && step_used_diff) {
> +		/*
> +		 * Have exactly one critical qword, all the other qwords are set
> +		 * correctly, so we can set this qword now.
> +		 */
> +		i = ffs(step_used_diff) - 1;
> +		step[i] = target[i];
> +	} else if (!step_change) {
> +		/* cur == target, so all done */
> +		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> +			return true;
> +
> +		/*
> +		 * All the used HW bits match, but unused bits are different.
> +		 * Set them as well. Technically this isn't necessary but it
> +		 * brings the entry to the full target state, so if there are
> +		 * bugs in the mask calculation this will obscure them.
> +		 */
> +		memcpy(step, target, len * sizeof(*step));

Bah, I'm not a huge fan of this sort of defensive programming. I'd prefer
to propagate the error rather than quietly try to cover it up.

> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> +				  struct arm_smmu_ste *used_bits)
> +{
> +	memset(used_bits, 0, sizeof(*used_bits));
> +
> +	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +		return;
> +
> +	/*
> +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +	 * attribute/permission configuration fields for the SHCFG behavior.
> +	 */
> +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0])) & 1 &&
> +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> +		    STRTAB_STE_1_S1DSS_BYPASS)
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> +	case STRTAB_STE_0_CFG_ABORT:
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +		break;
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +						  STRTAB_STE_0_S1CTXPTR_MASK |
> +						  STRTAB_STE_0_S1CDMAX);
> +		used_bits->data[1] |=
> +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +		break;
> +	case STRTAB_STE_0_CFG_S2_TRANS:
> +		used_bits->data[1] |=
> +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +		used_bits->data[2] |=
> +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +		break;

I think this is going to be a pain to maintain :/

> +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> +				    const struct arm_smmu_ste *target,
> +				    const struct arm_smmu_ste *target_used)
> +{
> +	struct arm_smmu_ste cur_used;
> +	struct arm_smmu_ste step;
> +
> +	arm_smmu_get_ste_used(cur, &cur_used);
> +	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> +					 target_used->data, step.data,
> +					 cpu_to_le64(STRTAB_STE_0_V),
> +					 ARRAY_SIZE(cur->data));
> +}
> +
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +			       struct arm_smmu_ste *ste,
> +			       const struct arm_smmu_ste *target)
> +{
> +	struct arm_smmu_ste target_used;
> +	int i;
> +
> +	arm_smmu_get_ste_used(target, &target_used);
> +	/* Masks in arm_smmu_get_ste_used() are up to date */
> +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);

That's a runtime cost on every single STE update for what would be a driver
bug.

> +
> +	while (true) {
> +		if (arm_smmu_write_ste_step(ste, target, &target_used))
> +			break;
> +		arm_smmu_sync_ste_for_sid(smmu, sid);
> +	}

This really should be bounded...

Will
Jason Gunthorpe Dec. 12, 2023, 6:04 p.m. UTC | #2
On Tue, Dec 12, 2023 at 04:23:53PM +0000, Will Deacon wrote:
> > At this point it generates the same sequence of updates as the current
> > code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> > extra sync (this seems to be an existing bug).
> 
> This is certainly very clever, but at the same time I can't help but feel
> that it's slightly over-engineered to solve the general case, whereas I'm
> struggling to see why such a level of complexity is necessary.

I'm open to any alternative that accomplishes the same outcome of
decoupling the STE programming from the STE state and adding the API
required hitless transitions.

This is a really important part of all of this work.

IMHO this is inherently complex. The SMMU spec devotes several pages
to describing how to do tearless updates. However, that text does not
even consider the need for hitless updates. It only contemplates the
V=0 flow.

Here we are very much adding hitless and intending to do so.

> In the comment, you say:
> 
> > + * In the most general case we can make any update in three steps:
> > + *  - Disrupting the entry (V=0)
> > + *  - Fill now unused bits, all bits except V
> > + *  - Make valid (V=1), single 64 bit store
> > + *
> > + * However this disrupts the HW while it is happening. There are several
> > + * interesting cases where a STE/CD can be updated without disturbing the HW
> > + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> > + * because the used bits don't intersect. We can detect this by calculating how
> > + * many 64 bit values need update after adjusting the unused bits and skip the
> > + * V=0 process.
> 
> Please can you spell out these "interesting cases"? For cases where we're
> changing CONFIG, I'd have thought it would be perfectly fine to go via an
> invalid STE. What am I missing?

The modern IOMMU driver API requires hitless transitions in a bunch of
cases. I think I explained it in an email to Michael:

https://lore.kernel.org/linux-iommu/20231012121616.GF3952@nvidia.com/

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - NESTING -> NESTING (eg to change S1DSS, change CD table pointers, etc)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

I will add that list to the commit message, the cover letter talks
about it a bit too.

To properly implement the API the driver has to support this.
Rather than try to just target those few cases (and hope I even know
all the needed cases) I just did everything.

"Everything" was a reaction to the complexity, at the end of part 3 we
have 7 different categories of STEs:
 - IDENTITY
 - BLOCKING
 - S2 domain
 - IDENTITY w/ PASID
 - BLOCKING w/ PASID
 - S1 domain
 - S1 domain nested on S2 domain

(plus a bunch of those have ATS and !ATS variations, and the nesting
has its own sub types..)

Which is 42 different permutations of transitions (plus the CD
stuff). Even if we want to just special case the sequences above we
still have to identify them and open code the arcs. Then the next
person along has to understand why those specific arcs are special
which is a complex and subtle argument based on the spec saying the HW
does not read certain bits.

Finally, there is a VM emulation reason - we don't know what the VM
will do so it may be assuming hitless changes. To actually correctly
emulate this we do need this general approach. Otherwise we have a
weird emulation gap where STE sequences generated by a VM that should
be hitless on real HW (eg the ones above, perhaps) will not
work. Indeed a VM running kernels with all three parts of this will be
doing and expecting hitless STE updates!

While this is perhaps not a real world functional concern it falls
under the usual rubrik of accurate VM emulation.

So, to my mind ignoring the general case and just doing V=0 all the
time is not OK.

> Generally, I like where the later patches in the series take things, but
> I'd really like to reduce the complexity of the strtab updating code to
> what is absolutely required.

This is the sort of thing where the algorithm in
arm_smmu_write_entry_step() is perhaps complex, but the ongoing
support of it is not. Just update the used bits function when new STE
features are (rarely) introduced.

> I've left some minor comments on the code below, but I'd really like to
> see this whole thing simplified if possible.

I have no particularly goood ideas on this front, sorry. I did try
several other things. Many thoughts were so difficult I couldn't even
write them down correctly :(

> > + */
> > +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> > +				      const __le64 *target,
> > +				      const __le64 *target_used, __le64 *step,
> > +				      __le64 v_bit,
> > +				      unsigned int len)
> > +{
> > +	u8 step_used_diff = 0;
> > +	u8 step_change = 0;
> > +	unsigned int i;
> > +
> > +	/*
> > +	 * Compute a step that has all the bits currently unused by HW set to
> > +	 * their target values.
> > +	 */
> 
> Well, ok, I do have a cosmetic nit here: using 'step' for both "STE pointer"
> and "incremental change" is perhaps, err, a step too far ;)

Ok.. I'll call it arm_smmu_write_entry_next()

> > +	for (i = 0; i != len; i++) {
> > +		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> 
> Isn't 'cur[i] & cur_used[i]' always cur[i]?

No. The routine is called iteratively as the comment explained. The
first iteration may set unused bits to their target values, so we have
to mask them away again here during the second iteration. Later
iterations may change, eg the config, which means again we will have
unused values set from the prior config.

> > +	} else if (!step_change) {
> > +		/* cur == target, so all done */
> > +		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> > +			return true;
> > +
> > +		/*
> > +		 * All the used HW bits match, but unused bits are different.
> > +		 * Set them as well. Technically this isn't necessary but it
> > +		 * brings the entry to the full target state, so if there are
> > +		 * bugs in the mask calculation this will obscure them.
> > +		 */
> > +		memcpy(step, target, len * sizeof(*step));
> 
> Bah, I'm not a huge fan of this sort of defensive programming. I'd prefer
> to propagate the error rather than quietly try to cover it up.

There is no error.

This is adjusting the unused bits that were left over from the prior
config to 0, it is a similar answer to the 'cur[i]' question above.

The defensiveness is only a decision that the installed STE should
fully match the given target STE.

In principle the HW doesn't read unused bits in the target STE so we
could leave them set to 1 instead of fully matching.

"bugs in the mask calculation" means everything from the "HW should
not look at bit X but does" to "the spec was misread and the HW does
look at bit X"

> > +/*
> > + * Based on the value of ent report which bits of the STE the HW will access. It
> > + * would be nice if this was complete according to the spec, but minimally it
> > + * has to capture the bits this driver uses.
> > + */
> > +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> > +				  struct arm_smmu_ste *used_bits)
> > +{
> > +	memset(used_bits, 0, sizeof(*used_bits));
> > +
> > +	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> > +	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > +		return;
> > +
> > +	/*
> > +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> > +	 * attribute/permission configuration fields for the SHCFG behavior.
> > +	 */
> > +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0])) & 1 &&
> > +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> > +		    STRTAB_STE_1_S1DSS_BYPASS)
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > +
> > +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> > +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> > +	case STRTAB_STE_0_CFG_ABORT:
> > +		break;
> > +	case STRTAB_STE_0_CFG_BYPASS:
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > +		break;
> > +	case STRTAB_STE_0_CFG_S1_TRANS:
> > +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > +						  STRTAB_STE_0_S1CTXPTR_MASK |
> > +						  STRTAB_STE_0_S1CDMAX);
> > +		used_bits->data[1] |=
> > +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +		break;
> > +	case STRTAB_STE_0_CFG_S2_TRANS:
> > +		used_bits->data[1] |=
> > +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> > +		used_bits->data[2] |=
> > +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> > +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> > +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> > +		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> > +		break;
> 
> I think this is going to be a pain to maintain :/

Ah, but not your pain :)

Some day someone adds a new STE bit, they don't understand this so
they just change one of the make functions.

They test and hit the WARN_ON, which brings them here. Then they
hopefully realize they need to read the spec and understand the new
bit so they do that and make a try at the right conditions.

Reviewer sees the new hunk in this function and double-checks that the
conditions for the new bit are correct. Reviewer needs to consider if
any hitless flows become broken (and I've considered adding an
explicit self test for this)

The hard work of reviewing the spec and deciding how a new STE bit is
processed cannot be skipped, but we can make it harder to notice that
it has been skipped. I think the current design makes skipping this
work too easy.

So you are calling it a "pain to maintain" I'm calling it an "enforced
rigor" going forward. Like type safety.

> > +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> > +			       struct arm_smmu_ste *ste,
> > +			       const struct arm_smmu_ste *target)
> > +{
> > +	struct arm_smmu_ste target_used;
> > +	int i;
> > +
> > +	arm_smmu_get_ste_used(target, &target_used);
> > +	/* Masks in arm_smmu_get_ste_used() are up to date */
> > +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> > +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> 
> That's a runtime cost on every single STE update for what would be a driver
> bug.

See above. STE programming is not critical path, and I think this
series makes it faster overall anyhow.

> > +	while (true) {
> > +		if (arm_smmu_write_ste_step(ste, target, &target_used))
> > +			break;
> > +		arm_smmu_sync_ste_for_sid(smmu, sid);
> > +	}
> 
> This really should be bounded...

Sure, I think the bound is 4 but I will check.

Thanks,
Jason
Mostafa Saleh Jan. 29, 2024, 7:10 p.m. UTC | #3
Hi Jason,

On Tue, Dec 05, 2023 at 03:14:36PM -0400, Jason Gunthorpe wrote:
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
> 
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
> 
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
> 
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
> 
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
> 
> This gives two basic flows through the algorithm.
> 
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
> 
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
> 
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.
> 
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).
> 
> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
Would the driver do anything in that case, or would just print the log message?

> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 272 +++++++++++++++-----
>  1 file changed, 208 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b120d836681c1c..0934f882b94e94 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,6 +971,101 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
>  	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>  
> +/*
> + * This algorithm updates any STE/CD to any value without creating a situation
> + * where the HW can percieve a corrupted entry. HW is only required to have a 64
> + * bit atomicity with stores from the CPU, while entries are many 64 bit values
> + * big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not see an entry
> + * torn across two steps. Upon each call cur/cur_used reflect the current
> + * synchronized value seen by the HW.
> + *
> + * During each step the HW can observe a torn entry that has any combination of
> + * the step's old/new 64 bit words. The algorithm objective is for the HW
> + * behavior to always be one of current behavior, V=0, or new behavior, during
> + * each step, and across all steps.
> + *
> + * At each step one of three actions is chosen to evolve cur to target:
> + *  - Update all unused bits with their target values.
> + *    This relies on the IGNORED behavior described in the specification
> + *  - Update a single 64-bit value
> + *  - Update all unused bits and set V=0
> + *
> + * The last two actions will cause cur_used to change, which will then allow the
> + * first action on the next step.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process.
> + */
> +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> +				      const __le64 *target,
> +				      const __le64 *target_used, __le64 *step,
> +				      __le64 v_bit,
I think this is confusing here, I believe we have this as an argument as this
function would be used for CD later, however for this series it is unnecessary,
IMHO, this should be removed and added in another patch for the CD rework.

> +				      unsigned int len)
> +{
> +	u8 step_used_diff = 0;
> +	u8 step_change = 0;
> +	unsigned int i;
> +
> +	/*
> +	 * Compute a step that has all the bits currently unused by HW set to
> +	 * their target values.
> +	 */
> +	for (i = 0; i != len; i++) {
> +		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> +		if (cur[i] != step[i])
> +			step_change |= 1 << i;
> +		/*
> +		 * Each bit indicates if the step is incorrect compared to the
> +		 * target, considering only the used bits in the target
> +		 */
> +		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> +			step_used_diff |= 1 << i;
> +	}
> +
> +	if (hweight8(step_used_diff) > 1) {
> +		/*
> +		 * More than 1 qword is mismatched, this cannot be done without
> +		 * a break. Clear the V bit and go again.
> +		 */
> +		step[0] &= ~v_bit;
> +	} else if (!step_change && step_used_diff) {
> +		/*
> +		 * Have exactly one critical qword, all the other qwords are set
> +		 * correctly, so we can set this qword now.
> +		 */
> +		i = ffs(step_used_diff) - 1;
> +		step[i] = target[i];
> +	} else if (!step_change) {
> +		/* cur == target, so all done */
> +		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> +			return true;
> +
> +		/*
> +		 * All the used HW bits match, but unused bits are different.
> +		 * Set them as well. Technically this isn't necessary but it
> +		 * brings the entry to the full target state, so if there are
> +		 * bugs in the mask calculation this will obscure them.
> +		 */
> +		memcpy(step, target, len * sizeof(*step));
> +	}
> +
> +	for (i = 0; i != len; i++)
> +		WRITE_ONCE(cur[i], step[i]);
> +	return false;
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>  			     int ssid, bool leaf)
>  {
> @@ -1248,37 +1343,115 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
>  	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>  
> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> +				  struct arm_smmu_ste *used_bits)
> +{
> +	memset(used_bits, 0, sizeof(*used_bits));
> +
> +	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +		return;
> +
> +	/*
> +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +	 * attribute/permission configuration fields for the SHCFG behavior.
> +	 */
> +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0])) & 1 &&
> +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> +		    STRTAB_STE_1_S1DSS_BYPASS)
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> +	case STRTAB_STE_0_CFG_ABORT:
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +		break;
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +						  STRTAB_STE_0_S1CTXPTR_MASK |
> +						  STRTAB_STE_0_S1CDMAX);
> +		used_bits->data[1] |=
> +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +		break;
AFAIU, this is missing something like (while passing smmu->features)
	used_bits->data[2] |= features & ARM_SMMU_FEAT_NESTING ?
			      cpu_to_le64(STRTAB_STE_2_S2VMID) : 0;

As the SMMUv3 manual says:
“ For a Non-secure STE when stage 2 is implemented (SMMU_IDR0.S2P == 1)
  translations resulting from a StreamWorld == NS-EL1 configuration are
  VMID-tagged with S2VMID when either of stage 1 (Config[0] == 1) or stage 2
  (Config[1] == 1) provide translation.“

Which means in case of S1=>S2 switch or vice versa this algorithm will ignore
VMID while it is used.

> +	case STRTAB_STE_0_CFG_S2_TRANS:
> +		used_bits->data[1] |=
> +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +		used_bits->data[2] |=
> +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +		break;
> +
> +	default:
> +		memset(used_bits, 0xFF, sizeof(*used_bits));
> +		WARN_ON(true);
> +	}
> +}
> +
> +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> +				    const struct arm_smmu_ste *target,
> +				    const struct arm_smmu_ste *target_used)
> +{
> +	struct arm_smmu_ste cur_used;
> +	struct arm_smmu_ste step;
> +
> +	arm_smmu_get_ste_used(cur, &cur_used);
> +	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> +					 target_used->data, step.data,
> +					 cpu_to_le64(STRTAB_STE_0_V),
> +					 ARRAY_SIZE(cur->data));
> +}
> +
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +			       struct arm_smmu_ste *ste,
> +			       const struct arm_smmu_ste *target)
> +{
> +	struct arm_smmu_ste target_used;
> +	int i;
> +
> +	arm_smmu_get_ste_used(target, &target_used);
> +	/* Masks in arm_smmu_get_ste_used() are up to date */
> +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
In what situation this would be triggered, is that for future proofing,
maybe we can move it to arm_smmu_get_ste_used()?

> +
> +	while (true) {
> +		if (arm_smmu_write_ste_step(ste, target, &target_used))
> +			break;
> +		arm_smmu_sync_ste_for_sid(smmu, sid);
> +	}
> +
> +	/* It's likely that we'll want to use the new STE soon */
> +	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +		struct arm_smmu_cmdq_ent
> +			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +					 .prefetch = {
> +						 .sid = sid,
> +					 } };
> +
> +		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	}
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  				      struct arm_smmu_ste *dst)
>  {
> -	/*
> -	 * This is hideously complicated, but we only really care about
> -	 * three cases at the moment:
> -	 *
> -	 * 1. Invalid (all zero) -> bypass/fault (init)
> -	 * 2. Bypass/fault -> translation/bypass (attach)
> -	 * 3. Translation/bypass -> bypass/fault (detach)
> -	 *
> -	 * Given that we can't update the STE atomically and the SMMU
> -	 * doesn't read the thing in a defined order, that leaves us
> -	 * with the following maintenance requirements:
> -	 *
> -	 * 1. Update Config, return (init time STEs aren't live)
> -	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -	 * 3. Update Config, sync
> -	 */
> -	u64 val = le64_to_cpu(dst->data[0]);
> -	bool ste_live = false;
> +	u64 val;
>  	struct arm_smmu_device *smmu = master->smmu;
>  	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>  	struct arm_smmu_s2_cfg *s2_cfg = NULL;
>  	struct arm_smmu_domain *smmu_domain = master->domain;
> -	struct arm_smmu_cmdq_ent prefetch_cmd = {
> -		.opcode		= CMDQ_OP_PREFETCH_CFG,
> -		.prefetch	= {
> -			.sid	= sid,
> -		},
> -	};
> +	struct arm_smmu_ste target = {};
>  
>  	if (smmu_domain) {
>  		switch (smmu_domain->stage) {
> @@ -1293,22 +1466,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		}
>  	}
>  
> -	if (val & STRTAB_STE_0_V) {
> -		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -		case STRTAB_STE_0_CFG_BYPASS:
> -			break;
> -		case STRTAB_STE_0_CFG_S1_TRANS:
> -		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> -			break;
> -		case STRTAB_STE_0_CFG_ABORT:
> -			BUG_ON(!disable_bypass);
> -			break;
> -		default:
> -			BUG(); /* STE corruption */
> -		}
> -	}
> -
>  	/* Nuke the existing STE_0 value, as we're going to rewrite it */
>  	val = STRTAB_STE_0_V;
>  
> @@ -1319,16 +1476,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		else
>  			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>  
> -		dst->data[0] = cpu_to_le64(val);
> -		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +		target.data[0] = cpu_to_le64(val);
> +		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>  						STRTAB_STE_1_SHCFG_INCOMING));
> -		dst->data[2] = 0; /* Nuke the VMID */
> -		/*
> -		 * The SMMU can perform negative caching, so we must sync
> -		 * the STE regardless of whether the old value was live.
> -		 */
> -		if (smmu)
> -			arm_smmu_sync_ste_for_sid(smmu, sid);
> +		target.data[2] = 0; /* Nuke the VMID */
> +		arm_smmu_write_ste(smmu, sid, dst, &target);
>  		return;
>  	}
>  
> @@ -1336,8 +1488,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>  			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>  
> -		BUG_ON(ste_live);
> -		dst->data[1] = cpu_to_le64(
> +		target.data[1] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>  			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>  			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1346,7 +1497,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  
>  		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>  		    !master->stall_enabled)
> -			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>  
>  		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>  			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1355,8 +1506,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  	}
>  
>  	if (s2_cfg) {
> -		BUG_ON(ste_live);
> -		dst->data[2] = cpu_to_le64(
> +		target.data[2] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>  			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1365,23 +1515,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>  			 STRTAB_STE_2_S2R);
>  
> -		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>  
>  		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>  	}
>  
>  	if (master->ats_enabled)
> -		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>  						 STRTAB_STE_1_EATS_TRANS));
>  
> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -	/* See comment in arm_smmu_write_ctx_desc() */
> -	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -	/* It's likely that we'll want to use the new STE soon */
> -	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	target.data[0] = cpu_to_le64(val);
> +	arm_smmu_write_ste(smmu, sid, dst, &target);
>  }
>  
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> -- 
> 2.43.0
>
Thanks,
Mostafa
Jason Gunthorpe Jan. 29, 2024, 7:49 p.m. UTC | #4
On Mon, Jan 29, 2024 at 07:10:47PM +0000, Mostafa Saleh wrote:

> > Going forward this will use a V=0 transition instead of cycling through
> > ABORT if a hitfull change is required. This seems more appropriate as ABORT
> > will fail DMAs without any logging, but dropping a DMA due to transient
> > V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
> Would the driver do anything in that case, or would just print the log message?

Just log, AFAIK.
 
> > +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> > +				      const __le64 *target,
> > +				      const __le64 *target_used, __le64 *step,
> > +				      __le64 v_bit,
> I think this is confusing here, I believe we have this as an argument as this
> function would be used for CD later, however for this series it is unnecessary,
> IMHO, this should be removed and added in another patch for the CD rework.

It is alot of code churn to do that, even more on the new version.

> > +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> > +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> > +	case STRTAB_STE_0_CFG_ABORT:
> > +		break;
> > +	case STRTAB_STE_0_CFG_BYPASS:
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > +		break;
> > +	case STRTAB_STE_0_CFG_S1_TRANS:
> > +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > +						  STRTAB_STE_0_S1CTXPTR_MASK |
> > +						  STRTAB_STE_0_S1CDMAX);
> > +		used_bits->data[1] |=
> > +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +		break;
> AFAIU, this is missing something like (while passing smmu->features)
> 	used_bits->data[2] |= features & ARM_SMMU_FEAT_NESTING ?
> 			      cpu_to_le64(STRTAB_STE_2_S2VMID) : 0;
> 
> As the SMMUv3 manual says:
> “ For a Non-secure STE when stage 2 is implemented (SMMU_IDR0.S2P == 1)
>   translations resulting from a StreamWorld == NS-EL1 configuration are
>   VMID-tagged with S2VMID when either of stage 1 (Config[0] == 1) or stage 2
>   (Config[1] == 1) provide translation.“
> 
> Which means in case of S1=>S2 switch or vice versa this algorithm will ignore
> VMID while it is used.

Ah, yes, that is a small miss, thanks. I don't think we need the
features test though, s2vmid doesn't mean something different if the
feature is not present..

> > +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> > +			       struct arm_smmu_ste *ste,
> > +			       const struct arm_smmu_ste *target)
> > +{
> > +	struct arm_smmu_ste target_used;
> > +	int i;
> > +
> > +	arm_smmu_get_ste_used(target, &target_used);
> > +	/* Masks in arm_smmu_get_ste_used() are up to date */
> > +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> > +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> In what situation this would be triggered, is that for future proofing,
> maybe we can move it to arm_smmu_get_ste_used()?

Yes, prevent people from making an error down the road.

It can't be in ste_used due to how this specific algorithm works
iteratively

And in the v4 version it still wouldn't be a good idea at this point
due to how the series slowly migrates STE and CD programming
over. There are cases where the current STE will not have been written
by this code and may not pass this test.

Thanks,
Jason
Mostafa Saleh Jan. 29, 2024, 8:48 p.m. UTC | #5
On Mon, Jan 29, 2024 at 03:49:10PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 29, 2024 at 07:10:47PM +0000, Mostafa Saleh wrote:
> 
> > > Going forward this will use a V=0 transition instead of cycling through
> > > ABORT if a hitfull change is required. This seems more appropriate as ABORT
> > > will fail DMAs without any logging, but dropping a DMA due to transient
> > > V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
> > Would the driver do anything in that case, or would just print the log message?
> 
> Just log, AFAIK.
>  
> > > +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> > > +				      const __le64 *target,
> > > +				      const __le64 *target_used, __le64 *step,
> > > +				      __le64 v_bit,
> > I think this is confusing here, I believe we have this as an argument as this
> > function would be used for CD later, however for this series it is unnecessary,
> > IMHO, this should be removed and added in another patch for the CD rework.
> 
> It is alot of code churn to do that, even more on the new version.
> 
> > > +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> > > +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> > > +	case STRTAB_STE_0_CFG_ABORT:
> > > +		break;
> > > +	case STRTAB_STE_0_CFG_BYPASS:
> > > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > > +		break;
> > > +	case STRTAB_STE_0_CFG_S1_TRANS:
> > > +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > > +						  STRTAB_STE_0_S1CTXPTR_MASK |
> > > +						  STRTAB_STE_0_S1CDMAX);
> > > +		used_bits->data[1] |=
> > > +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > > +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > > +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > > +		break;
> > AFAIU, this is missing something like (while passing smmu->features)
> > 	used_bits->data[2] |= features & ARM_SMMU_FEAT_NESTING ?
> > 			      cpu_to_le64(STRTAB_STE_2_S2VMID) : 0;
> > 
> > As the SMMUv3 manual says:
> > “ For a Non-secure STE when stage 2 is implemented (SMMU_IDR0.S2P == 1)
> >   translations resulting from a StreamWorld == NS-EL1 configuration are
> >   VMID-tagged with S2VMID when either of stage 1 (Config[0] == 1) or stage 2
> >   (Config[1] == 1) provide translation.“
> > 
> > Which means in case of S1=>S2 switch or vice versa this algorithm will ignore
> > VMID while it is used.
Yes, In that case we would consider S2VMID even for stage-1 instances only,
even though it should never change and in that case the algorithm will have
the same steps. I guess it might still look confusing, but no strong opinion.

> 
> Ah, yes, that is a small miss, thanks. I don't think we need the
> features test though, s2vmid doesn't mean something different if the
> feature is not present..
> 
> > > +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> > > +			       struct arm_smmu_ste *ste,
> > > +			       const struct arm_smmu_ste *target)
> > > +{
> > > +	struct arm_smmu_ste target_used;
> > > +	int i;
> > > +
> > > +	arm_smmu_get_ste_used(target, &target_used);
> > > +	/* Masks in arm_smmu_get_ste_used() are up to date */
> > > +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> > > +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> > In what situation this would be triggered, is that for future proofing,
> > maybe we can move it to arm_smmu_get_ste_used()?
> 
> Yes, prevent people from making an error down the road.
> 
> It can't be in ste_used due to how this specific algorithm works
> iteratively
> 
> And in the v4 version it still wouldn't be a good idea at this point
> due to how the series slowly migrates STE and CD programming
> over. There are cases where the current STE will not have been written
> by this code and may not pass this test.
> 
> Thanks,
> Jason
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b120d836681c1c..0934f882b94e94 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,101 @@  void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * This algorithm updates any STE/CD to any value without creating a situation
+ * where the HW can percieve a corrupted entry. HW is only required to have a 64
+ * bit atomicity with stores from the CPU, while entries are many 64 bit values
+ * big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not see an entry
+ * torn across two steps. Upon each call cur/cur_used reflect the current
+ * synchronized value seen by the HW.
+ *
+ * During each step the HW can observe a torn entry that has any combination of
+ * the step's old/new 64 bit words. The algorithm objective is for the HW
+ * behavior to always be one of current behavior, V=0, or new behavior, during
+ * each step, and across all steps.
+ *
+ * At each step one of three actions is chosen to evolve cur to target:
+ *  - Update all unused bits with their target values.
+ *    This relies on the IGNORED behavior described in the specification
+ *  - Update a single 64-bit value
+ *  - Update all unused bits and set V=0
+ *
+ * The last two actions will cause cur_used to change, which will then allow the
+ * first action on the next step.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process.
+ */
+static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
+				      const __le64 *target,
+				      const __le64 *target_used, __le64 *step,
+				      __le64 v_bit,
+				      unsigned int len)
+{
+	u8 step_used_diff = 0;
+	u8 step_change = 0;
+	unsigned int i;
+
+	/*
+	 * Compute a step that has all the bits currently unused by HW set to
+	 * their target values.
+	 */
+	for (i = 0; i != len; i++) {
+		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
+		if (cur[i] != step[i])
+			step_change |= 1 << i;
+		/*
+		 * Each bit indicates if the step is incorrect compared to the
+		 * target, considering only the used bits in the target
+		 */
+		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
+			step_used_diff |= 1 << i;
+	}
+
+	if (hweight8(step_used_diff) > 1) {
+		/*
+		 * More than 1 qword is mismatched, this cannot be done without
+		 * a break. Clear the V bit and go again.
+		 */
+		step[0] &= ~v_bit;
+	} else if (!step_change && step_used_diff) {
+		/*
+		 * Have exactly one critical qword, all the other qwords are set
+		 * correctly, so we can set this qword now.
+		 */
+		i = ffs(step_used_diff) - 1;
+		step[i] = target[i];
+	} else if (!step_change) {
+		/* cur == target, so all done */
+		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
+			return true;
+
+		/*
+		 * All the used HW bits match, but unused bits are different.
+		 * Set them as well. Technically this isn't necessary but it
+		 * brings the entry to the full target state, so if there are
+		 * bugs in the mask calculation this will obscure them.
+		 */
+		memcpy(step, target, len * sizeof(*step));
+	}
+
+	for (i = 0; i != len; i++)
+		WRITE_ONCE(cur[i], step[i]);
+	return false;
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1248,37 +1343,115 @@  static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
+				  struct arm_smmu_ste *used_bits)
+{
+	memset(used_bits, 0, sizeof(*used_bits));
+
+	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
+	/*
+	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+	 * attribute/permission configuration fields for the SHCFG behavior.
+	 */
+	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0])) & 1 &&
+	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+						  STRTAB_STE_0_S1CTXPTR_MASK |
+						  STRTAB_STE_0_S1CDMAX);
+		used_bits->data[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits->data[1] |=
+			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+		used_bits->data[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF, sizeof(*used_bits));
+		WARN_ON(true);
+	}
+}
+
+static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
+				    const struct arm_smmu_ste *target,
+				    const struct arm_smmu_ste *target_used)
+{
+	struct arm_smmu_ste cur_used;
+	struct arm_smmu_ste step;
+
+	arm_smmu_get_ste_used(cur, &cur_used);
+	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
+					 target_used->data, step.data,
+					 cpu_to_le64(STRTAB_STE_0_V),
+					 ARRAY_SIZE(cur->data));
+}
+
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste target_used;
+	int i;
+
+	arm_smmu_get_ste_used(target, &target_used);
+	/* Masks in arm_smmu_get_ste_used() are up to date */
+	for (i = 0; i != ARRAY_SIZE(target->data); i++)
+		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+
+	while (true) {
+		if (arm_smmu_write_ste_step(ste, target, &target_used))
+			break;
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+	}
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
-	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1293,22 +1466,6 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1319,16 +1476,11 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(smmu, sid, dst, &target);
 		return;
 	}
 
@@ -1336,8 +1488,7 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1497,7 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1506,7 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1515,17 @@  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,