diff mbox series

[kvm-unit-tests,10/10] arm64: gic: Use IPI test checking for the LPI tests

Message ID 20201125155113.192079-11-alexandru.elisei@arm.com (mailing list archive)
State New, archived
Headers show
Series GIC fixes and improvements | expand

Commit Message

Alexandru Elisei Nov. 25, 2020, 3:51 p.m. UTC
The LPI code validates a result similarly to the IPI tests, by checking if
the target CPU received the interrupt with the expected interrupt number.
However, the LPI tests invent their own way of checking the test results by
creating a global struct (lpi_stats), using a separate interrupt handler
(lpi_handler) and test function (check_lpi_stats).

There are several areas that can be improved in the LPI code, which are
already covered by the IPI tests:

- check_lpi_stats() doesn't take into account that the target CPU can
  receive the correct interrupt multiple times.
- check_lpi_stats() doesn't take into the account the scenarios where all
  online CPUs can receive the interrupt, but the target CPU is the last CPU
  that touches lpi_stats.observed.
- Insufficient or missing memory synchronization.

Instead of duplicating code, let's convert the LPI tests to use
check_acked() and the same interrupt handler as the IPI tests, which has
been renamed to irq_handler() to avoid any confusion.

check_lpi_stats() has been replaced with check_acked() which, together with
using irq_handler(), instantly gives us more correctness checks and proper
memory synchronization between threads. lpi_stats.expected has been
replaced by the CPU mask and the expected interrupt number arguments to
check_acked(), with no change in semantics.

lpi_handler() aborted the test if the interrupt number was not an LPI. This
was changed in favor of allowing the test to continue, as it will fail in
check_acked(), but possibly print information useful for debugging. If the
test receives spurious interrupts, those are reported via report_info() at
the end of the test for consistency with the IPI tests, which don't treat
spurious interrupts as critical errors.

In the spirit of code reuse, secondary_lpi_tests() has been replaced with
ipi_recv() because the two are now identical; ipi_recv() has been renamed
to irq_recv(), similarly to irq_handler(), to avoid confusion.

CC: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
With this change, I get the following failure for its-trigger on a
rockpro64 (running on the little cores):

$ taskset -c 0-3 arm/run arm/gic.flat -smp 4 -machine gic-version=3 -append its-trigger
/usr/bin/qemu-system-aarch64 -nodefaults -machine virt,gic-version=host,accel=kvm -cpu host -device virtio-serial-device -device virtconsole,chardev=ctd -chardev testdev,id=ctd -device pci-testdev -display none -serial stdio -kernel arm/gic.flat -smp 4 -machine gic-version=3 -append its-trigger # -initrd /tmp/tmp.wWW0iJY6DS
ITS: MAPD devid=2 size = 0x8 itt=0x403a0000 valid=1
ITS: MAPD devid=7 size = 0x8 itt=0x403b0000 valid=1
MAPC col_id=3 target_addr = 0x30000 valid=1
MAPC col_id=2 target_addr = 0x20000 valid=1
INVALL col_id=2
INVALL col_id=3
MAPTI dev_id=2 event_id=20 -> phys_id=8195, col_id=3
MAPTI dev_id=7 event_id=255 -> phys_id=8196, col_id=2
INT dev_id=2 event_id=20
PASS: gicv3: its-trigger: int: dev=2, eventid=20  -> lpi= 8195, col=3
INT dev_id=7 event_id=255
PASS: gicv3: its-trigger: int: dev=7, eventid=255 -> lpi= 8196, col=2
INV dev_id=2 event_id=20
INT dev_id=2 event_id=20
PASS: gicv3: its-trigger: inv/invall: dev2/eventid=20 does not trigger any LPI
INT dev_id=2 event_id=20
PASS: gicv3: its-trigger: inv/invall: dev2/eventid=20 still does not trigger any LPI
INVALL col_id=3
INT dev_id=2 event_id=20
INFO: gicv3: its-trigger: inv/invall: ACKS: missing=0 extra=1 unexpected=0
FAIL: gicv3: its-trigger: inv/invall: dev2/eventid=20 now triggers an LPI
ITS: MAPD devid=2 size = 0x8 itt=0x403a0000 valid=0
INT dev_id=2 event_id=20
PASS: gicv3: its-trigger: mapd valid=false: no LPI after device unmap
SUMMARY: 6 tests, 1 unexpected failures

The reason for the failure is that the test "dev2/eventid=20 now triggers
an LPI" triggers 2 LPIs, not one. This behavior was present before this
patch, but it was ignored because check_lpi_stats() wasn't looking at the
acked array.

I'm not familiar with the ITS so I'm not sure if this is expected, if the
test is incorrect or if there is something wrong with KVM emulation.

Did some more testing on an Ampere eMAG (fast out-of-order cores) using
qemu and kvmtool and Linux v5.8, here's what I found:

- Using qemu and gic.flat built from *master*: error encountered 864 times
  out of 1088 runs.
- Using qemu: error encountered 852 times out of 1027 runs.
- Using kvmtool: error encountered 8164 times out of 10602 runs.

Looks to me like it's consistent between master and this series, and
between qemu and kvmtool.

Here's the diff that I used for testing master (I removed the diff line
because it causes trouble when applying the main patch):

@@ -772,8 +772,12 @@ static void test_its_trigger(void)
        /* Now call the invall and check the LPI hits */
        its_send_invall(col3);
        lpi_stats_expect(3, 8195);
+       acked[3] = 0;
+       dsb(ishst);
        its_send_int(dev2, 20);
        check_lpi_stats("dev2/eventid=20 now triggers an LPI");
+       report_info("acked[3] = %d", acked[3]);
+       report(acked[3] == 1, "dev2/eventid=20 received one interrupt");
 
        report_prefix_pop();
 

 arm/gic.c | 185 ++++++++++++++++++++++++++----------------------------
 1 file changed, 88 insertions(+), 97 deletions(-)

Comments

Zenghui Yu Nov. 26, 2020, 9:30 a.m. UTC | #1
On 2020/11/25 23:51, Alexandru Elisei wrote:
> The reason for the failure is that the test "dev2/eventid=20 now triggers
> an LPI" triggers 2 LPIs, not one. This behavior was present before this
> patch, but it was ignored because check_lpi_stats() wasn't looking at the
> acked array.
> 
> I'm not familiar with the ITS so I'm not sure if this is expected, if the
> test is incorrect or if there is something wrong with KVM emulation.

I think this is expected, or not.

Before INVALL, the LPI-8195 was already pending but disabled. On
receiving INVALL, VGIC will reload configuration for all LPIs targeting
collection-3 and deliver the now enabled LPI-8195. We'll therefore see
and handle it before sending the following INT (which will set the
LPI-8195 pending again).

> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
> qemu and kvmtool and Linux v5.8, here's what I found:
> 
> - Using qemu and gic.flat built from*master*: error encountered 864 times
>    out of 1088 runs.
> - Using qemu: error encountered 852 times out of 1027 runs.
> - Using kvmtool: error encountered 8164 times out of 10602 runs.

If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
vcpu-3 hadn't been scheduled), the following INT will set the already
pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
And we won't see the mentioned failure.

I think we can just drop the (meaningless and confusing?) INT.


Thanks,
Zenghui
Alexandru Elisei Nov. 27, 2020, 2:50 p.m. UTC | #2
Hi Zhenghui,

Thank you for having a look at this!

On 11/26/20 9:30 AM, Zenghui Yu wrote:
> On 2020/11/25 23:51, Alexandru Elisei wrote:
>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>> acked array.
>>
>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>> test is incorrect or if there is something wrong with KVM emulation.
>
> I think this is expected, or not.
>
> Before INVALL, the LPI-8195 was already pending but disabled. On
> receiving INVALL, VGIC will reload configuration for all LPIs targeting
> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
> and handle it before sending the following INT (which will set the
> LPI-8195 pending again).
>
>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>> qemu and kvmtool and Linux v5.8, here's what I found:
>>
>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>    out of 1088 runs.
>> - Using qemu: error encountered 852 times out of 1027 runs.
>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>
> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
> vcpu-3 hadn't been scheduled), the following INT will set the already
> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
> And we won't see the mentioned failure.
>
> I think we can just drop the (meaningless and confusing?) INT.

I think I understand your explanation, the VCPU takes the interrupt immediately
after the INVALL and before the INT, and the second interrupt that I am seeing is
the one caused by the INT command.

I tried modifying the test like this:

diff --git a/arm/gic.c b/arm/gic.c
index 6e93da80fe0d..0ef8c12ea234 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -761,10 +761,17 @@ static void test_its_trigger(void)
        wmb();
        cpumask_clear(&mask);
        cpumask_set_cpu(3, &mask);
-       its_send_int(dev2, 20);
        wait_for_interrupts(&mask);
        report(check_acked(&mask, 0, 8195),
-                       "dev2/eventid=20 now triggers an LPI");
+                       "dev2/eventid=20 pending LPI is received");
+
+       stats_reset();
+       wmb();
+       cpumask_clear(&mask);
+       cpumask_set_cpu(3, &mask);
+       its_send_int(dev2, 20);
+       wait_for_interrupts(&mask);
+       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
 
        report_prefix_pop();
 
I removed the INT from the initial test, and added a separate one to check that
the INT command still works. That looks to me that preserves the spirit of the
original test. After doing stress testing this is what I got:

- with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
the interrupt after INVALL.
- with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
interrupt after INVALL, just like with kvmtool.

Judging from the fact that there is an order of magnitude less failures with
kvmtool than with qemu, I'm leaning towards some random timing issue. I will try
increasing the timeout for wait_for_interrupts() and see if the results improve
over the weekend.

Thanks,
Alex
>
>
> Thanks,
> Zenghui
Zenghui Yu Nov. 30, 2020, 1:59 p.m. UTC | #3
Hi Alex,

On 2020/11/27 22:50, Alexandru Elisei wrote:
> Hi Zhenghui,
> 
> Thank you for having a look at this!
> 
> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>> acked array.
>>>
>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>> test is incorrect or if there is something wrong with KVM emulation.
>>
>> I think this is expected, or not.
>>
>> Before INVALL, the LPI-8195 was already pending but disabled. On
>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>> and handle it before sending the following INT (which will set the
>> LPI-8195 pending again).
>>
>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>
>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>     out of 1088 runs.
>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>
>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>> vcpu-3 hadn't been scheduled), the following INT will set the already
>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>> And we won't see the mentioned failure.
>>
>> I think we can just drop the (meaningless and confusing?) INT.
> 
> I think I understand your explanation, the VCPU takes the interrupt immediately
> after the INVALL and before the INT, and the second interrupt that I am seeing is
> the one caused by the INT command.

Yes.

> I tried modifying the test like this:
> 
> diff --git a/arm/gic.c b/arm/gic.c
> index 6e93da80fe0d..0ef8c12ea234 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>          wmb();
>          cpumask_clear(&mask);
>          cpumask_set_cpu(3, &mask);
> -       its_send_int(dev2, 20);

Shouldn't its_send_invall(col3) be moved down here? See below.

>          wait_for_interrupts(&mask);
>          report(check_acked(&mask, 0, 8195),
> -                       "dev2/eventid=20 now triggers an LPI");
> +                       "dev2/eventid=20 pending LPI is received");
> +
> +       stats_reset();
> +       wmb();
> +       cpumask_clear(&mask);
> +       cpumask_set_cpu(3, &mask);
> +       its_send_int(dev2, 20);
> +       wait_for_interrupts(&mask);
> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>   
>          report_prefix_pop();
>   
> I removed the INT from the initial test, and added a separate one to check that
> the INT command still works. That looks to me that preserves the spirit of the
> original test. After doing stress testing this is what I got:
> 
> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
> the interrupt after INVALL.
> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
> interrupt after INVALL, just like with kvmtool.

I guess the reason of failure is that the LPI is taken *immediately*
after the INVALL?

	/* Now call the invall and check the LPI hits */
	its_send_invall(col3);
		<- LPI is taken, acked[]++
	stats_reset();
		<- acked[] is cleared unexpectedly
	wmb();
	cpumask_clear(&mask);
	cpumask_set_cpu(3, &mask);
	wait_for_interrupts(&mask);
		<- we'll hit timed-out since acked[] is 0


Thanks,
Zenghui

> Judging from the fact that there is an order of magnitude less failures with
> kvmtool than with qemu, I'm leaning towards some random timing issue. I will try
> increasing the timeout for wait_for_interrupts() and see if the results improve
> over the weekend.
> 
> Thanks,
> Alex
>>
>>
>> Thanks,
>> Zenghui
> .
>
Alexandru Elisei Nov. 30, 2020, 2:19 p.m. UTC | #4
Hi Zenghui,

On 11/30/20 1:59 PM, Zenghui Yu wrote:
> Hi Alex,
>
> On 2020/11/27 22:50, Alexandru Elisei wrote:
>> Hi Zhenghui,
>>
>> Thank you for having a look at this!
>>
>> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>>> acked array.
>>>>
>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>>> test is incorrect or if there is something wrong with KVM emulation.
>>>
>>> I think this is expected, or not.
>>>
>>> Before INVALL, the LPI-8195 was already pending but disabled. On
>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>>> and handle it before sending the following INT (which will set the
>>> LPI-8195 pending again).
>>>
>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>>
>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>>     out of 1088 runs.
>>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>>
>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>>> vcpu-3 hadn't been scheduled), the following INT will set the already
>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>>> And we won't see the mentioned failure.
>>>
>>> I think we can just drop the (meaningless and confusing?) INT.
>>
>> I think I understand your explanation, the VCPU takes the interrupt immediately
>> after the INVALL and before the INT, and the second interrupt that I am seeing is
>> the one caused by the INT command.
>
> Yes.
>
>> I tried modifying the test like this:
>>
>> diff --git a/arm/gic.c b/arm/gic.c
>> index 6e93da80fe0d..0ef8c12ea234 100644
>> --- a/arm/gic.c
>> +++ b/arm/gic.c
>> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>>          wmb();
>>          cpumask_clear(&mask);
>>          cpumask_set_cpu(3, &mask);
>> -       its_send_int(dev2, 20);
>
> Shouldn't its_send_invall(col3) be moved down here? See below.
>
>>          wait_for_interrupts(&mask);
>>          report(check_acked(&mask, 0, 8195),
>> -                       "dev2/eventid=20 now triggers an LPI");
>> +                       "dev2/eventid=20 pending LPI is received");
>> +
>> +       stats_reset();
>> +       wmb();
>> +       cpumask_clear(&mask);
>> +       cpumask_set_cpu(3, &mask);
>> +       its_send_int(dev2, 20);
>> +       wait_for_interrupts(&mask);
>> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>>            report_prefix_pop();
>>   I removed the INT from the initial test, and added a separate one to check that
>> the INT command still works. That looks to me that preserves the spirit of the
>> original test. After doing stress testing this is what I got:
>>
>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
>> the interrupt after INVALL.
>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
>> interrupt after INVALL, just like with kvmtool.
>
> I guess the reason of failure is that the LPI is taken *immediately*
> after the INVALL?
>
>     /* Now call the invall and check the LPI hits */
>     its_send_invall(col3);
>         <- LPI is taken, acked[]++
>     stats_reset();
>         <- acked[] is cleared unexpectedly
>     wmb();
>     cpumask_clear(&mask);
>     cpumask_set_cpu(3, &mask);
>     wait_for_interrupts(&mask);
>         <- we'll hit timed-out since acked[] is 0

Yes, of course, you're right, I didn't realize that I was resetting the stats
*after* the interrupt was enabled. This also explains why I was still seeing
timeouts even when the timeout duration was set to 50 seconds. I'll retest with
the fix:

diff --git a/arm/gic.c b/arm/gic.c
index 6e93da80fe0d..c4240f5aba39 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -756,15 +756,22 @@ static void test_its_trigger(void)
                        "dev2/eventid=20 still does not trigger any LPI");
 
        /* Now call the invall and check the LPI hits */
+       stats_reset();
+       wmb();
+       cpumask_clear(&mask);
+       cpumask_set_cpu(3, &mask);
        its_send_invall(col3);
+       wait_for_interrupts(&mask);
+       report(check_acked(&mask, 0, 8195),
+                       "dev2/eventid=20 pending LPI is received");
+
        stats_reset();
        wmb();
        cpumask_clear(&mask);
        cpumask_set_cpu(3, &mask);
        its_send_int(dev2, 20);
        wait_for_interrupts(&mask);
-       report(check_acked(&mask, 0, 8195),
-                       "dev2/eventid=20 now triggers an LPI");
+       report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI");
 
        report_prefix_pop();
 
I also pushed a branch at [1].

Thank you so much for spotting this! You've saved me (and probably others) a lot
of time debugging.

[1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2

Thanks,
Alex
Eric Auger Nov. 30, 2020, 5:48 p.m. UTC | #5
Hi Alexandru, Zenghui
On 11/26/20 10:30 AM, Zenghui Yu wrote:
> On 2020/11/25 23:51, Alexandru Elisei wrote:
>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>> acked array.
>>
>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>> test is incorrect or if there is something wrong with KVM emulation.
> 
> I think this is expected, or not.
> 
> Before INVALL, the LPI-8195 was already pending but disabled. On
> receiving INVALL, VGIC will reload configuration for all LPIs targeting
> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
> and handle it before sending the following INT (which will set the
> LPI-8195 pending again).
> 
>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>> qemu and kvmtool and Linux v5.8, here's what I found:
>>
>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>    out of 1088 runs.
>> - Using qemu: error encountered 852 times out of 1027 runs.
>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
> 
> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
> vcpu-3 hadn't been scheduled), the following INT will set the already
> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
> And we won't see the mentioned failure.
> 
> I think we can just drop the (meaningless and confusing?) INT.
Yes I agree with Zenghui, we can remove the INT and just check the
pending LPI set while disabled eventually hits

Thanks

Eric
> 
> 
> Thanks,
> Zenghui
>
Alexandru Elisei Dec. 1, 2020, 3:09 p.m. UTC | #6
Hi,

On 11/30/20 2:19 PM, Alexandru Elisei wrote:
> Hi Zenghui,
>
> On 11/30/20 1:59 PM, Zenghui Yu wrote:
>> Hi Alex,
>>
>> On 2020/11/27 22:50, Alexandru Elisei wrote:
>>> Hi Zhenghui,
>>>
>>> Thank you for having a look at this!
>>>
>>> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>>>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>>>> acked array.
>>>>>
>>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>>>> test is incorrect or if there is something wrong with KVM emulation.
>>>> I think this is expected, or not.
>>>>
>>>> Before INVALL, the LPI-8195 was already pending but disabled. On
>>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>>>> and handle it before sending the following INT (which will set the
>>>> LPI-8195 pending again).
>>>>
>>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>>>
>>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>>>     out of 1088 runs.
>>>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>>>> vcpu-3 hadn't been scheduled), the following INT will set the already
>>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>>>> And we won't see the mentioned failure.
>>>>
>>>> I think we can just drop the (meaningless and confusing?) INT.
>>> I think I understand your explanation, the VCPU takes the interrupt immediately
>>> after the INVALL and before the INT, and the second interrupt that I am seeing is
>>> the one caused by the INT command.
>> Yes.
>>
>>> I tried modifying the test like this:
>>>
>>> diff --git a/arm/gic.c b/arm/gic.c
>>> index 6e93da80fe0d..0ef8c12ea234 100644
>>> --- a/arm/gic.c
>>> +++ b/arm/gic.c
>>> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>>>          wmb();
>>>          cpumask_clear(&mask);
>>>          cpumask_set_cpu(3, &mask);
>>> -       its_send_int(dev2, 20);
>> Shouldn't its_send_invall(col3) be moved down here? See below.
>>
>>>          wait_for_interrupts(&mask);
>>>          report(check_acked(&mask, 0, 8195),
>>> -                       "dev2/eventid=20 now triggers an LPI");
>>> +                       "dev2/eventid=20 pending LPI is received");
>>> +
>>> +       stats_reset();
>>> +       wmb();
>>> +       cpumask_clear(&mask);
>>> +       cpumask_set_cpu(3, &mask);
>>> +       its_send_int(dev2, 20);
>>> +       wait_for_interrupts(&mask);
>>> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>>>            report_prefix_pop();
>>>   I removed the INT from the initial test, and added a separate one to check that
>>> the INT command still works. That looks to me that preserves the spirit of the
>>> original test. After doing stress testing this is what I got:
>>>
>>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
>>> the interrupt after INVALL.
>>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
>>> interrupt after INVALL, just like with kvmtool.
>> I guess the reason of failure is that the LPI is taken *immediately*
>> after the INVALL?
>>
>>     /* Now call the invall and check the LPI hits */
>>     its_send_invall(col3);
>>         <- LPI is taken, acked[]++
>>     stats_reset();
>>         <- acked[] is cleared unexpectedly
>>     wmb();
>>     cpumask_clear(&mask);
>>     cpumask_set_cpu(3, &mask);
>>     wait_for_interrupts(&mask);
>>         <- we'll hit timed-out since acked[] is 0
> Yes, of course, you're right, I didn't realize that I was resetting the stats
> *after* the interrupt was enabled. This also explains why I was still seeing
> timeouts even when the timeout duration was set to 50 seconds. I'll retest with
> the fix:
>
> diff --git a/arm/gic.c b/arm/gic.c
> index 6e93da80fe0d..c4240f5aba39 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -756,15 +756,22 @@ static void test_its_trigger(void)
>                         "dev2/eventid=20 still does not trigger any LPI");
>  
>         /* Now call the invall and check the LPI hits */
> +       stats_reset();
> +       wmb();
> +       cpumask_clear(&mask);
> +       cpumask_set_cpu(3, &mask);
>         its_send_invall(col3);
> +       wait_for_interrupts(&mask);
> +       report(check_acked(&mask, 0, 8195),
> +                       "dev2/eventid=20 pending LPI is received");
> +
>         stats_reset();
>         wmb();
>         cpumask_clear(&mask);
>         cpumask_set_cpu(3, &mask);
>         its_send_int(dev2, 20);
>         wait_for_interrupts(&mask);
> -       report(check_acked(&mask, 0, 8195),
> -                       "dev2/eventid=20 now triggers an LPI");
> +       report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI");
>  
>         report_prefix_pop();
>  
> I also pushed a branch at [1].
>
> Thank you so much for spotting this! You've saved me (and probably others) a lot
> of time debugging.
>
> [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2

I have been testing the branch, no failures after 17,996 runs with qemu and 58,669
runs with kvmtool. This looks fine to me, I'll send a v2 with the fix.

Thanks,
Alex
Eric Auger Dec. 3, 2020, 2:59 p.m. UTC | #7
Hi Alexandru,
On 11/25/20 4:51 PM, Alexandru Elisei wrote:
> The LPI code validates a result similarly to the IPI tests, by checking if
> the target CPU received the interrupt with the expected interrupt number.
> However, the LPI tests invent their own way of checking the test results by
> creating a global struct (lpi_stats), using a separate interrupt handler
> (lpi_handler) and test function (check_lpi_stats).
> 
> There are several areas that can be improved in the LPI code, which are
> already covered by the IPI tests:
> 
> - check_lpi_stats() doesn't take into account that the target CPU can
>   receive the correct interrupt multiple times.
> - check_lpi_stats() doesn't take into the account the scenarios where all
>   online CPUs can receive the interrupt, but the target CPU is the last CPU
>   that touches lpi_stats.observed.
> - Insufficient or missing memory synchronization.
> 
> Instead of duplicating code, let's convert the LPI tests to use
> check_acked() and the same interrupt handler as the IPI tests, which has
> been renamed to irq_handler() to avoid any confusion.
> 
> check_lpi_stats() has been replaced with check_acked() which, together with
> using irq_handler(), instantly gives us more correctness checks and proper
> memory synchronization between threads. lpi_stats.expected has been
> replaced by the CPU mask and the expected interrupt number arguments to
> check_acked(), with no change in semantics.
> 
> lpi_handler() aborted the test if the interrupt number was not an LPI. This
> was changed in favor of allowing the test to continue, as it will fail in
> check_acked(), but possibly print information useful for debugging. If the
> test receives spurious interrupts, those are reported via report_info() at
> the end of the test for consistency with the IPI tests, which don't treat
> spurious interrupts as critical errors.
> 
> In the spirit of code reuse, secondary_lpi_tests() has been replaced with
> ipi_recv() because the two are now identical; ipi_recv() has been renamed
> to irq_recv(), similarly to irq_handler(), to avoid confusion.
> 
> CC: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
> With this change, I get the following failure for its-trigger on a
> rockpro64 (running on the little cores):
> 
> $ taskset -c 0-3 arm/run arm/gic.flat -smp 4 -machine gic-version=3 -append its-trigger
> /usr/bin/qemu-system-aarch64 -nodefaults -machine virt,gic-version=host,accel=kvm -cpu host -device virtio-serial-device -device virtconsole,chardev=ctd -chardev testdev,id=ctd -device pci-testdev -display none -serial stdio -kernel arm/gic.flat -smp 4 -machine gic-version=3 -append its-trigger # -initrd /tmp/tmp.wWW0iJY6DS
> ITS: MAPD devid=2 size = 0x8 itt=0x403a0000 valid=1
> ITS: MAPD devid=7 size = 0x8 itt=0x403b0000 valid=1
> MAPC col_id=3 target_addr = 0x30000 valid=1
> MAPC col_id=2 target_addr = 0x20000 valid=1
> INVALL col_id=2
> INVALL col_id=3
> MAPTI dev_id=2 event_id=20 -> phys_id=8195, col_id=3
> MAPTI dev_id=7 event_id=255 -> phys_id=8196, col_id=2
> INT dev_id=2 event_id=20
> PASS: gicv3: its-trigger: int: dev=2, eventid=20  -> lpi= 8195, col=3
> INT dev_id=7 event_id=255
> PASS: gicv3: its-trigger: int: dev=7, eventid=255 -> lpi= 8196, col=2
> INV dev_id=2 event_id=20
> INT dev_id=2 event_id=20
> PASS: gicv3: its-trigger: inv/invall: dev2/eventid=20 does not trigger any LPI
> INT dev_id=2 event_id=20
> PASS: gicv3: its-trigger: inv/invall: dev2/eventid=20 still does not trigger any LPI
> INVALL col_id=3
> INT dev_id=2 event_id=20
> INFO: gicv3: its-trigger: inv/invall: ACKS: missing=0 extra=1 unexpected=0
> FAIL: gicv3: its-trigger: inv/invall: dev2/eventid=20 now triggers an LPI
> ITS: MAPD devid=2 size = 0x8 itt=0x403a0000 valid=0
> INT dev_id=2 event_id=20
> PASS: gicv3: its-trigger: mapd valid=false: no LPI after device unmap
> SUMMARY: 6 tests, 1 unexpected failures
> 
> The reason for the failure is that the test "dev2/eventid=20 now triggers
> an LPI" triggers 2 LPIs, not one. This behavior was present before this
> patch, but it was ignored because check_lpi_stats() wasn't looking at the
> acked array.
> 
> I'm not familiar with the ITS so I'm not sure if this is expected, if the
> test is incorrect or if there is something wrong with KVM emulation.
> 
> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
> qemu and kvmtool and Linux v5.8, here's what I found:
> 
> - Using qemu and gic.flat built from *master*: error encountered 864 times
>   out of 1088 runs.
> - Using qemu: error encountered 852 times out of 1027 runs.
> - Using kvmtool: error encountered 8164 times out of 10602 runs.
> 
> Looks to me like it's consistent between master and this series, and
> between qemu and kvmtool.
> 
> Here's the diff that I used for testing master (I removed the diff line
> because it causes trouble when applying the main patch):
> 
> @@ -772,8 +772,12 @@ static void test_its_trigger(void)
>         /* Now call the invall and check the LPI hits */
>         its_send_invall(col3);
>         lpi_stats_expect(3, 8195);
> +       acked[3] = 0;
> +       dsb(ishst);
>         its_send_int(dev2, 20);
>         check_lpi_stats("dev2/eventid=20 now triggers an LPI");
> +       report_info("acked[3] = %d", acked[3]);
> +       report(acked[3] == 1, "dev2/eventid=20 received one interrupt");
>  
>         report_prefix_pop();
>  
> 
>  arm/gic.c | 185 ++++++++++++++++++++++++++----------------------------
>  1 file changed, 88 insertions(+), 97 deletions(-)
> 
> diff --git a/arm/gic.c b/arm/gic.c
> index da7b42da5449..6e93da80fe0d 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -111,7 +111,7 @@ static bool check_acked(cpumask_t *mask, int sender, int irqnum)
>  		}
>  		if (!acked[cpu])
>  			continue;
> -		smp_rmb(); /* pairs with smp_wmb in ipi_handler */
> +		smp_rmb(); /* pairs with smp_wmb in irq_handler */
>  
>  		if (has_gicv2 && irq_sender[cpu] != sender) {
>  			report_info("cpu%d received IPI from wrong sender %d",
> @@ -149,11 +149,12 @@ static void check_spurious(void)
>  static int gic_get_sender(int irqstat)
>  {
>  	if (gic_version() == 2)
> +		/* GICC_IAR.CPUID is RAZ for non-SGIs */
>  		return (irqstat >> 10) & 7;
>  	return -1;
>  }
>  
> -static void ipi_handler(struct pt_regs *regs __unused)
> +static void irq_handler(struct pt_regs *regs __unused)
>  {
>  	u32 irqstat = gic_read_iar();
>  	u32 irqnr = gic_iar_irqnr(irqstat);
> @@ -192,75 +193,6 @@ static void setup_irq(irq_handler_fn handler)
>  }
>  
>  #if defined(__aarch64__)
> -struct its_event {
> -	int cpu_id;
> -	int lpi_id;
> -};
> -
> -struct its_stats {
> -	struct its_event expected;
> -	struct its_event observed;
> -};
> -
> -static struct its_stats lpi_stats;
> -
> -static void lpi_handler(struct pt_regs *regs __unused)
> -{
> -	u32 irqstat = gic_read_iar();
> -	int irqnr = gic_iar_irqnr(irqstat);
> -
> -	gic_write_eoir(irqstat);
> -	assert(irqnr >= 8192);
> -	smp_rmb(); /* pairs with wmb in lpi_stats_expect */
> -	lpi_stats.observed.cpu_id = smp_processor_id();
> -	lpi_stats.observed.lpi_id = irqnr;
> -	acked[lpi_stats.observed.cpu_id]++;
> -	smp_wmb(); /* pairs with rmb in check_lpi_stats */
> -}
> -
> -static void lpi_stats_expect(int exp_cpu_id, int exp_lpi_id)
> -{
> -	lpi_stats.expected.cpu_id = exp_cpu_id;
> -	lpi_stats.expected.lpi_id = exp_lpi_id;
> -	lpi_stats.observed.cpu_id = -1;
> -	lpi_stats.observed.lpi_id = -1;
> -	smp_wmb(); /* pairs with rmb in handler */
> -}
> -
> -static void check_lpi_stats(const char *msg)
> -{
> -	int i;
> -
> -	for (i = 0; i < 50; i++) {
> -		mdelay(100);
> -		smp_rmb(); /* pairs with wmb in lpi_handler */
> -		if (lpi_stats.observed.cpu_id == lpi_stats.expected.cpu_id &&
> -		    lpi_stats.observed.lpi_id == lpi_stats.expected.lpi_id) {
> -			report(true, "%s", msg);
> -			return;
> -		}
> -	}
> -
> -	if (lpi_stats.observed.cpu_id == -1 && lpi_stats.observed.lpi_id == -1) {
> -		report_info("No LPI received whereas (cpuid=%d, intid=%d) "
> -			    "was expected", lpi_stats.expected.cpu_id,
> -			    lpi_stats.expected.lpi_id);
> -	} else {
> -		report_info("Unexpected LPI (cpuid=%d, intid=%d)",
> -			    lpi_stats.observed.cpu_id,
> -			    lpi_stats.observed.lpi_id);
> -	}
> -	report(false, "%s", msg);
> -}
> -
> -static void secondary_lpi_test(void)
> -{
> -	setup_irq(lpi_handler);
> -	cpumask_set_cpu(smp_processor_id(), &ready);
> -	while (1)
> -		wfi();
> -}
> -
>  static void check_lpi_hits(int *expected, const char *msg)
>  {
>  	bool pass = true;
> @@ -347,7 +279,7 @@ static void ipi_test_smp(void)
>  
>  static void ipi_send(void)
>  {
> -	setup_irq(ipi_handler);
> +	setup_irq(irq_handler);
>  	wait_on_ready();
>  	ipi_test_self();
>  	ipi_test_smp();
> @@ -355,9 +287,9 @@ static void ipi_send(void)
>  	exit(report_summary());
>  }
>  
> -static void ipi_recv(void)
> +static void irq_recv(void)
>  {
> -	setup_irq(ipi_handler);
> +	setup_irq(irq_handler);
>  	cpumask_set_cpu(smp_processor_id(), &ready);
>  	while (1)
>  		wfi();
> @@ -368,7 +300,7 @@ static void ipi_test(void *data __unused)
>  	if (smp_processor_id() == IPI_SENDER)
>  		ipi_send();
>  	else
> -		ipi_recv();
> +		irq_recv();
>  }
>  
>  static struct gic gicv2 = {
> @@ -698,12 +630,12 @@ static int its_prerequisites(int nb_cpus)
>  
>  	stats_reset();
>  
> -	setup_irq(lpi_handler);
> +	setup_irq(irq_handler);
>  
>  	for_each_present_cpu(cpu) {
>  		if (cpu == 0)
>  			continue;
> -		smp_boot_secondary(cpu, secondary_lpi_test);
> +		smp_boot_secondary(cpu, irq_recv);
>  	}
>  	wait_on_ready();
>  
> @@ -757,6 +689,7 @@ static void test_its_trigger(void)
>  {
>  	struct its_collection *col3;
>  	struct its_device *dev2, *dev7;
> +	cpumask_t mask;
>  
>  	if (its_setup1())
>  		return;
> @@ -767,13 +700,27 @@ static void test_its_trigger(void)
>  
>  	report_prefix_push("int");
>  
> -	lpi_stats_expect(3, 8195);
> +	stats_reset();
> +	/*
> +	 * its_send_int() is missing the synchronization from the GICv3 IPI
> +	 * trigger functions.
> +	 */
> +	wmb();
so don't you want to add it in __its_send_int instead?

Eric
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(3, &mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("dev=2, eventid=20  -> lpi= 8195, col=3");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8195),
> +			"dev=2, eventid=20  -> lpi= 8195, col=3");
>  
> -	lpi_stats_expect(2, 8196);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(2, &mask);
>  	its_send_int(dev7, 255);
> -	check_lpi_stats("dev=7, eventid=255 -> lpi= 8196, col=2");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8196),
> +			"dev=7, eventid=255 -> lpi= 8196, col=2");
>  
>  	report_prefix_pop();
>  
> @@ -786,9 +733,13 @@ static void test_its_trigger(void)
>  	gicv3_lpi_set_config(8195, LPI_PROP_DEFAULT & ~LPI_PROP_ENABLED);
>  	its_send_inv(dev2, 20);
>  
> -	lpi_stats_expect(-1, -1);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("dev2/eventid=20 does not trigger any LPI");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, -1, -1),
> +			"dev2/eventid=20 does not trigger any LPI");
>  
>  	/*
>  	 * re-enable the LPI but willingly do not call invall
> @@ -796,15 +747,24 @@ static void test_its_trigger(void)
>  	 * The LPI should not hit
>  	 */
>  	gicv3_lpi_set_config(8195, LPI_PROP_DEFAULT);
> -	lpi_stats_expect(-1, -1);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("dev2/eventid=20 still does not trigger any LPI");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, -1, -1),
> +			"dev2/eventid=20 still does not trigger any LPI");
>  
>  	/* Now call the invall and check the LPI hits */
>  	its_send_invall(col3);
> -	lpi_stats_expect(3, 8195);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(3, &mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("dev2/eventid=20 now triggers an LPI");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8195),
> +			"dev2/eventid=20 now triggers an LPI");
>  
>  	report_prefix_pop();
>  
> @@ -815,9 +775,14 @@ static void test_its_trigger(void)
>  	 */
>  
>  	its_send_mapd(dev2, false);
> -	lpi_stats_expect(-1, -1);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("no LPI after device unmap");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, -1, -1), "no LPI after device unmap");
> +
> +	check_spurious();
>  	report_prefix_pop();
>  }
>  
> @@ -825,6 +790,7 @@ static void test_its_migration(void)
>  {
>  	struct its_device *dev2, *dev7;
>  	bool test_skipped = false;
> +	cpumask_t mask;
>  
>  	if (its_setup1()) {
>  		test_skipped = true;
> @@ -841,13 +807,25 @@ do_migrate:
>  	if (test_skipped)
>  		return;
>  
> -	lpi_stats_expect(3, 8195);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(3, &mask);
>  	its_send_int(dev2, 20);
> -	check_lpi_stats("dev2/eventid=20 triggers LPI 8195 on PE #3 after migration");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8195),
> +			"dev2/eventid=20 triggers LPI 8195 on PE #3 after migration");
>  
> -	lpi_stats_expect(2, 8196);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(2, &mask);
>  	its_send_int(dev7, 255);
> -	check_lpi_stats("dev7/eventid=255 triggers LPI 8196 on PE #2 after migration");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8196),
> +			"dev7/eventid=255 triggers LPI 8196 on PE #2 after migration");
> +
> +	check_spurious();
>  }
>  
>  #define ERRATA_UNMAPPED_COLLECTIONS "ERRATA_8c58be34494b"
> @@ -857,6 +835,7 @@ static void test_migrate_unmapped_collection(void)
>  	struct its_collection *col = NULL;
>  	struct its_device *dev2 = NULL, *dev7 = NULL;
>  	bool test_skipped = false;
> +	cpumask_t mask;
>  	int pe0 = 0;
>  	u8 config;
>  
> @@ -891,17 +870,29 @@ do_migrate:
>  	its_send_mapc(col, true);
>  	its_send_invall(col);
>  
> -	lpi_stats_expect(2, 8196);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(2, &mask);
>  	its_send_int(dev7, 255);
> -	check_lpi_stats("dev7/eventid= 255 triggered LPI 8196 on PE #2");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8196),
> +			"dev7/eventid= 255 triggered LPI 8196 on PE #2");
>  
>  	config = gicv3_lpi_get_config(8192);
>  	report(config == LPI_PROP_DEFAULT,
>  	       "Config of LPI 8192 was properly migrated");
>  
> -	lpi_stats_expect(pe0, 8192);
> +	stats_reset();
> +	wmb();
> +	cpumask_clear(&mask);
> +	cpumask_set_cpu(pe0, &mask);
>  	its_send_int(dev2, 0);
> -	check_lpi_stats("dev2/eventid = 0 triggered LPI 8192 on PE0");
> +	wait_for_interrupts(&mask);
> +	report(check_acked(&mask, 0, 8192),
> +			"dev2/eventid = 0 triggered LPI 8192 on PE0");
> +
> +	check_spurious();
>  }
>  
>  static void test_its_pending_migration(void)
>
Alexandru Elisei Dec. 9, 2020, 10:29 a.m. UTC | #8
Hi Eric,

On 12/3/20 2:59 PM, Auger Eric wrote:
> Hi Alexandru,
> On 11/25/20 4:51 PM, Alexandru Elisei wrote:
>> The LPI code validates a result similarly to the IPI tests, by checking if
>> the target CPU received the interrupt with the expected interrupt number.
>> However, the LPI tests invent their own way of checking the test results by
>> creating a global struct (lpi_stats), using a separate interrupt handler
>> (lpi_handler) and test function (check_lpi_stats).
>>
>> There are several areas that can be improved in the LPI code, which are
>> already covered by the IPI tests:
>>
>> - check_lpi_stats() doesn't take into account that the target CPU can
>>   receive the correct interrupt multiple times.
>> - check_lpi_stats() doesn't take into the account the scenarios where all
>>   online CPUs can receive the interrupt, but the target CPU is the last CPU
>>   that touches lpi_stats.observed.
>> - Insufficient or missing memory synchronization.
>>
>> Instead of duplicating code, let's convert the LPI tests to use
>> check_acked() and the same interrupt handler as the IPI tests, which has
>> been renamed to irq_handler() to avoid any confusion.
>>
>> check_lpi_stats() has been replaced with check_acked() which, together with
>> using irq_handler(), instantly gives us more correctness checks and proper
>> memory synchronization between threads. lpi_stats.expected has been
>> replaced by the CPU mask and the expected interrupt number arguments to
>> check_acked(), with no change in semantics.
>>
>> lpi_handler() aborted the test if the interrupt number was not an LPI. This
>> was changed in favor of allowing the test to continue, as it will fail in
>> check_acked(), but possibly print information useful for debugging. If the
>> test receives spurious interrupts, those are reported via report_info() at
>> the end of the test for consistency with the IPI tests, which don't treat
>> spurious interrupts as critical errors.
>>
>> In the spirit of code reuse, secondary_lpi_tests() has been replaced with
>> ipi_recv() because the two are now identical; ipi_recv() has been renamed
>> to irq_recv(), similarly to irq_handler(), to avoid confusion.
>>
>> CC: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>> ---
>> [..]
>> @@ -767,13 +700,27 @@ static void test_its_trigger(void)
>>  
>>  	report_prefix_push("int");
>>  
>> -	lpi_stats_expect(3, 8195);
>> +	stats_reset();
>> +	/*
>> +	 * its_send_int() is missing the synchronization from the GICv3 IPI
>> +	 * trigger functions.
>> +	 */
>> +	wmb();
> so don't you want to add it in __its_send_int instead?

The memory synchronization in the IPI sender functions make perfect sense, that's
how IPIs are used - one CPU kicks the target, the target reads from a shared
memory location.

I don't think receiving an interrupt from a device is how one would usually expect
to do inter-processor communication. However, I did more digging about this
ability to trigger interrupts from made-up devices, and it seems to me that this
was introduced for testing purposes (please correct me if I'm wrong). With this in
mind, I guess it wouldn't be awkward to have the wmb() in its_send_int(), because
we are using it just like we would an IPI. And it also reduces the boilerplate code.

I'll make the change in the next iteration.

Thanks,
Alex
diff mbox series

Patch

diff --git a/arm/gic.c b/arm/gic.c
index da7b42da5449..6e93da80fe0d 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -111,7 +111,7 @@  static bool check_acked(cpumask_t *mask, int sender, int irqnum)
 		}
 		if (!acked[cpu])
 			continue;
-		smp_rmb(); /* pairs with smp_wmb in ipi_handler */
+		smp_rmb(); /* pairs with smp_wmb in irq_handler */
 
 		if (has_gicv2 && irq_sender[cpu] != sender) {
 			report_info("cpu%d received IPI from wrong sender %d",
@@ -149,11 +149,12 @@  static void check_spurious(void)
 static int gic_get_sender(int irqstat)
 {
 	if (gic_version() == 2)
+		/* GICC_IAR.CPUID is RAZ for non-SGIs */
 		return (irqstat >> 10) & 7;
 	return -1;
 }
 
-static void ipi_handler(struct pt_regs *regs __unused)
+static void irq_handler(struct pt_regs *regs __unused)
 {
 	u32 irqstat = gic_read_iar();
 	u32 irqnr = gic_iar_irqnr(irqstat);
@@ -192,75 +193,6 @@  static void setup_irq(irq_handler_fn handler)
 }
 
 #if defined(__aarch64__)
-struct its_event {
-	int cpu_id;
-	int lpi_id;
-};
-
-struct its_stats {
-	struct its_event expected;
-	struct its_event observed;
-};
-
-static struct its_stats lpi_stats;
-
-static void lpi_handler(struct pt_regs *regs __unused)
-{
-	u32 irqstat = gic_read_iar();
-	int irqnr = gic_iar_irqnr(irqstat);
-
-	gic_write_eoir(irqstat);
-	assert(irqnr >= 8192);
-	smp_rmb(); /* pairs with wmb in lpi_stats_expect */
-	lpi_stats.observed.cpu_id = smp_processor_id();
-	lpi_stats.observed.lpi_id = irqnr;
-	acked[lpi_stats.observed.cpu_id]++;
-	smp_wmb(); /* pairs with rmb in check_lpi_stats */
-}
-
-static void lpi_stats_expect(int exp_cpu_id, int exp_lpi_id)
-{
-	lpi_stats.expected.cpu_id = exp_cpu_id;
-	lpi_stats.expected.lpi_id = exp_lpi_id;
-	lpi_stats.observed.cpu_id = -1;
-	lpi_stats.observed.lpi_id = -1;
-	smp_wmb(); /* pairs with rmb in handler */
-}
-
-static void check_lpi_stats(const char *msg)
-{
-	int i;
-
-	for (i = 0; i < 50; i++) {
-		mdelay(100);
-		smp_rmb(); /* pairs with wmb in lpi_handler */
-		if (lpi_stats.observed.cpu_id == lpi_stats.expected.cpu_id &&
-		    lpi_stats.observed.lpi_id == lpi_stats.expected.lpi_id) {
-			report(true, "%s", msg);
-			return;
-		}
-	}
-
-	if (lpi_stats.observed.cpu_id == -1 && lpi_stats.observed.lpi_id == -1) {
-		report_info("No LPI received whereas (cpuid=%d, intid=%d) "
-			    "was expected", lpi_stats.expected.cpu_id,
-			    lpi_stats.expected.lpi_id);
-	} else {
-		report_info("Unexpected LPI (cpuid=%d, intid=%d)",
-			    lpi_stats.observed.cpu_id,
-			    lpi_stats.observed.lpi_id);
-	}
-	report(false, "%s", msg);
-}
-
-static void secondary_lpi_test(void)
-{
-	setup_irq(lpi_handler);
-	cpumask_set_cpu(smp_processor_id(), &ready);
-	while (1)
-		wfi();
-}
-
 static void check_lpi_hits(int *expected, const char *msg)
 {
 	bool pass = true;
@@ -347,7 +279,7 @@  static void ipi_test_smp(void)
 
 static void ipi_send(void)
 {
-	setup_irq(ipi_handler);
+	setup_irq(irq_handler);
 	wait_on_ready();
 	ipi_test_self();
 	ipi_test_smp();
@@ -355,9 +287,9 @@  static void ipi_send(void)
 	exit(report_summary());
 }
 
-static void ipi_recv(void)
+static void irq_recv(void)
 {
-	setup_irq(ipi_handler);
+	setup_irq(irq_handler);
 	cpumask_set_cpu(smp_processor_id(), &ready);
 	while (1)
 		wfi();
@@ -368,7 +300,7 @@  static void ipi_test(void *data __unused)
 	if (smp_processor_id() == IPI_SENDER)
 		ipi_send();
 	else
-		ipi_recv();
+		irq_recv();
 }
 
 static struct gic gicv2 = {
@@ -698,12 +630,12 @@  static int its_prerequisites(int nb_cpus)
 
 	stats_reset();
 
-	setup_irq(lpi_handler);
+	setup_irq(irq_handler);
 
 	for_each_present_cpu(cpu) {
 		if (cpu == 0)
 			continue;
-		smp_boot_secondary(cpu, secondary_lpi_test);
+		smp_boot_secondary(cpu, irq_recv);
 	}
 	wait_on_ready();
 
@@ -757,6 +689,7 @@  static void test_its_trigger(void)
 {
 	struct its_collection *col3;
 	struct its_device *dev2, *dev7;
+	cpumask_t mask;
 
 	if (its_setup1())
 		return;
@@ -767,13 +700,27 @@  static void test_its_trigger(void)
 
 	report_prefix_push("int");
 
-	lpi_stats_expect(3, 8195);
+	stats_reset();
+	/*
+	 * its_send_int() is missing the synchronization from the GICv3 IPI
+	 * trigger functions.
+	 */
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(3, &mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("dev=2, eventid=20  -> lpi= 8195, col=3");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8195),
+			"dev=2, eventid=20  -> lpi= 8195, col=3");
 
-	lpi_stats_expect(2, 8196);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(2, &mask);
 	its_send_int(dev7, 255);
-	check_lpi_stats("dev=7, eventid=255 -> lpi= 8196, col=2");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8196),
+			"dev=7, eventid=255 -> lpi= 8196, col=2");
 
 	report_prefix_pop();
 
@@ -786,9 +733,13 @@  static void test_its_trigger(void)
 	gicv3_lpi_set_config(8195, LPI_PROP_DEFAULT & ~LPI_PROP_ENABLED);
 	its_send_inv(dev2, 20);
 
-	lpi_stats_expect(-1, -1);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("dev2/eventid=20 does not trigger any LPI");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, -1, -1),
+			"dev2/eventid=20 does not trigger any LPI");
 
 	/*
 	 * re-enable the LPI but willingly do not call invall
@@ -796,15 +747,24 @@  static void test_its_trigger(void)
 	 * The LPI should not hit
 	 */
 	gicv3_lpi_set_config(8195, LPI_PROP_DEFAULT);
-	lpi_stats_expect(-1, -1);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("dev2/eventid=20 still does not trigger any LPI");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, -1, -1),
+			"dev2/eventid=20 still does not trigger any LPI");
 
 	/* Now call the invall and check the LPI hits */
 	its_send_invall(col3);
-	lpi_stats_expect(3, 8195);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(3, &mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("dev2/eventid=20 now triggers an LPI");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8195),
+			"dev2/eventid=20 now triggers an LPI");
 
 	report_prefix_pop();
 
@@ -815,9 +775,14 @@  static void test_its_trigger(void)
 	 */
 
 	its_send_mapd(dev2, false);
-	lpi_stats_expect(-1, -1);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("no LPI after device unmap");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, -1, -1), "no LPI after device unmap");
+
+	check_spurious();
 	report_prefix_pop();
 }
 
@@ -825,6 +790,7 @@  static void test_its_migration(void)
 {
 	struct its_device *dev2, *dev7;
 	bool test_skipped = false;
+	cpumask_t mask;
 
 	if (its_setup1()) {
 		test_skipped = true;
@@ -841,13 +807,25 @@  do_migrate:
 	if (test_skipped)
 		return;
 
-	lpi_stats_expect(3, 8195);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(3, &mask);
 	its_send_int(dev2, 20);
-	check_lpi_stats("dev2/eventid=20 triggers LPI 8195 on PE #3 after migration");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8195),
+			"dev2/eventid=20 triggers LPI 8195 on PE #3 after migration");
 
-	lpi_stats_expect(2, 8196);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(2, &mask);
 	its_send_int(dev7, 255);
-	check_lpi_stats("dev7/eventid=255 triggers LPI 8196 on PE #2 after migration");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8196),
+			"dev7/eventid=255 triggers LPI 8196 on PE #2 after migration");
+
+	check_spurious();
 }
 
 #define ERRATA_UNMAPPED_COLLECTIONS "ERRATA_8c58be34494b"
@@ -857,6 +835,7 @@  static void test_migrate_unmapped_collection(void)
 	struct its_collection *col = NULL;
 	struct its_device *dev2 = NULL, *dev7 = NULL;
 	bool test_skipped = false;
+	cpumask_t mask;
 	int pe0 = 0;
 	u8 config;
 
@@ -891,17 +870,29 @@  do_migrate:
 	its_send_mapc(col, true);
 	its_send_invall(col);
 
-	lpi_stats_expect(2, 8196);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(2, &mask);
 	its_send_int(dev7, 255);
-	check_lpi_stats("dev7/eventid= 255 triggered LPI 8196 on PE #2");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8196),
+			"dev7/eventid= 255 triggered LPI 8196 on PE #2");
 
 	config = gicv3_lpi_get_config(8192);
 	report(config == LPI_PROP_DEFAULT,
 	       "Config of LPI 8192 was properly migrated");
 
-	lpi_stats_expect(pe0, 8192);
+	stats_reset();
+	wmb();
+	cpumask_clear(&mask);
+	cpumask_set_cpu(pe0, &mask);
 	its_send_int(dev2, 0);
-	check_lpi_stats("dev2/eventid = 0 triggered LPI 8192 on PE0");
+	wait_for_interrupts(&mask);
+	report(check_acked(&mask, 0, 8192),
+			"dev2/eventid = 0 triggered LPI 8192 on PE0");
+
+	check_spurious();
 }
 
 static void test_its_pending_migration(void)