diff mbox series

drm/doc: ci: Require more context for flaky tests

Message ID 20231019094609.251787-1-mripard@kernel.org (mailing list archive)
State New, archived
Headers show
Series drm/doc: ci: Require more context for flaky tests | expand

Commit Message

Maxime Ripard Oct. 19, 2023, 9:46 a.m. UTC
Flaky tests can be very difficult to reproduce after the facts, which
will make it even harder to ever fix.

Let's document the metadata we agreed on to provide more context to
anyone trying to address these fixes.

Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
 Documentation/gpu/automated_testing.rst | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Daniel Vetter Oct. 19, 2023, 10:46 a.m. UTC | #1
On Thu, Oct 19, 2023 at 11:46:09AM +0200, Maxime Ripard wrote:
> Flaky tests can be very difficult to reproduce after the facts, which
> will make it even harder to ever fix.
> 
> Let's document the metadata we agreed on to provide more context to
> anyone trying to address these fixes.
> 
> Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
> Signed-off-by: Maxime Ripard <mripard@kernel.org>

Not that my opinion matters much since I'm really not involved in the
details, and no opinion on the specific format and all that, but this
sounds like a very good idea too me.

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Cheers, Sima
> ---
>  Documentation/gpu/automated_testing.rst | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/gpu/automated_testing.rst b/Documentation/gpu/automated_testing.rst
> index 469b6fb65c30..2dd0e221c2c3 100644
> --- a/Documentation/gpu/automated_testing.rst
> +++ b/Documentation/gpu/automated_testing.rst
> @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a specific hardware revision are
>  known to behave unreliably. These tests won't cause a job to fail regardless of
>  the result. They will still be run.
>  
> +Each new flake entry must be associated with a link to a bug report to
> +the author of the affected driver, the board name or Device Tree name of
> +the board, the first kernel version affected, and an approximation of
> +the failure rate.
> +
> +They should be provided under the following format::
> +
> +  # Bug Report: $LORE_OR_PATCHWORK_URL
> +  # Board Name: broken-board.dtb
> +  # Version: 6.6-rc1
> +  # Failure Rate: 100
> +  flaky-test
> +
>  drivers/gpu/drm/ci/${DRIVER_NAME}-${HW_REVISION}-skips.txt
>  -----------------------------------------------------------
>  
> -- 
> 2.41.0
>
Helen Mae Koike Fornazier Oct. 19, 2023, 4:51 p.m. UTC | #2
On 19/10/2023 06:46, Maxime Ripard wrote:
> Flaky tests can be very difficult to reproduce after the facts, which
> will make it even harder to ever fix.
> 
> Let's document the metadata we agreed on to provide more context to
> anyone trying to address these fixes.
> 
> Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
>   Documentation/gpu/automated_testing.rst | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/gpu/automated_testing.rst b/Documentation/gpu/automated_testing.rst
> index 469b6fb65c30..2dd0e221c2c3 100644
> --- a/Documentation/gpu/automated_testing.rst
> +++ b/Documentation/gpu/automated_testing.rst
> @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a specific hardware revision are
>   known to behave unreliably. These tests won't cause a job to fail regardless of
>   the result. They will still be run.
>   
> +Each new flake entry must be associated with a link to a bug report to

What do you mean by but report? Just a link to an email to the mailing 
list is enough?

Also, I had made a mistake to the first flakes lists, which I corrected 
with https://www.spinics.net/lists/kernel/msg4959629.html (there was a 
bug in my script which ended up erroneous adding a bunch of tests in the 
flake list, so I cleaned them up), I would like to kind request to let 
me add those documentation in a future patch to not block that patch series.

Thanks
Helen


> +the author of the affected driver, the board name or Device Tree name of
> +the board, the first kernel version affected, and an approximation of
> +the failure rate.
> +
> +They should be provided under the following format::
> +
> +  # Bug Report: $LORE_OR_PATCHWORK_URL
> +  # Board Name: broken-board.dtb
> +  # Version: 6.6-rc1
> +  # Failure Rate: 100
> +  flaky-test
> +
>   drivers/gpu/drm/ci/${DRIVER_NAME}-${HW_REVISION}-skips.txt
>   -----------------------------------------------------------
>
Helen Mae Koike Fornazier Oct. 20, 2023, 4:33 a.m. UTC | #3
On 19/10/2023 13:51, Helen Koike wrote:
> 
> 
> On 19/10/2023 06:46, Maxime Ripard wrote:
>> Flaky tests can be very difficult to reproduce after the facts, which
>> will make it even harder to ever fix.
>>
>> Let's document the metadata we agreed on to provide more context to
>> anyone trying to address these fixes.
>>
>> Link: 
>> https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
>> Signed-off-by: Maxime Ripard <mripard@kernel.org>
>> ---
>>   Documentation/gpu/automated_testing.rst | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/Documentation/gpu/automated_testing.rst 
>> b/Documentation/gpu/automated_testing.rst
>> index 469b6fb65c30..2dd0e221c2c3 100644
>> --- a/Documentation/gpu/automated_testing.rst
>> +++ b/Documentation/gpu/automated_testing.rst
>> @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a 
>> specific hardware revision are
>>   known to behave unreliably. These tests won't cause a job to fail 
>> regardless of
>>   the result. They will still be run.
>> +Each new flake entry must be associated with a link to a bug report to
> 
> What do you mean by but report? Just a link to an email to the mailing 
> list is enough?
> 
> Also, I had made a mistake to the first flakes lists, which I corrected 
> with https://www.spinics.net/lists/kernel/msg4959629.html (there was a 
> bug in my script which ended up erroneous adding a bunch of tests in the 
> flake list, so I cleaned them up), I would like to kind request to let 
> me add those documentation in a future patch to not block that patch 
> series.
> 
> Thanks
> Helen
> 
> 
>> +the author of the affected driver, the board name or Device Tree name of
>> +the board, the first kernel version affected, and an approximation of
>> +the failure rate.
>> +
>> +They should be provided under the following format::
>> +
>> +  # Bug Report: $LORE_OR_PATCHWORK_URL

I wonder if the commit adding the test into the flakes.txt file with and 
Acked-by from the device maintainer shouldn't be already considered the 
Bug Report.

>> +  # Board Name: broken-board.dtb

Maybe Board Name isn't required, since it is already in the name of the 
file.

>> +  # Version: 6.6-rc1
>> +  # Failure Rate: 100

Maybe also:

   # Pipeline url: 
https://gitlab.freedesktop.org/helen.fornazier/linux/-/pipelines/1014435

All this info will complicated a bit the update-xfails.py script, but 
well, we can handle...
(see 
https://patchwork.kernel.org/project/dri-devel/patch/20231020034124.136295-4-helen.koike@collabora.com/ 
)
We need to update that script to make life easier.
Vignesh sent a patch adding at least the pipeline url to the file
https://patchwork.kernel.org/project/linux-arm-msm/patch/20231019070650.61159-9-vignesh.raman@collabora.com/
but to meet this doc that needs to be updated too.

Regards,
Helen

>> +  flaky-test
>> +
>>   drivers/gpu/drm/ci/${DRIVER_NAME}-${HW_REVISION}-skips.txt
>>   -----------------------------------------------------------
Maxime Ripard Oct. 23, 2023, 3:05 p.m. UTC | #4
On Thu, Oct 19, 2023 at 01:51:59PM -0300, Helen Koike wrote:
> 
> 
> On 19/10/2023 06:46, Maxime Ripard wrote:
> > Flaky tests can be very difficult to reproduce after the facts, which
> > will make it even harder to ever fix.
> > 
> > Let's document the metadata we agreed on to provide more context to
> > anyone trying to address these fixes.
> > 
> > Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
> > Signed-off-by: Maxime Ripard <mripard@kernel.org>
> > ---
> >   Documentation/gpu/automated_testing.rst | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/Documentation/gpu/automated_testing.rst b/Documentation/gpu/automated_testing.rst
> > index 469b6fb65c30..2dd0e221c2c3 100644
> > --- a/Documentation/gpu/automated_testing.rst
> > +++ b/Documentation/gpu/automated_testing.rst
> > @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a specific hardware revision are
> >   known to behave unreliably. These tests won't cause a job to fail regardless of
> >   the result. They will still be run.
> > +Each new flake entry must be associated with a link to a bug report to
> 
> What do you mean by but report? Just a link to an email to the mailing list
> is enough?

Yes, a mail to the maintainers of that driver is enough. Waiting for an
actual fix would take too long, but at least that way we have the
opportunity to come back later on and see if there's progress.

> Also, I had made a mistake to the first flakes lists, which I corrected with
> https://www.spinics.net/lists/kernel/msg4959629.html (there was a bug in my
> script which ended up erroneous adding a bunch of tests in the flake list,
> so I cleaned them up), I would like to kind request to let me add those
> documentation in a future patch to not block that patch series.

Sounds fair, especially since you remove a significant number of them

Maxime
Maxime Ripard Oct. 23, 2023, 3:09 p.m. UTC | #5
On Fri, Oct 20, 2023 at 01:33:59AM -0300, Helen Koike wrote:
> On 19/10/2023 13:51, Helen Koike wrote:
> > On 19/10/2023 06:46, Maxime Ripard wrote:
> > > Flaky tests can be very difficult to reproduce after the facts, which
> > > will make it even harder to ever fix.
> > > 
> > > Let's document the metadata we agreed on to provide more context to
> > > anyone trying to address these fixes.
> > > 
> > > Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
> > > Signed-off-by: Maxime Ripard <mripard@kernel.org>
> > > ---
> > >   Documentation/gpu/automated_testing.rst | 13 +++++++++++++
> > >   1 file changed, 13 insertions(+)
> > > 
> > > diff --git a/Documentation/gpu/automated_testing.rst
> > > b/Documentation/gpu/automated_testing.rst
> > > index 469b6fb65c30..2dd0e221c2c3 100644
> > > --- a/Documentation/gpu/automated_testing.rst
> > > +++ b/Documentation/gpu/automated_testing.rst
> > > @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a
> > > specific hardware revision are
> > >   known to behave unreliably. These tests won't cause a job to fail
> > > regardless of
> > >   the result. They will still be run.
> > > +Each new flake entry must be associated with a link to a bug report to
> > 
> > What do you mean by but report? Just a link to an email to the mailing
> > list is enough?
> > 
> > Also, I had made a mistake to the first flakes lists, which I corrected
> > with https://www.spinics.net/lists/kernel/msg4959629.html (there was a
> > bug in my script which ended up erroneous adding a bunch of tests in the
> > flake list, so I cleaned them up), I would like to kind request to let
> > me add those documentation in a future patch to not block that patch
> > series.
> > 
> > Thanks
> > Helen
> > 
> > 
> > > +the author of the affected driver, the board name or Device Tree name of
> > > +the board, the first kernel version affected, and an approximation of
> > > +the failure rate.
> > > +
> > > +They should be provided under the following format::
> > > +
> > > +  # Bug Report: $LORE_OR_PATCHWORK_URL
> 
> I wonder if the commit adding the test into the flakes.txt file with and
> Acked-by from the device maintainer shouldn't be already considered the Bug
> Report.

I guess it could, yes. I think I'd still prefer the link since it would
allow to also evaluate if the issue is fixed or not now.

> > > +  # Board Name: broken-board.dtb
> 
> Maybe Board Name isn't required, since it is already in the name of the
> file.

I have no idea how the i915 naming works, but on ARM at least the name
of the file contains the name of the SoC, not the board where it was
observed.

> > > +  # Version: 6.6-rc1
> > > +  # Failure Rate: 100
> 
> Maybe also:
> 
>   # Pipeline url:
> https://gitlab.freedesktop.org/helen.fornazier/linux/-/pipelines/1014435

Sounds like a good idea yeah :) Are those artifacts archived/deleted at
some point or do they stick around forever?

> All this info will complicated a bit the update-xfails.py script, but well,
> we can handle...
> (see https://patchwork.kernel.org/project/dri-devel/patch/20231020034124.136295-4-helen.koike@collabora.com/
> )
> We need to update that script to make life easier.

I guess we could just add a template for now? It would keep the script
easy and yet still hint its user that we want more data

> Vignesh sent a patch adding at least the pipeline url to the file
> https://patchwork.kernel.org/project/linux-arm-msm/patch/20231019070650.61159-9-vignesh.raman@collabora.com/
> but to meet this doc that needs to be updated too.

Sure, I'll update it

Maxime
Helen Mae Koike Fornazier Oct. 25, 2023, 12:47 p.m. UTC | #6
On 23/10/2023 12:09, Maxime Ripard wrote:
> On Fri, Oct 20, 2023 at 01:33:59AM -0300, Helen Koike wrote:
>> On 19/10/2023 13:51, Helen Koike wrote:
>>> On 19/10/2023 06:46, Maxime Ripard wrote:
>>>> Flaky tests can be very difficult to reproduce after the facts, which
>>>> will make it even harder to ever fix.
>>>>
>>>> Let's document the metadata we agreed on to provide more context to
>>>> anyone trying to address these fixes.
>>>>
>>>> Link: https://lore.kernel.org/dri-devel/CAPj87rPbJ1V1-R7WMTHkDat2A4nwSd61Df9mdGH2PR=ZzxaU=Q@mail.gmail.com/
>>>> Signed-off-by: Maxime Ripard <mripard@kernel.org>
>>>> ---
>>>>    Documentation/gpu/automated_testing.rst | 13 +++++++++++++
>>>>    1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/Documentation/gpu/automated_testing.rst
>>>> b/Documentation/gpu/automated_testing.rst
>>>> index 469b6fb65c30..2dd0e221c2c3 100644
>>>> --- a/Documentation/gpu/automated_testing.rst
>>>> +++ b/Documentation/gpu/automated_testing.rst
>>>> @@ -67,6 +67,19 @@ Lists the tests that for a given driver on a
>>>> specific hardware revision are
>>>>    known to behave unreliably. These tests won't cause a job to fail
>>>> regardless of
>>>>    the result. They will still be run.
>>>> +Each new flake entry must be associated with a link to a bug report to
>>>
>>> What do you mean by but report? Just a link to an email to the mailing
>>> list is enough?
>>>
>>> Also, I had made a mistake to the first flakes lists, which I corrected
>>> with https://www.spinics.net/lists/kernel/msg4959629.html (there was a
>>> bug in my script which ended up erroneous adding a bunch of tests in the
>>> flake list, so I cleaned them up), I would like to kind request to let
>>> me add those documentation in a future patch to not block that patch
>>> series.
>>>
>>> Thanks
>>> Helen
>>>
>>>
>>>> +the author of the affected driver, the board name or Device Tree name of
>>>> +the board, the first kernel version affected, and an approximation of
>>>> +the failure rate.
>>>> +
>>>> +They should be provided under the following format::
>>>> +
>>>> +  # Bug Report: $LORE_OR_PATCHWORK_URL
>>
>> I wonder if the commit adding the test into the flakes.txt file with and
>> Acked-by from the device maintainer shouldn't be already considered the Bug
>> Report.
> 
> I guess it could, yes. I think I'd still prefer the link since it would
> allow to also evaluate if the issue is fixed or not now.
> 
>>>> +  # Board Name: broken-board.dtb
>>
>> Maybe Board Name isn't required, since it is already in the name of the
>> file.
> 
> I have no idea how the i915 naming works, but on ARM at least the name
> of the file contains the name of the SoC, not the board where it was
> observed.

right, yeah we could use the dtb to be more clear/precise, no problem.

> 
>>>> +  # Version: 6.6-rc1
>>>> +  # Failure Rate: 100
>>
>> Maybe also:
>>
>>    # Pipeline url:
>> https://gitlab.freedesktop.org/helen.fornazier/linux/-/pipelines/1014435
> 
> Sounds like a good idea yeah :) Are those artifacts archived/deleted at
> some point or do they stick around forever?

Good point, I asked the admins, they stick for 4 weeks (could be more, 
but it is not forever) :(

> 
>> All this info will complicated a bit the update-xfails.py script, but well,
>> we can handle...
>> (see https://patchwork.kernel.org/project/dri-devel/patch/20231020034124.136295-4-helen.koike@collabora.com/
>> )
>> We need to update that script to make life easier.
> 
> I guess we could just add a template for now? It would keep the script
> easy and yet still hint its user that we want more data

ack

Thanks
Helen

> 
>> Vignesh sent a patch adding at least the pipeline url to the file
>> https://patchwork.kernel.org/project/linux-arm-msm/patch/20231019070650.61159-9-vignesh.raman@collabora.com/
>> but to meet this doc that needs to be updated too.
> 
> Sure, I'll update it
> 
> Maxime
Maxime Ripard Oct. 25, 2023, 2:19 p.m. UTC | #7
On Wed, Oct 25, 2023 at 09:47:07AM -0300, Helen Koike wrote:
> > > > > +  # Version: 6.6-rc1
> > > > > +  # Failure Rate: 100
> > > 
> > > Maybe also:
> > > 
> > >    # Pipeline url:
> > > https://gitlab.freedesktop.org/helen.fornazier/linux/-/pipelines/1014435
> > 
> > Sounds like a good idea yeah :) Are those artifacts archived/deleted at
> > some point or do they stick around forever?
> 
> Good point, I asked the admins, they stick for 4 weeks (could be more, but
> it is not forever) :(

That's not even a release cycle :/

I guess it's too short to be useful. We can definitely revisit if that
delay is extended at some point though.

Maxime
Maxime Ripard Oct. 26, 2023, 10:58 a.m. UTC | #8
On Thu, 19 Oct 2023 11:46:09 +0200, Maxime Ripard wrote:
> Flaky tests can be very difficult to reproduce after the facts, which
> will make it even harder to ever fix.
> 
> Let's document the metadata we agreed on to provide more context to
> anyone trying to address these fixes.
> 
> 
> [...]

Applied to drm/drm-misc (drm-misc-next).

Thanks!
Maxime
Maxime Ripard Oct. 26, 2023, 11:02 a.m. UTC | #9
On Thu, Oct 26, 2023 at 12:58:48PM +0200, Maxime Ripard wrote:
> On Thu, 19 Oct 2023 11:46:09 +0200, Maxime Ripard wrote:
> > Flaky tests can be very difficult to reproduce after the facts, which
> > will make it even harder to ever fix.
> > 
> > Let's document the metadata we agreed on to provide more context to
> > anyone trying to address these fixes.
> > 
> > 
> > [...]
> 
> Applied to drm/drm-misc (drm-misc-next).

b4 might have been confused, but I only applied the v2.

Maxime
diff mbox series

Patch

diff --git a/Documentation/gpu/automated_testing.rst b/Documentation/gpu/automated_testing.rst
index 469b6fb65c30..2dd0e221c2c3 100644
--- a/Documentation/gpu/automated_testing.rst
+++ b/Documentation/gpu/automated_testing.rst
@@ -67,6 +67,19 @@  Lists the tests that for a given driver on a specific hardware revision are
 known to behave unreliably. These tests won't cause a job to fail regardless of
 the result. They will still be run.
 
+Each new flake entry must be associated with a link to a bug report to
+the author of the affected driver, the board name or Device Tree name of
+the board, the first kernel version affected, and an approximation of
+the failure rate.
+
+They should be provided under the following format::
+
+  # Bug Report: $LORE_OR_PATCHWORK_URL
+  # Board Name: broken-board.dtb
+  # Version: 6.6-rc1
+  # Failure Rate: 100
+  flaky-test
+
 drivers/gpu/drm/ci/${DRIVER_NAME}-${HW_REVISION}-skips.txt
 -----------------------------------------------------------