diff mbox series

[RFC] tests/functional: Don't fail any precaching errors

Message ID 20250311131327.903329-1-npiggin@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] tests/functional: Don't fail any precaching errors | expand

Commit Message

Nicholas Piggin March 11, 2025, 1:13 p.m. UTC
The NetBSD archive is currently failing part-way through downloads,
which results in no clean HTTP error but a short transfer and checksum
error. This is treated as fatal in the precache download, and it halts
an entire set of tests even if some others could run.

I hacked up this patch to get a bunch of CI tests going again for ppc
merge testing.

Don't treat any precaching failures as errors.
This causes tests to be skipped when they try to fetch their asset.
Some CI results before/after patching:

functional-system-fedora
https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good

functional-system-debian
https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bad
https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good

This is making the tests skip. Is there a way to make the error more
prominent / obvious in the output? Should they fail instead? I think
there should be a more obvious indication of failure due to asset so
it does not go unnoticed.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 tests/functional/qemu_test/asset.py | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

Comments

Daniel P. Berrangé March 11, 2025, 1:37 p.m. UTC | #1
On Tue, Mar 11, 2025 at 11:13:26PM +1000, Nicholas Piggin wrote:
> The NetBSD archive is currently failing part-way through downloads,
> which results in no clean HTTP error but a short transfer and checksum
> error. This is treated as fatal in the precache download, and it halts
> an entire set of tests even if some others could run.
> 
> I hacked up this patch to get a bunch of CI tests going again for ppc
> merge testing.
> 
> Don't treat any precaching failures as errors.
> This causes tests to be skipped when they try to fetch their asset.
> Some CI results before/after patching:
> 
> functional-system-fedora
> https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
> https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good
> 
> functional-system-debian
> https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bda
> https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good
> 
> This is making the tests skip. Is there a way to make the error more
> prominent / obvious in the output? Should they fail instead? I think
> there should be a more obvious indication of failure due to asset so
> it does not go unnoticed.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  tests/functional/qemu_test/asset.py | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
> index f0730695f09..3134ccb10da 100644
> --- a/tests/functional/qemu_test/asset.py
> +++ b/tests/functional/qemu_test/asset.py
> @@ -174,14 +174,11 @@ def precache_test(test):
>                  try:
>                      asset.fetch()
>                  except HTTPError as e:
> -                    # Treat 404 as fatal, since it is highly likely to
> -                    # indicate a broken test rather than a transient
> -                    # server or networking problem
> -                    if e.code == 404:
> -                        raise
> -

Why are you removing this ? The commit above does not make any reference
to the problem being a missing URL (404 code). We want missing URLs to
be fatal so that we identify when images we rely on are deleted by their
host, as that is not a transient problem.

>                      log.debug(f"HTTP error {e.code} from {asset.url} " +
>                                "skipping asset precache")
> +                except:
> +                    log.debug(f"Error from {asset.url} " +
> +                              "skipping asset precache")

So is the bit that actually deals with the exception you show in the
jobs above.

Best practice would be for us to define an 'AssetException' and use that
in assert.py when raising exceptions, or to wrap other exceptions in cases
where we propagate exceptions. Then this code can be move tailored to
catch AssetException, instead of Exception.


With regards,
Daniel
Thomas Huth March 11, 2025, 1:55 p.m. UTC | #2
On 11/03/2025 14.37, Daniel P. Berrangé wrote:
> On Tue, Mar 11, 2025 at 11:13:26PM +1000, Nicholas Piggin wrote:
>> The NetBSD archive is currently failing part-way through downloads,
>> which results in no clean HTTP error but a short transfer and checksum
>> error. This is treated as fatal in the precache download, and it halts
>> an entire set of tests even if some others could run.
>>
>> I hacked up this patch to get a bunch of CI tests going again for ppc
>> merge testing.
>>
>> Don't treat any precaching failures as errors.
>> This causes tests to be skipped when they try to fetch their asset.
>> Some CI results before/after patching:
>>
>> functional-system-fedora
>> https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
>> https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good
>>
>> functional-system-debian
>> https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bda
>> https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good
>>
>> This is making the tests skip. Is there a way to make the error more
>> prominent / obvious in the output? Should they fail instead? I think
>> there should be a more obvious indication of failure due to asset so
>> it does not go unnoticed.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   tests/functional/qemu_test/asset.py | 9 +++------
>>   1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
>> index f0730695f09..3134ccb10da 100644
>> --- a/tests/functional/qemu_test/asset.py
>> +++ b/tests/functional/qemu_test/asset.py
>> @@ -174,14 +174,11 @@ def precache_test(test):
>>                   try:
>>                       asset.fetch()
>>                   except HTTPError as e:
>> -                    # Treat 404 as fatal, since it is highly likely to
>> -                    # indicate a broken test rather than a transient
>> -                    # server or networking problem
>> -                    if e.code == 404:
>> -                        raise
>> -
> 
> Why are you removing this ? The commit above does not make any reference
> to the problem being a missing URL (404 code). We want missing URLs to
> be fatal so that we identify when images we rely on are deleted by their
> host, as that is not a transient problem.
> 
>>                       log.debug(f"HTTP error {e.code} from {asset.url} " +
>>                                 "skipping asset precache")
>> +                except:
>> +                    log.debug(f"Error from {asset.url} " +
>> +                              "skipping asset precache")
> 
> So is the bit that actually deals with the exception you show in the
> jobs above.
> 
> Best practice would be for us to define an 'AssetException' and use that
> in assert.py when raising exceptions, or to wrap other exceptions in cases
> where we propagate exceptions. Then this code can be move tailored to
> catch AssetException, instead of Exception.

At least we should distinguish between "HTTP server bailed out early" (in 
which case we should likely skip the test), and "checksum of the asset does 
not match" in which case we should rather fail the test since this is a hard 
error that needs to be tackled if the file has been changed on the server 
(otherwise this would go unnoticed and the test will never be run).

  Thomas
Daniel P. Berrangé March 11, 2025, 2:11 p.m. UTC | #3
On Tue, Mar 11, 2025 at 02:55:25PM +0100, Thomas Huth wrote:
> On 11/03/2025 14.37, Daniel P. Berrangé wrote:
> > On Tue, Mar 11, 2025 at 11:13:26PM +1000, Nicholas Piggin wrote:
> > > The NetBSD archive is currently failing part-way through downloads,
> > > which results in no clean HTTP error but a short transfer and checksum
> > > error. This is treated as fatal in the precache download, and it halts
> > > an entire set of tests even if some others could run.
> > > 
> > > I hacked up this patch to get a bunch of CI tests going again for ppc
> > > merge testing.
> > > 
> > > Don't treat any precaching failures as errors.
> > > This causes tests to be skipped when they try to fetch their asset.
> > > Some CI results before/after patching:
> > > 
> > > functional-system-fedora
> > > https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
> > > https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good
> > > 
> > > functional-system-debian
> > > https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bda
> > > https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good
> > > 
> > > This is making the tests skip. Is there a way to make the error more
> > > prominent / obvious in the output? Should they fail instead? I think
> > > there should be a more obvious indication of failure due to asset so
> > > it does not go unnoticed.
> > > 
> > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > > ---
> > >   tests/functional/qemu_test/asset.py | 9 +++------
> > >   1 file changed, 3 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
> > > index f0730695f09..3134ccb10da 100644
> > > --- a/tests/functional/qemu_test/asset.py
> > > +++ b/tests/functional/qemu_test/asset.py
> > > @@ -174,14 +174,11 @@ def precache_test(test):
> > >                   try:
> > >                       asset.fetch()
> > >                   except HTTPError as e:
> > > -                    # Treat 404 as fatal, since it is highly likely to
> > > -                    # indicate a broken test rather than a transient
> > > -                    # server or networking problem
> > > -                    if e.code == 404:
> > > -                        raise
> > > -
> > 
> > Why are you removing this ? The commit above does not make any reference
> > to the problem being a missing URL (404 code). We want missing URLs to
> > be fatal so that we identify when images we rely on are deleted by their
> > host, as that is not a transient problem.
> > 
> > >                       log.debug(f"HTTP error {e.code} from {asset.url} " +
> > >                                 "skipping asset precache")
> > > +                except:
> > > +                    log.debug(f"Error from {asset.url} " +
> > > +                              "skipping asset precache")
> > 
> > So is the bit that actually deals with the exception you show in the
> > jobs above.
> > 
> > Best practice would be for us to define an 'AssetException' and use that
> > in assert.py when raising exceptions, or to wrap other exceptions in cases
> > where we propagate exceptions. Then this code can be move tailored to
> > catch AssetException, instead of Exception.
> 
> At least we should distinguish between "HTTP server bailed out early" (in
> which case we should likely skip the test), and "checksum of the asset does
> not match" in which case we should rather fail the test since this is a hard
> error that needs to be tackled if the file has been changed on the server
> (otherwise this would go unnoticed and the test will never be run).

In theory we are already distinguishing those cases. In the logs above,
the stack traces come from a part of the code that indicates we have
successfully downloaded the asset and now try to validate the result
and failed.

If the connection is terminating early, then our logic at:

                with tmp_cache_file.open("xb") as dst:
                    with urllib.request.urlopen(self.url) as resp:
                        copyfileobj(resp, dst)

does not appear to be seeing a fatal error, otherwise we would
have seen the log from

            except Exception as e:
                self.log.error("Unable to download %s: %s", self.url, e)
                tmp_cache_file.unlink()
                raise

and would never have got as far as comparing the checksums.

So I'm wondering if 'copyfileobj' is not reliable in detecting
premature termination of the http connection before all data
has been sent. 

With regards,
Daniel
Nicholas Piggin March 11, 2025, 11:41 p.m. UTC | #4
On Tue Mar 11, 2025 at 11:37 PM AEST, Daniel P. Berrangé wrote:
> On Tue, Mar 11, 2025 at 11:13:26PM +1000, Nicholas Piggin wrote:
>> The NetBSD archive is currently failing part-way through downloads,
>> which results in no clean HTTP error but a short transfer and checksum
>> error. This is treated as fatal in the precache download, and it halts
>> an entire set of tests even if some others could run.
>> 
>> I hacked up this patch to get a bunch of CI tests going again for ppc
>> merge testing.
>> 
>> Don't treat any precaching failures as errors.
>> This causes tests to be skipped when they try to fetch their asset.
>> Some CI results before/after patching:
>> 
>> functional-system-fedora
>> https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
>> https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good
>> 
>> functional-system-debian
>> https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bda
>> https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good
>> 
>> This is making the tests skip. Is there a way to make the error more
>> prominent / obvious in the output? Should they fail instead? I think
>> there should be a more obvious indication of failure due to asset so
>> it does not go unnoticed.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>  tests/functional/qemu_test/asset.py | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>> 
>> diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
>> index f0730695f09..3134ccb10da 100644
>> --- a/tests/functional/qemu_test/asset.py
>> +++ b/tests/functional/qemu_test/asset.py
>> @@ -174,14 +174,11 @@ def precache_test(test):
>>                  try:
>>                      asset.fetch()
>>                  except HTTPError as e:
>> -                    # Treat 404 as fatal, since it is highly likely to
>> -                    # indicate a broken test rather than a transient
>> -                    # server or networking problem
>> -                    if e.code == 404:
>> -                        raise
>> -
>
> Why are you removing this ? The commit above does not make any reference
> to the problem being a missing URL (404 code). We want missing URLs to
> be fatal so that we identify when images we rely on are deleted by their
> host, as that is not a transient problem.

Yeah it was just a quick hack I guess so I didn't put a
complete changelog or get it quite to the point I wanted.

I think *no* precaching errors including 404 should cause
failure because you would still want other tests to proceed
(in some cases).

But the failure should be caught when the test case tries to
fetch the asset, so you can still easily identify the download
failure.

Thanks,
Nick
Nicholas Piggin March 12, 2025, 5:54 a.m. UTC | #5
On Wed Mar 12, 2025 at 12:11 AM AEST, Daniel P. Berrangé wrote:
> On Tue, Mar 11, 2025 at 02:55:25PM +0100, Thomas Huth wrote:
>> On 11/03/2025 14.37, Daniel P. Berrangé wrote:
>> > On Tue, Mar 11, 2025 at 11:13:26PM +1000, Nicholas Piggin wrote:
>> > > The NetBSD archive is currently failing part-way through downloads,
>> > > which results in no clean HTTP error but a short transfer and checksum
>> > > error. This is treated as fatal in the precache download, and it halts
>> > > an entire set of tests even if some others could run.
>> > > 
>> > > I hacked up this patch to get a bunch of CI tests going again for ppc
>> > > merge testing.
>> > > 
>> > > Don't treat any precaching failures as errors.
>> > > This causes tests to be skipped when they try to fetch their asset.
>> > > Some CI results before/after patching:
>> > > 
>> > > functional-system-fedora
>> > > https://gitlab.com/npiggin/qemu/-/jobs/9370860490 #bad
>> > > https://gitlab.com/npiggin/qemu/-/jobs/9373246826 #good
>> > > 
>> > > functional-system-debian
>> > > https://gitlab.com/npiggin/qemu/-/jobs/9370860479 #bda
>> > > https://gitlab.com/npiggin/qemu/-/jobs/9373246822 #good
>> > > 
>> > > This is making the tests skip. Is there a way to make the error more
>> > > prominent / obvious in the output? Should they fail instead? I think
>> > > there should be a more obvious indication of failure due to asset so
>> > > it does not go unnoticed.
>> > > 
>> > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> > > ---
>> > >   tests/functional/qemu_test/asset.py | 9 +++------
>> > >   1 file changed, 3 insertions(+), 6 deletions(-)
>> > > 
>> > > diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
>> > > index f0730695f09..3134ccb10da 100644
>> > > --- a/tests/functional/qemu_test/asset.py
>> > > +++ b/tests/functional/qemu_test/asset.py
>> > > @@ -174,14 +174,11 @@ def precache_test(test):
>> > >                   try:
>> > >                       asset.fetch()
>> > >                   except HTTPError as e:
>> > > -                    # Treat 404 as fatal, since it is highly likely to
>> > > -                    # indicate a broken test rather than a transient
>> > > -                    # server or networking problem
>> > > -                    if e.code == 404:
>> > > -                        raise
>> > > -
>> > 
>> > Why are you removing this ? The commit above does not make any reference
>> > to the problem being a missing URL (404 code). We want missing URLs to
>> > be fatal so that we identify when images we rely on are deleted by their
>> > host, as that is not a transient problem.
>> > 
>> > >                       log.debug(f"HTTP error {e.code} from {asset.url} " +
>> > >                                 "skipping asset precache")
>> > > +                except:
>> > > +                    log.debug(f"Error from {asset.url} " +
>> > > +                              "skipping asset precache")
>> > 
>> > So is the bit that actually deals with the exception you show in the
>> > jobs above.
>> > 
>> > Best practice would be for us to define an 'AssetException' and use that
>> > in assert.py when raising exceptions, or to wrap other exceptions in cases
>> > where we propagate exceptions. Then this code can be move tailored to
>> > catch AssetException, instead of Exception.
>> 
>> At least we should distinguish between "HTTP server bailed out early" (in
>> which case we should likely skip the test), and "checksum of the asset does
>> not match" in which case we should rather fail the test since this is a hard
>> error that needs to be tackled if the file has been changed on the server
>> (otherwise this would go unnoticed and the test will never be run).

Yes, at least doing that would take care of this particular case. You
might like to have that case in the retry loop too, if possible.

I don't suppose the API can resume fetching from an offset and it might
be getting overkill to care so much, but that does allow the client to
see the error (wget attempts again after short xfer then gets a 503).

> In theory we are already distinguishing those cases. In the logs above,
> the stack traces come from a part of the code that indicates we have
> successfully downloaded the asset and now try to validate the result
> and failed.
>
> If the connection is terminating early, then our logic at:
>
>                 with tmp_cache_file.open("xb") as dst:
>                     with urllib.request.urlopen(self.url) as resp:
>                         copyfileobj(resp, dst)
>
> does not appear to be seeing a fatal error, otherwise we would
> have seen the log from
>
>             except Exception as e:
>                 self.log.error("Unable to download %s: %s", self.url, e)
>                 tmp_cache_file.unlink()
>                 raise
>
> and would never have got as far as comparing the checksums.
>
> So I'm wondering if 'copyfileobj' is not reliable in detecting
> premature termination of the http connection before all data
> has been sent. 

It seems not. I posted a new series that tries to handle it. And
adds a new asset specific exception.

Thanks,
Nick
Thomas Huth March 12, 2025, 6:36 a.m. UTC | #6
On 12/03/2025 00.41, Nicholas Piggin wrote:
...
> I think *no* precaching errors including 404 should cause
> failure because you would still want other tests to proceed
> (in some cases).
> 
> But the failure should be caught when the test case tries to
> fetch the asset, so you can still easily identify the download
> failure.

Sorry, I did not get that... if we ignore the 404 during the precaching 
step, how should the failure be caught when the test runs? The downloads are 
disabled that case, so the test cannot know whether the asset is not 
available due to a 404 or any other reason...?

  Thomas
Nicholas Piggin March 12, 2025, 6:55 a.m. UTC | #7
On Wed Mar 12, 2025 at 4:36 PM AEST, Thomas Huth wrote:
> On 12/03/2025 00.41, Nicholas Piggin wrote:
> ...
>> I think *no* precaching errors including 404 should cause
>> failure because you would still want other tests to proceed
>> (in some cases).
>> 
>> But the failure should be caught when the test case tries to
>> fetch the asset, so you can still easily identify the download
>> failure.
>
> Sorry, I did not get that... if we ignore the 404 during the precaching 
> step, how should the failure be caught when the test runs? The downloads are 
> disabled that case, so the test cannot know whether the asset is not 
> available due to a 404 or any other reason...?

Right, it would need something to get that error to the test case.
I had not thought too much about it if there is disagreement about
it being a good solution then no point.

But some possibilities:
- downloads could be enabled even for pre-cache case
- the pre-cache step could build an in-memory dictionary of the
  assets
- some additional file name or attribute could be added to the
  pre-cache to convey status

I didn't follow the asset caching rework too closely so I could be
well off the mark, but I would look into 1 first since it might be
simplest. If that does not work maybe an in-memory metadata of
assets might not be too hard and could help with flexibility in
future.

Thanks,
Nick
diff mbox series

Patch

diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
index f0730695f09..3134ccb10da 100644
--- a/tests/functional/qemu_test/asset.py
+++ b/tests/functional/qemu_test/asset.py
@@ -174,14 +174,11 @@  def precache_test(test):
                 try:
                     asset.fetch()
                 except HTTPError as e:
-                    # Treat 404 as fatal, since it is highly likely to
-                    # indicate a broken test rather than a transient
-                    # server or networking problem
-                    if e.code == 404:
-                        raise
-
                     log.debug(f"HTTP error {e.code} from {asset.url} " +
                               "skipping asset precache")
+                except:
+                    log.debug(f"Error from {asset.url} " +
+                              "skipping asset precache")
 
         log.removeHandler(handler)