diff mbox series

[v1,bpf] selftests/bpf: Add exponential backoff to map_update_retriable in test_maps

Message ID 20210816175250.296110-1-fallentree@fb.com (mailing list archive)
State Accepted
Delegated to: BPF
Headers show
Series [v1,bpf] selftests/bpf: Add exponential backoff to map_update_retriable in test_maps | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers warning 10 maintainers not CCed: songliubraving@fb.com shuah@kernel.org daniel@iogearbox.net kafai@fb.com netdev@vger.kernel.org ast@kernel.org john.fastabend@gmail.com linux-kselftest@vger.kernel.org kpsingh@kernel.org yhs@fb.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link

Commit Message

Yucong Sun Aug. 16, 2021, 5:52 p.m. UTC
Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
action CI system. This patch adds exponential backoff with a cap of 50ms, to
reduce the flakyness of the test.

Signed-off-by: Yucong Sun <fallentree@fb.com>
---
 tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Song Liu Aug. 16, 2021, 11:28 p.m. UTC | #1
On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote:
>
> Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> action CI system. This patch adds exponential backoff with a cap of 50ms, to
> reduce the flakyness of the test.

Do we have data showing how flaky the test is before and after this change?

>
> Signed-off-by: Yucong Sun <fallentree@fb.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 14cea869235b..ed92d56c19cf 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
>  static int map_update_retriable(int map_fd, const void *key, const void *value,
>                                 int flags, int attempts)
>  {
> +       int delay = 1;
> +
>         while (bpf_map_update_elem(map_fd, key, value, flags)) {
>                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
>                         return -errno;
>
> -               usleep(1);
> +               if (delay < 50)
> +                       delay *= 2;
> +
> +               usleep(delay);

It is a little weird that the delay times in microseconds are 2, 4, 8,
16, 32, 64, 64, ...
Maybe just use rand()?

Thanks,
Song

>                 attempts--;
>         }
>
> --
> 2.30.2
>
sunyucong@gmail.com Aug. 16, 2021, 11:45 p.m. UTC | #2
On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote:
>
> On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote:
> >
> > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> > action CI system. This patch adds exponential backoff with a cap of 50ms, to
> > reduce the flakiness of the test.
>
> Do we have data showing how flaky the test is before and after this change?

Before the change, on 2 CPU KVM on my laptop the test is perfectly
fine, on Github action (2 emulated CPU) , it appeared to fail roughly
1 in 10 runs or even more frequently.
After the change, it appears pretty robust both on my laptop and on
github action, I ran the github action a couple times and it succeeded
every time.

>
> >
> > Signed-off-by: Yucong Sun <fallentree@fb.com>
> > ---
> >  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> > index 14cea869235b..ed92d56c19cf 100644
> > --- a/tools/testing/selftests/bpf/test_maps.c
> > +++ b/tools/testing/selftests/bpf/test_maps.c
> > @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
> >  static int map_update_retriable(int map_fd, const void *key, const void *value,
> >                                 int flags, int attempts)
> >  {
> > +       int delay = 1;
> > +
> >         while (bpf_map_update_elem(map_fd, key, value, flags)) {
> >                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
> >                         return -errno;
> >
> > -               usleep(1);
> > +               if (delay < 50)
> > +                       delay *= 2;
> > +
> > +               usleep(delay);
>
> It is a little weird that the delay times in microseconds are 2, 4, 8,
> 16, 32, 64, 64, ...
> Maybe just use rand()?

map_update_retriable is called by test_map_update() , which is being
parallel executed in 1024 threads, so the lock contention is
intentional, I think if we introduce randomness in the delay it kind
of defeats the purpose of the test.
My original proposal is to just increase the attempts to 10X , Andrii
proposed to use an exponential back-off, which is what I ended up
implementing.

>
> Thanks,
> Song
>
> >                 attempts--;
> >         }
> >
> > --
> > 2.30.2
> >
Song Liu Aug. 17, 2021, 12:02 a.m. UTC | #3
On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote:
> > >
> > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> > > action CI system. This patch adds exponential backoff with a cap of 50ms, to
> > > reduce the flakiness of the test.
> >
> > Do we have data showing how flaky the test is before and after this change?
>
> Before the change, on 2 CPU KVM on my laptop the test is perfectly
> fine, on Github action (2 emulated CPU) , it appeared to fail roughly
> 1 in 10 runs or even more frequently.
> After the change, it appears pretty robust both on my laptop and on
> github action, I ran the github action a couple times and it succeeded
> every time.

Thanks for the data!

We should include this in the commit log. Maybe the maintainer could just
amend it when applying the patch.

>
> >
> > >
> > > Signed-off-by: Yucong Sun <fallentree@fb.com>
> > > ---
> > >  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> > > index 14cea869235b..ed92d56c19cf 100644
> > > --- a/tools/testing/selftests/bpf/test_maps.c
> > > +++ b/tools/testing/selftests/bpf/test_maps.c
> > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
> > >  static int map_update_retriable(int map_fd, const void *key, const void *value,
> > >                                 int flags, int attempts)
> > >  {
> > > +       int delay = 1;
> > > +
> > >         while (bpf_map_update_elem(map_fd, key, value, flags)) {
> > >                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
> > >                         return -errno;
> > >
> > > -               usleep(1);
> > > +               if (delay < 50)
> > > +                       delay *= 2;
> > > +
> > > +               usleep(delay);
> >
> > It is a little weird that the delay times in microseconds are 2, 4, 8,
> > 16, 32, 64, 64, ...
> > Maybe just use rand()?
>
> map_update_retriable is called by test_map_update() , which is being
> parallel executed in 1024 threads, so the lock contention is
> intentional, I think if we introduce randomness in the delay it kind
> of defeats the purpose of the test.
> My original proposal is to just increase the attempts to 10X , Andrii
> proposed to use an exponential back-off, which is what I ended up
> implementing.

If that is what we agreed on, it works.

Acked-by: Song Liu <songliubraving@fb.com>
Andrii Nakryiko Aug. 17, 2021, 2:19 a.m. UTC | #4
On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote:
> > >
> > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> > > action CI system. This patch adds exponential backoff with a cap of 50ms, to
> > > reduce the flakiness of the test.
> >
> > Do we have data showing how flaky the test is before and after this change?
>
> Before the change, on 2 CPU KVM on my laptop the test is perfectly
> fine, on Github action (2 emulated CPU) , it appeared to fail roughly
> 1 in 10 runs or even more frequently.
> After the change, it appears pretty robust both on my laptop and on
> github action, I ran the github action a couple times and it succeeded
> every time.
>
> >
> > >
> > > Signed-off-by: Yucong Sun <fallentree@fb.com>
> > > ---
> > >  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> > > index 14cea869235b..ed92d56c19cf 100644
> > > --- a/tools/testing/selftests/bpf/test_maps.c
> > > +++ b/tools/testing/selftests/bpf/test_maps.c
> > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
> > >  static int map_update_retriable(int map_fd, const void *key, const void *value,
> > >                                 int flags, int attempts)
> > >  {
> > > +       int delay = 1;
> > > +
> > >         while (bpf_map_update_elem(map_fd, key, value, flags)) {
> > >                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
> > >                         return -errno;
> > >
> > > -               usleep(1);
> > > +               if (delay < 50)
> > > +                       delay *= 2;
> > > +
> > > +               usleep(delay);
> >
> > It is a little weird that the delay times in microseconds are 2, 4, 8,
> > 16, 32, 64, 64, ...
> > Maybe just use rand()?
>
> map_update_retriable is called by test_map_update() , which is being
> parallel executed in 1024 threads, so the lock contention is
> intentional, I think if we introduce randomness in the delay it kind
> of defeats the purpose of the test.

Not really, the purpose of the test is to test a lot of concurrent
updates with some collisions. It doesn't matter if there is one
collision or 20. We will still get an initial collision (we only
usleep() after the initial attempt fails with EAGAIN or EBUSY) even
with randomized initial sleep *after* the first collision, but after
that we should make sure that test eventually completes, so for that
random initial delay is a good thing.

I reworked the code a bit, added initial random delay between [0ms,
5ms), and then actually capped delay under 50ms (it can't go to
>50ms). Note also that your code is actually working with microseconds
(usleep() takes microseconds), while commit was actually talking about
milliseconds.

Applied to bpf-next, thanks.

> My original proposal is to just increase the attempts to 10X , Andrii
> proposed to use an exponential back-off, which is what I ended up
> implementing.
>
> >
> > Thanks,
> > Song
> >
> > >                 attempts--;
> > >         }
> > >
> > > --
> > > 2.30.2
> > >
Andrii Nakryiko Aug. 17, 2021, 4:15 a.m. UTC | #5
On Mon, Aug 16, 2021 at 7:19 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote:
> >
> > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote:
> > >
> > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote:
> > > >
> > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to
> > > > reduce the flakiness of the test.
> > >
> > > Do we have data showing how flaky the test is before and after this change?
> >
> > Before the change, on 2 CPU KVM on my laptop the test is perfectly
> > fine, on Github action (2 emulated CPU) , it appeared to fail roughly
> > 1 in 10 runs or even more frequently.
> > After the change, it appears pretty robust both on my laptop and on
> > github action, I ran the github action a couple times and it succeeded
> > every time.
> >
> > >
> > > >
> > > > Signed-off-by: Yucong Sun <fallentree@fb.com>
> > > > ---
> > > >  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
> > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> > > > index 14cea869235b..ed92d56c19cf 100644
> > > > --- a/tools/testing/selftests/bpf/test_maps.c
> > > > +++ b/tools/testing/selftests/bpf/test_maps.c
> > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
> > > >  static int map_update_retriable(int map_fd, const void *key, const void *value,
> > > >                                 int flags, int attempts)
> > > >  {
> > > > +       int delay = 1;
> > > > +
> > > >         while (bpf_map_update_elem(map_fd, key, value, flags)) {
> > > >                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
> > > >                         return -errno;
> > > >
> > > > -               usleep(1);
> > > > +               if (delay < 50)
> > > > +                       delay *= 2;
> > > > +
> > > > +               usleep(delay);
> > >
> > > It is a little weird that the delay times in microseconds are 2, 4, 8,
> > > 16, 32, 64, 64, ...
> > > Maybe just use rand()?
> >
> > map_update_retriable is called by test_map_update() , which is being
> > parallel executed in 1024 threads, so the lock contention is
> > intentional, I think if we introduce randomness in the delay it kind
> > of defeats the purpose of the test.
>
> Not really, the purpose of the test is to test a lot of concurrent
> updates with some collisions. It doesn't matter if there is one
> collision or 20. We will still get an initial collision (we only
> usleep() after the initial attempt fails with EAGAIN or EBUSY) even
> with randomized initial sleep *after* the first collision, but after
> that we should make sure that test eventually completes, so for that
> random initial delay is a good thing.
>
> I reworked the code a bit, added initial random delay between [0ms,
> 5ms), and then actually capped delay under 50ms (it can't go to
> >50ms). Note also that your code is actually working with microseconds
> (usleep() takes microseconds), while commit was actually talking about
> milliseconds.
>
> Applied to bpf-next, thanks.

Fun, two out of three test runs now fail with (see [0], for an example):

  error -16 16
  test_maps: test_maps.c:1374: test_update_delete: Assertion `err == 0' failed.
  test_maps: test_maps.c:1312: __run_parallel: Assertion `status == 0' failed.
  ./libbpf/travis-ci/vmtest/run_selftests.sh: line 19:  1226 Aborted
              ./test_maps


Seems like we need the same trick for bpf_map_delete_elem() calls?..


  [0] https://github.com/kernel-patches/bpf/pull/1632/checks?check_run_id=3345916773

>
> > My original proposal is to just increase the attempts to 10X , Andrii
> > proposed to use an exponential back-off, which is what I ended up
> > implementing.
> >
> > >
> > > Thanks,
> > > Song
> > >
> > > >                 attempts--;
> > > >         }
> > > >
> > > > --
> > > > 2.30.2
> > > >
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 14cea869235b..ed92d56c19cf 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1400,11 +1400,16 @@  static void test_map_stress(void)
 static int map_update_retriable(int map_fd, const void *key, const void *value,
 				int flags, int attempts)
 {
+	int delay = 1;
+
 	while (bpf_map_update_elem(map_fd, key, value, flags)) {
 		if (!attempts || (errno != EAGAIN && errno != EBUSY))
 			return -errno;
 
-		usleep(1);
+		if (delay < 50)
+			delay *= 2;
+
+		usleep(delay);
 		attempts--;
 	}