Message ID | 20210816175250.296110-1-fallentree@fb.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | BPF |
Headers | show |
Series | [v1,bpf] selftests/bpf: Add exponential backoff to map_update_retriable in test_maps | expand |
On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote: > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > action CI system. This patch adds exponential backoff with a cap of 50ms, to > reduce the flakyness of the test. Do we have data showing how flaky the test is before and after this change? > > Signed-off-by: Yucong Sun <fallentree@fb.com> > --- > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > index 14cea869235b..ed92d56c19cf 100644 > --- a/tools/testing/selftests/bpf/test_maps.c > +++ b/tools/testing/selftests/bpf/test_maps.c > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > static int map_update_retriable(int map_fd, const void *key, const void *value, > int flags, int attempts) > { > + int delay = 1; > + > while (bpf_map_update_elem(map_fd, key, value, flags)) { > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > return -errno; > > - usleep(1); > + if (delay < 50) > + delay *= 2; > + > + usleep(delay); It is a little weird that the delay times in microseconds are 2, 4, 8, 16, 32, 64, 64, ... Maybe just use rand()? Thanks, Song > attempts--; > } > > -- > 2.30.2 >
On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote: > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote: > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > reduce the flakiness of the test. > > Do we have data showing how flaky the test is before and after this change? Before the change, on 2 CPU KVM on my laptop the test is perfectly fine, on Github action (2 emulated CPU) , it appeared to fail roughly 1 in 10 runs or even more frequently. After the change, it appears pretty robust both on my laptop and on github action, I ran the github action a couple times and it succeeded every time. > > > > > Signed-off-by: Yucong Sun <fallentree@fb.com> > > --- > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > index 14cea869235b..ed92d56c19cf 100644 > > --- a/tools/testing/selftests/bpf/test_maps.c > > +++ b/tools/testing/selftests/bpf/test_maps.c > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > int flags, int attempts) > > { > > + int delay = 1; > > + > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > return -errno; > > > > - usleep(1); > > + if (delay < 50) > > + delay *= 2; > > + > > + usleep(delay); > > It is a little weird that the delay times in microseconds are 2, 4, 8, > 16, 32, 64, 64, ... > Maybe just use rand()? map_update_retriable is called by test_map_update() , which is being parallel executed in 1024 threads, so the lock contention is intentional, I think if we introduce randomness in the delay it kind of defeats the purpose of the test. My original proposal is to just increase the attempts to 10X , Andrii proposed to use an exponential back-off, which is what I ended up implementing. > > Thanks, > Song > > > attempts--; > > } > > > > -- > > 2.30.2 > >
On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote: > > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote: > > > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote: > > > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > > reduce the flakiness of the test. > > > > Do we have data showing how flaky the test is before and after this change? > > Before the change, on 2 CPU KVM on my laptop the test is perfectly > fine, on Github action (2 emulated CPU) , it appeared to fail roughly > 1 in 10 runs or even more frequently. > After the change, it appears pretty robust both on my laptop and on > github action, I ran the github action a couple times and it succeeded > every time. Thanks for the data! We should include this in the commit log. Maybe the maintainer could just amend it when applying the patch. > > > > > > > > > Signed-off-by: Yucong Sun <fallentree@fb.com> > > > --- > > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > > index 14cea869235b..ed92d56c19cf 100644 > > > --- a/tools/testing/selftests/bpf/test_maps.c > > > +++ b/tools/testing/selftests/bpf/test_maps.c > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > > int flags, int attempts) > > > { > > > + int delay = 1; > > > + > > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > > return -errno; > > > > > > - usleep(1); > > > + if (delay < 50) > > > + delay *= 2; > > > + > > > + usleep(delay); > > > > It is a little weird that the delay times in microseconds are 2, 4, 8, > > 16, 32, 64, 64, ... > > Maybe just use rand()? > > map_update_retriable is called by test_map_update() , which is being > parallel executed in 1024 threads, so the lock contention is > intentional, I think if we introduce randomness in the delay it kind > of defeats the purpose of the test. > My original proposal is to just increase the attempts to 10X , Andrii > proposed to use an exponential back-off, which is what I ended up > implementing. If that is what we agreed on, it works. Acked-by: Song Liu <songliubraving@fb.com>
On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote: > > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote: > > > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote: > > > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > > reduce the flakiness of the test. > > > > Do we have data showing how flaky the test is before and after this change? > > Before the change, on 2 CPU KVM on my laptop the test is perfectly > fine, on Github action (2 emulated CPU) , it appeared to fail roughly > 1 in 10 runs or even more frequently. > After the change, it appears pretty robust both on my laptop and on > github action, I ran the github action a couple times and it succeeded > every time. > > > > > > > > > Signed-off-by: Yucong Sun <fallentree@fb.com> > > > --- > > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > > index 14cea869235b..ed92d56c19cf 100644 > > > --- a/tools/testing/selftests/bpf/test_maps.c > > > +++ b/tools/testing/selftests/bpf/test_maps.c > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > > int flags, int attempts) > > > { > > > + int delay = 1; > > > + > > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > > return -errno; > > > > > > - usleep(1); > > > + if (delay < 50) > > > + delay *= 2; > > > + > > > + usleep(delay); > > > > It is a little weird that the delay times in microseconds are 2, 4, 8, > > 16, 32, 64, 64, ... > > Maybe just use rand()? > > map_update_retriable is called by test_map_update() , which is being > parallel executed in 1024 threads, so the lock contention is > intentional, I think if we introduce randomness in the delay it kind > of defeats the purpose of the test. Not really, the purpose of the test is to test a lot of concurrent updates with some collisions. It doesn't matter if there is one collision or 20. We will still get an initial collision (we only usleep() after the initial attempt fails with EAGAIN or EBUSY) even with randomized initial sleep *after* the first collision, but after that we should make sure that test eventually completes, so for that random initial delay is a good thing. I reworked the code a bit, added initial random delay between [0ms, 5ms), and then actually capped delay under 50ms (it can't go to >50ms). Note also that your code is actually working with microseconds (usleep() takes microseconds), while commit was actually talking about milliseconds. Applied to bpf-next, thanks. > My original proposal is to just increase the attempts to 10X , Andrii > proposed to use an exponential back-off, which is what I ended up > implementing. > > > > > Thanks, > > Song > > > > > attempts--; > > > } > > > > > > -- > > > 2.30.2 > > >
On Mon, Aug 16, 2021 at 7:19 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, Aug 16, 2021 at 4:45 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote: > > > > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@kernel.org> wrote: > > > > > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@fb.com> wrote: > > > > > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > > > reduce the flakiness of the test. > > > > > > Do we have data showing how flaky the test is before and after this change? > > > > Before the change, on 2 CPU KVM on my laptop the test is perfectly > > fine, on Github action (2 emulated CPU) , it appeared to fail roughly > > 1 in 10 runs or even more frequently. > > After the change, it appears pretty robust both on my laptop and on > > github action, I ran the github action a couple times and it succeeded > > every time. > > > > > > > > > > > > > Signed-off-by: Yucong Sun <fallentree@fb.com> > > > > --- > > > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > > > index 14cea869235b..ed92d56c19cf 100644 > > > > --- a/tools/testing/selftests/bpf/test_maps.c > > > > +++ b/tools/testing/selftests/bpf/test_maps.c > > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > > > int flags, int attempts) > > > > { > > > > + int delay = 1; > > > > + > > > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > > > return -errno; > > > > > > > > - usleep(1); > > > > + if (delay < 50) > > > > + delay *= 2; > > > > + > > > > + usleep(delay); > > > > > > It is a little weird that the delay times in microseconds are 2, 4, 8, > > > 16, 32, 64, 64, ... > > > Maybe just use rand()? > > > > map_update_retriable is called by test_map_update() , which is being > > parallel executed in 1024 threads, so the lock contention is > > intentional, I think if we introduce randomness in the delay it kind > > of defeats the purpose of the test. > > Not really, the purpose of the test is to test a lot of concurrent > updates with some collisions. It doesn't matter if there is one > collision or 20. We will still get an initial collision (we only > usleep() after the initial attempt fails with EAGAIN or EBUSY) even > with randomized initial sleep *after* the first collision, but after > that we should make sure that test eventually completes, so for that > random initial delay is a good thing. > > I reworked the code a bit, added initial random delay between [0ms, > 5ms), and then actually capped delay under 50ms (it can't go to > >50ms). Note also that your code is actually working with microseconds > (usleep() takes microseconds), while commit was actually talking about > milliseconds. > > Applied to bpf-next, thanks. Fun, two out of three test runs now fail with (see [0], for an example): error -16 16 test_maps: test_maps.c:1374: test_update_delete: Assertion `err == 0' failed. test_maps: test_maps.c:1312: __run_parallel: Assertion `status == 0' failed. ./libbpf/travis-ci/vmtest/run_selftests.sh: line 19: 1226 Aborted ./test_maps Seems like we need the same trick for bpf_map_delete_elem() calls?.. [0] https://github.com/kernel-patches/bpf/pull/1632/checks?check_run_id=3345916773 > > > My original proposal is to just increase the attempts to 10X , Andrii > > proposed to use an exponential back-off, which is what I ended up > > implementing. > > > > > > > > Thanks, > > > Song > > > > > > > attempts--; > > > > } > > > > > > > > -- > > > > 2.30.2 > > > >
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index 14cea869235b..ed92d56c19cf 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -1400,11 +1400,16 @@ static void test_map_stress(void) static int map_update_retriable(int map_fd, const void *key, const void *value, int flags, int attempts) { + int delay = 1; + while (bpf_map_update_elem(map_fd, key, value, flags)) { if (!attempts || (errno != EAGAIN && errno != EBUSY)) return -errno; - usleep(1); + if (delay < 50) + delay *= 2; + + usleep(delay); attempts--; }
Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github action CI system. This patch adds exponential backoff with a cap of 50ms, to reduce the flakyness of the test. Signed-off-by: Yucong Sun <fallentree@fb.com> --- tools/testing/selftests/bpf/test_maps.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)