Message ID | df69012695c7094ccb1943ca02b4920db3537466.1644421921.git.fmaurer@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 4a11678f683814df82fca9018d964771e02d7e6d |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next,v2] bpf: Do not try bpf_msg_push_data with len 0 | expand |
On 2/9/22 7:55 AM, Felix Maurer wrote: > If bpf_msg_push_data is called with len 0 (as it happens during > selftests/bpf/test_sockmap), we do not need to do anything and can > return early. > > Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM > error: we later called get_order(copy + len); if len was 0, copy + len > was also often 0 and get_order returned some undefined value (at the > moment 52). alloc_pages caught that and failed, but then > bpf_msg_push_data returned ENOMEM. This was wrong because we are most > probably not out of memory and actually do not need any additional > memory. > > v2: Add bug description and Fixes tag > > Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") > Signed-off-by: Felix Maurer <fmaurer@redhat.com> LGTM. I am wondering why bpf CI didn't catch this problem. Did you modified the test with length 0 in order to trigger that? If this is the case, it would be great you can add such a test to the test_sockmap. Acked-by: Yonghong Song <yhs@fb.com>
On 09.02.22 18:06, Yonghong Song wrote: > On 2/9/22 7:55 AM, Felix Maurer wrote: >> If bpf_msg_push_data is called with len 0 (as it happens during >> selftests/bpf/test_sockmap), we do not need to do anything and can >> return early. >> >> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM >> error: we later called get_order(copy + len); if len was 0, copy + len >> was also often 0 and get_order returned some undefined value (at the >> moment 52). alloc_pages caught that and failed, but then >> bpf_msg_push_data returned ENOMEM. This was wrong because we are most >> probably not out of memory and actually do not need any additional >> memory. >> >> v2: Add bug description and Fixes tag >> >> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") >> Signed-off-by: Felix Maurer <fmaurer@redhat.com> > > LGTM. I am wondering why bpf CI didn't catch this problem. Did you > modified the test with length 0 in order to trigger that? If this > is the case, it would be great you can add such a test to the > test_sockmap. I did not modify the tests to trigger that. The state of the selftests around that is unfortunately not very good. There is no explicit test with length 0 but bpf_msg_push_data is still called with length 0, because of what I consider to be bugs in the test. On the other hand, explicit tests with other lengths are sometimes not called as well. I'll elaborate on that in a bit. Something easy to fix is that the tests do not check the return value of bpf_msg_push_data which they probably should. That may have helped find the problem earlier. Now to the issue mentioned in the beginning: Only some of the BPF programs used in test_sockmap actually call bpf_msg_push_data. However, they are not always attached, just for particular scenarios: txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, bpf_msg_push_data is never called. This happens for example in test_txmsg_push. Out of the four defined tests only one actually calls the helper. But after a test, the parameters in the map are reset to 0 (instead of being removed). Therefore, when the maps are reused in a subsequent test which is one of the scenarios above, the values are present and bpf_msg_push_data is called, albeit with the parameters set to 0. This is also what triggered the wrong behavior fixed in the patch. Unfortunately, I do not have the time to fix these issues in the test at the moment. > Acked-by: Yonghong Song <yhs@fb.com> Thanks!
On 2/10/22 7:45 AM, Felix Maurer wrote: > On 09.02.22 18:06, Yonghong Song wrote: >> On 2/9/22 7:55 AM, Felix Maurer wrote: >>> If bpf_msg_push_data is called with len 0 (as it happens during >>> selftests/bpf/test_sockmap), we do not need to do anything and can >>> return early. >>> >>> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM >>> error: we later called get_order(copy + len); if len was 0, copy + len >>> was also often 0 and get_order returned some undefined value (at the >>> moment 52). alloc_pages caught that and failed, but then >>> bpf_msg_push_data returned ENOMEM. This was wrong because we are most >>> probably not out of memory and actually do not need any additional >>> memory. >>> >>> v2: Add bug description and Fixes tag >>> >>> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") >>> Signed-off-by: Felix Maurer <fmaurer@redhat.com> >> >> LGTM. I am wondering why bpf CI didn't catch this problem. Did you >> modified the test with length 0 in order to trigger that? If this >> is the case, it would be great you can add such a test to the >> test_sockmap. > > I did not modify the tests to trigger that. The state of the selftests > around that is unfortunately not very good. There is no explicit test > with length 0 but bpf_msg_push_data is still called with length 0, > because of what I consider to be bugs in the test. On the other hand, > explicit tests with other lengths are sometimes not called as well. I'll > elaborate on that in a bit. > > Something easy to fix is that the tests do not check the return value of > bpf_msg_push_data which they probably should. That may have helped find > the problem earlier. > > Now to the issue mentioned in the beginning: Only some of the BPF > programs used in test_sockmap actually call bpf_msg_push_data. However, > they are not always attached, just for particular scenarios: > txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, > bpf_msg_push_data is never called. This happens for example in > test_txmsg_push. Out of the four defined tests only one actually calls > the helper. > > But after a test, the parameters in the map are reset to 0 (instead of > being removed). Therefore, when the maps are reused in a subsequent test > which is one of the scenarios above, the values are present and > bpf_msg_push_data is called, albeit with the parameters set to 0. This > is also what triggered the wrong behavior fixed in the patch. > > Unfortunately, I do not have the time to fix these issues in the test at > the moment. Thanks for detailed explanation. Maybe for the immediate case, can you just fix this in the selftest, > Something easy to fix is that the tests do not check the return value of > bpf_msg_push_data which they probably should. That may have helped find > the problem earlier. This will be enough to verify your kernel change as without it the test will fail. The rest of test improvements can come later. > >> Acked-by: Yonghong Song <yhs@fb.com> > > Thanks! >
On Thu, Feb 10, 2022 at 10:05 AM Yonghong Song <yhs@fb.com> wrote: > > > > On 2/10/22 7:45 AM, Felix Maurer wrote: > > On 09.02.22 18:06, Yonghong Song wrote: > >> On 2/9/22 7:55 AM, Felix Maurer wrote: > >>> If bpf_msg_push_data is called with len 0 (as it happens during > >>> selftests/bpf/test_sockmap), we do not need to do anything and can > >>> return early. > >>> > >>> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM > >>> error: we later called get_order(copy + len); if len was 0, copy + len > >>> was also often 0 and get_order returned some undefined value (at the > >>> moment 52). alloc_pages caught that and failed, but then > >>> bpf_msg_push_data returned ENOMEM. This was wrong because we are most > >>> probably not out of memory and actually do not need any additional > >>> memory. > >>> > >>> v2: Add bug description and Fixes tag > >>> > >>> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") > >>> Signed-off-by: Felix Maurer <fmaurer@redhat.com> > >> > >> LGTM. I am wondering why bpf CI didn't catch this problem. Did you > >> modified the test with length 0 in order to trigger that? If this > >> is the case, it would be great you can add such a test to the > >> test_sockmap. > > > > I did not modify the tests to trigger that. The state of the selftests > > around that is unfortunately not very good. There is no explicit test > > with length 0 but bpf_msg_push_data is still called with length 0, > > because of what I consider to be bugs in the test. On the other hand, > > explicit tests with other lengths are sometimes not called as well. I'll > > elaborate on that in a bit. > > > > Something easy to fix is that the tests do not check the return value of > > bpf_msg_push_data which they probably should. That may have helped find > > the problem earlier. > > > > Now to the issue mentioned in the beginning: Only some of the BPF > > programs used in test_sockmap actually call bpf_msg_push_data. However, > > they are not always attached, just for particular scenarios: > > txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, > > bpf_msg_push_data is never called. This happens for example in > > test_txmsg_push. Out of the four defined tests only one actually calls > > the helper. > > > > But after a test, the parameters in the map are reset to 0 (instead of > > being removed). Therefore, when the maps are reused in a subsequent test > > which is one of the scenarios above, the values are present and > > bpf_msg_push_data is called, albeit with the parameters set to 0. This > > is also what triggered the wrong behavior fixed in the patch. > > > > Unfortunately, I do not have the time to fix these issues in the test at > > the moment. > > Thanks for detailed explanation. Maybe for the immediate case, can you > just fix this in the selftest, > > > Something easy to fix is that the tests do not check the return > value of > > bpf_msg_push_data which they probably should. That may have helped find > > the problem earlier. > > This will be enough to verify your kernel change as without it the > test will fail. > > The rest of test improvements can come later. John, what is your take on this fix? bpf tree material?
Alexei Starovoitov wrote: > On Thu, Feb 10, 2022 at 10:05 AM Yonghong Song <yhs@fb.com> wrote: > > > > > > > > On 2/10/22 7:45 AM, Felix Maurer wrote: > > > On 09.02.22 18:06, Yonghong Song wrote: > > >> On 2/9/22 7:55 AM, Felix Maurer wrote: > > >>> If bpf_msg_push_data is called with len 0 (as it happens during > > >>> selftests/bpf/test_sockmap), we do not need to do anything and can > > >>> return early. > > >>> > > >>> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM > > >>> error: we later called get_order(copy + len); if len was 0, copy + len > > >>> was also often 0 and get_order returned some undefined value (at the > > >>> moment 52). alloc_pages caught that and failed, but then > > >>> bpf_msg_push_data returned ENOMEM. This was wrong because we are most > > >>> probably not out of memory and actually do not need any additional > > >>> memory. > > >>> > > >>> v2: Add bug description and Fixes tag > > >>> > > >>> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") > > >>> Signed-off-by: Felix Maurer <fmaurer@redhat.com> > > >> > > >> LGTM. I am wondering why bpf CI didn't catch this problem. Did you > > >> modified the test with length 0 in order to trigger that? If this > > >> is the case, it would be great you can add such a test to the > > >> test_sockmap. > > > > > > I did not modify the tests to trigger that. The state of the selftests > > > around that is unfortunately not very good. There is no explicit test > > > with length 0 but bpf_msg_push_data is still called with length 0, > > > because of what I consider to be bugs in the test. On the other hand, > > > explicit tests with other lengths are sometimes not called as well. I'll > > > elaborate on that in a bit. > > > > > > Something easy to fix is that the tests do not check the return value of > > > bpf_msg_push_data which they probably should. That may have helped find > > > the problem earlier. > > > > > > Now to the issue mentioned in the beginning: Only some of the BPF > > > programs used in test_sockmap actually call bpf_msg_push_data. However, > > > they are not always attached, just for particular scenarios: > > > txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, > > > bpf_msg_push_data is never called. This happens for example in > > > test_txmsg_push. Out of the four defined tests only one actually calls > > > the helper. > > > > > > But after a test, the parameters in the map are reset to 0 (instead of > > > being removed). Therefore, when the maps are reused in a subsequent test > > > which is one of the scenarios above, the values are present and > > > bpf_msg_push_data is called, albeit with the parameters set to 0. This > > > is also what triggered the wrong behavior fixed in the patch. > > > > > > Unfortunately, I do not have the time to fix these issues in the test at > > > the moment. > > > > Thanks for detailed explanation. Maybe for the immediate case, can you > > just fix this in the selftest, > > > > > Something easy to fix is that the tests do not check the return > > value of > > > bpf_msg_push_data which they probably should. That may have helped find > > > the problem earlier. > > > > This will be enough to verify your kernel change as without it the > > test will fail. > > > > The rest of test improvements can come later. > > John, > what is your take on this fix? Fix looks good its nice to return 0 here instead of ENOMEM as a result of paticulars of passing 0 to get_order(). Ack for me. > bpf tree material? I checked our code here and we would never pass '0' to pull data. Its hard to imagine what type of code would do that, but on the other hand its a bug and its only rc3... I've no strong opinion if I wrote the patch I would have pointed it at bpf tree so slight preference to push it as a fix. Acked-by: John Fastabend <john.fastabend@gmail.com>
Hello: This patch was applied to bpf/bpf.git (master) by Daniel Borkmann <daniel@iogearbox.net>: On Wed, 9 Feb 2022 16:55:26 +0100 you wrote: > If bpf_msg_push_data is called with len 0 (as it happens during > selftests/bpf/test_sockmap), we do not need to do anything and can > return early. > > Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM > error: we later called get_order(copy + len); if len was 0, copy + len > was also often 0 and get_order returned some undefined value (at the > moment 52). alloc_pages caught that and failed, but then > bpf_msg_push_data returned ENOMEM. This was wrong because we are most > probably not out of memory and actually do not need any additional > memory. > > [...] Here is the summary with links: - [bpf-next,v2] bpf: Do not try bpf_msg_push_data with len 0 https://git.kernel.org/bpf/bpf/c/4a11678f6838 You are awesome, thank you!
On 10.02.22 19:04, Yonghong Song wrote: > On 2/10/22 7:45 AM, Felix Maurer wrote: >> On 09.02.22 18:06, Yonghong Song wrote: >>> On 2/9/22 7:55 AM, Felix Maurer wrote: >>>> If bpf_msg_push_data is called with len 0 (as it happens during >>>> selftests/bpf/test_sockmap), we do not need to do anything and can >>>> return early. >>>> >>>> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM >>>> error: we later called get_order(copy + len); if len was 0, copy + len >>>> was also often 0 and get_order returned some undefined value (at the >>>> moment 52). alloc_pages caught that and failed, but then >>>> bpf_msg_push_data returned ENOMEM. This was wrong because we are most >>>> probably not out of memory and actually do not need any additional >>>> memory. >>>> >>>> v2: Add bug description and Fixes tag >>>> >>>> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") >>>> Signed-off-by: Felix Maurer <fmaurer@redhat.com> >>> >>> LGTM. I am wondering why bpf CI didn't catch this problem. Did you >>> modified the test with length 0 in order to trigger that? If this >>> is the case, it would be great you can add such a test to the >>> test_sockmap. >> >> I did not modify the tests to trigger that. The state of the selftests >> around that is unfortunately not very good. There is no explicit test >> with length 0 but bpf_msg_push_data is still called with length 0, >> because of what I consider to be bugs in the test. On the other hand, >> explicit tests with other lengths are sometimes not called as well. I'll >> elaborate on that in a bit. >> >> Something easy to fix is that the tests do not check the return value of >> bpf_msg_push_data which they probably should. That may have helped find >> the problem earlier. >> >> Now to the issue mentioned in the beginning: Only some of the BPF >> programs used in test_sockmap actually call bpf_msg_push_data. However, >> they are not always attached, just for particular scenarios: >> txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, >> bpf_msg_push_data is never called. This happens for example in >> test_txmsg_push. Out of the four defined tests only one actually calls >> the helper. >> >> But after a test, the parameters in the map are reset to 0 (instead of >> being removed). Therefore, when the maps are reused in a subsequent test >> which is one of the scenarios above, the values are present and >> bpf_msg_push_data is called, albeit with the parameters set to 0. This >> is also what triggered the wrong behavior fixed in the patch. >> >> Unfortunately, I do not have the time to fix these issues in the test at >> the moment. > > Thanks for detailed explanation. Maybe for the immediate case, can you > just fix this in the selftest, > > > Something easy to fix is that the tests do not check the return > value of > > bpf_msg_push_data which they probably should. That may have helped find > > the problem earlier. > > This will be enough to verify your kernel change as without it the > test will fail. I just send a patch checking the return values of the bpf_msg_push_data usages in the test [1]. Passing the errors to userspace by dropping packets is not very nice, but a straightforward way in the current test program. I did try the same checks of the return values of bpf_msg_pull_data, but then the tests fail. So there might be something hidden here as well. [1]:https://lore.kernel.org/bpf/89f767bb44005d6b4dd1f42038c438f76b3ebfad.1644601294.git.fmaurer@redhat.com/ > The rest of test improvements can come later. > >> >>> Acked-by: Yonghong Song <yhs@fb.com> >> >> Thanks! >> >
On 2/11/22 9:52 AM, Felix Maurer wrote: > On 10.02.22 19:04, Yonghong Song wrote: >> On 2/10/22 7:45 AM, Felix Maurer wrote: >>> On 09.02.22 18:06, Yonghong Song wrote: >>>> On 2/9/22 7:55 AM, Felix Maurer wrote: >>>>> If bpf_msg_push_data is called with len 0 (as it happens during >>>>> selftests/bpf/test_sockmap), we do not need to do anything and can >>>>> return early. >>>>> >>>>> Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM >>>>> error: we later called get_order(copy + len); if len was 0, copy + len >>>>> was also often 0 and get_order returned some undefined value (at the >>>>> moment 52). alloc_pages caught that and failed, but then >>>>> bpf_msg_push_data returned ENOMEM. This was wrong because we are most >>>>> probably not out of memory and actually do not need any additional >>>>> memory. >>>>> >>>>> v2: Add bug description and Fixes tag >>>>> >>>>> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") >>>>> Signed-off-by: Felix Maurer <fmaurer@redhat.com> >>>> >>>> LGTM. I am wondering why bpf CI didn't catch this problem. Did you >>>> modified the test with length 0 in order to trigger that? If this >>>> is the case, it would be great you can add such a test to the >>>> test_sockmap. >>> >>> I did not modify the tests to trigger that. The state of the selftests >>> around that is unfortunately not very good. There is no explicit test >>> with length 0 but bpf_msg_push_data is still called with length 0, >>> because of what I consider to be bugs in the test. On the other hand, >>> explicit tests with other lengths are sometimes not called as well. I'll >>> elaborate on that in a bit. >>> >>> Something easy to fix is that the tests do not check the return value of >>> bpf_msg_push_data which they probably should. That may have helped find >>> the problem earlier. >>> >>> Now to the issue mentioned in the beginning: Only some of the BPF >>> programs used in test_sockmap actually call bpf_msg_push_data. However, >>> they are not always attached, just for particular scenarios: >>> txmsg_pass==1, txmsg_redir==1, or txmsg_drop==1. If none of those apply, >>> bpf_msg_push_data is never called. This happens for example in >>> test_txmsg_push. Out of the four defined tests only one actually calls >>> the helper. >>> >>> But after a test, the parameters in the map are reset to 0 (instead of >>> being removed). Therefore, when the maps are reused in a subsequent test >>> which is one of the scenarios above, the values are present and >>> bpf_msg_push_data is called, albeit with the parameters set to 0. This >>> is also what triggered the wrong behavior fixed in the patch. >>> >>> Unfortunately, I do not have the time to fix these issues in the test at >>> the moment. >> >> Thanks for detailed explanation. Maybe for the immediate case, can you >> just fix this in the selftest, >> >> > Something easy to fix is that the tests do not check the return >> value of >> > bpf_msg_push_data which they probably should. That may have helped find >> > the problem earlier. >> >> This will be enough to verify your kernel change as without it the >> test will fail. > > I just send a patch checking the return values of the bpf_msg_push_data > usages in the test [1]. Passing the errors to userspace by dropping > packets is not very nice, but a straightforward way in the current test > program. > > I did try the same checks of the return values of bpf_msg_pull_data, but > then the tests fail. So there might be something hidden here as well. Thanks for the patch! Maybe this can be the first step to fix test_sockmap. John, could you help take a look at the patch? > > [1]:https://lore.kernel.org/bpf/89f767bb44005d6b4dd1f42038c438f76b3ebfad.1644601294.git.fmaurer@redhat.com/ > >> The rest of test improvements can come later. >> >>> >>>> Acked-by: Yonghong Song <yhs@fb.com> >>> >>> Thanks! >>> >> >
diff --git a/net/core/filter.c b/net/core/filter.c index 4603b7cd3cd1..9eb785842258 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2710,6 +2710,9 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, if (unlikely(flags)) return -EINVAL; + if (unlikely(len == 0)) + return 0; + /* First find the starting scatterlist element */ i = msg->sg.start; do {
If bpf_msg_push_data is called with len 0 (as it happens during selftests/bpf/test_sockmap), we do not need to do anything and can return early. Calling bpf_msg_push_data with len 0 previously lead to a wrong ENOMEM error: we later called get_order(copy + len); if len was 0, copy + len was also often 0 and get_order returned some undefined value (at the moment 52). alloc_pages caught that and failed, but then bpf_msg_push_data returned ENOMEM. This was wrong because we are most probably not out of memory and actually do not need any additional memory. v2: Add bug description and Fixes tag Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") Signed-off-by: Felix Maurer <fmaurer@redhat.com> --- net/core/filter.c | 3 +++ 1 file changed, 3 insertions(+)