diff mbox series

handling EINTR from bpf_map_lookup_batch

Message ID Z6JXtA1M5jAZx8xD@debian.debian (mailing list archive)
State New
Headers show
Series handling EINTR from bpf_map_lookup_batch | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Yan Zhai Feb. 4, 2025, 6:08 p.m. UTC
I am getting EINTR when trying to use bpf_map_lookup_batch on an
array_of_maps. The error happens when there is a "hole" in the array.
For example, say the outer map has max entries of 256, each inner map
is used for a transport protocol, and I only populated key 6 and
17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
This so far seems to only happen with array of maps. Does it make
sense to allow skipping to the next key for this map type? Something
like:


Also the context about my scenario if anyone is curious: I am trying
to associate each map to a userspace service in a multi tenant
environment. This is an addition to cgroup accounting, in case the
creator cgroup goes away, e.g. systemd service restarts always
recreate cgroups. And we also want to monitor the utilization level of
non-prealloc maps of different tenants. When dealing with inner maps,
it is not always trivial. To connect dots I choose to read these IDs
periodically and link them to the tenant of the outer map, that's
where this EINTR occurred.

best
Yan

Comments

Hou Tao Feb. 5, 2025, 2:19 a.m. UTC | #1
Hi,

On 2/5/2025 2:08 AM, Yan Zhai wrote:
> I am getting EINTR when trying to use bpf_map_lookup_batch on an
> array_of_maps. The error happens when there is a "hole" in the array.
> For example, say the outer map has max entries of 256, each inner map
> is used for a transport protocol, and I only populated key 6 and
> 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> This so far seems to only happen with array of maps. Does it make
> sense to allow skipping to the next key for this map type? Something
> like:
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c420edbfb7c8..83915a8059ef 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
>                                          attr->batch.elem_flags);
>
>                 if (err == -ENOENT) {
> +                       if (IS_FD_ARRAY(map)
> +                               goto next_key;

It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
better to reset err as 0, otherwise generic_map_lookup_batch may return
-ENOENT.
>                         if (retry) {
>                                 retry--;
>                                 continue;
> @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
>                         goto free_buf;
>                 }
>
> +next_key:
>                 if (!prev_key)
>                         prev_key = buf_prevkey;
>

Make sense.  Please add a selftest for it. Another way is to return id 0
for these non-existent values in the fd array, but it may break existed
prog. Just skipping the empty array slot is better.
> Also the context about my scenario if anyone is curious: I am trying
> to associate each map to a userspace service in a multi tenant
> environment. This is an addition to cgroup accounting, in case the
> creator cgroup goes away, e.g. systemd service restarts always
> recreate cgroups. And we also want to monitor the utilization level of
> non-prealloc maps of different tenants. When dealing with inner maps,
> it is not always trivial. To connect dots I choose to read these IDs
> periodically and link them to the tenant of the outer map, that's
> where this EINTR occurred.
>
> best
> Yan
>
> .
Alexei Starovoitov Feb. 5, 2025, 9:56 a.m. UTC | #2
On Wed, Feb 5, 2025 at 2:19 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/5/2025 2:08 AM, Yan Zhai wrote:
> > I am getting EINTR when trying to use bpf_map_lookup_batch on an
> > array_of_maps. The error happens when there is a "hole" in the array.
> > For example, say the outer map has max entries of 256, each inner map
> > is used for a transport protocol, and I only populated key 6 and
> > 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> > This so far seems to only happen with array of maps. Does it make
> > sense to allow skipping to the next key for this map type? Something
> > like:
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..83915a8059ef 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                                          attr->batch.elem_flags);
> >
> >                 if (err == -ENOENT) {
> > +                       if (IS_FD_ARRAY(map)
> > +                               goto next_key;
>
> It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
> map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
> better to reset err as 0, otherwise generic_map_lookup_batch may return
> -ENOENT.
> >                         if (retry) {
> >                                 retry--;
> >                                 continue;
> > @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                         goto free_buf;
> >                 }
> >
> > +next_key:
> >                 if (!prev_key)
> >                         prev_key = buf_prevkey;
> >
>
> Make sense.  Please add a selftest for it. Another way is to return id 0
> for these non-existent values in the fd array, but it may break existed
> prog. Just skipping the empty array slot is better.

Let's not invent new magic return values.

But stepping back... why do we have this EINTR case at all?
Can we always goto next_key for all map types?
The command returns and a set of (key, value) pairs.
It's always better to skip then get stuck in EINTR,
since EINTR implies that the user space should retry and it
might be successful next time.
While here it's not the case.
I don't see any selftests for EINTR, so I suspect it was added
as escape path in case retry count exceeds 3 and author assumed
that it should never happen in practice, so EINTR was expected
to be 'never happens'. Clearly that's not the case.
Yan Zhai Feb. 5, 2025, 4:15 p.m. UTC | #3
On Tue, Feb 4, 2025 at 8:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/5/2025 2:08 AM, Yan Zhai wrote:
> > I am getting EINTR when trying to use bpf_map_lookup_batch on an
> > array_of_maps. The error happens when there is a "hole" in the array.
> > For example, say the outer map has max entries of 256, each inner map
> > is used for a transport protocol, and I only populated key 6 and
> > 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> > This so far seems to only happen with array of maps. Does it make
> > sense to allow skipping to the next key for this map type? Something
> > like:
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..83915a8059ef 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                                          attr->batch.elem_flags);
> >
> >                 if (err == -ENOENT) {
> > +                       if (IS_FD_ARRAY(map)
> > +                               goto next_key;
>
> It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
> map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
> better to reset err as 0, otherwise generic_map_lookup_batch may return
> -ENOENT.

Jump to the next key should always restart the loop, thus err will be
correctly set afterwards.

> >                         if (retry) {
> >                                 retry--;
> >                                 continue;
> > @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                         goto free_buf;
> >                 }
> >
> > +next_key:
> >                 if (!prev_key)
> >                         prev_key = buf_prevkey;
> >
>
> Make sense.  Please add a selftest for it. Another way is to return id 0
> for these non-existent values in the fd array, but it may break existed
> prog. Just skipping the empty array slot is better.

Working on it.

thanks
Yan

> > Also the context about my scenario if anyone is curious: I am trying
> > to associate each map to a userspace service in a multi tenant
> > environment. This is an addition to cgroup accounting, in case the
> > creator cgroup goes away, e.g. systemd service restarts always
> > recreate cgroups. And we also want to monitor the utilization level of
> > non-prealloc maps of different tenants. When dealing with inner maps,
> > it is not always trivial. To connect dots I choose to read these IDs
> > periodically and link them to the tenant of the outer map, that's
> > where this EINTR occurred.
> >
> > best
> > Yan
> >
> > .
>
Yan Zhai Feb. 5, 2025, 4:27 p.m. UTC | #4
On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Let's not invent new magic return values.
>
> But stepping back... why do we have this EINTR case at all?
> Can we always goto next_key for all map types?
> The command returns and a set of (key, value) pairs.
> It's always better to skip then get stuck in EINTR,
> since EINTR implies that the user space should retry and it
> might be successful next time.
> While here it's not the case.
> I don't see any selftests for EINTR, so I suspect it was added
> as escape path in case retry count exceeds 3 and author assumed
> that it should never happen in practice, so EINTR was expected
> to be 'never happens'. Clearly that's not the case.

It makes more sense to me if we just goto the next key for all types.
At least for current users of generic batch lookup, arrays and
lpm_trie, I didn't notice in any case retry would help.

best
Yan
Yan Zhai Feb. 5, 2025, 5 p.m. UTC | #5
On Wed, Feb 05, 2025 at 10:27:25AM -0600, Yan Zhai wrote:
> On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Let's not invent new magic return values.
> >
> > But stepping back... why do we have this EINTR case at all?
> > Can we always goto next_key for all map types?
> > The command returns and a set of (key, value) pairs.
> > It's always better to skip then get stuck in EINTR,
> > since EINTR implies that the user space should retry and it
> > might be successful next time.
> > While here it's not the case.
> > I don't see any selftests for EINTR, so I suspect it was added
> > as escape path in case retry count exceeds 3 and author assumed
> > that it should never happen in practice, so EINTR was expected
> > to be 'never happens'. Clearly that's not the case.
> 
> It makes more sense to me if we just goto the next key for all types.
> At least for current users of generic batch lookup, arrays and
> lpm_trie, I didn't notice in any case retry would help.
> 

I opened a patch here:
https://lore.kernel.org/bpf/Z6OYbS4WqQnmzi2z@debian.debian/

Yan
diff mbox series

Patch

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c420edbfb7c8..83915a8059ef 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2027,6 +2027,8 @@  int generic_map_lookup_batch(struct bpf_map *map,
                                         attr->batch.elem_flags);

                if (err == -ENOENT) {
+                       if (IS_FD_ARRAY(map)
+                               goto next_key;
                        if (retry) {
                                retry--;
                                continue;
@@ -2048,6 +2050,7 @@  int generic_map_lookup_batch(struct bpf_map *map,
                        goto free_buf;
                }

+next_key:
                if (!prev_key)
                        prev_key = buf_prevkey;