diff mbox series

[RFC,net-next,v5,2/2] net: add netmem to skb_frag_t

Message ID 20240109011455.1061529-3-almasrymina@google.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Abstract page from net stack | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6067 this patch: 6067
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 2321 this patch: 2321
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6410 this patch: 6410
netdev/checkpatch warning WARNING: do not add new typedefs
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Mina Almasry Jan. 9, 2024, 1:14 a.m. UTC
Use struct netmem* instead of page in skb_frag_t. Currently struct
netmem* is always a struct page underneath, but the abstraction
allows efforts to add support for skb frags not backed by pages.

There is unfortunately 1 instance where the skb_frag_t is assumed to be
a exactly a bio_vec in kcm. For this case, WARN_ON_ONCE and return error
before doing a cast.

Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v4:
- Handle error in kcm_write_msgs() instead of only warning (Willem)

v3:
- Renamed the fields in skb_frag_t.

v2:
- Add skb frag filling helpers.

---
 include/linux/skbuff.h | 90 +++++++++++++++++++++++++++++-------------
 net/core/skbuff.c      | 22 ++++++++---
 net/kcm/kcmsock.c      |  9 ++++-
 3 files changed, 86 insertions(+), 35 deletions(-)

Comments

Yunsheng Lin Jan. 11, 2024, 12:44 p.m. UTC | #1
On 2024/1/9 9:14, Mina Almasry wrote:

...

>  
> +		if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {

I am really hate to bring it up again.
If you are not willing to introduce a new helper, do you care to use some
existing API like skb_frag_address_safe()?

> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>  		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
> -			      skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> -			      msize);
> +			      (const struct bio_vec *)skb_shinfo(skb)->frags,
> +			      skb_shinfo(skb)->nr_frags, msize);

I think we need to use some built-time checking to ensure some consistency
between skb_frag_t and bio_vec.

>  		iov_iter_advance(&msg.msg_iter, txm->frag_offset);
>  
>  		do {
>
Mina Almasry Jan. 12, 2024, 12:34 a.m. UTC | #2
On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/9 9:14, Mina Almasry wrote:
>
> ...
>
> >
> > +             if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
>
> I am really hate to bring it up again.
> If you are not willing to introduce a new helper,

I'm actually more than happy to add a new helper like:

static inline netmem_ref skb_frag_netmem();

For future callers to obtain frag->netmem to use the netmem_ref directly.

What I'm hung up on is really your follow up request:

"Is it possible to introduce something like skb_frag_netmem() for
netmem? so that we can keep most existing users of skb_frag_page()
unchanged and avoid adding additional checking overhead for existing
users."

With this patchseries, skb_frag_t no longer has a page pointer inside
of it, it only has a netmem_ref. The netmem_ref is currently always a
page, but in the future may not be a page. Can you clarify how we keep
skb_frag_page() unchanged and without checks? What do you expect
skb_frag_page() and its callers to do? We can not assume netmem_ref is
always a struct page. I'm happy to implement a change but I need to
understand it a bit better.

> do you care to use some
> existing API like skb_frag_address_safe()?
>

skb_frag_address_safe() checks that the page is mapped. In this case,
we are not checking if the frag page is mapped; we would like to make
sure that the skb_frag has a page inside of it in the first place.
Seems like a different check from skb_frag_address_safe().

In fact, skb_frag_address[_safe]() actually assume that the skb frag
is always a page currently, I think I need to squash this fix:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e59f76151628..bc8b107d0235 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3533,7 +3533,9 @@ static inline void skb_frag_unref(struct sk_buff
*skb, int f)
  */
 static inline void *skb_frag_address(const skb_frag_t *frag)
 {
-       return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
+       return skb_frag_page(frag) ?
+               page_address(skb_frag_page(frag)) + skb_frag_off(frag) :
+               NULL;
 }

 /**
@@ -3545,7 +3547,14 @@ static inline void *skb_frag_address(const
skb_frag_t *frag)
  */
 static inline void *skb_frag_address_safe(const skb_frag_t *frag)
 {
-       void *ptr = page_address(skb_frag_page(frag));
+       struct page *page;
+       void *ptr;
+
+       page = skb_frag_page(frag);
+       if (!page)
+               return NULL;
+
+       ptr = page_address(skb_frag_page(frag));
        if (unlikely(!ptr))
                return NULL;

> > +                     ret = -EINVAL;
> > +                     goto out;
> > +             }
> > +
> >               iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
> > -                           skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> > -                           msize);
> > +                           (const struct bio_vec *)skb_shinfo(skb)->frags,
> > +                           skb_shinfo(skb)->nr_frags, msize);
>
> I think we need to use some built-time checking to ensure some consistency
> between skb_frag_t and bio_vec.
>

I can add static_assert() that bio_vec->bv_len & bio_vec->bv_offset
are aligned with skb_frag_t->len & skb_frag_t->offset.

I can also maybe add a helper skb_frag_bvec() to do the cast instead
of doing it at the calling site. That may be a bit cleaner.

> >               iov_iter_advance(&msg.msg_iter, txm->frag_offset);
> >
> >               do {
> >



--
Thanks,
Mina
Yunsheng Lin Jan. 12, 2024, 11:51 a.m. UTC | #3
On 2024/1/12 8:34, Mina Almasry wrote:
> On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/1/9 9:14, Mina Almasry wrote:
>>
>> ...
>>
>>>
>>> +             if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
>>
>> I am really hate to bring it up again.
>> If you are not willing to introduce a new helper,
> 
> I'm actually more than happy to add a new helper like:
> 
> static inline netmem_ref skb_frag_netmem();
> 
> For future callers to obtain frag->netmem to use the netmem_ref directly.
> 
> What I'm hung up on is really your follow up request:
> 
> "Is it possible to introduce something like skb_frag_netmem() for
> netmem? so that we can keep most existing users of skb_frag_page()
> unchanged and avoid adding additional checking overhead for existing
> users."
> 
> With this patchseries, skb_frag_t no longer has a page pointer inside
> of it, it only has a netmem_ref. The netmem_ref is currently always a
> page, but in the future may not be a page. Can you clarify how we keep
> skb_frag_page() unchanged and without checks? What do you expect
> skb_frag_page() and its callers to do? We can not assume netmem_ref is
> always a struct page. I'm happy to implement a change but I need to
> understand it a bit better.

There are still many existing places still not expecting or handling
skb_frag_page() returning NULL, mostly those are in the drivers not
supporting devmem, what's the point of adding the extral overhead for
those driver?

The networking stack should forbid skb with devmem frag being forwarded
to drivers not supporting devmem yet. I am sure if this is done properly
in your patchset yet? if not, I think you might to audit every existing
drivers handling skb_frag_page() returning NULL correctly.


> 
>> do you care to use some
>> existing API like skb_frag_address_safe()?
>>
> 
> skb_frag_address_safe() checks that the page is mapped. In this case,
> we are not checking if the frag page is mapped; we would like to make
> sure that the skb_frag has a page inside of it in the first place.
> Seems like a different check from skb_frag_address_safe().
> 
> In fact, skb_frag_address[_safe]() actually assume that the skb frag
> is always a page currently, I think I need to squash this fix:
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index e59f76151628..bc8b107d0235 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3533,7 +3533,9 @@ static inline void skb_frag_unref(struct sk_buff
> *skb, int f)
>   */
>  static inline void *skb_frag_address(const skb_frag_t *frag)
>  {
> -       return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
> +       return skb_frag_page(frag) ?
> +               page_address(skb_frag_page(frag)) + skb_frag_off(frag) :
> +               NULL;
>  }
> 
>  /**
> @@ -3545,7 +3547,14 @@ static inline void *skb_frag_address(const
> skb_frag_t *frag)
>   */
>  static inline void *skb_frag_address_safe(const skb_frag_t *frag)
>  {
> -       void *ptr = page_address(skb_frag_page(frag));
> +       struct page *page;
> +       void *ptr;
> +
> +       page = skb_frag_page(frag);
> +       if (!page)
> +               return NULL;
> +
> +       ptr = page_address(skb_frag_page(frag));
>         if (unlikely(!ptr))
>                 return NULL;
> 
>>> +                     ret = -EINVAL;
>>> +                     goto out;
>>> +             }
>>> +
>>>               iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
>>> -                           skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
>>> -                           msize);
>>> +                           (const struct bio_vec *)skb_shinfo(skb)->frags,
>>> +                           skb_shinfo(skb)->nr_frags, msize);
>>
>> I think we need to use some built-time checking to ensure some consistency
>> between skb_frag_t and bio_vec.
>>
> 
> I can add static_assert() that bio_vec->bv_len & bio_vec->bv_offset
> are aligned with skb_frag_t->len & skb_frag_t->offset.
> 
> I can also maybe add a helper skb_frag_bvec() to do the cast instead
> of doing it at the calling site. That may be a bit cleaner.

I think the more generic way to do is to add something like iov_iter_netmem()
instead of doing any cast, so that netmem can be decoupled from bvec completely.

> 
>>>               iov_iter_advance(&msg.msg_iter, txm->frag_offset);
>>>
>>>               do {
>>>
> 
> 
> 
> --
> Thanks,
> Mina
> .
>
Mina Almasry Jan. 12, 2024, 3:35 p.m. UTC | #4
On Fri, Jan 12, 2024 at 3:51 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/12 8:34, Mina Almasry wrote:
> > On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/1/9 9:14, Mina Almasry wrote:
> >>
> >> ...
> >>
> >>>
> >>> +             if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
> >>
> >> I am really hate to bring it up again.
> >> If you are not willing to introduce a new helper,
> >
> > I'm actually more than happy to add a new helper like:
> >
> > static inline netmem_ref skb_frag_netmem();
> >
> > For future callers to obtain frag->netmem to use the netmem_ref directly.
> >
> > What I'm hung up on is really your follow up request:
> >
> > "Is it possible to introduce something like skb_frag_netmem() for
> > netmem? so that we can keep most existing users of skb_frag_page()
> > unchanged and avoid adding additional checking overhead for existing
> > users."
> >
> > With this patchseries, skb_frag_t no longer has a page pointer inside
> > of it, it only has a netmem_ref. The netmem_ref is currently always a
> > page, but in the future may not be a page. Can you clarify how we keep
> > skb_frag_page() unchanged and without checks? What do you expect
> > skb_frag_page() and its callers to do? We can not assume netmem_ref is
> > always a struct page. I'm happy to implement a change but I need to
> > understand it a bit better.


You did not answer my question that I asked here, and ignoring this
question is preventing us from making any forward progress on this
discussion. What do you expect or want skb_frag_page() to do when
there is no page in the frag?

>
> There are still many existing places still not expecting or handling
> skb_frag_page() returning NULL, mostly those are in the drivers not
> supporting devmem,

As of this series skb_frag_page() cannot return NULL.

In the devmem series, all core networking stack places where
skb_frag_page() may return NULL are audited.

skb_frag_page() returning NULL in a driver that doesn't support devmem
is not possible. The driver is the one that creates the devmem frags
in the first place. When the driver author adds devmem support, they
should also add support for skb_frag_page() returning NULL.

> what's the point of adding the extral overhead for
> those driver?
>

There is no overhead with static branches. The checks are no-op unless
the user enables devmem, at which point the devmem connections see no
overhead and non-devmem connections will see minimal overhead that I
suspect will not reproduce any perf issue. If the user is not fine
with that they can simply not enable devmem and continue to not
experience any overhead.

> The networking stack should forbid skb with devmem frag being forwarded
> to drivers not supporting devmem yet. I am sure if this is done properly
> in your patchset yet? if not, I think you might to audit every existing
> drivers handling skb_frag_page() returning NULL correctly.
>

There is no audit required. The devmem frags are generated by the
driver and forwarded to the TCP stack, not the other way around.

>
> >
> >> do you care to use some
> >> existing API like skb_frag_address_safe()?
> >>
> >
> > skb_frag_address_safe() checks that the page is mapped. In this case,
> > we are not checking if the frag page is mapped; we would like to make
> > sure that the skb_frag has a page inside of it in the first place.
> > Seems like a different check from skb_frag_address_safe().
> >
> > In fact, skb_frag_address[_safe]() actually assume that the skb frag
> > is always a page currently, I think I need to squash this fix:
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index e59f76151628..bc8b107d0235 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3533,7 +3533,9 @@ static inline void skb_frag_unref(struct sk_buff
> > *skb, int f)
> >   */
> >  static inline void *skb_frag_address(const skb_frag_t *frag)
> >  {
> > -       return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
> > +       return skb_frag_page(frag) ?
> > +               page_address(skb_frag_page(frag)) + skb_frag_off(frag) :
> > +               NULL;
> >  }
> >
> >  /**
> > @@ -3545,7 +3547,14 @@ static inline void *skb_frag_address(const
> > skb_frag_t *frag)
> >   */
> >  static inline void *skb_frag_address_safe(const skb_frag_t *frag)
> >  {
> > -       void *ptr = page_address(skb_frag_page(frag));
> > +       struct page *page;
> > +       void *ptr;
> > +
> > +       page = skb_frag_page(frag);
> > +       if (!page)
> > +               return NULL;
> > +
> > +       ptr = page_address(skb_frag_page(frag));
> >         if (unlikely(!ptr))
> >                 return NULL;
> >
> >>> +                     ret = -EINVAL;
> >>> +                     goto out;
> >>> +             }
> >>> +
> >>>               iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
> >>> -                           skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> >>> -                           msize);
> >>> +                           (const struct bio_vec *)skb_shinfo(skb)->frags,
> >>> +                           skb_shinfo(skb)->nr_frags, msize);
> >>
> >> I think we need to use some built-time checking to ensure some consistency
> >> between skb_frag_t and bio_vec.
> >>
> >
> > I can add static_assert() that bio_vec->bv_len & bio_vec->bv_offset
> > are aligned with skb_frag_t->len & skb_frag_t->offset.
> >
> > I can also maybe add a helper skb_frag_bvec() to do the cast instead
> > of doing it at the calling site. That may be a bit cleaner.
>
> I think the more generic way to do is to add something like iov_iter_netmem()
> instead of doing any cast, so that netmem can be decoupled from bvec completely.
>
> >
> >>>               iov_iter_advance(&msg.msg_iter, txm->frag_offset);
> >>>
> >>>               do {
> >>>
> >
> >
> >
> > --
> > Thanks,
> > Mina
> > .
> >
Yunsheng Lin Jan. 15, 2024, 9:37 a.m. UTC | #5
On 2024/1/12 23:35, Mina Almasry wrote:
> On Fri, Jan 12, 2024 at 3:51 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/1/12 8:34, Mina Almasry wrote:
>>> On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> On 2024/1/9 9:14, Mina Almasry wrote:
>>>>
>>>> ...
>>>>
>>>>>
>>>>> +             if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
>>>>
>>>> I am really hate to bring it up again.
>>>> If you are not willing to introduce a new helper,
>>>
>>> I'm actually more than happy to add a new helper like:
>>>
>>> static inline netmem_ref skb_frag_netmem();
>>>
>>> For future callers to obtain frag->netmem to use the netmem_ref directly.
>>>
>>> What I'm hung up on is really your follow up request:
>>>
>>> "Is it possible to introduce something like skb_frag_netmem() for
>>> netmem? so that we can keep most existing users of skb_frag_page()
>>> unchanged and avoid adding additional checking overhead for existing
>>> users."
>>>
>>> With this patchseries, skb_frag_t no longer has a page pointer inside
>>> of it, it only has a netmem_ref. The netmem_ref is currently always a
>>> page, but in the future may not be a page. Can you clarify how we keep
>>> skb_frag_page() unchanged and without checks? What do you expect
>>> skb_frag_page() and its callers to do? We can not assume netmem_ref is
>>> always a struct page. I'm happy to implement a change but I need to
>>> understand it a bit better.
> 
> 
> You did not answer my question that I asked here, and ignoring this
> question is preventing us from making any forward progress on this
> discussion. What do you expect or want skb_frag_page() to do when
> there is no page in the frag?

I would expect it to do nothing.
IMHO, the caller is expected to only call skb_frag_page() for the frag
with normal page, for example, doing some 'readable' checking before
callingskb_frag_page(), or not doing any checking at all if it is called
in some existing driver not support devmem.

> 
>>
>> There are still many existing places still not expecting or handling
>> skb_frag_page() returning NULL, mostly those are in the drivers not
>> supporting devmem,
> 
> As of this series skb_frag_page() cannot return NULL.
> 
> In the devmem series, all core networking stack places where
> skb_frag_page() may return NULL are audited.
> 
> skb_frag_page() returning NULL in a driver that doesn't support devmem
> is not possible. The driver is the one that creates the devmem frags

Is it possible a netdev supporting devmen and a netdev not supporting
devmen are put into the same bridge, and a rx skb from netdev supporting
devmen is forwarded to netdev not supporting devmem?

br_forward() or ip_forward() may be the place that might do this kind
of forwarding?

> in the first place. When the driver author adds devmem support, they
> should also add support for skb_frag_page() returning NULL.
> 
>> what's the point of adding the extral overhead for
>> those driver?
>>
> 
> There is no overhead with static branches. The checks are no-op unless
> the user enables devmem, at which point the devmem connections see no

no-op is still some cpu instruction that will be replaced by some jumping
instruction when devmem is enabled, right?

Maybe Alexander had better words for those kinds of overhead:
"The overhead may not seem like much, but when you are having to deal
with it per page and you are processing pages at millions per second it
can quickly start to add up."


> overhead and non-devmem connections will see minimal overhead that I
> suspect will not reproduce any perf issue. If the user is not fine
> with that they can simply not enable devmem and continue to not
> experience any overhead.
> 
>> The networking stack should forbid skb with devmem frag being forwarded
>> to drivers not supporting devmem yet. I am sure if this is done properly
>> in your patchset yet? if not, I think you might to audit every existing
>> drivers handling skb_frag_page() returning NULL correctly.
>>
> 
> There is no audit required. The devmem frags are generated by the
> driver and forwarded to the TCP stack, not the other way around.
> 
>>
>>>
Mina Almasry Jan. 15, 2024, 11:23 p.m. UTC | #6
On Mon, Jan 15, 2024 at 1:37 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/12 23:35, Mina Almasry wrote:
> > On Fri, Jan 12, 2024 at 3:51 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/1/12 8:34, Mina Almasry wrote:
> >>> On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>
> >>>> On 2024/1/9 9:14, Mina Almasry wrote:
> >>>>
> >>>> ...
> >>>>
> >>>>>
> >>>>> +             if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
> >>>>
> >>>> I am really hate to bring it up again.
> >>>> If you are not willing to introduce a new helper,
> >>>
> >>> I'm actually more than happy to add a new helper like:
> >>>
> >>> static inline netmem_ref skb_frag_netmem();
> >>>
> >>> For future callers to obtain frag->netmem to use the netmem_ref directly.
> >>>
> >>> What I'm hung up on is really your follow up request:
> >>>
> >>> "Is it possible to introduce something like skb_frag_netmem() for
> >>> netmem? so that we can keep most existing users of skb_frag_page()
> >>> unchanged and avoid adding additional checking overhead for existing
> >>> users."
> >>>
> >>> With this patchseries, skb_frag_t no longer has a page pointer inside
> >>> of it, it only has a netmem_ref. The netmem_ref is currently always a
> >>> page, but in the future may not be a page. Can you clarify how we keep
> >>> skb_frag_page() unchanged and without checks? What do you expect
> >>> skb_frag_page() and its callers to do? We can not assume netmem_ref is
> >>> always a struct page. I'm happy to implement a change but I need to
> >>> understand it a bit better.
> >
> >
> > You did not answer my question that I asked here, and ignoring this
> > question is preventing us from making any forward progress on this
> > discussion. What do you expect or want skb_frag_page() to do when
> > there is no page in the frag?
>
> I would expect it to do nothing.

I don't understand. skb_frag_page() with an empty implementation just
results in a compiler error as the function needs to return a page
pointer. Do you actually expect skb_frag_page() to unconditionally
cast frag->netmem to a page pointer? That was explained as
unacceptable over and over again by Jason and Christian as it risks
casting devmem to page; completely unacceptable and will get nacked.
Do you have a suggestion of what skb_frag_page() should do that will
not get nacked by mm?

> IMHO, the caller is expected to only call skb_frag_page() for the frag
> with normal page, for example, doing some 'readable' checking before
> callingskb_frag_page(), or not doing any checking at all if it is called
> in some existing driver not support devmem.
>

You did not specify what you actually want skb_frag_page() to do, but
IMO leaving it up to the programmer to guess/decide if it's safe to
use with no checking in the function is very error prone and hacky, as
the programmer is obviously liable to assume the wrong thing.

All skb_frag_page() callers (which are mainly core networking
actually) would need to do a check anyway to ensure the compiler type
safety, set as a hard requirement by mm & dmabuf maintainers, and at
that point having the check in the function makes sense.

> >
> >>
> >> There are still many existing places still not expecting or handling
> >> skb_frag_page() returning NULL, mostly those are in the drivers not
> >> supporting devmem,
> >
> > As of this series skb_frag_page() cannot return NULL.
> >
> > In the devmem series, all core networking stack places where
> > skb_frag_page() may return NULL are audited.
> >
> > skb_frag_page() returning NULL in a driver that doesn't support devmem
> > is not possible. The driver is the one that creates the devmem frags
>
> Is it possible a netdev supporting devmen and a netdev not supporting
> devmen are put into the same bridge, and a rx skb from netdev supporting
> devmen is forwarded to netdev not supporting devmem?
>
> br_forward() or ip_forward() may be the place that might do this kind
> of forwarding?
>

I'd need to check, but even at that point it sounds to me like this
forwarding code would need to check handle devmem or disabled for
devmem, not a global audit of drivers.

> > in the first place. When the driver author adds devmem support, they
> > should also add support for skb_frag_page() returning NULL.
> >
> >> what's the point of adding the extral overhead for
> >> those driver?
> >>
> >
> > There is no overhead with static branches. The checks are no-op unless
> > the user enables devmem, at which point the devmem connections see no
>
> no-op is still some cpu instruction that will be replaced by some jumping
> instruction when devmem is enabled, right?
>

No, there is no overhead with static branches. There seems to be a
misunderstanding of how they work causing this massive overestimation
of the 'overhead' of the checks. Branches in code are free or almost
free when they are correctly predicted by the compiler:

if (unlikely(check()))
   do_work();
continue();

Is free or almost free when check() is false, as the compiler assumes
check() is false and starts executing continue() immediately. There is
a minor perf hit when the branch is mispredicted as the CPU needs to
start executing do_work() after the check() result is available.
Static branches enable us to at runtime decide which branch the
compiler predicts. We could put devmem processing to always be in the
unlikely path without static branches, but that's not necessary.
Optimizing devmem processing with static branches when devmem is
enabled seems the right choice here. That's my understanding at least.

> Maybe Alexander had better words for those kinds of overhead:
> "The overhead may not seem like much, but when you are having to deal
> with it per page and you are processing pages at millions per second it
> can quickly start to add up."
>
>

This IMO would be relevant if we're adding (significant, or
measureable) overhead to page processing, but with static branches or
putting devmem in the unlikely path we are not.

> > overhead and non-devmem connections will see minimal overhead that I
> > suspect will not reproduce any perf issue. If the user is not fine
> > with that they can simply not enable devmem and continue to not
> > experience any overhead.
> >
> >> The networking stack should forbid skb with devmem frag being forwarded
> >> to drivers not supporting devmem yet. I am sure if this is done properly
> >> in your patchset yet? if not, I think you might to audit every existing
> >> drivers handling skb_frag_page() returning NULL correctly.
> >>
> >
> > There is no audit required. The devmem frags are generated by the
> > driver and forwarded to the TCP stack, not the other way around.
> >
> >>
> >>>
>
Jason Gunthorpe Jan. 16, 2024, 12:01 a.m. UTC | #7
On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > You did not answer my question that I asked here, and ignoring this
> > > question is preventing us from making any forward progress on this
> > > discussion. What do you expect or want skb_frag_page() to do when
> > > there is no page in the frag?
> >
> > I would expect it to do nothing.
> 
> I don't understand. skb_frag_page() with an empty implementation just
> results in a compiler error as the function needs to return a page
> pointer. Do you actually expect skb_frag_page() to unconditionally
> cast frag->netmem to a page pointer? That was explained as
> unacceptable over and over again by Jason and Christian as it risks
> casting devmem to page; completely unacceptable and will get nacked.
> Do you have a suggestion of what skb_frag_page() should do that will
> not get nacked by mm?

WARN_ON and return NULL seems reasonable?

Jason
Yunsheng Lin Jan. 16, 2024, 11:04 a.m. UTC | #8
On 2024/1/16 8:01, Jason Gunthorpe wrote:
> On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
>>>> You did not answer my question that I asked here, and ignoring this
>>>> question is preventing us from making any forward progress on this
>>>> discussion. What do you expect or want skb_frag_page() to do when
>>>> there is no page in the frag?
>>>
>>> I would expect it to do nothing.
>>
>> I don't understand. skb_frag_page() with an empty implementation just
>> results in a compiler error as the function needs to return a page
>> pointer. Do you actually expect skb_frag_page() to unconditionally
>> cast frag->netmem to a page pointer? That was explained as
>> unacceptable over and over again by Jason and Christian as it risks
>> casting devmem to page; completely unacceptable and will get nacked.
>> Do you have a suggestion of what skb_frag_page() should do that will
>> not get nacked by mm?
> 
> WARN_ON and return NULL seems reasonable?

While I am agreed that it may be a nightmare to debug the case of passing
a false page into the mm system, but I am not sure what's the point of
returning NULL to caller if the caller is not expecting or handling the
NULL returning[for example, most of mm API called by the networking does not
seems to handling NULL as input page], isn't the NULL returning will make
the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
As returning NULL seems to be causing a confusion for the caller of
skb_frag_page() as whether to or how to handle the NULL returning case.

For existing cases, the skb_frag_page() is mostly called with below pattern,
it means there may be up to 17 checkings for each skb:
for (i = 0; i < shinfo->nr_frags; i++) {
	skb_frag_t *frag = &sinfo->frags[i];
	....
	calling some function with skb_frag_page(frag);
	.....
}

IMHO, we should avoid this kind of problem at API level, I am not able to
come up with an idea how to do that for now, is there any idea to do that in
your mind?
Even if we can not come up with an idea for now, doesn't it make sense to avoid
this kind of overhead as much as possible even if we have marked the checking
as unlikely() cases?

> 
> Jason
> .
>
Jason Gunthorpe Jan. 16, 2024, 12:16 p.m. UTC | #9
On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> >>>> You did not answer my question that I asked here, and ignoring this
> >>>> question is preventing us from making any forward progress on this
> >>>> discussion. What do you expect or want skb_frag_page() to do when
> >>>> there is no page in the frag?
> >>>
> >>> I would expect it to do nothing.
> >>
> >> I don't understand. skb_frag_page() with an empty implementation just
> >> results in a compiler error as the function needs to return a page
> >> pointer. Do you actually expect skb_frag_page() to unconditionally
> >> cast frag->netmem to a page pointer? That was explained as
> >> unacceptable over and over again by Jason and Christian as it risks
> >> casting devmem to page; completely unacceptable and will get nacked.
> >> Do you have a suggestion of what skb_frag_page() should do that will
> >> not get nacked by mm?
> > 
> > WARN_ON and return NULL seems reasonable?
> 
> While I am agreed that it may be a nightmare to debug the case of passing
> a false page into the mm system, but I am not sure what's the point of
> returning NULL to caller if the caller is not expecting or handling
> the

You have to return something and NULL will largely reliably crash the
thread. The WARN_ON explains in detail why your thread just crashed.

> NULL returning[for example, most of mm API called by the networking does not
> seems to handling NULL as input page], isn't the NULL returning will make
> the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> As returning NULL seems to be causing a confusion for the caller of
> skb_frag_page() as whether to or how to handle the NULL returning case.

Possibly, though Linus doesn't like BUG_ON on principle..

I think the bigger challenge is convincing people that this devmem
stuff doesn't just open a bunch of holes in the kernel where userspace
can crash it.

The fact you all are debating what to do with skb_frag_page() suggests
to me there isn't confidence...

Jason
Yunsheng Lin Jan. 17, 2024, 9:28 a.m. UTC | #10
On 2024/1/16 20:16, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
>> On 2024/1/16 8:01, Jason Gunthorpe wrote:
>>> On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
>>>>>> You did not answer my question that I asked here, and ignoring this
>>>>>> question is preventing us from making any forward progress on this
>>>>>> discussion. What do you expect or want skb_frag_page() to do when
>>>>>> there is no page in the frag?
>>>>>
>>>>> I would expect it to do nothing.
>>>>
>>>> I don't understand. skb_frag_page() with an empty implementation just
>>>> results in a compiler error as the function needs to return a page
>>>> pointer. Do you actually expect skb_frag_page() to unconditionally
>>>> cast frag->netmem to a page pointer? That was explained as
>>>> unacceptable over and over again by Jason and Christian as it risks
>>>> casting devmem to page; completely unacceptable and will get nacked.
>>>> Do you have a suggestion of what skb_frag_page() should do that will
>>>> not get nacked by mm?
>>>
>>> WARN_ON and return NULL seems reasonable?
>>
>> While I am agreed that it may be a nightmare to debug the case of passing
>> a false page into the mm system, but I am not sure what's the point of
>> returning NULL to caller if the caller is not expecting or handling
>> the
> 
> You have to return something and NULL will largely reliably crash the
> thread. The WARN_ON explains in detail why your thread just crashed.
> 
>> NULL returning[for example, most of mm API called by the networking does not
>> seems to handling NULL as input page], isn't the NULL returning will make
>> the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
>> depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
>> As returning NULL seems to be causing a confusion for the caller of
>> skb_frag_page() as whether to or how to handle the NULL returning case.
> 
> Possibly, though Linus doesn't like BUG_ON on principle..

From the below discussion, it seems the BUG_ON belonging to something
like below:
(a) development code where it replaces error handling that you just
haven't written yet, and you haven't really thought through all the
possibilities, so you're saying "this can't happen, I'll fix it
later".

https://linux.kernel.narkive.com/WPFYQ9d4/bug-on-in-workingset-node-shadows-dec-triggers#post6

Anyway, it would be fine for me if skb_frag_page() is not used to decide
if a normal page or a devmem is in a frag. As for the overhead of catching
the calling of skb_frag_page() for devmem frag, it would be better to avoid
it using some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM.

> 
> I think the bigger challenge is convincing people that this devmem
> stuff doesn't just open a bunch of holes in the kernel where userspace
> can crash it.
> 
> The fact you all are debating what to do with skb_frag_page() suggests
> to me there isn't confidence...
> 
> Jason
> .
>
Mina Almasry Jan. 17, 2024, 6 p.m. UTC | #11
On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > >>>> You did not answer my question that I asked here, and ignoring this
> > >>>> question is preventing us from making any forward progress on this
> > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > >>>> there is no page in the frag?
> > >>>
> > >>> I would expect it to do nothing.
> > >>
> > >> I don't understand. skb_frag_page() with an empty implementation just
> > >> results in a compiler error as the function needs to return a page
> > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > >> cast frag->netmem to a page pointer? That was explained as
> > >> unacceptable over and over again by Jason and Christian as it risks
> > >> casting devmem to page; completely unacceptable and will get nacked.
> > >> Do you have a suggestion of what skb_frag_page() should do that will
> > >> not get nacked by mm?
> > >
> > > WARN_ON and return NULL seems reasonable?
> >

That's more or less what I'm thinking.

> > While I am agreed that it may be a nightmare to debug the case of passing
> > a false page into the mm system, but I am not sure what's the point of
> > returning NULL to caller if the caller is not expecting or handling
> > the
>
> You have to return something and NULL will largely reliably crash the
> thread. The WARN_ON explains in detail why your thread just crashed.
>

Agreed.

> > NULL returning[for example, most of mm API called by the networking does not
> > seems to handling NULL as input page], isn't the NULL returning will make
> > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > As returning NULL seems to be causing a confusion for the caller of
> > skb_frag_page() as whether to or how to handle the NULL returning case.
>
> Possibly, though Linus doesn't like BUG_ON on principle..
>
> I think the bigger challenge is convincing people that this devmem
> stuff doesn't just open a bunch of holes in the kernel where userspace
> can crash it.
>

It does not, and as of right now there are no pending concerns from
any netdev maintainers regarding mishandled devmem checks at least.
This is because the devmem series comes with a full audit of
skb_frag_page() callers [1] and all areas in the net stack attempting
to access the skb [2].

[1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@google.com/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@google.com/

> The fact you all are debating what to do with skb_frag_page() suggests
> to me there isn't confidence...
>

The debate raging on is related to the performance of skb_frag_page(),
not correctness (and even then, I don't think it's related to
perf...). Yunsheng would like us to optimize skb_frag_page() using an
unconditional cast from netmem to page. This in Yunsheng's mind is a
performance optimization as we don't need to add an if statement
checking if the netmem is a page. I'm resistant to implement that
change so far because:

(a) unconditionally casting from netmem to page negates the compiler
type safety that you and Christian are laying out as a requirement for
the devmem stuff.
(b) With likely/unlikely or static branches the check to make sure
netmem is page is a no-op for existing use cases anyway, so AFAIU,
there is no perf gain from optimizing it out anyway.

But none of this is related to correctness. Code calling
skb_frag_page() will fail or crash if it's not handled correctly
regardless of the implementation details of skb_frag_page(). In the
devmem series we add support to handle it correctly via [1] & [2].
Mina Almasry Jan. 17, 2024, 6:34 p.m. UTC | #12
On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > >>>> You did not answer my question that I asked here, and ignoring this
> > > >>>> question is preventing us from making any forward progress on this
> > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > >>>> there is no page in the frag?
> > > >>>
> > > >>> I would expect it to do nothing.
> > > >>
> > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > >> results in a compiler error as the function needs to return a page
> > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > >> cast frag->netmem to a page pointer? That was explained as
> > > >> unacceptable over and over again by Jason and Christian as it risks
> > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > >> not get nacked by mm?
> > > >
> > > > WARN_ON and return NULL seems reasonable?
> > >
>
> That's more or less what I'm thinking.
>
> > > While I am agreed that it may be a nightmare to debug the case of passing
> > > a false page into the mm system, but I am not sure what's the point of
> > > returning NULL to caller if the caller is not expecting or handling
> > > the
> >
> > You have to return something and NULL will largely reliably crash the
> > thread. The WARN_ON explains in detail why your thread just crashed.
> >
>
> Agreed.
>
> > > NULL returning[for example, most of mm API called by the networking does not
> > > seems to handling NULL as input page], isn't the NULL returning will make
> > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > As returning NULL seems to be causing a confusion for the caller of
> > > skb_frag_page() as whether to or how to handle the NULL returning case.
> >
> > Possibly, though Linus doesn't like BUG_ON on principle..
> >
> > I think the bigger challenge is convincing people that this devmem
> > stuff doesn't just open a bunch of holes in the kernel where userspace
> > can crash it.
> >
>
> It does not, and as of right now there are no pending concerns from
> any netdev maintainers regarding mishandled devmem checks at least.
> This is because the devmem series comes with a full audit of
> skb_frag_page() callers [1] and all areas in the net stack attempting
> to access the skb [2].
>
> [1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@google.com/
> [2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@google.com/
>
> > The fact you all are debating what to do with skb_frag_page() suggests
> > to me there isn't confidence...
> >
>
> The debate raging on is related to the performance of skb_frag_page(),
> not correctness (and even then, I don't think it's related to
> perf...). Yunsheng would like us to optimize skb_frag_page() using an
> unconditional cast from netmem to page. This in Yunsheng's mind is a
> performance optimization as we don't need to add an if statement
> checking if the netmem is a page. I'm resistant to implement that
> change so far because:
>
> (a) unconditionally casting from netmem to page negates the compiler
> type safety that you and Christian are laying out as a requirement for
> the devmem stuff.
> (b) With likely/unlikely or static branches the check to make sure
> netmem is page is a no-op for existing use cases anyway, so AFAIU,
> there is no perf gain from optimizing it out anyway.
>

Another thought, if anyone else is concerned about the devmem checks
performance,  potentially we could introduce CONFIG_NET_DEVMEM which
when disabled prevents the user from using devmem at all (disables the
netlink API).

When that is disabled, skb_frag_page(), netmem_to_page() & friends can
assume netmem is always a page and do a straight cast between netmem &
page. When it's enabled, it will check that netmem == page before
doing a cast, and return NULL if it is not a page.

I think this is technically viable and I think preserves the compiler
type safety requirements set by mm folks. From my POV though, bloating
the kernel with a a new CONFIG just to optimize out no-op checks seems
unnecessary, but if there is agreement that the checks are a concern,
adding CONFIG_NET_DEVMEM should address it while being acceptable to
mm maintainers.

> But none of this is related to correctness. Code calling
> skb_frag_page() will fail or crash if it's not handled correctly
> regardless of the implementation details of skb_frag_page(). In the
> devmem series we add support to handle it correctly via [1] & [2].
>
> --
> Thanks,
> Mina
Willem de Bruijn Jan. 17, 2024, 6:54 p.m. UTC | #13
Mina Almasry wrote:
> On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > > >>>> You did not answer my question that I asked here, and ignoring this
> > > > >>>> question is preventing us from making any forward progress on this
> > > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > > >>>> there is no page in the frag?
> > > > >>>
> > > > >>> I would expect it to do nothing.
> > > > >>
> > > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > > >> results in a compiler error as the function needs to return a page
> > > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > > >> cast frag->netmem to a page pointer? That was explained as
> > > > >> unacceptable over and over again by Jason and Christian as it risks
> > > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > > >> not get nacked by mm?
> > > > >
> > > > > WARN_ON and return NULL seems reasonable?
> > > >
> >
> > That's more or less what I'm thinking.
> >
> > > > While I am agreed that it may be a nightmare to debug the case of passing
> > > > a false page into the mm system, but I am not sure what's the point of
> > > > returning NULL to caller if the caller is not expecting or handling
> > > > the
> > >
> > > You have to return something and NULL will largely reliably crash the
> > > thread. The WARN_ON explains in detail why your thread just crashed.
> > >
> >
> > Agreed.
> >
> > > > NULL returning[for example, most of mm API called by the networking does not
> > > > seems to handling NULL as input page], isn't the NULL returning will make
> > > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > > As returning NULL seems to be causing a confusion for the caller of
> > > > skb_frag_page() as whether to or how to handle the NULL returning case.
> > >
> > > Possibly, though Linus doesn't like BUG_ON on principle..
> > >
> > > I think the bigger challenge is convincing people that this devmem
> > > stuff doesn't just open a bunch of holes in the kernel where userspace
> > > can crash it.
> > >
> >
> > It does not, and as of right now there are no pending concerns from
> > any netdev maintainers regarding mishandled devmem checks at least.
> > This is because the devmem series comes with a full audit of
> > skb_frag_page() callers [1] and all areas in the net stack attempting
> > to access the skb [2].
> >
> > [1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@google.com/
> > [2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@google.com/
> >
> > > The fact you all are debating what to do with skb_frag_page() suggests
> > > to me there isn't confidence...
> > >
> >
> > The debate raging on is related to the performance of skb_frag_page(),
> > not correctness (and even then, I don't think it's related to
> > perf...). Yunsheng would like us to optimize skb_frag_page() using an
> > unconditional cast from netmem to page. This in Yunsheng's mind is a
> > performance optimization as we don't need to add an if statement
> > checking if the netmem is a page. I'm resistant to implement that
> > change so far because:
> >
> > (a) unconditionally casting from netmem to page negates the compiler
> > type safety that you and Christian are laying out as a requirement for
> > the devmem stuff.
> > (b) With likely/unlikely or static branches the check to make sure
> > netmem is page is a no-op for existing use cases anyway, so AFAIU,
> > there is no perf gain from optimizing it out anyway.
> >
> 
> Another thought, if anyone else is concerned about the devmem checks
> performance,  potentially we could introduce CONFIG_NET_DEVMEM which
> when disabled prevents the user from using devmem at all (disables the
> netlink API).
> 
> When that is disabled, skb_frag_page(), netmem_to_page() & friends can
> assume netmem is always a page and do a straight cast between netmem &
> page. When it's enabled, it will check that netmem == page before
> doing a cast, and return NULL if it is not a page.
> 
> I think this is technically viable and I think preserves the compiler
> type safety requirements set by mm folks. From my POV though, bloating
> the kernel with a a new CONFIG just to optimize out no-op checks seems
> unnecessary, but if there is agreement that the checks are a concern,
> adding CONFIG_NET_DEVMEM should address it while being acceptable to
> mm maintainers.

I agree. A concern with CONFIGs is that what matters in practice is
which default the distros compile with. In this case, adding hurdles
to using the feature likely for no real reason.

Static branches are used throughout the kernel in performance
sensitive paths, exactly because they allow optional paths effectively
for free. I'm quite surprised that this issue is being raised so
strongly here, as they are hardly new or controversial.

But perhaps best is to show with data. Is there a representative page
pool benchmark that exercises the most sensitive case (XDP_DROP?) that
we can run with and without a static branch to demonstrate that any
diff is within the noise?

> > But none of this is related to correctness. Code calling
> > skb_frag_page() will fail or crash if it's not handled correctly
> > regardless of the implementation details of skb_frag_page(). In the
> > devmem series we add support to handle it correctly via [1] & [2].
> >
> > --
> > Thanks,
> > Mina
> 
> 
> 
> -- 
> Thanks,
> Mina
Yunsheng Lin Jan. 18, 2024, 8:52 a.m. UTC | #14
On 2024/1/18 2:54, Willem de Bruijn wrote:
> 
> I agree. A concern with CONFIGs is that what matters in practice is
> which default the distros compile with. In this case, adding hurdles
> to using the feature likely for no real reason.
> 
> Static branches are used throughout the kernel in performance
> sensitive paths, exactly because they allow optional paths effectively
> for free. I'm quite surprised that this issue is being raised so
> strongly here, as they are hardly new or controversial.

The new or controversial part about its usage in the devmem patchset as my
understanding is:
1. It is assumed the devmem and normal page processing in networking does
   not have to be treated equally in the same system, either the performance
   of devmem is favored or the performance of normal page is favored. I think
   if distros is starting to worry about the CONFIG for devmem, devmem must be
   quite popular that we might need the best performance of both. IMHO, static
   branches might be just a convenient way to start supporting the devmem for
   now as we seems to not have a clear idea of unified handling or proper API
   for both devmem and normal page.

2. Specifically to skb_frag_page(), if the returning NULL is to catch its misuse
   for devmem, then I am agreed with this generally. But the NULL returning
   handling in kcm_write_msgs() seems to suggest otherwise to me. Isn't it
   reasonable to make the semantic obvious by using WARN_ON or BUG_ON directly in
   skb_frag_page(), and returning NULL does not 100% reliably crash the thread as
   suggested by jason?

> 
> But perhaps best is to show with data. Is there a representative page
> pool benchmark that exercises the most sensitive case (XDP_DROP?) that
> we can run with and without a static branch to demonstrate that any
> diff is within the noise?
> 
>>> But none of this is related to correctness. Code calling
>>> skb_frag_page() will fail or crash if it's not handled correctly
>>> regardless of the implementation details of skb_frag_page(). In the
>>> devmem series we add support to handle it correctly via [1] & [2].
>>>
>>> --
>>> Thanks,
>>> Mina
>>
>>
>>
>> -- 
>> Thanks,
>> Mina
> 
> 
> 
> .
>
Mina Almasry Jan. 18, 2024, 1:56 p.m. UTC | #15
On Wed, Jan 17, 2024 at 10:54 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Mina Almasry wrote:
> > On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > > > >>>> You did not answer my question that I asked here, and ignoring this
> > > > > >>>> question is preventing us from making any forward progress on this
> > > > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > > > >>>> there is no page in the frag?
> > > > > >>>
> > > > > >>> I would expect it to do nothing.
> > > > > >>
> > > > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > > > >> results in a compiler error as the function needs to return a page
> > > > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > > > >> cast frag->netmem to a page pointer? That was explained as
> > > > > >> unacceptable over and over again by Jason and Christian as it risks
> > > > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > > > >> not get nacked by mm?
> > > > > >
> > > > > > WARN_ON and return NULL seems reasonable?
> > > > >
> > >
> > > That's more or less what I'm thinking.
> > >
> > > > > While I am agreed that it may be a nightmare to debug the case of passing
> > > > > a false page into the mm system, but I am not sure what's the point of
> > > > > returning NULL to caller if the caller is not expecting or handling
> > > > > the
> > > >
> > > > You have to return something and NULL will largely reliably crash the
> > > > thread. The WARN_ON explains in detail why your thread just crashed.
> > > >
> > >
> > > Agreed.
> > >
> > > > > NULL returning[for example, most of mm API called by the networking does not
> > > > > seems to handling NULL as input page], isn't the NULL returning will make
> > > > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > > > As returning NULL seems to be causing a confusion for the caller of
> > > > > skb_frag_page() as whether to or how to handle the NULL returning case.
> > > >
> > > > Possibly, though Linus doesn't like BUG_ON on principle..
> > > >
> > > > I think the bigger challenge is convincing people that this devmem
> > > > stuff doesn't just open a bunch of holes in the kernel where userspace
> > > > can crash it.
> > > >
> > >
> > > It does not, and as of right now there are no pending concerns from
> > > any netdev maintainers regarding mishandled devmem checks at least.
> > > This is because the devmem series comes with a full audit of
> > > skb_frag_page() callers [1] and all areas in the net stack attempting
> > > to access the skb [2].
> > >
> > > [1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@google.com/
> > > [2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@google.com/
> > >
> > > > The fact you all are debating what to do with skb_frag_page() suggests
> > > > to me there isn't confidence...
> > > >
> > >
> > > The debate raging on is related to the performance of skb_frag_page(),
> > > not correctness (and even then, I don't think it's related to
> > > perf...). Yunsheng would like us to optimize skb_frag_page() using an
> > > unconditional cast from netmem to page. This in Yunsheng's mind is a
> > > performance optimization as we don't need to add an if statement
> > > checking if the netmem is a page. I'm resistant to implement that
> > > change so far because:
> > >
> > > (a) unconditionally casting from netmem to page negates the compiler
> > > type safety that you and Christian are laying out as a requirement for
> > > the devmem stuff.
> > > (b) With likely/unlikely or static branches the check to make sure
> > > netmem is page is a no-op for existing use cases anyway, so AFAIU,
> > > there is no perf gain from optimizing it out anyway.
> > >
> >
> > Another thought, if anyone else is concerned about the devmem checks
> > performance,  potentially we could introduce CONFIG_NET_DEVMEM which
> > when disabled prevents the user from using devmem at all (disables the
> > netlink API).
> >
> > When that is disabled, skb_frag_page(), netmem_to_page() & friends can
> > assume netmem is always a page and do a straight cast between netmem &
> > page. When it's enabled, it will check that netmem == page before
> > doing a cast, and return NULL if it is not a page.
> >
> > I think this is technically viable and I think preserves the compiler
> > type safety requirements set by mm folks. From my POV though, bloating
> > the kernel with a a new CONFIG just to optimize out no-op checks seems
> > unnecessary, but if there is agreement that the checks are a concern,
> > adding CONFIG_NET_DEVMEM should address it while being acceptable to
> > mm maintainers.
>
> I agree. A concern with CONFIGs is that what matters in practice is
> which default the distros compile with. In this case, adding hurdles
> to using the feature likely for no real reason.
>
> Static branches are used throughout the kernel in performance
> sensitive paths, exactly because they allow optional paths effectively
> for free. I'm quite surprised that this issue is being raised so
> strongly here, as they are hardly new or controversial.
>
> But perhaps best is to show with data. Is there a representative page
> pool benchmark that exercises the most sensitive case (XDP_DROP?) that
> we can run with and without a static branch to demonstrate that any
> diff is within the noise?
>

Yes, Jesper has a page_pool benchmark that he pointed me to in RFC v2:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

When running the results on RFCv2, the results were:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

On RFC v2 the benchmark showed only a single instruction regression in
the page_pool fast path & the change deemed acceptable to Jesper from
a performance POV [1].

I have not run the benchmark continually on follow up iterations of
the RFC, but I think I'll start doing so in the future.

[1] https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/

> > > But none of this is related to correctness. Code calling
> > > skb_frag_page() will fail or crash if it's not handled correctly
> > > regardless of the implementation details of skb_frag_page(). In the
> > > devmem series we add support to handle it correctly via [1] & [2].
> > >
> > > --
> > > Thanks,
> > > Mina
> >
> >
> >
> > --
> > Thanks,
> > Mina
>
>
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5ae952454c8..e59f76151628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@ 
 #endif
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
+#include <net/netmem.h>
 
 /**
  * DOC: skb checksums
@@ -359,7 +360,11 @@  extern int sysctl_max_skb_frags;
  */
 #define GSO_BY_FRAGS	0xFFFF
 
-typedef struct bio_vec skb_frag_t;
+typedef struct skb_frag {
+	netmem_ref netmem;
+	unsigned int len;
+	unsigned int offset;
+} skb_frag_t;
 
 /**
  * skb_frag_size() - Returns the size of a skb fragment
@@ -367,7 +372,7 @@  typedef struct bio_vec skb_frag_t;
  */
 static inline unsigned int skb_frag_size(const skb_frag_t *frag)
 {
-	return frag->bv_len;
+	return frag->len;
 }
 
 /**
@@ -377,7 +382,7 @@  static inline unsigned int skb_frag_size(const skb_frag_t *frag)
  */
 static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
 {
-	frag->bv_len = size;
+	frag->len = size;
 }
 
 /**
@@ -387,7 +392,7 @@  static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
  */
 static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_len += delta;
+	frag->len += delta;
 }
 
 /**
@@ -397,7 +402,7 @@  static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
 {
-	frag->bv_len -= delta;
+	frag->len -= delta;
 }
 
 /**
@@ -417,7 +422,7 @@  static inline bool skb_frag_must_loop(struct page *p)
  *	skb_frag_foreach_page - loop over pages in a fragment
  *
  *	@f:		skb frag to operate on
- *	@f_off:		offset from start of f->bv_page
+ *	@f_off:		offset from start of f->netmem
  *	@f_len:		length from f_off to loop over
  *	@p:		(temp var) current page
  *	@p_off:		(temp var) offset from start of current page,
@@ -2429,22 +2434,37 @@  static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
+					     netmem_ref netmem, int off,
+					     int size)
+{
+	frag->netmem = netmem;
+	frag->offset = off;
+	skb_frag_size_set(frag, size);
+}
+
 static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
 					   struct page *page,
 					   int off, int size)
 {
-	frag->bv_page = page;
-	frag->bv_offset = off;
-	skb_frag_size_set(frag, size);
+	skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
+}
+
+static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info *shinfo,
+						int i, netmem_ref netmem,
+						int off, int size)
+{
+	skb_frag_t *frag = &shinfo->frags[i];
+
+	skb_frag_fill_netmem_desc(frag, netmem, off, size);
 }
 
 static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
 					      int i, struct page *page,
 					      int off, int size)
 {
-	skb_frag_t *frag = &shinfo->frags[i];
-
-	skb_frag_fill_page_desc(frag, page, off, size);
+	__skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
+				     size);
 }
 
 /**
@@ -2460,10 +2480,10 @@  static inline void skb_len_add(struct sk_buff *skb, int delta)
 }
 
 /**
- * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * __skb_fill_netmem_desc - initialise a fragment in an skb
  * @skb: buffer containing fragment to be initialised
- * @i: paged fragment index to initialise
- * @page: the page to use for this fragment
+ * @i: fragment index to initialise
+ * @netmem: the netmem to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
  *
@@ -2472,10 +2492,12 @@  static inline void skb_len_add(struct sk_buff *skb, int delta)
  *
  * Does not take any additional reference on the fragment.
  */
-static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
-					struct page *page, int off, int size)
+static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i,
+					  netmem_ref netmem, int off, int size)
 {
-	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+	struct page *page = netmem_to_page(netmem);
+
+	__skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);
 
 	/* Propagate page pfmemalloc to the skb if we can. The problem is
 	 * that not all callers have unique ownership of the page but rely
@@ -2483,7 +2505,20 @@  static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	 */
 	page = compound_head(page);
 	if (page_is_pfmemalloc(page))
-		skb->pfmemalloc	= true;
+		skb->pfmemalloc = true;
+}
+
+static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
+					struct page *page, int off, int size)
+{
+	__skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
+}
+
+static inline void skb_fill_netmem_desc(struct sk_buff *skb, int i,
+					netmem_ref netmem, int off, int size)
+{
+	__skb_fill_netmem_desc(skb, i, netmem, off, size);
+	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
 /**
@@ -2503,8 +2538,7 @@  static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
 				      struct page *page, int off, int size)
 {
-	__skb_fill_page_desc(skb, i, page, off, size);
-	skb_shinfo(skb)->nr_frags = i + 1;
+	skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
 }
 
 /**
@@ -2530,6 +2564,8 @@  static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i,
 
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		     int size, unsigned int truesize);
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+			    int off, int size, unsigned int truesize);
 
 void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
 			  unsigned int truesize);
@@ -3378,7 +3414,7 @@  static inline void skb_propagate_pfmemalloc(const struct page *page,
  */
 static inline unsigned int skb_frag_off(const skb_frag_t *frag)
 {
-	return frag->bv_offset;
+	return frag->offset;
 }
 
 /**
@@ -3388,7 +3424,7 @@  static inline unsigned int skb_frag_off(const skb_frag_t *frag)
  */
 static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_offset += delta;
+	frag->offset += delta;
 }
 
 /**
@@ -3398,7 +3434,7 @@  static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 {
-	frag->bv_offset = offset;
+	frag->offset = offset;
 }
 
 /**
@@ -3409,7 +3445,7 @@  static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 static inline void skb_frag_off_copy(skb_frag_t *fragto,
 				     const skb_frag_t *fragfrom)
 {
-	fragto->bv_offset = fragfrom->bv_offset;
+	fragto->offset = fragfrom->offset;
 }
 
 /**
@@ -3420,7 +3456,7 @@  static inline void skb_frag_off_copy(skb_frag_t *fragto,
  */
 static inline struct page *skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->bv_page;
+	return netmem_to_page(frag->netmem);
 }
 
 /**
@@ -3524,7 +3560,7 @@  static inline void *skb_frag_address_safe(const skb_frag_t *frag)
 static inline void skb_frag_page_copy(skb_frag_t *fragto,
 				      const skb_frag_t *fragfrom)
 {
-	fragto->bv_page = fragfrom->bv_page;
+	fragto->netmem = fragfrom->netmem;
 }
 
 bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 12d22c0b8551..4fdc33c81969 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -845,16 +845,24 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 }
 EXPORT_SYMBOL(__napi_alloc_skb);
 
-void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
-		     int size, unsigned int truesize)
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+			    int off, int size, unsigned int truesize)
 {
 	DEBUG_NET_WARN_ON_ONCE(size > truesize);
 
-	skb_fill_page_desc(skb, i, page, off, size);
+	skb_fill_netmem_desc(skb, i, netmem, off, size);
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += truesize;
 }
+EXPORT_SYMBOL(skb_add_rx_frag_netmem);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		     int size, unsigned int truesize)
+{
+	skb_add_rx_frag_netmem(skb, i, page_to_netmem(page), off, size,
+			       truesize);
+}
 EXPORT_SYMBOL(skb_add_rx_frag);
 
 void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
@@ -1904,10 +1912,11 @@  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 
 	/* skb frags point to kernel buffers */
 	for (i = 0; i < new_frags - 1; i++) {
-		__skb_fill_page_desc(skb, i, head, 0, psize);
+		__skb_fill_netmem_desc(skb, i, page_to_netmem(head), 0, psize);
 		head = (struct page *)page_private(head);
 	}
-	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+	__skb_fill_netmem_desc(skb, new_frags - 1, page_to_netmem(head), 0,
+			       d_off);
 	skb_shinfo(skb)->nr_frags = new_frags;
 
 release:
@@ -3645,7 +3654,8 @@  skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		if (plen) {
 			page = virt_to_head_page(from->head);
 			offset = from->data - (unsigned char *)page_address(page);
-			__skb_fill_page_desc(to, 0, page, offset, plen);
+			__skb_fill_netmem_desc(to, 0, page_to_netmem(page),
+					       offset, plen);
 			get_page(page);
 			j = 1;
 			len -= plen;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 1184d40167b8..145ef22b2b35 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -636,9 +636,14 @@  static int kcm_write_msgs(struct kcm_sock *kcm)
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			msize += skb_frag_size(&skb_shinfo(skb)->frags[i]);
 
+		if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
+			ret = -EINVAL;
+			goto out;
+		}
+
 		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
-			      skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
-			      msize);
+			      (const struct bio_vec *)skb_shinfo(skb)->frags,
+			      skb_shinfo(skb)->nr_frags, msize);
 		iov_iter_advance(&msg.msg_iter, txm->frag_offset);
 
 		do {