mbox series

[net-next,v1,00/12] First try to replace page_frag with page_frag_cache

Message ID 20240407130850.19625-1-linyunsheng@huawei.com (mailing list archive)
Headers show
Series First try to replace page_frag with page_frag_cache | expand

Message

Yunsheng Lin April 7, 2024, 1:08 p.m. UTC
After [1], Only there are two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now have
have pcp support for high-order pages in commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the
per-cpu lists").

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset, we are not only able to unify the page
frag implementation a little, but seems able to have about
0.5+% performance boost testing by using the vhost_net_test
introduced in [1] and page_frag_test.ko introduced in this
patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

         603027.29 msec task-clock                       #    1.756 CPUs utilized               ( +-  0.04% )
           2097713      context-switches                 #    3.479 K/sec                       ( +-  0.00% )
               212      cpu-migrations                   #    0.352 /sec                        ( +-  4.72% )
                40      page-faults                      #    0.066 /sec                        ( +-  1.18% )
      467215266413      cycles                           #    0.775 GHz                         ( +-  0.12% )  (66.02%)
      131736729037      stalled-cycles-frontend          #   28.20% frontend cycles idle        ( +-  2.38% )  (64.34%)
       77728393294      stalled-cycles-backend           #   16.64% backend cycles idle         ( +-  3.98% )  (65.42%)
      345874254764      instructions                     #    0.74  insn per cycle
                                                  #    0.38  stalled cycles per insn     ( +-  0.75% )  (70.28%)
      105166217892      branches                         #  174.397 M/sec                       ( +-  0.65% )  (68.56%)
        9649321070      branch-misses                    #    9.18% of all branches             ( +-  0.69% )  (65.38%)

           343.376 +- 0.147 seconds time elapsed  ( +-  0.04% )


 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             39.12 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.51% )
                 5      context-switches                 #  127.805 /sec                        ( +-  3.76% )
                 1      cpu-migrations                   #   25.561 /sec                        ( +- 15.52% )
               197      page-faults                      #    5.035 K/sec                       ( +-  0.10% )
          10689913      cycles                           #    0.273 GHz                         ( +-  9.46% )  (72.72%)
           2821237      stalled-cycles-frontend          #   26.39% frontend cycles idle        ( +- 12.04% )  (76.23%)
           5035549      stalled-cycles-backend           #   47.11% backend cycles idle         ( +-  9.69% )  (49.40%)
           5439395      instructions                     #    0.51  insn per cycle
                                                  #    0.93  stalled cycles per insn     ( +- 11.58% )  (51.45%)
           1274419      branches                         #   32.575 M/sec                       ( +- 12.69% )  (77.88%)
             49562      branch-misses                    #    3.89% of all branches             ( +-  9.91% )  (72.32%)

            30.309 +- 0.305 seconds time elapsed  ( +-  1.01% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

         598081.02 msec task-clock                       #    1.752 CPUs utilized               ( +-  0.11% )
           2097738      context-switches                 #    3.507 K/sec                       ( +-  0.00% )
               220      cpu-migrations                   #    0.368 /sec                        ( +-  6.58% )
                40      page-faults                      #    0.067 /sec                        ( +-  0.92% )
      469788205101      cycles                           #    0.785 GHz                         ( +-  0.27% )  (64.86%)
      137108509582      stalled-cycles-frontend          #   29.19% frontend cycles idle        ( +-  0.96% )  (63.62%)
       75499065401      stalled-cycles-backend           #   16.07% backend cycles idle         ( +-  1.04% )  (65.86%)
      345469451681      instructions                     #    0.74  insn per cycle
                                                  #    0.40  stalled cycles per insn     ( +-  0.37% )  (70.16%)
      102782224964      branches                         #  171.853 M/sec                       ( +-  0.62% )  (69.28%)
        9295357532      branch-misses                    #    9.04% of all branches             ( +-  1.08% )  (66.21%)

           341.466 +- 0.305 seconds time elapsed  ( +-  0.09% )


 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             40.09 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.60% )
                 5      context-switches                 #  124.722 /sec                        ( +-  3.45% )
                 1      cpu-migrations                   #   24.944 /sec                        ( +- 12.62% )
               197      page-faults                      #    4.914 K/sec                       ( +-  0.11% )
          10221721      cycles                           #    0.255 GHz                         ( +-  9.05% )  (27.73%)
           2459009      stalled-cycles-frontend          #   24.06% frontend cycles idle        ( +- 10.80% )  (29.05%)
           5148423      stalled-cycles-backend           #   50.37% backend cycles idle         ( +-  7.30% )  (82.47%)
           5889929      instructions                     #    0.58  insn per cycle
                                                  #    0.87  stalled cycles per insn     ( +- 11.85% )  (87.75%)
           1276667      branches                         #   31.846 M/sec                       ( +- 11.48% )  (89.80%)
             50631      branch-misses                    #    3.97% of all branches             ( +-  8.72% )  (83.20%)

            29.341 +- 0.300 seconds time elapsed  ( +-  1.02% )

CC: Alexander Duyck <alexander.duyck@gmail.com>

1. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/

Yunsheng Lin (12):
  mm: Move the page fragment allocator from page_alloc into its own file
  mm: page_frag: use initial zero offset for page_frag_alloc_align()
  mm: page_frag: change page_frag_alloc_* API to accept align param
  mm: page_frag: add '_va' suffix to page_frag API
  mm: page_frag: add two inline helper for page_frag API
  mm: page_frag: reuse MSB of 'size' field for pfmemalloc
  mm: page_frag: reuse existing bit field of 'va' for pagecnt_bias
  net: introduce the skb_copy_to_va_nocache() helper
  mm: page_frag: introduce prepare/commit API for page_frag
  net: replace page_frag with page_frag_cache
  mm: page_frag: add a test module for page_frag
  mm: page_frag: update documentation and maintainer for page_frag

 Documentation/mm/page_frags.rst               | 115 ++++--
 MAINTAINERS                                   |  10 +
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 101 ++---
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/ethernet/google/gve/gve_rx.c      |   4 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |   2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
 .../marvell/octeontx2/nic/otx2_common.c       |   2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c    |   4 +-
 drivers/net/tun.c                             |  34 +-
 drivers/nvme/host/tcp.c                       |   8 +-
 drivers/nvme/target/tcp.c                     |  22 +-
 drivers/vhost/net.c                           |   6 +-
 include/linux/gfp.h                           |  22 --
 include/linux/mm_types.h                      |  18 -
 include/linux/page_frag_cache.h               | 339 ++++++++++++++++
 include/linux/sched.h                         |   4 +-
 include/linux/skbuff.h                        |  15 +-
 include/net/sock.h                            |  29 +-
 kernel/bpf/cpumap.c                           |   2 +-
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   2 +-
 mm/Kconfig.debug                              |   8 +
 mm/Makefile                                   |   2 +
 mm/page_alloc.c                               | 136 -------
 mm/page_frag_cache.c                          | 185 +++++++++
 mm/page_frag_test.c                           | 366 ++++++++++++++++++
 net/core/skbuff.c                             |  57 +--
 net/core/skmsg.c                              |  22 +-
 net/core/sock.c                               |  46 ++-
 net/core/xdp.c                                |   2 +-
 net/ipv4/ip_output.c                          |  35 +-
 net/ipv4/tcp.c                                |  35 +-
 net/ipv4/tcp_output.c                         |  28 +-
 net/ipv6/ip6_output.c                         |  35 +-
 net/kcm/kcmsock.c                             |  30 +-
 net/mptcp/protocol.c                          |  74 ++--
 net/rxrpc/txbuf.c                             |  16 +-
 net/sunrpc/svcsock.c                          |   4 +-
 net/tls/tls_device.c                          | 139 ++++---
 43 files changed, 1404 insertions(+), 572 deletions(-)
 create mode 100644 include/linux/page_frag_cache.h
 create mode 100644 mm/page_frag_cache.c
 create mode 100644 mm/page_frag_test.c

Comments

Alexander Duyck April 7, 2024, 5:02 p.m. UTC | #1
On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> After [1], Only there are two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now have
> have pcp support for high-order pages in commit 44042b449872
> ("mm/page_alloc: allow high-order pages to be stored on the
> per-cpu lists").
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset, we are not only able to unify the page
> frag implementation a little, but seems able to have about
> 0.5+% performance boost testing by using the vhost_net_test
> introduced in [1] and page_frag_test.ko introduced in this
> patch.

One question that jumps out at me for this is "why?". No offense but
this is a pretty massive set of changes with over 1400 additions and
500+ deletions and I can't help but ask why, and this cover page
doesn't give me any good reason to think about accepting this set.
What is meant to be the benefit to the community for adding this? All
I am seeing is a ton of extra code to have to review as this
unification is adding an additional 1000+ lines without a good
explanation as to why they are needed.

Also I wouldn't bother mentioning the 0.5+% performance gain as a
"bonus". Changes of that amount usually mean it is within the margin
of error. At best it likely means you haven't introduced a noticeable
regression.
Yunsheng Lin April 8, 2024, 1:37 p.m. UTC | #2
On 2024/4/8 1:02, Alexander Duyck wrote:
> On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> After [1], Only there are two implementations for page frag:
>>
>> 1. mm/page_alloc.c: net stack seems to be using it in the
>>    rx part with 'struct page_frag_cache' and the main API
>>    being page_frag_alloc_align().
>> 2. net/core/sock.c: net stack seems to be using it in the
>>    tx part with 'struct page_frag' and the main API being
>>    skb_page_frag_refill().
>>
>> This patchset tries to unfiy the page frag implementation
>> by replacing page_frag with page_frag_cache for sk_page_frag()
>> first. net_high_order_alloc_disable_key for the implementation
>> in net/core/sock.c doesn't seems matter that much now have
>> have pcp support for high-order pages in commit 44042b449872
>> ("mm/page_alloc: allow high-order pages to be stored on the
>> per-cpu lists").
>>
>> As the related change is mostly related to networking, so
>> targeting the net-next. And will try to replace the rest
>> of page_frag in the follow patchset.
>>
>> After this patchset, we are not only able to unify the page
>> frag implementation a little, but seems able to have about
>> 0.5+% performance boost testing by using the vhost_net_test
>> introduced in [1] and page_frag_test.ko introduced in this
>> patch.
> 
> One question that jumps out at me for this is "why?". No offense but
> this is a pretty massive set of changes with over 1400 additions and
> 500+ deletions and I can't help but ask why, and this cover page
> doesn't give me any good reason to think about accepting this set.

There are 375 + 256 additions for testing module and the documentation
update in the last two patches, and there is 198 additions and 176
deletions for moving the page fragment allocator from page_alloc into
its own file in patch 1.
Without above number, there are above 600+ additions and 300+ deletions,
deos that seems reasonable considering 140+ additions are needed to for
the new API, 300+ additions and deletions for updating the users to use
the new API as there are many users using the old API?

> What is meant to be the benefit to the community for adding this? All
> I am seeing is a ton of extra code to have to review as this
> unification is adding an additional 1000+ lines without a good
> explanation as to why they are needed.

Some benefits I see for now:
1. Improve the maintainability of page frag's implementation:
   (1) future bugfix and performance can be done in one place.
       For example, we may able to save some space for the
       'page_frag_cache' API user, and avoid 'get_page()' for
       the old 'page_frag' API user.

   (2) Provide a proper API so that caller does not need to access
       internal data field. Exposing the internal data field may
       enable the caller to do some unexpcted implementation of
       its own like below, after this patchset the API user is not
       supposed to do access the data field of 'page_frag_cache'
       directly[Currently it is still acessable from API caller if
       the caller is not following the rule, I am not sure how to
       limit the access without any performance impact yet].
https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c#L1141

2. page_frag API may provide a central point for netwroking to allocate
   memory instead of calling page allocator directly in the future, so
   that we can decouple 'struct page' from networking.

> 
> Also I wouldn't bother mentioning the 0.5+% performance gain as a
> "bonus". Changes of that amount usually mean it is within the margin
> of error. At best it likely means you haven't introduced a noticeable
> regression.

For micro-benchmark ko added in this patchset, performance gain seems quit
stable from testing in system without any other load.

> .
>
Alexander Duyck April 8, 2024, 3:09 p.m. UTC | #3
On Mon, Apr 8, 2024 at 6:38 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/4/8 1:02, Alexander Duyck wrote:
> > On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> After [1], Only there are two implementations for page frag:
> >>
> >> 1. mm/page_alloc.c: net stack seems to be using it in the
> >>    rx part with 'struct page_frag_cache' and the main API
> >>    being page_frag_alloc_align().
> >> 2. net/core/sock.c: net stack seems to be using it in the
> >>    tx part with 'struct page_frag' and the main API being
> >>    skb_page_frag_refill().
> >>
> >> This patchset tries to unfiy the page frag implementation
> >> by replacing page_frag with page_frag_cache for sk_page_frag()
> >> first. net_high_order_alloc_disable_key for the implementation
> >> in net/core/sock.c doesn't seems matter that much now have
> >> have pcp support for high-order pages in commit 44042b449872
> >> ("mm/page_alloc: allow high-order pages to be stored on the
> >> per-cpu lists").
> >>
> >> As the related change is mostly related to networking, so
> >> targeting the net-next. And will try to replace the rest
> >> of page_frag in the follow patchset.
> >>
> >> After this patchset, we are not only able to unify the page
> >> frag implementation a little, but seems able to have about
> >> 0.5+% performance boost testing by using the vhost_net_test
> >> introduced in [1] and page_frag_test.ko introduced in this
> >> patch.
> >
> > One question that jumps out at me for this is "why?". No offense but
> > this is a pretty massive set of changes with over 1400 additions and
> > 500+ deletions and I can't help but ask why, and this cover page
> > doesn't give me any good reason to think about accepting this set.
>
> There are 375 + 256 additions for testing module and the documentation
> update in the last two patches, and there is 198 additions and 176
> deletions for moving the page fragment allocator from page_alloc into
> its own file in patch 1.
> Without above number, there are above 600+ additions and 300+ deletions,
> deos that seems reasonable considering 140+ additions are needed to for
> the new API, 300+ additions and deletions for updating the users to use
> the new API as there are many users using the old API?

Maybe it would make more sense to break this into 2 sets. The first
one adding your testing, and the second one consolidating the API.
With that we would have a clearly defined test infrastructure in place
for the second set which is making significant changes to the API. In
addition it would provide the opportunity for others to point out any
other test that they might want pulled in since this is likely to have
impact outside of just the tests you have proposed.

> > What is meant to be the benefit to the community for adding this? All
> > I am seeing is a ton of extra code to have to review as this
> > unification is adding an additional 1000+ lines without a good
> > explanation as to why they are needed.
>
> Some benefits I see for now:
> 1. Improve the maintainability of page frag's implementation:
>    (1) future bugfix and performance can be done in one place.
>        For example, we may able to save some space for the
>        'page_frag_cache' API user, and avoid 'get_page()' for
>        the old 'page_frag' API user.

The problem as I see it is it is consolidating all the consumers down
to the least common denominator in terms of performance. You have
already demonstrated that with patch 2 which enforces that all drivers
have to work from the bottom up instead of being able to work top down
in the page.

This eventually leads you down the path where every time somebody has
a use case for it that may not be optimal for others it is going to be
a fight to see if the new use case can degrade the performance of the
other use cases.

>    (2) Provide a proper API so that caller does not need to access
>        internal data field. Exposing the internal data field may
>        enable the caller to do some unexpcted implementation of
>        its own like below, after this patchset the API user is not
>        supposed to do access the data field of 'page_frag_cache'
>        directly[Currently it is still acessable from API caller if
>        the caller is not following the rule, I am not sure how to
>        limit the access without any performance impact yet].
> https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c#L1141

This just makes the issue I point out in 1 even worse. The problem is
this code has to be used at the very lowest of levels and is as
tightly optimized as it is since it is called at least once per packet
in the case of networking. Networking that is still getting faster
mind you and demanding even fewer cycles per packet to try and keep
up. I just see this change as taking us in the wrong direction.

> 2. page_frag API may provide a central point for netwroking to allocate
>    memory instead of calling page allocator directly in the future, so
>    that we can decouple 'struct page' from networking.

I hope not. The fact is the page allocator serves a very specific
purpose, and the page frag API was meant to serve a different one and
not be a replacement for it. One thing that has really irked me is the
fact that I have seen it abused as much as it has been where people
seem to think it is just a page allocator when it was really meant to
just provide a way to shard order 0 pages into sizes that are half a
page or less in size. I really meant for it to be a quick-n-dirty slab
allocator for sizes 2K or less where ideally we are working with
powers of 2.

It concerns me that you are talking about taking this down a path that
will likely lead to further misuse of the code as a backdoor way to
allocate order 0 pages using this instead of just using the page
allocator.

> >
> > Also I wouldn't bother mentioning the 0.5+% performance gain as a
> > "bonus". Changes of that amount usually mean it is within the margin
> > of error. At best it likely means you haven't introduced a noticeable
> > regression.
>
> For micro-benchmark ko added in this patchset, performance gain seems quit
> stable from testing in system without any other load.

Again, that doesn't mean anything. It could just be that the code
shifted somewhere due to all the code moved so a loop got more aligned
than it was before. To give you an idea I have seen performance gains
in the past from turning off Rx checksum for some workloads and that
was simply due to the fact that the CPUs were staying awake longer
instead of going into deep sleep states as such we could handle more
packets per second even though we were using more cycles. Without
significantly more context it is hard to say that the gain is anything
real at all and a 0.5% gain is well within that margin of error.
Yunsheng Lin April 9, 2024, 7:59 a.m. UTC | #4
On 2024/4/8 23:09, Alexander Duyck wrote:
> On Mon, Apr 8, 2024 at 6:38 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/4/8 1:02, Alexander Duyck wrote:
>>> On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> After [1], Only there are two implementations for page frag:
>>>>
>>>> 1. mm/page_alloc.c: net stack seems to be using it in the
>>>>    rx part with 'struct page_frag_cache' and the main API
>>>>    being page_frag_alloc_align().
>>>> 2. net/core/sock.c: net stack seems to be using it in the
>>>>    tx part with 'struct page_frag' and the main API being
>>>>    skb_page_frag_refill().
>>>>
>>>> This patchset tries to unfiy the page frag implementation
>>>> by replacing page_frag with page_frag_cache for sk_page_frag()
>>>> first. net_high_order_alloc_disable_key for the implementation
>>>> in net/core/sock.c doesn't seems matter that much now have
>>>> have pcp support for high-order pages in commit 44042b449872
>>>> ("mm/page_alloc: allow high-order pages to be stored on the
>>>> per-cpu lists").
>>>>
>>>> As the related change is mostly related to networking, so
>>>> targeting the net-next. And will try to replace the rest
>>>> of page_frag in the follow patchset.
>>>>
>>>> After this patchset, we are not only able to unify the page
>>>> frag implementation a little, but seems able to have about
>>>> 0.5+% performance boost testing by using the vhost_net_test
>>>> introduced in [1] and page_frag_test.ko introduced in this
>>>> patch.
>>>
>>> One question that jumps out at me for this is "why?". No offense but
>>> this is a pretty massive set of changes with over 1400 additions and
>>> 500+ deletions and I can't help but ask why, and this cover page
>>> doesn't give me any good reason to think about accepting this set.
>>
>> There are 375 + 256 additions for testing module and the documentation
>> update in the last two patches, and there is 198 additions and 176
>> deletions for moving the page fragment allocator from page_alloc into
>> its own file in patch 1.
>> Without above number, there are above 600+ additions and 300+ deletions,
>> deos that seems reasonable considering 140+ additions are needed to for
>> the new API, 300+ additions and deletions for updating the users to use
>> the new API as there are many users using the old API?
> 
> Maybe it would make more sense to break this into 2 sets. The first
> one adding your testing, and the second one consolidating the API.
> With that we would have a clearly defined test infrastructure in place
> for the second set which is making significant changes to the API. In
> addition it would provide the opportunity for others to point out any
> other test that they might want pulled in since this is likely to have
> impact outside of just the tests you have proposed.

Do you have someone might want pulled in some test in mind, if yes, then
it might make sense to work together to minimise some possible duplicated
work. If no, it does not make much sense to break this into 2 sets just to
introduce a testing in the first set.

If it helps you or someone to do the comparing test before and after patchset
easier, I would reorder the patch adding the micro-benchmark ko to the first
patch.

> 
>>> What is meant to be the benefit to the community for adding this? All
>>> I am seeing is a ton of extra code to have to review as this
>>> unification is adding an additional 1000+ lines without a good
>>> explanation as to why they are needed.
>>
>> Some benefits I see for now:
>> 1. Improve the maintainability of page frag's implementation:
>>    (1) future bugfix and performance can be done in one place.
>>        For example, we may able to save some space for the
>>        'page_frag_cache' API user, and avoid 'get_page()' for
>>        the old 'page_frag' API user.
> 
> The problem as I see it is it is consolidating all the consumers down
> to the least common denominator in terms of performance. You have
> already demonstrated that with patch 2 which enforces that all drivers
> have to work from the bottom up instead of being able to work top down
> in the page.

I am agreed that consolidating 'the least common denominator' is what we
do when we design a subsystem/libary and sometimes we may need to have a
trade off between maintainability and perfromance.

But your argument 'having to load two registers with the values and then
compare them which saves us a few cycles' in [1] does not seems to justify
that we need to have it's own implementation of page_frag, not to mention
the 'work top down' way has its own disadvantages as mentioned in patch 2.

Also, in patch 5 & 6, we need to load 'size' to a register anyway so that we
can remove 'pagecnt_bias' and 'pfmemalloc' from 'struct page_frag_cache', it
would be better you can work through the whole patchset to get a bigger picture.

1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/

> 
> This eventually leads you down the path where every time somebody has
> a use case for it that may not be optimal for others it is going to be
> a fight to see if the new use case can degrade the performance of the
> other use cases.

I think it is always better to have a disscusion[or 'fight'] about how to
support a new use case:
1. refoctor the existing implementation to support the new use case, and
   introduce a new API for it if have to.
2. if the above does not work, and the use case is important enough that
   we might create/design a subsystem/libary for it.

But from updating page_frag API, I do not see that we need the second
option yet.

> 
>>    (2) Provide a proper API so that caller does not need to access
>>        internal data field. Exposing the internal data field may
>>        enable the caller to do some unexpcted implementation of
>>        its own like below, after this patchset the API user is not
>>        supposed to do access the data field of 'page_frag_cache'
>>        directly[Currently it is still acessable from API caller if
>>        the caller is not following the rule, I am not sure how to
>>        limit the access without any performance impact yet].
>> https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c#L1141
> 
> This just makes the issue I point out in 1 even worse. The problem is
> this code has to be used at the very lowest of levels and is as
> tightly optimized as it is since it is called at least once per packet
> in the case of networking. Networking that is still getting faster
> mind you and demanding even fewer cycles per packet to try and keep
> up. I just see this change as taking us in the wrong direction.

Yes, I am agreed with your point about 'demanding even fewer cycles per
packet', but not so with 'tightly optimized'.

'tightly optimized' may mean everybody inventing their own wheels.

> 
>> 2. page_frag API may provide a central point for netwroking to allocate
>>    memory instead of calling page allocator directly in the future, so
>>    that we can decouple 'struct page' from networking.
> 
> I hope not. The fact is the page allocator serves a very specific
> purpose, and the page frag API was meant to serve a different one and
> not be a replacement for it. One thing that has really irked me is the
> fact that I have seen it abused as much as it has been where people
> seem to think it is just a page allocator when it was really meant to
> just provide a way to shard order 0 pages into sizes that are half a
> page or less in size. I really meant for it to be a quick-n-dirty slab
> allocator for sizes 2K or less where ideally we are working with
> powers of 2.
> 
> It concerns me that you are talking about taking this down a path that
> will likely lead to further misuse of the code as a backdoor way to
> allocate order 0 pages using this instead of just using the page
> allocator.

Let's not get to a conclusion here and wait to see how thing evolve
in the future.

> 
>>>
>>> Also I wouldn't bother mentioning the 0.5+% performance gain as a
>>> "bonus". Changes of that amount usually mean it is within the margin
>>> of error. At best it likely means you haven't introduced a noticeable
>>> regression.
>>
>> For micro-benchmark ko added in this patchset, performance gain seems quit
>> stable from testing in system without any other load.
> 
> Again, that doesn't mean anything. It could just be that the code
> shifted somewhere due to all the code moved so a loop got more aligned
> than it was before. To give you an idea I have seen performance gains
> in the past from turning off Rx checksum for some workloads and that
> was simply due to the fact that the CPUs were staying awake longer
> instead of going into deep sleep states as such we could handle more
> packets per second even though we were using more cycles. Without
> significantly more context it is hard to say that the gain is anything
> real at all and a 0.5% gain is well within that margin of error.

As vhost_net_test added in [2] is heavily invovled with tun and virtio
handling, the 0.5% gain does seems within that margin of error, there is
why I added a micro-benchmark specificly for page_frag in this patchset.

It is tested five times, three times with this patchset and two times without
this patchset, the complete log is as below, even there is some noise, all
the result with this patchset is better than the result without this patchset:

with this patchset:
 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             40.09 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.60% )
                 5      context-switches                 #  124.722 /sec                        ( +-  3.45% )
                 1      cpu-migrations                   #   24.944 /sec                        ( +- 12.62% )
               197      page-faults                      #    4.914 K/sec                       ( +-  0.11% )
          10221721      cycles                           #    0.255 GHz                         ( +-  9.05% )  (27.73%)
           2459009      stalled-cycles-frontend          #   24.06% frontend cycles idle        ( +- 10.80% )  (29.05%)
           5148423      stalled-cycles-backend           #   50.37% backend cycles idle         ( +-  7.30% )  (82.47%)
           5889929      instructions                     #    0.58  insn per cycle
                                                  #    0.87  stalled cycles per insn     ( +- 11.85% )  (87.75%)
           1276667      branches                         #   31.846 M/sec                       ( +- 11.48% )  (89.80%)
             50631      branch-misses                    #    3.97% of all branches             ( +-  8.72% )  (83.20%)

            29.341 +- 0.300 seconds time elapsed  ( +-  1.02% )

Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             36.56 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.29% )
                 6      context-switches                 #  164.130 /sec                        ( +-  2.65% )
                 1      cpu-migrations                   #   27.355 /sec                        ( +- 15.67% )
               197      page-faults                      #    5.389 K/sec                       ( +-  0.12% )
          10006308      cycles                           #    0.274 GHz                         ( +-  8.36% )  (81.62%)
           2928275      stalled-cycles-frontend          #   29.26% frontend cycles idle        ( +- 11.50% )  (82.62%)
           5321882      stalled-cycles-backend           #   53.19% backend cycles idle         ( +-  8.39% )  (32.25%)
           6653737      instructions                     #    0.66  insn per cycle
                                                  #    0.80  stalled cycles per insn     ( +- 14.95% )  (37.23%)
           1301600      branches                         #   35.605 M/sec                       ( +- 14.24% )  (86.14%)
             47880      branch-misses                    #    3.68% of all branches             ( +- 10.70% )  (80.16%)

            28.683 +- 0.253 seconds time elapsed  ( +-  0.88% )

 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             39.02 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.13% )
                 6      context-switches                 #  153.753 /sec                        ( +-  2.98% )
                 1      cpu-migrations                   #   25.626 /sec                        ( +- 14.50% )
               197      page-faults                      #    5.048 K/sec                       ( +-  0.08% )
          10184452      cycles                           #    0.261 GHz                         ( +-  8.30% )  (40.64%)
           2756400      stalled-cycles-frontend          #   27.06% frontend cycles idle        ( +- 10.82% )  (71.70%)
           5127852      stalled-cycles-backend           #   50.35% backend cycles idle         ( +-  8.95% )  (78.94%)
           6353385      instructions                     #    0.62  insn per cycle
                                                  #    0.81  stalled cycles per insn     ( +- 18.79% )  (84.34%)
           1409873      branches                         #   36.129 M/sec                       ( +- 23.85% )  (80.42%)
             52044      branch-misses                    #    3.69% of all branches             ( +- 10.68% )  (43.96%)

            28.730 +- 0.201 seconds time elapsed  ( +-  0.70% )

-----------------------------------------------------------------------------------------------------------

without this patchset:
 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             39.12 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.51% )
                 5      context-switches                 #  127.805 /sec                        ( +-  3.76% )
                 1      cpu-migrations                   #   25.561 /sec                        ( +- 15.52% )
               197      page-faults                      #    5.035 K/sec                       ( +-  0.10% )
          10689913      cycles                           #    0.273 GHz                         ( +-  9.46% )  (72.72%)
           2821237      stalled-cycles-frontend          #   26.39% frontend cycles idle        ( +- 12.04% )  (76.23%)
           5035549      stalled-cycles-backend           #   47.11% backend cycles idle         ( +-  9.69% )  (49.40%)
           5439395      instructions                     #    0.51  insn per cycle
                                                  #    0.93  stalled cycles per insn     ( +- 11.58% )  (51.45%)
           1274419      branches                         #   32.575 M/sec                       ( +- 12.69% )  (77.88%)
             49562      branch-misses                    #    3.89% of all branches             ( +-  9.91% )  (72.32%)

            30.309 +- 0.305 seconds time elapsed  ( +-  1.01% )

 Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):

             37.40 msec task-clock                       #    0.001 CPUs utilized               ( +-  4.72% )
                 5      context-switches                 #  133.691 /sec                        ( +-  3.65% )
                 1      cpu-migrations                   #   26.738 /sec                        ( +- 14.13% )
               197      page-faults                      #    5.267 K/sec                       ( +-  0.12% )
          10196250      cycles                           #    0.273 GHz                         ( +-  9.37% )  (79.84%)
           2579562      stalled-cycles-frontend          #   25.30% frontend cycles idle        ( +- 13.05% )  (48.29%)
           4833236      stalled-cycles-backend           #   47.40% backend cycles idle         ( +-  9.84% )  (45.64%)
           5992762      instructions                     #    0.59  insn per cycle
                                                  #    0.81  stalled cycles per insn     ( +- 11.01% )  (76.56%)
           1274592      branches                         #   34.080 M/sec                       ( +- 12.88% )  (74.52%)
             51015      branch-misses                    #    4.00% of all branches             ( +- 10.60% )  (75.15%)

            29.958 +- 0.314 seconds time elapsed  ( +-  1.05% )



2. https://lore.kernel.org/all/20240228093013.8263-6-linyunsheng@huawei.com/

> .
>
Alexander Duyck April 9, 2024, 3:29 p.m. UTC | #5
On Tue, Apr 9, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/4/8 23:09, Alexander Duyck wrote:
> > On Mon, Apr 8, 2024 at 6:38 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/4/8 1:02, Alexander Duyck wrote:
> >>> On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>
> >>>> After [1], Only there are two implementations for page frag:
> >>>>
> >>>> 1. mm/page_alloc.c: net stack seems to be using it in the
> >>>>    rx part with 'struct page_frag_cache' and the main API
> >>>>    being page_frag_alloc_align().
> >>>> 2. net/core/sock.c: net stack seems to be using it in the
> >>>>    tx part with 'struct page_frag' and the main API being
> >>>>    skb_page_frag_refill().
> >>>>
> >>>> This patchset tries to unfiy the page frag implementation
> >>>> by replacing page_frag with page_frag_cache for sk_page_frag()
> >>>> first. net_high_order_alloc_disable_key for the implementation
> >>>> in net/core/sock.c doesn't seems matter that much now have
> >>>> have pcp support for high-order pages in commit 44042b449872
> >>>> ("mm/page_alloc: allow high-order pages to be stored on the
> >>>> per-cpu lists").
> >>>>
> >>>> As the related change is mostly related to networking, so
> >>>> targeting the net-next. And will try to replace the rest
> >>>> of page_frag in the follow patchset.
> >>>>
> >>>> After this patchset, we are not only able to unify the page
> >>>> frag implementation a little, but seems able to have about
> >>>> 0.5+% performance boost testing by using the vhost_net_test
> >>>> introduced in [1] and page_frag_test.ko introduced in this
> >>>> patch.
> >>>
> >>> One question that jumps out at me for this is "why?". No offense but
> >>> this is a pretty massive set of changes with over 1400 additions and
> >>> 500+ deletions and I can't help but ask why, and this cover page
> >>> doesn't give me any good reason to think about accepting this set.
> >>
> >> There are 375 + 256 additions for testing module and the documentation
> >> update in the last two patches, and there is 198 additions and 176
> >> deletions for moving the page fragment allocator from page_alloc into
> >> its own file in patch 1.
> >> Without above number, there are above 600+ additions and 300+ deletions,
> >> deos that seems reasonable considering 140+ additions are needed to for
> >> the new API, 300+ additions and deletions for updating the users to use
> >> the new API as there are many users using the old API?
> >
> > Maybe it would make more sense to break this into 2 sets. The first
> > one adding your testing, and the second one consolidating the API.
> > With that we would have a clearly defined test infrastructure in place
> > for the second set which is making significant changes to the API. In
> > addition it would provide the opportunity for others to point out any
> > other test that they might want pulled in since this is likely to have
> > impact outside of just the tests you have proposed.
>
> Do you have someone might want pulled in some test in mind, if yes, then
> it might make sense to work together to minimise some possible duplicated
> work. If no, it does not make much sense to break this into 2 sets just to
> introduce a testing in the first set.
>
> If it helps you or someone to do the comparing test before and after patchset
> easier, I would reorder the patch adding the micro-benchmark ko to the first
> patch.

Well the socket code will be largely impacted by any changes to this.
Seems like it might make sense to think about coming up with a socket
based test for example that might make good use of the allocator
located there so we can test the consolidating of the page frag code
out of there.

> >
> >>> What is meant to be the benefit to the community for adding this? All
> >>> I am seeing is a ton of extra code to have to review as this
> >>> unification is adding an additional 1000+ lines without a good
> >>> explanation as to why they are needed.
> >>
> >> Some benefits I see for now:
> >> 1. Improve the maintainability of page frag's implementation:
> >>    (1) future bugfix and performance can be done in one place.
> >>        For example, we may able to save some space for the
> >>        'page_frag_cache' API user, and avoid 'get_page()' for
> >>        the old 'page_frag' API user.
> >
> > The problem as I see it is it is consolidating all the consumers down
> > to the least common denominator in terms of performance. You have
> > already demonstrated that with patch 2 which enforces that all drivers
> > have to work from the bottom up instead of being able to work top down
> > in the page.
>
> I am agreed that consolidating 'the least common denominator' is what we
> do when we design a subsystem/libary and sometimes we may need to have a
> trade off between maintainability and perfromance.
>
> But your argument 'having to load two registers with the values and then
> compare them which saves us a few cycles' in [1] does not seems to justify
> that we need to have it's own implementation of page_frag, not to mention
> the 'work top down' way has its own disadvantages as mentioned in patch 2.
>
> Also, in patch 5 & 6, we need to load 'size' to a register anyway so that we
> can remove 'pagecnt_bias' and 'pfmemalloc' from 'struct page_frag_cache', it
> would be better you can work through the whole patchset to get a bigger picture.
>
> 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/

I haven't had a chance to review the entire patch set yet. I am hoping
to get to it tomorrow. That said, my main concern is that this becomes
a slippery slope. Where one thing leads to another and eventually this
becomes some overgrown setup that is no longer performant and has
people migrating back to the slab cache.

> >
> > This eventually leads you down the path where every time somebody has
> > a use case for it that may not be optimal for others it is going to be
> > a fight to see if the new use case can degrade the performance of the
> > other use cases.
>
> I think it is always better to have a disscusion[or 'fight'] about how to
> support a new use case:
> 1. refoctor the existing implementation to support the new use case, and
>    introduce a new API for it if have to.
> 2. if the above does not work, and the use case is important enough that
>    we might create/design a subsystem/libary for it.
>
> But from updating page_frag API, I do not see that we need the second
> option yet.

That is why we are having this discussion right now though. It seems
like you have your own use case that you want to use this for. So as a
result you are refactoring all the existing implementations and
crafting them to support your use case while trying to avoid
introducing regressions in the others. I would argue that based on
this set you are already trying to take the existing code and create a
"new" subsystem/library from it that is based on the original code
with only a few tweaks.

> >
> >>    (2) Provide a proper API so that caller does not need to access
> >>        internal data field. Exposing the internal data field may
> >>        enable the caller to do some unexpcted implementation of
> >>        its own like below, after this patchset the API user is not
> >>        supposed to do access the data field of 'page_frag_cache'
> >>        directly[Currently it is still acessable from API caller if
> >>        the caller is not following the rule, I am not sure how to
> >>        limit the access without any performance impact yet].
> >> https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c#L1141
> >
> > This just makes the issue I point out in 1 even worse. The problem is
> > this code has to be used at the very lowest of levels and is as
> > tightly optimized as it is since it is called at least once per packet
> > in the case of networking. Networking that is still getting faster
> > mind you and demanding even fewer cycles per packet to try and keep
> > up. I just see this change as taking us in the wrong direction.
>
> Yes, I am agreed with your point about 'demanding even fewer cycles per
> packet', but not so with 'tightly optimized'.
>
> 'tightly optimized' may mean everybody inventing their own wheels.

I hate to break this to you but that is the nature of things. If you
want to perform with decent performance you can only be so abstracted
away from the underlying implementation. The more generic you go the
less performance you will get.

> >
> >> 2. page_frag API may provide a central point for netwroking to allocate
> >>    memory instead of calling page allocator directly in the future, so
> >>    that we can decouple 'struct page' from networking.
> >
> > I hope not. The fact is the page allocator serves a very specific
> > purpose, and the page frag API was meant to serve a different one and
> > not be a replacement for it. One thing that has really irked me is the
> > fact that I have seen it abused as much as it has been where people
> > seem to think it is just a page allocator when it was really meant to
> > just provide a way to shard order 0 pages into sizes that are half a
> > page or less in size. I really meant for it to be a quick-n-dirty slab
> > allocator for sizes 2K or less where ideally we are working with
> > powers of 2.
> >
> > It concerns me that you are talking about taking this down a path that
> > will likely lead to further misuse of the code as a backdoor way to
> > allocate order 0 pages using this instead of just using the page
> > allocator.
>
> Let's not get to a conclusion here and wait to see how thing evolve
> in the future.

I still have an open mind, but this is a warning on where I will not
let this go. This is *NOT* an alternative to the page allocator. If we
need order 0 pages we should be allocating order 0 pages. Ideally this
is just for cases where we need memory in sizes 2K or less.

> >
> >>>
> >>> Also I wouldn't bother mentioning the 0.5+% performance gain as a
> >>> "bonus". Changes of that amount usually mean it is within the margin
> >>> of error. At best it likely means you haven't introduced a noticeable
> >>> regression.
> >>
> >> For micro-benchmark ko added in this patchset, performance gain seems quit
> >> stable from testing in system without any other load.
> >
> > Again, that doesn't mean anything. It could just be that the code
> > shifted somewhere due to all the code moved so a loop got more aligned
> > than it was before. To give you an idea I have seen performance gains
> > in the past from turning off Rx checksum for some workloads and that
> > was simply due to the fact that the CPUs were staying awake longer
> > instead of going into deep sleep states as such we could handle more
> > packets per second even though we were using more cycles. Without
> > significantly more context it is hard to say that the gain is anything
> > real at all and a 0.5% gain is well within that margin of error.
>
> As vhost_net_test added in [2] is heavily invovled with tun and virtio
> handling, the 0.5% gain does seems within that margin of error, there is
> why I added a micro-benchmark specificly for page_frag in this patchset.
>
> It is tested five times, three times with this patchset and two times without
> this patchset, the complete log is as below, even there is some noise, all
> the result with this patchset is better than the result without this patchset:

The problem is the vhost_net_test is you optimizing the page fragment
allocator for *YOUR* use case. I get that you want to show overall
improvement but that doesn't. You need to provide it with context for
the current users of the page fragment allocator in the form of
something other than one synthetic benchmark.

I could do the same thing by by tweaking the stack and making it drop
all network packets. The NICs would show a huge performance gain. It
doesn't mean it is usable by anybody. A benchmark is worthless without
the context about how it will impact other users.

Think about testing with real use cases for the areas that are already
making use of the page frags rather than your new synthetic benchmark
and the vhost case which you are optimizing for. Arguably this is why
so many implementations go their own way. It is difficult to optimize
for one use case without penalizing another and so the community needs
to be wiling to make that trade-off.
Yunsheng Lin April 10, 2024, 11:55 a.m. UTC | #6
On 2024/4/9 23:29, Alexander Duyck wrote:
...

> 
> Well the socket code will be largely impacted by any changes to this.
> Seems like it might make sense to think about coming up with a socket
> based test for example that might make good use of the allocator
> located there so we can test the consolidating of the page frag code
> out of there.

Does it make sense to use netcat + dummy netdev to test the socket code?
Any better idea in mind?

> 
>>>
>>>>> What is meant to be the benefit to the community for adding this? All
>>>>> I am seeing is a ton of extra code to have to review as this
>>>>> unification is adding an additional 1000+ lines without a good
>>>>> explanation as to why they are needed.
>>>>
>>>> Some benefits I see for now:
>>>> 1. Improve the maintainability of page frag's implementation:
>>>>    (1) future bugfix and performance can be done in one place.
>>>>        For example, we may able to save some space for the
>>>>        'page_frag_cache' API user, and avoid 'get_page()' for
>>>>        the old 'page_frag' API user.
>>>
>>> The problem as I see it is it is consolidating all the consumers down
>>> to the least common denominator in terms of performance. You have
>>> already demonstrated that with patch 2 which enforces that all drivers
>>> have to work from the bottom up instead of being able to work top down
>>> in the page.
>>
>> I am agreed that consolidating 'the least common denominator' is what we
>> do when we design a subsystem/libary and sometimes we may need to have a
>> trade off between maintainability and perfromance.
>>
>> But your argument 'having to load two registers with the values and then
>> compare them which saves us a few cycles' in [1] does not seems to justify
>> that we need to have it's own implementation of page_frag, not to mention
>> the 'work top down' way has its own disadvantages as mentioned in patch 2.
>>
>> Also, in patch 5 & 6, we need to load 'size' to a register anyway so that we
>> can remove 'pagecnt_bias' and 'pfmemalloc' from 'struct page_frag_cache', it
>> would be better you can work through the whole patchset to get a bigger picture.
>>
>> 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/
> 
> I haven't had a chance to review the entire patch set yet. I am hoping
> to get to it tomorrow. That said, my main concern is that this becomes
> a slippery slope. Where one thing leads to another and eventually this
> becomes some overgrown setup that is no longer performant and has
> people migrating back to the slab cache.

The problem with slab cache is that it does not have a metadata that
we can take extra reference to it, right?

> 
>>>
>>> This eventually leads you down the path where every time somebody has
>>> a use case for it that may not be optimal for others it is going to be
>>> a fight to see if the new use case can degrade the performance of the
>>> other use cases.
>>
>> I think it is always better to have a disscusion[or 'fight'] about how to
>> support a new use case:
>> 1. refoctor the existing implementation to support the new use case, and
>>    introduce a new API for it if have to.
>> 2. if the above does not work, and the use case is important enough that
>>    we might create/design a subsystem/libary for it.
>>
>> But from updating page_frag API, I do not see that we need the second
>> option yet.
> 
> That is why we are having this discussion right now though. It seems
> like you have your own use case that you want to use this for. So as a
> result you are refactoring all the existing implementations and
> crafting them to support your use case while trying to avoid
> introducing regressions in the others. I would argue that based on
> this set you are already trying to take the existing code and create a
> "new" subsystem/library from it that is based on the original code
> with only a few tweaks.

Yes, in someway. Maybe the plan is something like taking the best out
of all the existing implementations and form a "new" subsystem/library.

> 
>>>
>>>>    (2) Provide a proper API so that caller does not need to access
>>>>        internal data field. Exposing the internal data field may
>>>>        enable the caller to do some unexpcted implementation of
>>>>        its own like below, after this patchset the API user is not
>>>>        supposed to do access the data field of 'page_frag_cache'
>>>>        directly[Currently it is still acessable from API caller if
>>>>        the caller is not following the rule, I am not sure how to
>>>>        limit the access without any performance impact yet].
>>>> https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c#L1141
>>>
>>> This just makes the issue I point out in 1 even worse. The problem is
>>> this code has to be used at the very lowest of levels and is as
>>> tightly optimized as it is since it is called at least once per packet
>>> in the case of networking. Networking that is still getting faster
>>> mind you and demanding even fewer cycles per packet to try and keep
>>> up. I just see this change as taking us in the wrong direction.
>>
>> Yes, I am agreed with your point about 'demanding even fewer cycles per
>> packet', but not so with 'tightly optimized'.
>>
>> 'tightly optimized' may mean everybody inventing their own wheels.
> 
> I hate to break this to you but that is the nature of things. If you
> want to perform with decent performance you can only be so abstracted
> away from the underlying implementation. The more generic you go the
> less performance you will get.

But we need to have a balance between performance and maintainability,
I think what we are arguing about is where the balance might be?

> 
>>>
>>>> 2. page_frag API may provide a central point for netwroking to allocate
>>>>    memory instead of calling page allocator directly in the future, so
>>>>    that we can decouple 'struct page' from networking.
>>>
>>> I hope not. The fact is the page allocator serves a very specific
>>> purpose, and the page frag API was meant to serve a different one and
>>> not be a replacement for it. One thing that has really irked me is the
>>> fact that I have seen it abused as much as it has been where people
>>> seem to think it is just a page allocator when it was really meant to
>>> just provide a way to shard order 0 pages into sizes that are half a
>>> page or less in size. I really meant for it to be a quick-n-dirty slab
>>> allocator for sizes 2K or less where ideally we are working with
>>> powers of 2.
>>>
>>> It concerns me that you are talking about taking this down a path that
>>> will likely lead to further misuse of the code as a backdoor way to
>>> allocate order 0 pages using this instead of just using the page
>>> allocator.
>>
>> Let's not get to a conclusion here and wait to see how thing evolve
>> in the future.
> 
> I still have an open mind, but this is a warning on where I will not
> let this go. This is *NOT* an alternative to the page allocator. If we
> need order 0 pages we should be allocating order 0 pages. Ideally this
> is just for cases where we need memory in sizes 2K or less.

If the whole folio plan works, the page allocator may return a single
pointer without the 'struct page' metadata for networking, I am not sure
if I am worrying too much here, but we might need to prepare for that.

> 
>>>
>>>>>
>>>>> Also I wouldn't bother mentioning the 0.5+% performance gain as a
>>>>> "bonus". Changes of that amount usually mean it is within the margin
>>>>> of error. At best it likely means you haven't introduced a noticeable
>>>>> regression.
>>>>
>>>> For micro-benchmark ko added in this patchset, performance gain seems quit
>>>> stable from testing in system without any other load.
>>>
>>> Again, that doesn't mean anything. It could just be that the code
>>> shifted somewhere due to all the code moved so a loop got more aligned
>>> than it was before. To give you an idea I have seen performance gains
>>> in the past from turning off Rx checksum for some workloads and that
>>> was simply due to the fact that the CPUs were staying awake longer
>>> instead of going into deep sleep states as such we could handle more
>>> packets per second even though we were using more cycles. Without
>>> significantly more context it is hard to say that the gain is anything
>>> real at all and a 0.5% gain is well within that margin of error.
>>
>> As vhost_net_test added in [2] is heavily invovled with tun and virtio
>> handling, the 0.5% gain does seems within that margin of error, there is
>> why I added a micro-benchmark specificly for page_frag in this patchset.
>>
>> It is tested five times, three times with this patchset and two times without
>> this patchset, the complete log is as below, even there is some noise, all
>> the result with this patchset is better than the result without this patchset:
> 
> The problem is the vhost_net_test is you optimizing the page fragment
> allocator for *YOUR* use case. I get that you want to show overall
> improvement but that doesn't. You need to provide it with context for
> the current users of the page fragment allocator in the form of
> something other than one synthetic benchmark.
> 
> I could do the same thing by by tweaking the stack and making it drop
> all network packets. The NICs would show a huge performance gain. It
> doesn't mean it is usable by anybody. A benchmark is worthless without
> the context about how it will impact other users.
> 
> Think about testing with real use cases for the areas that are already
> making use of the page frags rather than your new synthetic benchmark
> and the vhost case which you are optimizing for. Arguably this is why
> so many implementations go their own way. It is difficult to optimize
> for one use case without penalizing another and so the community needs
> to be wiling to make that trade-off.
> .
>