mbox series

[v2,bpf-next,0/2] xsk: introduce generic almost-zerocopy xmit

Message ID 20210331122602.6000-1-alobakin@pm.me (mailing list archive)
Headers show
Series xsk: introduce generic almost-zerocopy xmit | expand

Message

Alexander Lobakin March 31, 2021, 12:26 p.m. UTC
This series is based on the exceptional generic zerocopy xmit logics
initially introduced by Xuan Zhuo. It extends it the way that it
could cover all the sane drivers, not only the ones that are capable
of xmitting skbs with no linear space.

The first patch is a random while-we-are-here improvement over
full-copy path, and the second is the main course. See the individual
commit messages for the details.

The original (full-zerocopy) path is still here and still generally
faster, but for now it seems like virtio_net will remain the only
user of it, at least for a considerable period of time.

From v1 [0]:
 - don't add a whole SMP_CACHE_BYTES because of only two bytes
   (NET_IP_ALIGN);
 - switch to zerocopy if the frame is 129 bytes or longer, not 128.
   128 still fit to kmalloc-512, while a zerocopy skb is always
   kmalloc-1024 -> can potentially be slower on this frame size.

[0] https://lore.kernel.org/netdev/20210330231528.546284-1-alobakin@pm.me

Alexander Lobakin (2):
  xsk: speed-up generic full-copy xmit
  xsk: introduce generic almost-zerocopy xmit

 net/xdp/xsk.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

--
Well, this is untested. I currently don't have an access to my setup
and is bound by moving to another country, but as I don't know for
sure at the moment when I'll get back to work on the kernel next time,
I found it worthy to publish this now -- if any further changes will
be required when I already will be out-of-sight, maybe someone could
carry on to make a another revision and so on (I'm still here for any
questions, comments, reviews and improvements till the end of this
week).
But this *should* work with all the sane drivers. If a particular
one won't handle this, it's likely ill. Any tests are highly
appreciated. Thanks!
--
2.31.1

Comments

Magnus Karlsson April 12, 2021, 2:13 p.m. UTC | #1
On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> This series is based on the exceptional generic zerocopy xmit logics
> initially introduced by Xuan Zhuo. It extends it the way that it
> could cover all the sane drivers, not only the ones that are capable
> of xmitting skbs with no linear space.
>
> The first patch is a random while-we-are-here improvement over
> full-copy path, and the second is the main course. See the individual
> commit messages for the details.
>
> The original (full-zerocopy) path is still here and still generally
> faster, but for now it seems like virtio_net will remain the only
> user of it, at least for a considerable period of time.
>
> From v1 [0]:
>  - don't add a whole SMP_CACHE_BYTES because of only two bytes
>    (NET_IP_ALIGN);
>  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
>    128 still fit to kmalloc-512, while a zerocopy skb is always
>    kmalloc-1024 -> can potentially be slower on this frame size.
>
> [0] https://lore.kernel.org/netdev/20210330231528.546284-1-alobakin@pm.me
>
> Alexander Lobakin (2):
>   xsk: speed-up generic full-copy xmit

I took both your patches for a spin on my machine and for the first
one I do see a small but consistent drop in performance. I thought it
would go the other way, but it does not so let us put this one on the
shelf for now.

>   xsk: introduce generic almost-zerocopy xmit

This one wreaked havoc on my machine ;-). The performance dropped with
75% for packets larger than 128 bytes when the new scheme kicks in.
Checking with perf top, it seems that we spend much more time
executing the sendmsg syscall. Analyzing some more:

$ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
Attaching 2 probes...
calls/sec: 1539509 with your patch compared to

calls/sec: 105796 without your patch

The application spends a lot of more time trying to get the kernel to
send new packets, but the kernel replies with "have not completed the
outstanding ones, so come back later" = EAGAIN. Seems like the
transmission takes longer when the skbs have fragments, but I have not
examined this any further. Did you get a speed-up?

>  net/xdp/xsk.c | 32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)
>
> --
> Well, this is untested. I currently don't have an access to my setup
> and is bound by moving to another country, but as I don't know for
> sure at the moment when I'll get back to work on the kernel next time,
> I found it worthy to publish this now -- if any further changes will
> be required when I already will be out-of-sight, maybe someone could
> carry on to make a another revision and so on (I'm still here for any
> questions, comments, reviews and improvements till the end of this
> week).
> But this *should* work with all the sane drivers. If a particular
> one won't handle this, it's likely ill. Any tests are highly
> appreciated. Thanks!
> --
> 2.31.1
>
>
Magnus Karlsson April 13, 2021, 7:14 a.m. UTC | #2
On Tue, Apr 13, 2021 at 3:49 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 12 Apr 2021 16:13:12 +0200, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > This series is based on the exceptional generic zerocopy xmit logics
> > > initially introduced by Xuan Zhuo. It extends it the way that it
> > > could cover all the sane drivers, not only the ones that are capable
> > > of xmitting skbs with no linear space.
> > >
> > > The first patch is a random while-we-are-here improvement over
> > > full-copy path, and the second is the main course. See the individual
> > > commit messages for the details.
> > >
> > > The original (full-zerocopy) path is still here and still generally
> > > faster, but for now it seems like virtio_net will remain the only
> > > user of it, at least for a considerable period of time.
> > >
> > > From v1 [0]:
> > >  - don't add a whole SMP_CACHE_BYTES because of only two bytes
> > >    (NET_IP_ALIGN);
> > >  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
> > >    128 still fit to kmalloc-512, while a zerocopy skb is always
> > >    kmalloc-1024 -> can potentially be slower on this frame size.
> > >
> > > [0] https://lore.kernel.org/netdev/20210330231528.546284-1-alobakin@pm.me
> > >
> > > Alexander Lobakin (2):
> > >   xsk: speed-up generic full-copy xmit
> >
> > I took both your patches for a spin on my machine and for the first
> > one I do see a small but consistent drop in performance. I thought it
> > would go the other way, but it does not so let us put this one on the
> > shelf for now.
> >
> > >   xsk: introduce generic almost-zerocopy xmit
> >
> > This one wreaked havoc on my machine ;-). The performance dropped with
> > 75% for packets larger than 128 bytes when the new scheme kicks in.
> > Checking with perf top, it seems that we spend much more time
> > executing the sendmsg syscall. Analyzing some more:
> >
> > $ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
> > interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
> > Attaching 2 probes...
> > calls/sec: 1539509 with your patch compared to
> >
> > calls/sec: 105796 without your patch
> >
> > The application spends a lot of more time trying to get the kernel to
> > send new packets, but the kernel replies with "have not completed the
> > outstanding ones, so come back later" = EAGAIN. Seems like the
> > transmission takes longer when the skbs have fragments, but I have not
> > examined this any further. Did you get a speed-up?
>
> Regarding this solution, I actually tested it on my mlx5 network card, but the
> performance was severely degraded, so I did not continue this solution later. I
> guess it might have something to do with the physical network card. We can try
> other network cards.

I tried it on a third card and got a 40% degradation, so let us scrap
this idea. It should stay optional as it is today as the (software)
drivers that benefit from this can turn it on explicitly.

> links: https://www.spinics.net/lists/netdev/msg710918.html
>
> Thanks.
>
> >
> > >  net/xdp/xsk.c | 32 ++++++++++++++++++++++----------
> > >  1 file changed, 22 insertions(+), 10 deletions(-)
> > >
> > > --
> > > Well, this is untested. I currently don't have an access to my setup
> > > and is bound by moving to another country, but as I don't know for
> > > sure at the moment when I'll get back to work on the kernel next time,
> > > I found it worthy to publish this now -- if any further changes will
> > > be required when I already will be out-of-sight, maybe someone could
> > > carry on to make a another revision and so on (I'm still here for any
> > > questions, comments, reviews and improvements till the end of this
> > > week).
> > > But this *should* work with all the sane drivers. If a particular
> > > one won't handle this, it's likely ill. Any tests are highly
> > > appreciated. Thanks!
> > > --
> > > 2.31.1
> > >
> > >
Alexander Lobakin April 18, 2021, 12:04 p.m. UTC | #3
From: Magnus Karlsson <magnus.karlsson@gmail.com>
Date: Tue, 13 Apr 2021 09:14:02 +0200

Hi!

I've finally done with a kinda comfy setup after moving to another
country and can finally continue working on patches and stuff.

> On Tue, Apr 13, 2021 at 3:49 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 12 Apr 2021 16:13:12 +0200, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > >
> > > > This series is based on the exceptional generic zerocopy xmit logics
> > > > initially introduced by Xuan Zhuo. It extends it the way that it
> > > > could cover all the sane drivers, not only the ones that are capable
> > > > of xmitting skbs with no linear space.
> > > >
> > > > The first patch is a random while-we-are-here improvement over
> > > > full-copy path, and the second is the main course. See the individual
> > > > commit messages for the details.
> > > >
> > > > The original (full-zerocopy) path is still here and still generally
> > > > faster, but for now it seems like virtio_net will remain the only
> > > > user of it, at least for a considerable period of time.
> > > >
> > > > From v1 [0]:
> > > >  - don't add a whole SMP_CACHE_BYTES because of only two bytes
> > > >    (NET_IP_ALIGN);
> > > >  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
> > > >    128 still fit to kmalloc-512, while a zerocopy skb is always
> > > >    kmalloc-1024 -> can potentially be slower on this frame size.
> > > >
> > > > [0] https://lore.kernel.org/netdev/20210330231528.546284-1-alobakin@pm.me
> > > >
> > > > Alexander Lobakin (2):
> > > >   xsk: speed-up generic full-copy xmit
> > >
> > > I took both your patches for a spin on my machine and for the first
> > > one I do see a small but consistent drop in performance. I thought it
> > > would go the other way, but it does not so let us put this one on the
> > > shelf for now.

This is kinda strange as the solution is pretty straightforward.
But sure, if the performance dropped after this one, it should not
be considered for taking.
I might have a look at it later.

> > > >   xsk: introduce generic almost-zerocopy xmit
> > >
> > > This one wreaked havoc on my machine ;-). The performance dropped with
> > > 75% for packets larger than 128 bytes when the new scheme kicks in.
> > > Checking with perf top, it seems that we spend much more time
> > > executing the sendmsg syscall. Analyzing some more:
> > >
> > > $ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
> > > interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
> > > Attaching 2 probes...
> > > calls/sec: 1539509 with your patch compared to
> > >
> > > calls/sec: 105796 without your patch
> > >
> > > The application spends a lot of more time trying to get the kernel to
> > > send new packets, but the kernel replies with "have not completed the
> > > outstanding ones, so come back later" = EAGAIN. Seems like the
> > > transmission takes longer when the skbs have fragments, but I have not
> > > examined this any further. Did you get a speed-up?
> >
> > Regarding this solution, I actually tested it on my mlx5 network card, but the
> > performance was severely degraded, so I did not continue this solution later. I
> > guess it might have something to do with the physical network card. We can try
> > other network cards.
>
> I tried it on a third card and got a 40% degradation, so let us scrap
> this idea. It should stay optional as it is today as the (software)
> drivers that benefit from this can turn it on explicitly.

Thank you guys a lot for the testing!

I think the main reason is the DMA mapping of one additional frag
(14 bytes of MAC header, which is excessive). It can take a lot of
CPU cycles, especially when the device is behind an IOMMU, and seems
like memcpying is faster here.

Moreover, if Xuan tested it as one of the steps towards his
full-zerocopy and found it to be a bad idea, this should not
go further.
So I'm burying this.

> > links: https://www.spinics.net/lists/netdev/msg710918.html
> >
> > Thanks.
> >
> > >
> > > >  net/xdp/xsk.c | 32 ++++++++++++++++++++++----------
> > > >  1 file changed, 22 insertions(+), 10 deletions(-)
> > > >
> > > > --
> > > > Well, this is untested. I currently don't have an access to my setup
> > > > and is bound by moving to another country, but as I don't know for
> > > > sure at the moment when I'll get back to work on the kernel next time,
> > > > I found it worthy to publish this now -- if any further changes will
> > > > be required when I already will be out-of-sight, maybe someone could
> > > > carry on to make a another revision and so on (I'm still here for any
> > > > questions, comments, reviews and improvements till the end of this
> > > > week).
> > > > But this *should* work with all the sane drivers. If a particular
> > > > one won't handle this, it's likely ill. Any tests are highly
> > > > appreciated. Thanks!
> > > > --
> > > > 2.31.1

Thanks,
Al