mbox series

[0/7] net: openvswitch: Reduce stack usage

Message ID 20231011034344.104398-1-npiggin@gmail.com (mailing list archive)
Headers show
Series net: openvswitch: Reduce stack usage | expand

Message

Nicholas Piggin Oct. 11, 2023, 3:43 a.m. UTC
Hi,

I'll post this out again to keep discussion going. Thanks all for the
testing and comments so far.

Changes since the RFC
https://lore.kernel.org/netdev/20230927001308.749910-1-npiggin@gmail.com/

- Replace slab allocations for flow keys with expanding the use
  of the per-CPU key allocator to ovs_vport_receive.

- Drop patch 1 with Ilya's since they did the same thing (that is
  added at patch 3).

- Change push_nsh stack reduction from slab allocation to per-cpu
  buffer.

- Drop the ovs_fragment stack usage reduction for now sinc it used
  slab and was a bit more complicated.

I posted an initial version of the per-cpu flow allocator patch in
the RFC thread. Since then I cleaned up some debug code and increased
the allocator size to accommodate the additional user of it.

Thanks,
Nick

Ilya Maximets (1):
  openvswitch: reduce stack usage in do_execute_actions

Nicholas Piggin (6):
  net: openvswitch: generalise the per-cpu flow key allocation stack
  net: openvswitch: Use flow key allocator in ovs_vport_receive
  net: openvswitch: Reduce push_nsh stack usage
  net: openvswitch: uninline action execution
  net: openvswitch: uninline ovs_fragment to control stack usage
  net: openvswitch: Reduce stack usage in ovs_dp_process_packet

 net/openvswitch/actions.c  | 208 +++++++++++++++++++++++--------------
 net/openvswitch/datapath.c |  56 +++++-----
 net/openvswitch/flow.h     |   3 +
 net/openvswitch/vport.c    |  27 +++--
 4 files changed, 185 insertions(+), 109 deletions(-)

Comments

Ilya Maximets Oct. 11, 2023, 12:22 p.m. UTC | #1
On 10/11/23 05:43, Nicholas Piggin wrote:
> Hi,
> 
> I'll post this out again to keep discussion going. Thanks all for the
> testing and comments so far.

Hi, Nicholas.  This patch set still needs performance evaluation
since it touches very performance-sensitive parts of the stack.
Did you run any performance tests with this version?

IIRC, Aaron was still working on testing for the RFC.  I think,
we should wait for his feedback before re-spinning a new version.

> 
> Changes since the RFC
> https://lore.kernel.org/netdev/20230927001308.749910-1-npiggin@gmail.com/
> 
> - Replace slab allocations for flow keys with expanding the use
>   of the per-CPU key allocator to ovs_vport_receive.

While this is likely to work faster than a dynamic memory allocation,
it is unlikley to be on par with a stack allocation.  Performance
evaluation is necessary.

> 
> - Drop patch 1 with Ilya's since they did the same thing (that is
>   added at patch 3).

The patch is already in net-next, so should not be included in this set.
For the next version (please, hold) please rebase the set on the
net-next/main and add the net-next to the subject prefix of the patches.
They are not simple bug fixes, so should go through net-next, IMO.

You may also see in netdev+bpf patchwork that CI failed trying to guess
on which tree the patches should be applied and no tests were executed.

> 
> - Change push_nsh stack reduction from slab allocation to per-cpu
>   buffer.

I still think this change is not needed and will only consume a lot
of per-CPU memory space for no reason, as NSH is not a frequently
used thing in OVS and the function is not on the recursive path and
explicitly not inlined already.

Best regards, Ilya Maximets.

P.S.  Please use my ovn.org email instead.

> 
> - Drop the ovs_fragment stack usage reduction for now sinc it used
>   slab and was a bit more complicated.
> 
> I posted an initial version of the per-cpu flow allocator patch in
> the RFC thread. Since then I cleaned up some debug code and increased
> the allocator size to accommodate the additional user of it.
> 
> Thanks,
> Nick
> 
> Ilya Maximets (1):
>   openvswitch: reduce stack usage in do_execute_actions
> 
> Nicholas Piggin (6):
>   net: openvswitch: generalise the per-cpu flow key allocation stack
>   net: openvswitch: Use flow key allocator in ovs_vport_receive
>   net: openvswitch: Reduce push_nsh stack usage
>   net: openvswitch: uninline action execution
>   net: openvswitch: uninline ovs_fragment to control stack usage
>   net: openvswitch: Reduce stack usage in ovs_dp_process_packet
> 
>  net/openvswitch/actions.c  | 208 +++++++++++++++++++++++--------------
>  net/openvswitch/datapath.c |  56 +++++-----
>  net/openvswitch/flow.h     |   3 +
>  net/openvswitch/vport.c    |  27 +++--
>  4 files changed, 185 insertions(+), 109 deletions(-)
>
Aaron Conole Oct. 11, 2023, 1:23 p.m. UTC | #2
Nicholas Piggin <npiggin@gmail.com> writes:

> Hi,
>
> I'll post this out again to keep discussion going. Thanks all for the
> testing and comments so far.

Thanks for the update - did you mean for this to be tagged RFC as well?

I don't see any performance data with the deployments on x86_64 and
ppc64le that cause these stack overflows.  Are you able to provide the
impact on ppc64le and x86_64?

I guess the change probably should be tagged as -next since it doesn't
really have a specific set of commits it is "fixing."  It's really like
a major change and shouldn't really go through stable trees, but I'll
let the maintainers tell me off if I got it wrong.

> Changes since the RFC
> https://lore.kernel.org/netdev/20230927001308.749910-1-npiggin@gmail.com/
>
> - Replace slab allocations for flow keys with expanding the use
>   of the per-CPU key allocator to ovs_vport_receive.
>
> - Drop patch 1 with Ilya's since they did the same thing (that is
>   added at patch 3).
>
> - Change push_nsh stack reduction from slab allocation to per-cpu
>   buffer.
>
> - Drop the ovs_fragment stack usage reduction for now sinc it used
>   slab and was a bit more complicated.
>
> I posted an initial version of the per-cpu flow allocator patch in
> the RFC thread. Since then I cleaned up some debug code and increased
> the allocator size to accommodate the additional user of it.
>
> Thanks,
> Nick
>
> Ilya Maximets (1):
>   openvswitch: reduce stack usage in do_execute_actions
>
> Nicholas Piggin (6):
>   net: openvswitch: generalise the per-cpu flow key allocation stack
>   net: openvswitch: Use flow key allocator in ovs_vport_receive
>   net: openvswitch: Reduce push_nsh stack usage
>   net: openvswitch: uninline action execution
>   net: openvswitch: uninline ovs_fragment to control stack usage
>   net: openvswitch: Reduce stack usage in ovs_dp_process_packet
>
>  net/openvswitch/actions.c  | 208 +++++++++++++++++++++++--------------
>  net/openvswitch/datapath.c |  56 +++++-----
>  net/openvswitch/flow.h     |   3 +
>  net/openvswitch/vport.c    |  27 +++--
>  4 files changed, 185 insertions(+), 109 deletions(-)
Nicholas Piggin Oct. 12, 2023, 12:08 a.m. UTC | #3
On Wed Oct 11, 2023 at 10:22 PM AEST, Ilya Maximets wrote:
> On 10/11/23 05:43, Nicholas Piggin wrote:
> > Hi,
> > 
> > I'll post this out again to keep discussion going. Thanks all for the
> > testing and comments so far.
>
> Hi, Nicholas.  This patch set still needs performance evaluation
> since it touches very performance-sensitive parts of the stack.
> Did you run any performance tests with this version?

I did, the recipe in the previous thread was in the noise on my
system.

> IIRC, Aaron was still working on testing for the RFC.  I think,
> we should wait for his feedback before re-spinning a new version.

The RFC was a a couple of % slow on the same microbenchmark. I
gave an updated git tree with reworked to avoid the slab allocs
he was looking at, but I thought I'd post it out for others to
see.

> > 
> > Changes since the RFC
> > https://lore.kernel.org/netdev/20230927001308.749910-1-npiggin@gmail.com/
> > 
> > - Replace slab allocations for flow keys with expanding the use
> >   of the per-CPU key allocator to ovs_vport_receive.
>
> While this is likely to work faster than a dynamic memory allocation,
> it is unlikley to be on par with a stack allocation.  Performance
> evaluation is necessary.

Sure.

> > 
> > - Drop patch 1 with Ilya's since they did the same thing (that is
> >   added at patch 3).
>
> The patch is already in net-next, so should not be included in this set.
> For the next version (please, hold) please rebase the set on the
> net-next/main and add the net-next to the subject prefix of the patches.
> They are not simple bug fixes, so should go through net-next, IMO.
>
> You may also see in netdev+bpf patchwork that CI failed trying to guess
> on which tree the patches should be applied and no tests were executed.

I was thinking you might take them through your ovs merge process,
but I'm happy to go whatever way you like. And yes they're not
intended for merge now, I did intend to add RFC v2 prefix.

>
> > 
> > - Change push_nsh stack reduction from slab allocation to per-cpu
> >   buffer.
>
> I still think this change is not needed and will only consume a lot
> of per-CPU memory space for no reason, as NSH is not a frequently
> used thing in OVS and the function is not on the recursive path and
> explicitly not inlined already.

If it's infrequent and you're concerned with per-CPU memory usage, we
could go back to using slab.

It's not in the recursive path but it can be a leaf called from the
recursive path. It could still be function that uses the most stack
in any given scenario, no?

Thanks,
Nick
Nicholas Piggin Oct. 12, 2023, 1:19 a.m. UTC | #4
On Wed Oct 11, 2023 at 11:23 PM AEST, Aaron Conole wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Hi,
> >
> > I'll post this out again to keep discussion going. Thanks all for the
> > testing and comments so far.
>
> Thanks for the update - did you mean for this to be tagged RFC as well?

Yeah, it wasn't intended for merge with no RB or tests of course.
I intended to tag it RFC v2.

>
> I don't see any performance data with the deployments on x86_64 and
> ppc64le that cause these stack overflows.  Are you able to provide the
> impact on ppc64le and x86_64?

Don't think it'll be easy but they are not be pushing such rates
so it wouldn't say much.  If you want to show the worst case, those
tput and latency microbenchmarks should do it.

It's the same tradeoff and reasons the per-cpu key allocator was
added in the first place, presumably. Probably more expensive than
stack, but similar order of magnitude O(cycles) vs slab which is
probably O(100s cycles).

> I guess the change probably should be tagged as -next since it doesn't
> really have a specific set of commits it is "fixing."  It's really like
> a major change and shouldn't really go through stable trees, but I'll
> let the maintainers tell me off if I got it wrong.

It should go upstream first if anything. I thought it was relatively
simple and elegant to reuse the per-cpu key allocator though :(

It is a kernel crash, so we need something for stable. But In a case
like this there's not one single problem. Linux kernel stack use has
always been pretty dumb - "don't use too much", for some values of
too much, and just cross fingers config and compiler and worlkoad
doesn't hit some overflow case.

And powerpc has always used more stack x86, so probably it should stay
one power-of-two larger to be safe. And that may be the best fix for
-stable.

But also, ovs uses too much stack. Look at the stack sizes in the first
RFC patch, and ovs takes the 5 largest. That's because it has always
been the practice to not put large variables on stack, and when you're
introducing significant recursion that puts extra onus on you to be
lean. Even if it costs a percent. There are probably lots of places in
the kernel that could get a few cycles by sticking large structures on
stack, but unfortunately we can't all do that.

Thanks,
Nick
David Laight Oct. 13, 2023, 8:27 a.m. UTC | #5
From: Nicholas Piggin
> Sent: 12 October 2023 02:19
...
> It is a kernel crash, so we need something for stable. But In a case
> like this there's not one single problem. Linux kernel stack use has
> always been pretty dumb - "don't use too much", for some values of
> too much, and just cross fingers config and compiler and worlkoad
> doesn't hit some overflow case.

I think it ought to be possible to use a FINE_IBT (I think that
is what it is called) compile to get indirect calls grouped
and change objtool to output the stack offset of every call.
It is then a simple (for some definition of simple) process
to work out the static maximum stack usage.

Any recursion causes grief (the stack depth for each
loop can be reported).

Also I suspect the compiler will need an enhancement to
add a constant into the hash to disambiguate indirect
calls with the same argument list.

My suspicion (from doing this exercise on an embedded system
a long time ago) is that the stack will be blown somewhere
inside printk() in some error path.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Aaron Conole Oct. 20, 2023, 5:04 p.m. UTC | #6
"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Wed Oct 11, 2023 at 11:23 PM AEST, Aaron Conole wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>> > Hi,
>> >
>> > I'll post this out again to keep discussion going. Thanks all for the
>> > testing and comments so far.
>>
>> Thanks for the update - did you mean for this to be tagged RFC as well?
>
> Yeah, it wasn't intended for merge with no RB or tests of course.
> I intended to tag it RFC v2.

I only did a basic test with this because of some other stuff, and I
only tested 1, 2, and 3.  I didn't see any real performance changes, but
that is only with a simple port-port test.  I plan to do some additional
testing with some recursive calls.  That will also help to understand
the limits a bit.

That said, I'm very nervous about the key allocator, especially if it is
possible that it runs out.  We probably will need the limit to be
bigger, but I want to get a worst-case flow from OVN side to understand.

>>
>> I don't see any performance data with the deployments on x86_64 and
>> ppc64le that cause these stack overflows.  Are you able to provide the
>> impact on ppc64le and x86_64?
>
> Don't think it'll be easy but they are not be pushing such rates
> so it wouldn't say much.  If you want to show the worst case, those
> tput and latency microbenchmarks should do it.
>
> It's the same tradeoff and reasons the per-cpu key allocator was
> added in the first place, presumably. Probably more expensive than
> stack, but similar order of magnitude O(cycles) vs slab which is
> probably O(100s cycles).
>
>> I guess the change probably should be tagged as -next since it doesn't
>> really have a specific set of commits it is "fixing."  It's really like
>> a major change and shouldn't really go through stable trees, but I'll
>> let the maintainers tell me off if I got it wrong.
>
> It should go upstream first if anything. I thought it was relatively
> simple and elegant to reuse the per-cpu key allocator though :(
>
> It is a kernel crash, so we need something for stable. But In a case
> like this there's not one single problem. Linux kernel stack use has
> always been pretty dumb - "don't use too much", for some values of
> too much, and just cross fingers config and compiler and worlkoad
> doesn't hit some overflow case.
>
> And powerpc has always used more stack x86, so probably it should stay
> one power-of-two larger to be safe. And that may be the best fix for
> -stable.

Given the reply from David (with msg-id:
<ff6cd12e28894f158d9a6c9f7157487f@AcuMS.aculab.com>), are there other
things we can look at with respect to the compiler as well?

> But also, ovs uses too much stack. Look at the stack sizes in the first
> RFC patch, and ovs takes the 5 largest. That's because it has always
> been the practice to not put large variables on stack, and when you're
> introducing significant recursion that puts extra onus on you to be
> lean. Even if it costs a percent. There are probably lots of places in
> the kernel that could get a few cycles by sticking large structures on
> stack, but unfortunately we can't all do that.

Well, OVS operated this way for at least 6 years, so it isn't a recent
thing.  But we should look at it.

I also wonder if we need to recurse in the internal devices, or if we
shouldn't just push the skb into the packet queue.  That will cut out
1/3 of the stack frame that you reported originally, and then when doing
the xmit, will cut out 2/3rds. I have no idea what the performance
impact hit there might be.  Maybe it looks more like a latency hit
rather than a throughput hit, but just speculating.

> Thanks,
> Nick
Nicholas Piggin Oct. 25, 2023, 4:06 a.m. UTC | #7
On Sat Oct 21, 2023 at 3:04 AM AEST, Aaron Conole wrote:
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Wed Oct 11, 2023 at 11:23 PM AEST, Aaron Conole wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >> > Hi,
> >> >
> >> > I'll post this out again to keep discussion going. Thanks all for the
> >> > testing and comments so far.
> >>
> >> Thanks for the update - did you mean for this to be tagged RFC as well?
> >
> > Yeah, it wasn't intended for merge with no RB or tests of course.
> > I intended to tag it RFC v2.
>
> I only did a basic test with this because of some other stuff, and I
> only tested 1, 2, and 3.  I didn't see any real performance changes, but
> that is only with a simple port-port test.  I plan to do some additional
> testing with some recursive calls.  That will also help to understand
> the limits a bit.

Thanks. Good so far.

>
> That said, I'm very nervous about the key allocator, especially if it is
> possible that it runs out.  We probably will need the limit to be
> bigger, but I want to get a worst-case flow from OVN side to understand.
>
> >>
> >> I don't see any performance data with the deployments on x86_64 and
> >> ppc64le that cause these stack overflows.  Are you able to provide the
> >> impact on ppc64le and x86_64?
> >
> > Don't think it'll be easy but they are not be pushing such rates
> > so it wouldn't say much.  If you want to show the worst case, those
> > tput and latency microbenchmarks should do it.
> >
> > It's the same tradeoff and reasons the per-cpu key allocator was
> > added in the first place, presumably. Probably more expensive than
> > stack, but similar order of magnitude O(cycles) vs slab which is
> > probably O(100s cycles).
> >
> >> I guess the change probably should be tagged as -next since it doesn't
> >> really have a specific set of commits it is "fixing."  It's really like
> >> a major change and shouldn't really go through stable trees, but I'll
> >> let the maintainers tell me off if I got it wrong.
> >
> > It should go upstream first if anything. I thought it was relatively
> > simple and elegant to reuse the per-cpu key allocator though :(
> >
> > It is a kernel crash, so we need something for stable. But In a case
> > like this there's not one single problem. Linux kernel stack use has
> > always been pretty dumb - "don't use too much", for some values of
> > too much, and just cross fingers config and compiler and worlkoad
> > doesn't hit some overflow case.
> >
> > And powerpc has always used more stack x86, so probably it should stay
> > one power-of-two larger to be safe. And that may be the best fix for
> > -stable.
>
> Given the reply from David (with msg-id:
> <ff6cd12e28894f158d9a6c9f7157487f@AcuMS.aculab.com>), are there other
> things we can look at with respect to the compiler as well?

Not too easily or immediately, and if we did get something improved
then old compilers still have to be supported for some time.

ppc64 stack might be increased to 32K too, which would avoid the
problem.

But whatever else we do, it would still be good to reduce ovs stack.

> > But also, ovs uses too much stack. Look at the stack sizes in the first
> > RFC patch, and ovs takes the 5 largest. That's because it has always
> > been the practice to not put large variables on stack, and when you're
> > introducing significant recursion that puts extra onus on you to be
> > lean. Even if it costs a percent. There are probably lots of places in
> > the kernel that could get a few cycles by sticking large structures on
> > stack, but unfortunately we can't all do that.
>
> Well, OVS operated this way for at least 6 years, so it isn't a recent
> thing.  But we should look at it.

Maybe ppc64 hadn't used it too much before and it was perfectly
fine on x86-64. And ovs developers shouldn't really be expected to
test on or understand stack allocation of all architectures. It
can't be called an ovs bug, there's just a bunch of things where
improvements could be made.

Thanks,
Nick

> I also wonder if we need to recurse in the internal devices, or if we
> shouldn't just push the skb into the packet queue.  That will cut out
> 1/3 of the stack frame that you reported originally, and then when doing
> the xmit, will cut out 2/3rds. I have no idea what the performance
> impact hit there might be.  Maybe it looks more like a latency hit
> rather than a throughput hit, but just speculating.