mbox series

[RFC,0/7] net: openvswitch: Reduce stack usage

Message ID 20230927001308.749910-1-npiggin@gmail.com (mailing list archive)
Headers show
Series net: openvswitch: Reduce stack usage | expand

Message

Nicholas Piggin Sept. 27, 2023, 12:13 a.m. UTC
Hi,

We've got a report of a stack overflow on ppc64le with a 16kB kernel
stack. Openvswitch is just one of many things in the stack, but it
does cause recursion and contributes to some usage.

Here are a few patches for reducing stack overhead. I don't know the
code well so consider them just ideas. GFP_ATOMIC allocations
introduced in a couple of places might be controversial, but there
is still some savings to be had if you skip those.

Here is one place detected where the stack reaches >14kB before
overflowing a little later. I massaged the output so it just shows
the stack frame address on the left.

[c00000037d480b40] __kmalloc+0x8c/0x5e0
[c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
[c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
[c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
[c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
[c00000037d480e80] sch_direct_xmit+0xec/0x330
[c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
[c00000037d480f90] __dev_queue_xmit+0x414/0x950
[c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
[c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
[c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
[c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
[c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
[c00000037d481690] dev_hard_start_xmit+0x11c/0x280
[c00000037d481710] __dev_queue_xmit+0x634/0x950
[c00000037d4817f0] neigh_hh_output+0xd0/0x180
[c00000037d481840] ip_finish_output2+0x31c/0x5c0
[c00000037d4818e0] ip_local_out+0x64/0x90
[c00000037d481920] iptunnel_xmit+0x194/0x290
[c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
[c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
[c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
[c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
[c00000037d481cb0] __dev_queue_xmit+0x634/0x950
[c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
[c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
[c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
[c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]
[c00000037d482210] do_execute_actions+0x4b8/0xeb0 [openvswitch]
[c00000037d4823d0] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d482450] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d482520] ovs_vport_receive+0x8c/0x130 [openvswitch]
[c00000037d482730] internal_dev_xmit+0x40/0xd0 [openvswitch]
[c00000037d482760] dev_hard_start_xmit+0x11c/0x280
[c00000037d4827e0] __dev_queue_xmit+0x634/0x950
[c00000037d4828c0] neigh_hh_output+0xd0/0x180
[c00000037d482910] ip_finish_output2+0x31c/0x5c0
[c00000037d4829b0] __ip_queue_xmit+0x1b0/0x4f0
[c00000037d482a40] __tcp_transmit_skb+0x450/0x9a0
[c00000037d482b10] tcp_write_xmit+0x4e0/0xb40
[c00000037d482be0] __tcp_push_pending_frames+0x44/0x130
[c00000037d482c50] __tcp_sock_set_cork.part.0+0x8c/0xb0
[c00000037d482c80] tcp_sock_set_cork+0x78/0xa0
[c00000037d482cb0] xs_tcp_send_request+0x2d4/0x430 [sunrpc]
[c00000037d482e50] xprt_request_transmit.constprop.0+0xa8/0x3c0 [sunrpc]
[c00000037d482eb0] xprt_transmit+0x12c/0x260 [sunrpc]
[c00000037d482f20] call_transmit+0xd0/0x100 [sunrpc]
[c00000037d482f50] __rpc_execute+0xec/0x570 [sunrpc]
[c00000037d482fd0] rpc_execute+0x168/0x1d0 [sunrpc]
[c00000037d483010] rpc_run_task+0x1cc/0x2a0 [sunrpc]
[c00000037d483070] nfs4_call_sync_sequence+0x98/0x100 [nfsv4]
[c00000037d483120] _nfs4_server_capabilities+0xd4/0x3c0 [nfsv4]
[c00000037d483210] nfs4_server_capabilities+0x74/0xd0 [nfsv4]
[c00000037d483270] nfs4_proc_get_root+0x3c/0x150 [nfsv4]
[c00000037d4832f0] nfs_get_root+0xac/0x660 [nfs]
[c00000037d483420] nfs_get_tree_common+0x104/0x5f0 [nfs]
[c00000037d4834b0] nfs_get_tree+0x90/0xc0 [nfs]
[c00000037d4834e0] vfs_get_tree+0x48/0x160
[c00000037d483560] nfs_do_submount+0x170/0x210 [nfs]
[c00000037d483600] nfs4_submount+0x250/0x360 [nfsv4]
[c00000037d4836b0] nfs_d_automount+0x194/0x2d0 [nfs]
[c00000037d483710] __traverse_mounts+0x114/0x330
[c00000037d483770] step_into+0x364/0x4d0
[c00000037d4837f0] walk_component+0x8c/0x300
[c00000037d483870] path_lookupat+0xa8/0x260
[c00000037d4838c0] filename_lookup+0xc8/0x230
[c00000037d483a00] vfs_path_lookup+0x68/0xc0
[c00000037d483a60] mount_subtree+0xd0/0x1e0
[c00000037d483ad0] do_nfs4_mount+0x280/0x520 [nfsv4]
[c00000037d483ba0] nfs4_try_get_tree+0x60/0x140 [nfsv4]
[c00000037d483c20] nfs_get_tree+0x60/0xc0 [nfs]
[c00000037d483c50] vfs_get_tree+0x48/0x160
[c00000037d483cd0] do_new_mount+0x204/0x3c0
[c00000037d483d40] sys_mount+0x168/0x1c0
[c00000037d483db0] system_call_exception+0x164/0x310
[c00000037d483e10] system_call_vectored_common+0xe8/0x278

That's hard to decipher so here all the stack frames sorted by
size, and number of appearances if > 1:

528 ovs_vport_receive (x2)
448 do_execute_actions (x3)
416 xs_tcp_send_request
320 filename_lookup
304 nfs_get_root
304 geneve_xmit_skb
256 virtqueue_add_outbuf
240 _nfs4_server_capabilities
224 __dev_queue_xmit (x4)
208 tcp_write_xmit
208 __tcp_transmit_skb
208 ovs_dp_process_packet (x3)
208 do_nfs4_mount
192 udp_tunnel_xmit_skb
176 start_xmit
176 nfs4_submount
176 nfs4_call_sync_sequence
160 sch_direct_xmit
160 nfs_do_submount
160 iptunnel_xmit
160 ip_finish_output2 (x2)
160 clone_execute
144 xmit_skb
144 nfs_get_tree_common
144 __ip_queue_xmit
128 walk_component
128 vfs_get_tree (x2)
128 step_into
128 __rpc_execute
128 ovs_vport_send (x2)
128 ovs_execute_actions (x3)
128 nfs4_try_get_tree
128 nfs4_proc_get_root
128 __kmalloc
128 geneve_xmit
128 dev_hard_start_xmit (x4)
112 xprt_transmit
112 __tcp_push_pending_frames
112 sys_mount
112 mount_subtree
112 do_new_mount
112 __dev_xmit_skb
96 xprt_request_transmit.constprop.0
96 vfs_path_lookup
96 __traverse_mounts
96 system_call_exception
96 rpc_run_task
96 nfs_d_automount
96 nfs4_server_capabilities
80 path_lookupat
80 neigh_hh_output (x2)
80 do_output (x2)
64 rpc_execute
64 ip_local_out
48 __tcp_sock_set_cork.part.0
48 tcp_sock_set_cork
48 nfs_get_tree (x2)
48 internal_dev_xmit (x2)
48 call_transmit

Thanks,
Nick

Nicholas Piggin (7):
  net: openvswitch: Move NSH buffer out of do_execute_actions
  net: openvswitch: Reduce execute_push_nsh stack overhead
  net: openvswitch: uninline action execution
  net: openvswitch: ovs_vport_receive reduce stack usage
  net: openvswitch: uninline ovs_fragment to control stack usage
  net: openvswitch: Reduce ovs_fragment stack usage
  net: openvswitch: Reduce stack usage in ovs_dp_process_packet

 net/openvswitch/actions.c  | 139 +++++++++++++++++++++++--------------
 net/openvswitch/datapath.c |  55 ++++++++-------
 net/openvswitch/drop.h     |   1 +
 net/openvswitch/vport.c    |  14 +++-
 4 files changed, 129 insertions(+), 80 deletions(-)

Comments

Ilya Maximets Sept. 27, 2023, 8:36 a.m. UTC | #1
On 9/27/23 02:13, Nicholas Piggin wrote:
> Hi,
> 
> We've got a report of a stack overflow on ppc64le with a 16kB kernel
> stack. Openvswitch is just one of many things in the stack, but it
> does cause recursion and contributes to some usage.
> 
> Here are a few patches for reducing stack overhead. I don't know the
> code well so consider them just ideas. GFP_ATOMIC allocations
> introduced in a couple of places might be controversial, but there
> is still some savings to be had if you skip those.
> 
> Here is one place detected where the stack reaches >14kB before
> overflowing a little later. I massaged the output so it just shows
> the stack frame address on the left.

Hi, Nicholas.  Thanks for the patches!

Though it looks like OVS is not really playing a huge role in the
stack trace below.  How much of the stack does the patch set save
in total?  How much patches 2-7 contribute (I posted a patch similar
to the first one last week, so we may not count it)?

Also, most of the changes introduced here has a real chance to
noticeably impact performance.  Did you run any performance tests
with this to assess the impact?

One last thing is that at least some of the patches seem to change
non-inlined non-recursive functions.  Seems unnecessary.

Best regards, Ilya Maximets.

> 
> [c00000037d480b40] __kmalloc+0x8c/0x5e0
> [c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
> [c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
> [c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
> [c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
> [c00000037d480e80] sch_direct_xmit+0xec/0x330
> [c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
> [c00000037d480f90] __dev_queue_xmit+0x414/0x950
> [c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
> [c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
> [c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
> [c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
> [c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
> [c00000037d481690] dev_hard_start_xmit+0x11c/0x280
> [c00000037d481710] __dev_queue_xmit+0x634/0x950
> [c00000037d4817f0] neigh_hh_output+0xd0/0x180
> [c00000037d481840] ip_finish_output2+0x31c/0x5c0
> [c00000037d4818e0] ip_local_out+0x64/0x90
> [c00000037d481920] iptunnel_xmit+0x194/0x290
> [c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
> [c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
> [c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
> [c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
> [c00000037d481cb0] __dev_queue_xmit+0x634/0x950
> [c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
> [c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
> [c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
> [c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]
> [c00000037d482210] do_execute_actions+0x4b8/0xeb0 [openvswitch]
> [c00000037d4823d0] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d482450] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d482520] ovs_vport_receive+0x8c/0x130 [openvswitch]
> [c00000037d482730] internal_dev_xmit+0x40/0xd0 [openvswitch]
> [c00000037d482760] dev_hard_start_xmit+0x11c/0x280
> [c00000037d4827e0] __dev_queue_xmit+0x634/0x950
> [c00000037d4828c0] neigh_hh_output+0xd0/0x180
> [c00000037d482910] ip_finish_output2+0x31c/0x5c0
> [c00000037d4829b0] __ip_queue_xmit+0x1b0/0x4f0
> [c00000037d482a40] __tcp_transmit_skb+0x450/0x9a0
> [c00000037d482b10] tcp_write_xmit+0x4e0/0xb40
> [c00000037d482be0] __tcp_push_pending_frames+0x44/0x130
> [c00000037d482c50] __tcp_sock_set_cork.part.0+0x8c/0xb0
> [c00000037d482c80] tcp_sock_set_cork+0x78/0xa0
> [c00000037d482cb0] xs_tcp_send_request+0x2d4/0x430 [sunrpc]
> [c00000037d482e50] xprt_request_transmit.constprop.0+0xa8/0x3c0 [sunrpc]
> [c00000037d482eb0] xprt_transmit+0x12c/0x260 [sunrpc]
> [c00000037d482f20] call_transmit+0xd0/0x100 [sunrpc]
> [c00000037d482f50] __rpc_execute+0xec/0x570 [sunrpc]
> [c00000037d482fd0] rpc_execute+0x168/0x1d0 [sunrpc]
> [c00000037d483010] rpc_run_task+0x1cc/0x2a0 [sunrpc]
> [c00000037d483070] nfs4_call_sync_sequence+0x98/0x100 [nfsv4]
> [c00000037d483120] _nfs4_server_capabilities+0xd4/0x3c0 [nfsv4]
> [c00000037d483210] nfs4_server_capabilities+0x74/0xd0 [nfsv4]
> [c00000037d483270] nfs4_proc_get_root+0x3c/0x150 [nfsv4]
> [c00000037d4832f0] nfs_get_root+0xac/0x660 [nfs]
> [c00000037d483420] nfs_get_tree_common+0x104/0x5f0 [nfs]
> [c00000037d4834b0] nfs_get_tree+0x90/0xc0 [nfs]
> [c00000037d4834e0] vfs_get_tree+0x48/0x160
> [c00000037d483560] nfs_do_submount+0x170/0x210 [nfs]
> [c00000037d483600] nfs4_submount+0x250/0x360 [nfsv4]
> [c00000037d4836b0] nfs_d_automount+0x194/0x2d0 [nfs]
> [c00000037d483710] __traverse_mounts+0x114/0x330
> [c00000037d483770] step_into+0x364/0x4d0
> [c00000037d4837f0] walk_component+0x8c/0x300
> [c00000037d483870] path_lookupat+0xa8/0x260
> [c00000037d4838c0] filename_lookup+0xc8/0x230
> [c00000037d483a00] vfs_path_lookup+0x68/0xc0
> [c00000037d483a60] mount_subtree+0xd0/0x1e0
> [c00000037d483ad0] do_nfs4_mount+0x280/0x520 [nfsv4]
> [c00000037d483ba0] nfs4_try_get_tree+0x60/0x140 [nfsv4]
> [c00000037d483c20] nfs_get_tree+0x60/0xc0 [nfs]
> [c00000037d483c50] vfs_get_tree+0x48/0x160
> [c00000037d483cd0] do_new_mount+0x204/0x3c0
> [c00000037d483d40] sys_mount+0x168/0x1c0
> [c00000037d483db0] system_call_exception+0x164/0x310
> [c00000037d483e10] system_call_vectored_common+0xe8/0x278
> 
> That's hard to decipher so here all the stack frames sorted by
> size, and number of appearances if > 1:
> 
> 528 ovs_vport_receive (x2)
> 448 do_execute_actions (x3)
> 416 xs_tcp_send_request
> 320 filename_lookup
> 304 nfs_get_root
> 304 geneve_xmit_skb
> 256 virtqueue_add_outbuf
> 240 _nfs4_server_capabilities
> 224 __dev_queue_xmit (x4)
> 208 tcp_write_xmit
> 208 __tcp_transmit_skb
> 208 ovs_dp_process_packet (x3)
> 208 do_nfs4_mount
> 192 udp_tunnel_xmit_skb
> 176 start_xmit
> 176 nfs4_submount
> 176 nfs4_call_sync_sequence
> 160 sch_direct_xmit
> 160 nfs_do_submount
> 160 iptunnel_xmit
> 160 ip_finish_output2 (x2)
> 160 clone_execute
> 144 xmit_skb
> 144 nfs_get_tree_common
> 144 __ip_queue_xmit
> 128 walk_component
> 128 vfs_get_tree (x2)
> 128 step_into
> 128 __rpc_execute
> 128 ovs_vport_send (x2)
> 128 ovs_execute_actions (x3)
> 128 nfs4_try_get_tree
> 128 nfs4_proc_get_root
> 128 __kmalloc
> 128 geneve_xmit
> 128 dev_hard_start_xmit (x4)
> 112 xprt_transmit
> 112 __tcp_push_pending_frames
> 112 sys_mount
> 112 mount_subtree
> 112 do_new_mount
> 112 __dev_xmit_skb
> 96 xprt_request_transmit.constprop.0
> 96 vfs_path_lookup
> 96 __traverse_mounts
> 96 system_call_exception
> 96 rpc_run_task
> 96 nfs_d_automount
> 96 nfs4_server_capabilities
> 80 path_lookupat
> 80 neigh_hh_output (x2)
> 80 do_output (x2)
> 64 rpc_execute
> 64 ip_local_out
> 48 __tcp_sock_set_cork.part.0
> 48 tcp_sock_set_cork
> 48 nfs_get_tree (x2)
> 48 internal_dev_xmit (x2)
> 48 call_transmit
> 
> Thanks,
> Nick
> 
> Nicholas Piggin (7):
>   net: openvswitch: Move NSH buffer out of do_execute_actions
>   net: openvswitch: Reduce execute_push_nsh stack overhead
>   net: openvswitch: uninline action execution
>   net: openvswitch: ovs_vport_receive reduce stack usage
>   net: openvswitch: uninline ovs_fragment to control stack usage
>   net: openvswitch: Reduce ovs_fragment stack usage
>   net: openvswitch: Reduce stack usage in ovs_dp_process_packet
> 
>  net/openvswitch/actions.c  | 139 +++++++++++++++++++++++--------------
>  net/openvswitch/datapath.c |  55 ++++++++-------
>  net/openvswitch/drop.h     |   1 +
>  net/openvswitch/vport.c    |  14 +++-
>  4 files changed, 129 insertions(+), 80 deletions(-)
>
Nicholas Piggin Sept. 28, 2023, 1:52 a.m. UTC | #2
On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> On 9/27/23 02:13, Nicholas Piggin wrote:
> > Hi,
> > 
> > We've got a report of a stack overflow on ppc64le with a 16kB kernel
> > stack. Openvswitch is just one of many things in the stack, but it
> > does cause recursion and contributes to some usage.
> > 
> > Here are a few patches for reducing stack overhead. I don't know the
> > code well so consider them just ideas. GFP_ATOMIC allocations
> > introduced in a couple of places might be controversial, but there
> > is still some savings to be had if you skip those.
> > 
> > Here is one place detected where the stack reaches >14kB before
> > overflowing a little later. I massaged the output so it just shows
> > the stack frame address on the left.
>
> Hi, Nicholas.  Thanks for the patches!

Hey, sorry your mail didn't come through for me (though it's on the
list)... Anyway thanks for the feedback.

And the important thing I forgot to mention: this was reproduced on a
RHEL9 kernel and that's where the traces are from. Upstream is quite
similar though so the code and call chains and stack use should be
pretty close.

It's a complicated configuration we're having difficulty with testing
upstream kernel. People are working to test things on the RHEL kernel
but I wanted to bring this upstream before we get too far down that
road.

Unfortunately that means I don't have performance or exact stack
use savings yet. But I will let you know if/when I get results.

> Though it looks like OVS is not really playing a huge role in the
> stack trace below.  How much of the stack does the patch set save
> in total?  How much patches 2-7 contribute (I posted a patch similar
> to the first one last week, so we may not count it)?

ovs functions themselves are maybe 30% of stack use, so significant.  I
did find they are the ones with some of the biggest structures in local
variables though, so low hanging fruit. This series should save about
2kB of stack, by eyeball. Should be enough to get us out of trouble for
this scenario, at least.

I don't suggest ovs is the only problem, I'm just trying to trim things
where possible. I have been trying to find other savings too, e.g.,
https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/

Recursion is a difficulty. I think we recursed 3 times in ovs, and it
looks like there's either 1 or 2 more recursions possible before the
limit (depending on how the accounting works, not sure if it stops at
4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
amount of stack, probably more than x86-64, but shouldn't be by a big
factor. So it could be risky for any arch with 16kB stack.

I wonder if we should have an arch function that can be called by
significant recursion points such as this, which signals free stack is
low and you should bail out ASAP. I don't think it's reasonable to
expect ovs to know about all arch size and usage of stack. You could
keep your hard limit for consistency, but if that goes wrong the
low free stack indication could save you.

>
> Also, most of the changes introduced here has a real chance to
> noticeably impact performance.  Did you run any performance tests
> with this to assess the impact?

Will see if we can do that, but I doubt this setup would be too
sensitive to small changes so it might be something upstream would have
to help evaluate. Function calls and even small kmalloc/free on the same
CPU shouldn't be too costly I hope, but I don't know how hot these paths
can get.

>
> One last thing is that at least some of the patches seem to change
> non-inlined non-recursive functions.  Seems unnecessary.

I was concentrating on functions in the recursive path, but there
were one or two big ones just off the side that still can be called
when you're deep into stack. In general it's just a good idea to
be frugal as reasonably possible with kernel stack always so I
wouldn't say unnecessary, but yes arguably less important. I defer
to your judgement about cost and benefit of all these changes
though.

Thanks,
Nick
Nicholas Piggin Sept. 29, 2023, 7:06 a.m. UTC | #3
On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> On 9/27/23 02:13, Nicholas Piggin wrote:
> > Hi,
> > 
> > We've got a report of a stack overflow on ppc64le with a 16kB kernel
> > stack. Openvswitch is just one of many things in the stack, but it
> > does cause recursion and contributes to some usage.
> > 
> > Here are a few patches for reducing stack overhead. I don't know the
> > code well so consider them just ideas. GFP_ATOMIC allocations
> > introduced in a couple of places might be controversial, but there
> > is still some savings to be had if you skip those.
> > 
> > Here is one place detected where the stack reaches >14kB before
> > overflowing a little later. I massaged the output so it just shows
> > the stack frame address on the left.
>
> Hi, Nicholas.  Thanks for the patches!
>
> Though it looks like OVS is not really playing a huge role in the
> stack trace below.  How much of the stack does the patch set save
> in total?  How much patches 2-7 contribute (I posted a patch similar
> to the first one last week, so we may not count it)?

Stack usage was tested for the same path (this is backported to
RHEL9 kernel), and saving was 2080 bytes for that. It's enough
to get us out of trouble. But if it was a config that caused more
recursions then it might still be a problem.

>
> Also, most of the changes introduced here has a real chance to
> noticeably impact performance.  Did you run any performance tests
> with this to assess the impact?

Some numbers were posted by Aaron as you would see. 2-4% for that
patch, but I suspect the rest should have much smaller impact.

Maybe patch 2 if you were doing a lot of push_nsh operations, but
that might be less important since it's out of the recursive path.

>
> One last thing is that at least some of the patches seem to change
> non-inlined non-recursive functions.  Seems unnecessary.
>
> Best regards, Ilya Maximets.
>

One thing I do notice in the trace:

> > 
> > [c00000037d480b40] __kmalloc+0x8c/0x5e0
> > [c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
> > [c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
> > [c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
> > [c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d480e80] sch_direct_xmit+0xec/0x330
> > [c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
> > [c00000037d480f90] __dev_queue_xmit+0x414/0x950
> > [c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
> > [c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
> > [c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
> > [c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> > [c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> > [c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
> > [c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
> > [c00000037d481690] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d481710] __dev_queue_xmit+0x634/0x950
> > [c00000037d4817f0] neigh_hh_output+0xd0/0x180
> > [c00000037d481840] ip_finish_output2+0x31c/0x5c0
> > [c00000037d4818e0] ip_local_out+0x64/0x90
> > [c00000037d481920] iptunnel_xmit+0x194/0x290
> > [c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
> > [c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
> > [c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
> > [c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d481cb0] __dev_queue_xmit+0x634/0x950
> > [c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
> > [c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
> > [c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
> > [c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> > [c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> > [c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]

                       ^^^^^

clone_execute is an action which can be deferred AFAIKS, but it is
not deferred until several recursions deep.

If we deferred always when possible, then might avoid such a big
stack (at least for this config). Is it very costly to defer? Would
it help here, or is it just going to process it right away and
cause basically the same call chain?

Thanks,
Nick
Ilya Maximets Oct. 2, 2023, 11:54 a.m. UTC | #4
On 9/28/23 03:52, Nicholas Piggin wrote:
> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>> Hi,
>>>
>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>> stack. Openvswitch is just one of many things in the stack, but it
>>> does cause recursion and contributes to some usage.
>>>
>>> Here are a few patches for reducing stack overhead. I don't know the
>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>> introduced in a couple of places might be controversial, but there
>>> is still some savings to be had if you skip those.
>>>
>>> Here is one place detected where the stack reaches >14kB before
>>> overflowing a little later. I massaged the output so it just shows
>>> the stack frame address on the left.
>>
>> Hi, Nicholas.  Thanks for the patches!
> 
> Hey, sorry your mail didn't come through for me (though it's on the
> list)... Anyway thanks for the feedback.
> 
> And the important thing I forgot to mention: this was reproduced on a
> RHEL9 kernel and that's where the traces are from. Upstream is quite
> similar though so the code and call chains and stack use should be
> pretty close.
> 
> It's a complicated configuration we're having difficulty with testing
> upstream kernel. People are working to test things on the RHEL kernel
> but I wanted to bring this upstream before we get too far down that
> road.
> 
> Unfortunately that means I don't have performance or exact stack
> use savings yet. But I will let you know if/when I get results.
> 
>> Though it looks like OVS is not really playing a huge role in the
>> stack trace below.  How much of the stack does the patch set save
>> in total?  How much patches 2-7 contribute (I posted a patch similar
>> to the first one last week, so we may not count it)?
> 
> ovs functions themselves are maybe 30% of stack use, so significant.  I
> did find they are the ones with some of the biggest structures in local
> variables though, so low hanging fruit. This series should save about
> 2kB of stack, by eyeball. Should be enough to get us out of trouble for
> this scenario, at least.

Unfortunately, the only low handing fruit in this set is patch #1,
the rest needs a serious performance evaluation.

> 
> I don't suggest ovs is the only problem, I'm just trying to trim things
> where possible. I have been trying to find other savings too, e.g.,
> https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/
> 
> Recursion is a difficulty. I think we recursed 3 times in ovs, and it
> looks like there's either 1 or 2 more recursions possible before the
> limit (depending on how the accounting works, not sure if it stops at
> 4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
> amount of stack, probably more than x86-64, but shouldn't be by a big
> factor. So it could be risky for any arch with 16kB stack.

The stack trace looks like a very standard trace for something like
an ovn-kubernetes setup.  And I haven't seen such issues on x86 or
aarch64 systems.  What architectures beside ppc64le use 16kB stack?

> 
> I wonder if we should have an arch function that can be called by
> significant recursion points such as this, which signals free stack is
> low and you should bail out ASAP. I don't think it's reasonable to
> expect ovs to know about all arch size and usage of stack. You could
> keep your hard limit for consistency, but if that goes wrong the
> low free stack indication could save you.

Every part of the code will need to react somehow to such a signal,
so I'm not sure how the implementations would look like.

> 
>>
>> Also, most of the changes introduced here has a real chance to
>> noticeably impact performance.  Did you run any performance tests
>> with this to assess the impact?
> 
> Will see if we can do that, but I doubt this setup would be too
> sensitive to small changes so it might be something upstream would have
> to help evaluate. Function calls and even small kmalloc/free on the same
> CPU shouldn't be too costly I hope, but I don't know how hot these paths
> can get.

This code path unfortunately is as hot as it gets as it needs to be able
to process millions of packets per second.  Even small changes may cause
significant performance impact. 

> 
>>
>> One last thing is that at least some of the patches seem to change
>> non-inlined non-recursive functions.  Seems unnecessary.
> 
> I was concentrating on functions in the recursive path, but there
> were one or two big ones just off the side that still can be called
> when you're deep into stack. In general it's just a good idea to
> be frugal as reasonably possible with kernel stack always so I
> wouldn't say unnecessary, but yes arguably less important. I defer
> to your judgement about cost and benefit of all these changes
> though.
> 
> Thanks,
> Nick
Ilya Maximets Oct. 2, 2023, 11:56 a.m. UTC | #5
On 9/29/23 09:06, Nicholas Piggin wrote:
> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>> Hi,
>>>
>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>> stack. Openvswitch is just one of many things in the stack, but it
>>> does cause recursion and contributes to some usage.
>>>
>>> Here are a few patches for reducing stack overhead. I don't know the
>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>> introduced in a couple of places might be controversial, but there
>>> is still some savings to be had if you skip those.
>>>
>>> Here is one place detected where the stack reaches >14kB before
>>> overflowing a little later. I massaged the output so it just shows
>>> the stack frame address on the left.
>>
>> Hi, Nicholas.  Thanks for the patches!
>>
>> Though it looks like OVS is not really playing a huge role in the
>> stack trace below.  How much of the stack does the patch set save
>> in total?  How much patches 2-7 contribute (I posted a patch similar
>> to the first one last week, so we may not count it)?
> 
> Stack usage was tested for the same path (this is backported to
> RHEL9 kernel), and saving was 2080 bytes for that. It's enough
> to get us out of trouble. But if it was a config that caused more
> recursions then it might still be a problem.

The 2K total value likely means that only patches 1 and 4 actually
contribute much into the savings.  And I agree that running at
85%+ stack utilization seems risky.  It can likely be overflowed
by just a few more recirculations in OVS pipeline or traversing
one more network namespace on a way out.  And it's possible that
some of the traffic will take such a route in your system even if
you didn't see it yet.

>> Also, most of the changes introduced here has a real chance to
>> noticeably impact performance.  Did you run any performance tests
>> with this to assess the impact?
> 
> Some numbers were posted by Aaron as you would see. 2-4% for that
> patch, but I suspect the rest should have much smaller impact.

They also seem to have a very small impact on the stack usage,
so may be not worth touching at all, since performance evaluation
for them will be necessary before they can be accepted.

> 
> Maybe patch 2 if you were doing a lot of push_nsh operations, but
> that might be less important since it's out of the recursive path.

It's also unlikely that you have NHS pipeline configured in OVS.

> 
>>
>> One last thing is that at least some of the patches seem to change
>> non-inlined non-recursive functions.  Seems unnecessary.
>>
>> Best regards, Ilya Maximets.
>>
> 
> One thing I do notice in the trace:
> 
> 
> clone_execute is an action which can be deferred AFAIKS, but it is
> not deferred until several recursions deep.
> 
> If we deferred always when possible, then might avoid such a big
> stack (at least for this config). Is it very costly to defer? Would
> it help here, or is it just going to process it right away and
> cause basically the same call chain?

It may save at most two stack frames maybe, because deferred actions
will be called just one function above in ovs_execute_actions(), and
it will not save us from packets exiting openvswitch module and
re-entering from a different port, which is a case in the provided
trace.

Also, I'd vote against deferring, because then we'll start hitting
the limit of deferred actions much faster causing packet drops, which
is already a problem for some OVN deployments.  And deferring involves
copying a lot of memory, which will hit performance once again.

Best regards, Ilya Maximets.
Aaron Conole Oct. 3, 2023, 1:31 p.m. UTC | #6
Ilya Maximets <i.maximets@ovn.org> writes:

> On 9/29/23 09:06, Nicholas Piggin wrote:
>> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>>> Hi,
>>>>
>>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>>> stack. Openvswitch is just one of many things in the stack, but it
>>>> does cause recursion and contributes to some usage.
>>>>
>>>> Here are a few patches for reducing stack overhead. I don't know the
>>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>>> introduced in a couple of places might be controversial, but there
>>>> is still some savings to be had if you skip those.
>>>>
>>>> Here is one place detected where the stack reaches >14kB before
>>>> overflowing a little later. I massaged the output so it just shows
>>>> the stack frame address on the left.
>>>
>>> Hi, Nicholas.  Thanks for the patches!
>>>
>>> Though it looks like OVS is not really playing a huge role in the
>>> stack trace below.  How much of the stack does the patch set save
>>> in total?  How much patches 2-7 contribute (I posted a patch similar
>>> to the first one last week, so we may not count it)?
>> 
>> Stack usage was tested for the same path (this is backported to
>> RHEL9 kernel), and saving was 2080 bytes for that. It's enough
>> to get us out of trouble. But if it was a config that caused more
>> recursions then it might still be a problem.
>
> The 2K total value likely means that only patches 1 and 4 actually
> contribute much into the savings.  And I agree that running at
> 85%+ stack utilization seems risky.  It can likely be overflowed
> by just a few more recirculations in OVS pipeline or traversing
> one more network namespace on a way out.  And it's possible that
> some of the traffic will take such a route in your system even if
> you didn't see it yet.
>
>>> Also, most of the changes introduced here has a real chance to
>>> noticeably impact performance.  Did you run any performance tests
>>> with this to assess the impact?
>> 
>> Some numbers were posted by Aaron as you would see. 2-4% for that
>> patch, but I suspect the rest should have much smaller impact.
>
> They also seem to have a very small impact on the stack usage,
> so may be not worth touching at all, since performance evaluation
> for them will be necessary before they can be accepted.

Actually, it's also important to keep in mind that the vport_receive is
only happening once in my performance test.  I expect it gets worse when
running in the scenario (br-ex, br-int, br-tun setup).

>> 
>> Maybe patch 2 if you were doing a lot of push_nsh operations, but
>> that might be less important since it's out of the recursive path.
>
> It's also unlikely that you have NHS pipeline configured in OVS.
>
>> 
>>>
>>> One last thing is that at least some of the patches seem to change
>>> non-inlined non-recursive functions.  Seems unnecessary.
>>>
>>> Best regards, Ilya Maximets.
>>>
>> 
>> One thing I do notice in the trace:
>> 
>> 
>> clone_execute is an action which can be deferred AFAIKS, but it is
>> not deferred until several recursions deep.
>> 
>> If we deferred always when possible, then might avoid such a big
>> stack (at least for this config). Is it very costly to defer? Would
>> it help here, or is it just going to process it right away and
>> cause basically the same call chain?
>
> It may save at most two stack frames maybe, because deferred actions
> will be called just one function above in ovs_execute_actions(), and
> it will not save us from packets exiting openvswitch module and
> re-entering from a different port, which is a case in the provided
> trace.

It used to always be deferred but, IIUC we were hitting deferred actions
limit quite a bit so when the sample and clone actions were unified there
was a choice to recurse to avoid dropping packets.

> Also, I'd vote against deferring, because then we'll start hitting
> the limit of deferred actions much faster causing packet drops, which
> is already a problem for some OVN deployments.  And deferring involves
> copying a lot of memory, which will hit performance once again.
>
> Best regards, Ilya Maximets.
Nicholas Piggin Oct. 4, 2023, 9:56 a.m. UTC | #7
On Mon Oct 2, 2023 at 9:54 PM AEST, Ilya Maximets wrote:
> On 9/28/23 03:52, Nicholas Piggin wrote:
> > On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> >> On 9/27/23 02:13, Nicholas Piggin wrote:
> >>> Hi,
> >>>
> >>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
> >>> stack. Openvswitch is just one of many things in the stack, but it
> >>> does cause recursion and contributes to some usage.
> >>>
> >>> Here are a few patches for reducing stack overhead. I don't know the
> >>> code well so consider them just ideas. GFP_ATOMIC allocations
> >>> introduced in a couple of places might be controversial, but there
> >>> is still some savings to be had if you skip those.
> >>>
> >>> Here is one place detected where the stack reaches >14kB before
> >>> overflowing a little later. I massaged the output so it just shows
> >>> the stack frame address on the left.
> >>
> >> Hi, Nicholas.  Thanks for the patches!
> > 
> > Hey, sorry your mail didn't come through for me (though it's on the
> > list)... Anyway thanks for the feedback.
> > 
> > And the important thing I forgot to mention: this was reproduced on a
> > RHEL9 kernel and that's where the traces are from. Upstream is quite
> > similar though so the code and call chains and stack use should be
> > pretty close.
> > 
> > It's a complicated configuration we're having difficulty with testing
> > upstream kernel. People are working to test things on the RHEL kernel
> > but I wanted to bring this upstream before we get too far down that
> > road.
> > 
> > Unfortunately that means I don't have performance or exact stack
> > use savings yet. But I will let you know if/when I get results.
> > 
> >> Though it looks like OVS is not really playing a huge role in the
> >> stack trace below.  How much of the stack does the patch set save
> >> in total?  How much patches 2-7 contribute (I posted a patch similar
> >> to the first one last week, so we may not count it)?
> > 
> > ovs functions themselves are maybe 30% of stack use, so significant.  I
> > did find they are the ones with some of the biggest structures in local
> > variables though, so low hanging fruit. This series should save about
> > 2kB of stack, by eyeball. Should be enough to get us out of trouble for
> > this scenario, at least.
>
> Unfortunately, the only low handing fruit in this set is patch #1,
> the rest needs a serious performance evaluation.
>
> > 
> > I don't suggest ovs is the only problem, I'm just trying to trim things
> > where possible. I have been trying to find other savings too, e.g.,
> > https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/
> > 
> > Recursion is a difficulty. I think we recursed 3 times in ovs, and it
> > looks like there's either 1 or 2 more recursions possible before the
> > limit (depending on how the accounting works, not sure if it stops at
> > 4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
> > amount of stack, probably more than x86-64, but shouldn't be by a big
> > factor. So it could be risky for any arch with 16kB stack.
>
> The stack trace looks like a very standard trace for something like
> an ovn-kubernetes setup.  And I haven't seen such issues on x86 or
> aarch64 systems.  What architectures beside ppc64le use 16kB stack?

AFAIKS from browsing defines of defaults for 64-bit builds, all I
looked at do (riscv, s390, loongarch, mips, sparc).

They will all be different about how much stack the compiler uses,
some type sizes that could be in local variables, and details of
kernel entry and how irq stacks are implemented. Would be interesting
to compare typical stack usage of different archs, I haven't made a
good study of it.

> > 
> > I wonder if we should have an arch function that can be called by
> > significant recursion points such as this, which signals free stack is
> > low and you should bail out ASAP. I don't think it's reasonable to
> > expect ovs to know about all arch size and usage of stack. You could
> > keep your hard limit for consistency, but if that goes wrong the
> > low free stack indication could save you.
>
> Every part of the code will need to react somehow to such a signal,
> so I'm not sure how the implementations would look like.

Not every, it can be few strategic checks. The recursion test that
is already in ovs, for example.

Thanks,
Nick