[net-next,v6,2/3] udp: implement memory accounting helpers

Message ID	4d2e0fc8f5c3d1309b0fb71bc65a2719a8e82825.1477043395.git.pabeni@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Paolo Abeni <pabeni@redhat.com> To: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, James Morris <jmorris@namei.org>, Trond Myklebust <trond.myklebust@primarydata.com>, Alexander Duyck <aduyck@mirantis.com>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, Tom Herbert <tom@herbertland.com>, Hannes Frederic Sowa <hannes@stressinduktion.org>, Edward Cree <ecree@solarflare.com>, linux-nfs@vger.kernel.org Subject: [PATCH net-next v6 2/3] udp: implement memory accounting helpers Date: Fri, 21 Oct 2016 13:55:46 +0200 Message-Id: <4d2e0fc8f5c3d1309b0fb71bc65a2719a8e82825.1477043395.git.pabeni@redhat.com> In-Reply-To: <cover.1477043395.git.pabeni@redhat.com> References: <cover.1477043395.git.pabeni@redhat.com> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Message ID

4d2e0fc8f5c3d1309b0fb71bc65a2719a8e82825.1477043395.git.pabeni@redhat.com (mailing list archive)

State

New, archived

Headers

From: Paolo Abeni <pabeni@redhat.com>
To: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
	James Morris <jmorris@namei.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Alexander Duyck <aduyck@mirantis.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <edumazet@google.com>, Tom Herbert <tom@herbertland.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	Edward Cree <ecree@solarflare.com>, linux-nfs@vger.kernel.org
Subject: [PATCH net-next v6 2/3] udp: implement memory accounting helpers
Date: Fri, 21 Oct 2016 13:55:46 +0200
Message-Id: <4d2e0fc8f5c3d1309b0fb71bc65a2719a8e82825.1477043395.git.pabeni@redhat.com>
In-Reply-To: <cover.1477043395.git.pabeni@redhat.com>
References: <cover.1477043395.git.pabeni@redhat.com>
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Paolo Abeni Oct. 21, 2016, 11:55 a.m. UTC

Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
  costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
  dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
  the work for us

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v5 -> v6:
  - don't orphan the skb on enqueue

v4 -> v5:
  - replace the mem_lock with the receive queue spin lock
  - ensure that the bh is always allowed to enqueue at least
    a skb, even if sk_rcvbuf is exceeded

v3 -> v4:
  - reworked memory accunting, simplifying the schema
  - provide an helper for both memory scheduling and enqueuing

v1 -> v2:
  - use a udp specific destrctor to perform memory reclaiming
  - remove a couple of helpers, unneeded after the above cleanup
  - do not reclaim memory on dequeue if not under memory
    pressure
  - reworked the fwd accounting schema to avoid potential
    integer overflow

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/udp.h |   4 +++
 net/ipv4/udp.c    | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

Comments

Eric Dumazet Oct. 21, 2016, 12:24 p.m. UTC | #1

On Fri, 2016-10-21 at 13:55 +0200, Paolo Abeni wrote:
> Avoid using the generic helpers.
> Use the receive queue spin lock to protect the memory
> accounting operation, both on enqueue and on dequeue.
> 
> On dequeue perform partial memory reclaiming, trying to
> leave a quantum of forward allocated memory.
> 
> On enqueue use a custom helper, to allow some optimizations:
> - use a plain spin_lock() variant instead of the slightly
>   costly spin_lock_irqsave(),
> - avoid dst_force check, since the calling code has already
>   dropped the skb dst
> - avoid orphaning the skb, since skb_steal_sock() already did
>   the work for us

>  
> +static void udp_rmem_release(struct sock *sk, int size, int partial)
> +{
> +	int amt;
> +
> +	atomic_sub(size, &sk->sk_rmem_alloc);
> +
> +	spin_lock_bh(&sk->sk_receive_queue.lock);
> +	sk->sk_forward_alloc += size;
> +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> +	sk->sk_forward_alloc -= amt;
> +	spin_unlock_bh(&sk->sk_receive_queue.lock);
> +
> +	if (amt)
> +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> +}
> +
> +static void udp_rmem_free(struct sk_buff *skb)
> +{
> +	udp_rmem_release(skb->sk, skb->truesize, 1);
> +}
> +


It looks like you are acquiring/releasing sk_receive_queue.lock twice
per packet in recvmsg() (the second time in the destructor above)

We could do slightly better if :

We do not set skb->destructor at all, and manage
sk_rmem_alloc/sk_forward_alloc changes at the time we dequeue skb
 (if !MSG_PEEK), before copy to user space.





--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Abeni Oct. 21, 2016, 1:34 p.m. UTC | #2

On Fri, 2016-10-21 at 05:24 -0700, Eric Dumazet wrote:
> On Fri, 2016-10-21 at 13:55 +0200, Paolo Abeni wrote:
> > Avoid using the generic helpers.
> > Use the receive queue spin lock to protect the memory
> > accounting operation, both on enqueue and on dequeue.
> > 
> > On dequeue perform partial memory reclaiming, trying to
> > leave a quantum of forward allocated memory.
> > 
> > On enqueue use a custom helper, to allow some optimizations:
> > - use a plain spin_lock() variant instead of the slightly
> >   costly spin_lock_irqsave(),
> > - avoid dst_force check, since the calling code has already
> >   dropped the skb dst
> > - avoid orphaning the skb, since skb_steal_sock() already did
> >   the work for us
> 
> >  
> > +static void udp_rmem_release(struct sock *sk, int size, int partial)
> > +{
> > +	int amt;
> > +
> > +	atomic_sub(size, &sk->sk_rmem_alloc);
> > +
> > +	spin_lock_bh(&sk->sk_receive_queue.lock);
> > +	sk->sk_forward_alloc += size;
> > +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> > +	sk->sk_forward_alloc -= amt;
> > +	spin_unlock_bh(&sk->sk_receive_queue.lock);
> > +
> > +	if (amt)
> > +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +}
> > +
> > +static void udp_rmem_free(struct sk_buff *skb)
> > +{
> > +	udp_rmem_release(skb->sk, skb->truesize, 1);
> > +}
> > +
> 
> 
> It looks like you are acquiring/releasing sk_receive_queue.lock twice
> per packet in recvmsg() (the second time in the destructor above)
> 
> We could do slightly better if :
> 
> We do not set skb->destructor at all, and manage
> sk_rmem_alloc/sk_forward_alloc changes at the time we dequeue skb
>  (if !MSG_PEEK), before copy to user space.

Thank you again for reviewing this!

Updating sk_rmem_alloc would still need an atomic operation,
because it is touched also by the error queue path: we will end up
adding an atomic operation (or two, when reclaiming the fwd allocated
memory) inside the critical section. The contention will likely
increase.

The above is going to be quite intrusive: we need to pass an
additional argument all the way up to __skb_try_recv_datagram() and
change the function behavior according to its value.

If you are otherwise satisfied with the series in the current shape,
can't we instead build incrementally on top of it ? it will simplify
both reviews and re-factor tasks.

Thank you,

Paolo


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Oct. 21, 2016, 1:41 p.m. UTC | #3

On Fri, 2016-10-21 at 15:34 +0200, Paolo Abeni wrote:

> Updating sk_rmem_alloc would still need an atomic operation,
> because it is touched also by the error queue path: we will end up
> adding an atomic operation (or two, when reclaiming the fwd allocated
> memory) inside the critical section. The contention will likely
> increase.
> 
> The above is going to be quite intrusive: we need to pass an
> additional argument all the way up to __skb_try_recv_datagram() and
> change the function behavior according to its value.
> 
> If you are otherwise satisfied with the series in the current shape,
> can't we instead build incrementally on top of it ? it will simplify
> both reviews and re-factor tasks.

I certainly can provide an incremental patch, it might be easier for
both of us ;)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Oct. 21, 2016, 9:13 p.m. UTC | #4

On Fri, 2016-10-21 at 13:55 +0200, Paolo Abeni wrote:
> Avoid using the generic helpers.
> Use the receive queue spin lock to protect the memory
> accounting operation, both on enqueue and on dequeue.
> 
> On dequeue perform partial memory reclaiming, trying to
> leave a quantum of forward allocated memory.
> 
> On enqueue use a custom helper, to allow some optimizations:
> - use a plain spin_lock() variant instead of the slightly
>   costly spin_lock_irqsave(),
> - avoid dst_force check, since the calling code has already
>   dropped the skb dst
> - avoid orphaning the skb, since skb_steal_sock() already did
>   the work for us
> 
> The above needs custom memory reclaiming on shutdown, provided
> by the udp_destruct_sock().

Acked-by: Eric Dumazet <edumazet@google.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..18f1e6b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,6 +246,9 @@  static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
 }
 
 /* net/ipv4/udp.c */
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
+
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
@@ -258,6 +261,7 @@  int udp_get_port(struct sock *sk, unsigned short snum,
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
+int udp_init_sock(struct sock *sk);
 int udp_disconnect(struct sock *sk, int flags);
 unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7d96dc2..e137e1d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1172,6 +1172,112 @@  int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	return ret;
 }
 
+static void udp_rmem_release(struct sock *sk, int size, int partial)
+{
+	int amt;
+
+	atomic_sub(size, &sk->sk_rmem_alloc);
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	sk->sk_forward_alloc += size;
+	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
+	sk->sk_forward_alloc -= amt;
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+
+	if (amt)
+		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+}
+
+static void udp_rmem_free(struct sk_buff *skb)
+{
+	udp_rmem_release(skb->sk, skb->truesize, 1);
+}
+
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct sk_buff_head *list = &sk->sk_receive_queue;
+	int rmem, delta, amt, err = -ENOMEM;
+	int size = skb->truesize;
+
+	/* try to avoid the costly atomic add/sub pair when the receive
+	 * queue is full; always allow at least a packet
+	 */
+	rmem = atomic_read(&sk->sk_rmem_alloc);
+	if (rmem && (rmem + size > sk->sk_rcvbuf))
+		goto drop;
+
+	/* we drop only if the receive buf is full and the receive
+	 * queue contains some other skb
+	 */
+	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
+	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+		goto uncharge_drop;
+
+	spin_lock(&list->lock);
+	if (size >= sk->sk_forward_alloc) {
+		amt = sk_mem_pages(size);
+		delta = amt << SK_MEM_QUANTUM_SHIFT;
+		if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
+			err = -ENOBUFS;
+			spin_unlock(&list->lock);
+			goto uncharge_drop;
+		}
+
+		sk->sk_forward_alloc += delta;
+	}
+
+	sk->sk_forward_alloc -= size;
+
+	/* the skb owner in now the udp socket */
+	skb->sk = sk;
+	skb->destructor = udp_rmem_free;
+	skb->dev = NULL;
+	sock_skb_set_dropcount(sk, skb);
+
+	__skb_queue_tail(list, skb);
+	spin_unlock(&list->lock);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_data_ready(sk);
+
+	return 0;
+
+uncharge_drop:
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+
+drop:
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
+
+static void udp_destruct_sock(struct sock *sk)
+{
+	/* reclaim completely the forward allocated memory */
+	__skb_queue_purge(&sk->sk_receive_queue);
+	udp_rmem_release(sk, 0, 0);
+	inet_sock_destruct(sk);
+}
+
+int udp_init_sock(struct sock *sk)
+{
+	sk->sk_destruct = udp_destruct_sock;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(udp_init_sock);
+
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
+{
+	if (unlikely(READ_ONCE(sk->sk_peek_off) >= 0)) {
+		bool slow = lock_sock_fast(sk);
+
+		sk_peek_offset_bwd(sk, len);
+		unlock_sock_fast(sk, slow);
+	}
+	consume_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_consume_udp);
+
 /**
  *	first_packet_length	- return length of first packet in receive queue
  *	@sk: socket

[net-next,v6,2/3] udp: implement memory accounting helpers

Commit Message

Comments

Patch