diff mbox series

[v2] vsock/virtio: Remove queued_replies pushback logic

Message ID 20250401201349.23867-1-graf@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2] vsock/virtio: Remove queued_replies pushback logic | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 0 this patch: 1
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 4 maintainers not CCed: xuanzhuo@linux.alibaba.com eperezma@redhat.com horms@kernel.org jasowang@redhat.com
netdev/build_clang fail Errors and warnings before: 0 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn fail Errors and warnings before: 0 this patch: 1
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 148 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexander Graf April 1, 2025, 8:13 p.m. UTC
Ever since the introduction of the virtio vsock driver, it included
pushback logic that blocks it from taking any new RX packets until the
TX queue backlog becomes shallower than the virtqueue size.

This logic works fine when you connect a user space application on the
hypervisor with a virtio-vsock target, because the guest will stop
receiving data until the host pulled all outstanding data from the VM.

With Nitro Enclaves however, we connect 2 VMs directly via vsock:

  Parent      Enclave

    RX -------- TX
    TX -------- RX

This means we now have 2 virtio-vsock backends that both have the pushback
logic. If the parent's TX queue runs full at the same time as the
Enclave's, both virtio-vsock drivers fall into the pushback path and
no longer accept RX traffic. However, that RX traffic is TX traffic on
the other side which blocks that driver from making any forward
progress. We're now in a deadlock.

To resolve this, let's remove that pushback logic altogether and rely on
higher levels (like credits) to ensure we do not consume unbounded
memory.

RX and TX queues share the same work queue. To prevent starvation of TX
by an RX flood and vice versa now that the pushback logic is gone, let's
deliberately reschedule RX and TX work after a fixed threshold (256) of
packets to process.

Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
Signed-off-by: Alexander Graf <graf@amazon.com>
---
 net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
 1 file changed, 19 insertions(+), 51 deletions(-)

Comments

Simon Horman April 2, 2025, 9:26 a.m. UTC | #1
On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
> Ever since the introduction of the virtio vsock driver, it included
> pushback logic that blocks it from taking any new RX packets until the
> TX queue backlog becomes shallower than the virtqueue size.
> 
> This logic works fine when you connect a user space application on the
> hypervisor with a virtio-vsock target, because the guest will stop
> receiving data until the host pulled all outstanding data from the VM.
> 
> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> 
>   Parent      Enclave
> 
>     RX -------- TX
>     TX -------- RX
> 
> This means we now have 2 virtio-vsock backends that both have the pushback
> logic. If the parent's TX queue runs full at the same time as the
> Enclave's, both virtio-vsock drivers fall into the pushback path and
> no longer accept RX traffic. However, that RX traffic is TX traffic on
> the other side which blocks that driver from making any forward
> progress. We're now in a deadlock.
> 
> To resolve this, let's remove that pushback logic altogether and rely on
> higher levels (like credits) to ensure we do not consume unbounded
> memory.
> 
> RX and TX queues share the same work queue. To prevent starvation of TX
> by an RX flood and vice versa now that the pushback logic is gone, let's
> deliberately reschedule RX and TX work after a fixed threshold (256) of
> packets to process.
> 
> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
> Signed-off-by: Alexander Graf <graf@amazon.com>
> ---
>  net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
>  1 file changed, 19 insertions(+), 51 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c

...

> @@ -158,7 +162,7 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		container_of(work, struct virtio_vsock, send_pkt_work);
>  	struct virtqueue *vq;
>  	bool added = false;
> -	bool restart_rx = false;
> +	int pkts = 0;
>  
>  	mutex_lock(&vsock->tx_lock);
>  
> @@ -172,6 +176,12 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		bool reply;
>  		int ret;
>  
> +		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
> +			/* Allow other works on the same queue to run */
> +			queue_work(virtio_vsock_workqueue, work);
> +			break;
> +		}
> +
>  		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>  		if (!skb)
>  			break;

Hi Alexander,

The next non-blank line of code looks like this:

		reply = virtio_vsock_skb_reply(skb);

But with this patch reply is assigned but otherwise unused.
So perhaps the line above, and the declaration of reply, can be removed?

Flagged by W=1 builds.

> @@ -184,17 +194,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  			break;
>  		}
>  
> -		if (reply) {
> -			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> -			int val;
> -
> -			val = atomic_dec_return(&vsock->queued_replies);
> -
> -			/* Do we now have resources to resume rx processing? */
> -			if (val + 1 == virtqueue_get_vring_size(rx_vq))
> -				restart_rx = true;
> -		}
> -
>  		added = true;
>  	}
>  
> @@ -203,9 +202,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  
>  out:
>  	mutex_unlock(&vsock->tx_lock);
> -
> -	if (restart_rx)
> -		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
>  /* Caller need to hold RCU for vsock.

...
Stefano Garzarella April 2, 2025, 1:26 p.m. UTC | #2
On Wed, Apr 02, 2025 at 10:26:05AM +0100, Simon Horman wrote:
>On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
>> Ever since the introduction of the virtio vsock driver, it included
>> pushback logic that blocks it from taking any new RX packets until the
>> TX queue backlog becomes shallower than the virtqueue size.
>>
>> This logic works fine when you connect a user space application on the
>> hypervisor with a virtio-vsock target, because the guest will stop
>> receiving data until the host pulled all outstanding data from the VM.
>>
>> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>>
>>   Parent      Enclave
>>
>>     RX -------- TX
>>     TX -------- RX
>>
>> This means we now have 2 virtio-vsock backends that both have the pushback
>> logic. If the parent's TX queue runs full at the same time as the
>> Enclave's, both virtio-vsock drivers fall into the pushback path and
>> no longer accept RX traffic. However, that RX traffic is TX traffic on
>> the other side which blocks that driver from making any forward
>> progress. We're now in a deadlock.
>>
>> To resolve this, let's remove that pushback logic altogether and rely on
>> higher levels (like credits) to ensure we do not consume unbounded
>> memory.
>>
>> RX and TX queues share the same work queue. To prevent starvation of TX
>> by an RX flood and vice versa now that the pushback logic is gone, let's
>> deliberately reschedule RX and TX work after a fixed threshold (256) of
>> packets to process.
>>
>> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>> ---
>>  net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
>>  1 file changed, 19 insertions(+), 51 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>
>...
>
>> @@ -158,7 +162,7 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  		container_of(work, struct virtio_vsock, send_pkt_work);
>>  	struct virtqueue *vq;
>>  	bool added = false;
>> -	bool restart_rx = false;
>> +	int pkts = 0;
>>
>>  	mutex_lock(&vsock->tx_lock);
>>
>> @@ -172,6 +176,12 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  		bool reply;
>>  		int ret;
>>
>> +		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
>> +			/* Allow other works on the same queue to run */
>> +			queue_work(virtio_vsock_workqueue, work);
>> +			break;
>> +		}
>> +
>>  		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>>  		if (!skb)
>>  			break;
>
>Hi Alexander,
>
>The next non-blank line of code looks like this:
>
>		reply = virtio_vsock_skb_reply(skb);
>
>But with this patch reply is assigned but otherwise unused.

Thanks for the report!

>So perhaps the line above, and the declaration of reply, can be removed?

@Alex: yes, please remove it.

A part of that the rest LGTM!

I've been running some tests for a while and everything seems okay.

I guess we can do something similar also in vhost-vsock, where we 
already have "vhost weight" support. IIUC it was added later by commit 
e79b431fb901 ("vhost: vsock: add weight support"), but we never removed 
"queued_replies" stuff, that IMO after that commit is pretty much 
useless.

I'm not asking to that in this series, if you don't have time I can do 
it separately ;-)

Thanks,
Stefano

>
>Flagged by W=1 builds.
>
>> @@ -184,17 +194,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  			break;
>>  		}
>>
>> -		if (reply) {
>> -			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
>> -			int val;
>> -
>> -			val = atomic_dec_return(&vsock->queued_replies);
>> -
>> -			/* Do we now have resources to resume rx processing? */
>> -			if (val + 1 == virtqueue_get_vring_size(rx_vq))
>> -				restart_rx = true;
>> -		}
>> -
>>  		added = true;
>>  	}
>>
>> @@ -203,9 +202,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>
>>  out:
>>  	mutex_unlock(&vsock->tx_lock);
>> -
>> -	if (restart_rx)
>> -		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>  }
>>
>>  /* Caller need to hold RCU for vsock.
>
>...
>
Stefan Hajnoczi April 2, 2025, 4:14 p.m. UTC | #3
On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
> Ever since the introduction of the virtio vsock driver, it included
> pushback logic that blocks it from taking any new RX packets until the
> TX queue backlog becomes shallower than the virtqueue size.
> 
> This logic works fine when you connect a user space application on the
> hypervisor with a virtio-vsock target, because the guest will stop
> receiving data until the host pulled all outstanding data from the VM.
> 
> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> 
>   Parent      Enclave
> 
>     RX -------- TX
>     TX -------- RX
> 
> This means we now have 2 virtio-vsock backends that both have the pushback
> logic. If the parent's TX queue runs full at the same time as the
> Enclave's, both virtio-vsock drivers fall into the pushback path and
> no longer accept RX traffic. However, that RX traffic is TX traffic on
> the other side which blocks that driver from making any forward
> progress. We're now in a deadlock.
> 
> To resolve this, let's remove that pushback logic altogether and rely on
> higher levels (like credits) to ensure we do not consume unbounded
> memory.

The reason for queued_replies is that rx packet processing may emit tx
packets. Therefore tx virtqueue space is required in order to process
the rx virtqueue.

queued_replies puts a bound on the amount of tx packets that can be
queued in memory so the other side cannot consume unlimited memory. Once
that bound has been reached, rx processing stops until the other side
frees up tx virtqueue space.

It's been a while since I looked at this problem, so I don't have a
solution ready. In fact, last time I thought about it I wondered if the
design of virtio-vsock fundamentally suffers from deadlocks.

I don't think removing queued_replies is possible without a replacement
for the bounded memory and virtqueue exhaustion issue though. Credits
are not a solution - they are about socket buffer space, not about
virtqueue space, which includes control packets that are not accounted
by socket buffer space.

> 
> RX and TX queues share the same work queue. To prevent starvation of TX
> by an RX flood and vice versa now that the pushback logic is gone, let's
> deliberately reschedule RX and TX work after a fixed threshold (256) of
> packets to process.
> 
> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
> Signed-off-by: Alexander Graf <graf@amazon.com>
> ---
>  net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
>  1 file changed, 19 insertions(+), 51 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index f0e48e6911fc..54030c729767 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -26,6 +26,12 @@ static struct virtio_vsock __rcu *the_virtio_vsock;
>  static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
>  static struct virtio_transport virtio_transport; /* forward declaration */
>  
> +/*
> + * Max number of RX packets transferred before requeueing so we do
> + * not starve TX traffic because they share the same work queue.
> + */
> +#define VSOCK_MAX_PKTS_PER_WORK 256
> +
>  struct virtio_vsock {
>  	struct virtio_device *vdev;
>  	struct virtqueue *vqs[VSOCK_VQ_MAX];
> @@ -44,8 +50,6 @@ struct virtio_vsock {
>  	struct work_struct send_pkt_work;
>  	struct sk_buff_head send_pkt_queue;
>  
> -	atomic_t queued_replies;
> -
>  	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
>  	 * must be accessed with rx_lock held.
>  	 */
> @@ -158,7 +162,7 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		container_of(work, struct virtio_vsock, send_pkt_work);
>  	struct virtqueue *vq;
>  	bool added = false;
> -	bool restart_rx = false;
> +	int pkts = 0;
>  
>  	mutex_lock(&vsock->tx_lock);
>  
> @@ -172,6 +176,12 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		bool reply;
>  		int ret;
>  
> +		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
> +			/* Allow other works on the same queue to run */
> +			queue_work(virtio_vsock_workqueue, work);
> +			break;
> +		}
> +
>  		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>  		if (!skb)
>  			break;
> @@ -184,17 +194,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  			break;
>  		}
>  
> -		if (reply) {
> -			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> -			int val;
> -
> -			val = atomic_dec_return(&vsock->queued_replies);
> -
> -			/* Do we now have resources to resume rx processing? */
> -			if (val + 1 == virtqueue_get_vring_size(rx_vq))
> -				restart_rx = true;
> -		}
> -
>  		added = true;
>  	}
>  
> @@ -203,9 +202,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  
>  out:
>  	mutex_unlock(&vsock->tx_lock);
> -
> -	if (restart_rx)
> -		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
>  /* Caller need to hold RCU for vsock.
> @@ -261,9 +257,6 @@ virtio_transport_send_pkt(struct sk_buff *skb)
>  	 */
>  	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
>  	    virtio_transport_send_skb_fast_path(vsock, skb)) {
> -		if (virtio_vsock_skb_reply(skb))
> -			atomic_inc(&vsock->queued_replies);
> -
>  		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>  	}
> @@ -277,7 +270,7 @@ static int
>  virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>  {
>  	struct virtio_vsock *vsock;
> -	int cnt = 0, ret;
> +	int ret;
>  
>  	rcu_read_lock();
>  	vsock = rcu_dereference(the_virtio_vsock);
> @@ -286,17 +279,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>  		goto out_rcu;
>  	}
>  
> -	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
> -
> -	if (cnt) {
> -		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> -		int new_cnt;
> -
> -		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
> -		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
> -		    new_cnt < virtqueue_get_vring_size(rx_vq))
> -			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> -	}
> +	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>  
>  	ret = 0;
>  
> @@ -367,18 +350,6 @@ static void virtio_transport_tx_work(struct work_struct *work)
>  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>  }
>  
> -/* Is there space left for replies to rx packets? */
> -static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
> -{
> -	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
> -	int val;
> -
> -	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
> -	val = atomic_read(&vsock->queued_replies);
> -
> -	return val < virtqueue_get_vring_size(vq);
> -}
> -
>  /* event_lock must be held */
>  static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
>  				       struct virtio_vsock_event *event)
> @@ -613,6 +584,7 @@ static void virtio_transport_rx_work(struct work_struct *work)
>  	struct virtio_vsock *vsock =
>  		container_of(work, struct virtio_vsock, rx_work);
>  	struct virtqueue *vq;
> +	int pkts = 0;
>  
>  	vq = vsock->vqs[VSOCK_VQ_RX];
>  
> @@ -627,11 +599,9 @@ static void virtio_transport_rx_work(struct work_struct *work)
>  			struct sk_buff *skb;
>  			unsigned int len;
>  
> -			if (!virtio_transport_more_replies(vsock)) {
> -				/* Stop rx until the device processes already
> -				 * pending replies.  Leave rx virtqueue
> -				 * callbacks disabled.
> -				 */
> +			if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
> +				/* Allow other works on the same queue to run */
> +				queue_work(virtio_vsock_workqueue, work);
>  				goto out;
>  			}
>  
> @@ -675,8 +645,6 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
>  	vsock->rx_buf_max_nr = 0;
>  	mutex_unlock(&vsock->rx_lock);
>  
> -	atomic_set(&vsock->queued_replies, 0);
> -
>  	ret = virtio_find_vqs(vdev, VSOCK_VQ_MAX, vsock->vqs, vqs_info, NULL);
>  	if (ret < 0)
>  		return ret;
> -- 
> 2.47.1
>
Stefano Garzarella April 3, 2025, 8:24 a.m. UTC | #4
On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
>On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
>> Ever since the introduction of the virtio vsock driver, it included
>> pushback logic that blocks it from taking any new RX packets until the
>> TX queue backlog becomes shallower than the virtqueue size.
>>
>> This logic works fine when you connect a user space application on the
>> hypervisor with a virtio-vsock target, because the guest will stop
>> receiving data until the host pulled all outstanding data from the VM.
>>
>> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>>
>>   Parent      Enclave
>>
>>     RX -------- TX
>>     TX -------- RX
>>
>> This means we now have 2 virtio-vsock backends that both have the pushback
>> logic. If the parent's TX queue runs full at the same time as the
>> Enclave's, both virtio-vsock drivers fall into the pushback path and
>> no longer accept RX traffic. However, that RX traffic is TX traffic on
>> the other side which blocks that driver from making any forward
>> progress. We're now in a deadlock.
>>
>> To resolve this, let's remove that pushback logic altogether and rely on
>> higher levels (like credits) to ensure we do not consume unbounded
>> memory.
>
>The reason for queued_replies is that rx packet processing may emit tx
>packets. Therefore tx virtqueue space is required in order to process
>the rx virtqueue.
>
>queued_replies puts a bound on the amount of tx packets that can be
>queued in memory so the other side cannot consume unlimited memory. Once
>that bound has been reached, rx processing stops until the other side
>frees up tx virtqueue space.
>
>It's been a while since I looked at this problem, so I don't have a
>solution ready. In fact, last time I thought about it I wondered if the
>design of virtio-vsock fundamentally suffers from deadlocks.
>
>I don't think removing queued_replies is possible without a replacement
>for the bounded memory and virtqueue exhaustion issue though. Credits
>are not a solution - they are about socket buffer space, not about
>virtqueue space, which includes control packets that are not accounted
>by socket buffer space.

This is a very good point that I missed, I need to add a comment in the 
code to explain it, because it wasn't clear to me! Thank you very much 
Stefan!

So, IIUC, with this patch, a host or a sibling VM (e.g.  enclave, 
parent), can flood the VM with requests like VIRTIO_VSOCK_OP_REQUEST 
(even for example with a random port that is not open) that require a 
response. If the peer that is sending the requests, using the RX 
virtqueue, does not consume the TX virtqueue, it easily causes a 
consumption of all the memory on the other peer, which initially starts 
filling up the TX virtqueue, but when it becomes full starts using the 
internal queue indiscriminately.

I agree, if we want to get rid of queued_replies, we should find some 
other way to avoid this. So far I can't think of anything other than to 
stop the consumption of the virtqueue and wait for the other peer to 
consume the other one.

Any other ideas?

Thanks,
Stefano


>
>>
>> RX and TX queues share the same work queue. To prevent starvation of TX
>> by an RX flood and vice versa now that the pushback logic is gone, let's
>> deliberately reschedule RX and TX work after a fixed threshold (256) of
>> packets to process.
>>
>> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>> ---
>>  net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
>>  1 file changed, 19 insertions(+), 51 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index f0e48e6911fc..54030c729767 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -26,6 +26,12 @@ static struct virtio_vsock __rcu *the_virtio_vsock;
>>  static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
>>  static struct virtio_transport virtio_transport; /* forward declaration */
>>
>> +/*
>> + * Max number of RX packets transferred before requeueing so we do
>> + * not starve TX traffic because they share the same work queue.
>> + */
>> +#define VSOCK_MAX_PKTS_PER_WORK 256
>> +
>>  struct virtio_vsock {
>>  	struct virtio_device *vdev;
>>  	struct virtqueue *vqs[VSOCK_VQ_MAX];
>> @@ -44,8 +50,6 @@ struct virtio_vsock {
>>  	struct work_struct send_pkt_work;
>>  	struct sk_buff_head send_pkt_queue;
>>
>> -	atomic_t queued_replies;
>> -
>>  	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
>>  	 * must be accessed with rx_lock held.
>>  	 */
>> @@ -158,7 +162,7 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  		container_of(work, struct virtio_vsock, send_pkt_work);
>>  	struct virtqueue *vq;
>>  	bool added = false;
>> -	bool restart_rx = false;
>> +	int pkts = 0;
>>
>>  	mutex_lock(&vsock->tx_lock);
>>
>> @@ -172,6 +176,12 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  		bool reply;
>>  		int ret;
>>
>> +		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
>> +			/* Allow other works on the same queue to run */
>> +			queue_work(virtio_vsock_workqueue, work);
>> +			break;
>> +		}
>> +
>>  		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>>  		if (!skb)
>>  			break;
>> @@ -184,17 +194,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>  			break;
>>  		}
>>
>> -		if (reply) {
>> -			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
>> -			int val;
>> -
>> -			val = atomic_dec_return(&vsock->queued_replies);
>> -
>> -			/* Do we now have resources to resume rx processing? */
>> -			if (val + 1 == virtqueue_get_vring_size(rx_vq))
>> -				restart_rx = true;
>> -		}
>> -
>>  		added = true;
>>  	}
>>
>> @@ -203,9 +202,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>
>>  out:
>>  	mutex_unlock(&vsock->tx_lock);
>> -
>> -	if (restart_rx)
>> -		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>  }
>>
>>  /* Caller need to hold RCU for vsock.
>> @@ -261,9 +257,6 @@ virtio_transport_send_pkt(struct sk_buff *skb)
>>  	 */
>>  	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
>>  	    virtio_transport_send_skb_fast_path(vsock, skb)) {
>> -		if (virtio_vsock_skb_reply(skb))
>> -			atomic_inc(&vsock->queued_replies);
>> -
>>  		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>>  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>>  	}
>> @@ -277,7 +270,7 @@ static int
>>  virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>>  {
>>  	struct virtio_vsock *vsock;
>> -	int cnt = 0, ret;
>> +	int ret;
>>
>>  	rcu_read_lock();
>>  	vsock = rcu_dereference(the_virtio_vsock);
>> @@ -286,17 +279,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>>  		goto out_rcu;
>>  	}
>>
>> -	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>> -
>> -	if (cnt) {
>> -		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
>> -		int new_cnt;
>> -
>> -		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
>> -		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
>> -		    new_cnt < virtqueue_get_vring_size(rx_vq))
>> -			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>> -	}
>> +	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>>
>>  	ret = 0;
>>
>> @@ -367,18 +350,6 @@ static void virtio_transport_tx_work(struct work_struct *work)
>>  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>>  }
>>
>> -/* Is there space left for replies to rx packets? */
>> -static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
>> -{
>> -	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
>> -	int val;
>> -
>> -	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
>> -	val = atomic_read(&vsock->queued_replies);
>> -
>> -	return val < virtqueue_get_vring_size(vq);
>> -}
>> -
>>  /* event_lock must be held */
>>  static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
>>  				       struct virtio_vsock_event *event)
>> @@ -613,6 +584,7 @@ static void virtio_transport_rx_work(struct work_struct *work)
>>  	struct virtio_vsock *vsock =
>>  		container_of(work, struct virtio_vsock, rx_work);
>>  	struct virtqueue *vq;
>> +	int pkts = 0;
>>
>>  	vq = vsock->vqs[VSOCK_VQ_RX];
>>
>> @@ -627,11 +599,9 @@ static void virtio_transport_rx_work(struct work_struct *work)
>>  			struct sk_buff *skb;
>>  			unsigned int len;
>>
>> -			if (!virtio_transport_more_replies(vsock)) {
>> -				/* Stop rx until the device processes already
>> -				 * pending replies.  Leave rx virtqueue
>> -				 * callbacks disabled.
>> -				 */
>> +			if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
>> +				/* Allow other works on the same queue to run */
>> +				queue_work(virtio_vsock_workqueue, work);
>>  				goto out;
>>  			}
>>
>> @@ -675,8 +645,6 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
>>  	vsock->rx_buf_max_nr = 0;
>>  	mutex_unlock(&vsock->rx_lock);
>>
>> -	atomic_set(&vsock->queued_replies, 0);
>> -
>>  	ret = virtio_find_vqs(vdev, VSOCK_VQ_MAX, vsock->vqs, vqs_info, NULL);
>>  	if (ret < 0)
>>  		return ret;
>> --
>> 2.47.1
>>
Michael S. Tsirkin April 3, 2025, 12:21 p.m. UTC | #5
On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
> On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
> > Ever since the introduction of the virtio vsock driver, it included
> > pushback logic that blocks it from taking any new RX packets until the
> > TX queue backlog becomes shallower than the virtqueue size.
> > 
> > This logic works fine when you connect a user space application on the
> > hypervisor with a virtio-vsock target, because the guest will stop
> > receiving data until the host pulled all outstanding data from the VM.
> > 
> > With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> > 
> >   Parent      Enclave
> > 
> >     RX -------- TX
> >     TX -------- RX
> > 
> > This means we now have 2 virtio-vsock backends that both have the pushback
> > logic. If the parent's TX queue runs full at the same time as the
> > Enclave's, both virtio-vsock drivers fall into the pushback path and
> > no longer accept RX traffic. However, that RX traffic is TX traffic on
> > the other side which blocks that driver from making any forward
> > progress. We're now in a deadlock.
> > 
> > To resolve this, let's remove that pushback logic altogether and rely on
> > higher levels (like credits) to ensure we do not consume unbounded
> > memory.
> 
> The reason for queued_replies is that rx packet processing may emit tx
> packets. Therefore tx virtqueue space is required in order to process
> the rx virtqueue.
> 
> queued_replies puts a bound on the amount of tx packets that can be
> queued in memory so the other side cannot consume unlimited memory. Once
> that bound has been reached, rx processing stops until the other side
> frees up tx virtqueue space.
> 
> It's been a while since I looked at this problem, so I don't have a
> solution ready. In fact, last time I thought about it I wondered if the
> design of virtio-vsock fundamentally suffers from deadlocks.
> 
> I don't think removing queued_replies is possible without a replacement
> for the bounded memory and virtqueue exhaustion issue though. Credits
> are not a solution - they are about socket buffer space, not about
> virtqueue space, which includes control packets that are not accounted
> by socket buffer space.


Hmm.
Actually, let's think which packets require a response.

VIRTIO_VSOCK_OP_REQUEST
VIRTIO_VSOCK_OP_SHUTDOWN
VIRTIO_VSOCK_OP_CREDIT_REQUEST


the response to these always reports a state of an existing socket.
and, only one type of response is relevant for each socket.

So here's my suggestion:
stop queueing replies on the vsock device, instead,
simply store the response on the socket, and create a list of sockets
that have replies to be transmitted


WDYT?


> > 
> > RX and TX queues share the same work queue. To prevent starvation of TX
> > by an RX flood and vice versa now that the pushback logic is gone, let's
> > deliberately reschedule RX and TX work after a fixed threshold (256) of
> > packets to process.
> > 
> > Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > ---
> >  net/vmw_vsock/virtio_transport.c | 70 +++++++++-----------------------
> >  1 file changed, 19 insertions(+), 51 deletions(-)
> > 
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index f0e48e6911fc..54030c729767 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -26,6 +26,12 @@ static struct virtio_vsock __rcu *the_virtio_vsock;
> >  static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
> >  static struct virtio_transport virtio_transport; /* forward declaration */
> >  
> > +/*
> > + * Max number of RX packets transferred before requeueing so we do
> > + * not starve TX traffic because they share the same work queue.
> > + */
> > +#define VSOCK_MAX_PKTS_PER_WORK 256
> > +
> >  struct virtio_vsock {
> >  	struct virtio_device *vdev;
> >  	struct virtqueue *vqs[VSOCK_VQ_MAX];
> > @@ -44,8 +50,6 @@ struct virtio_vsock {
> >  	struct work_struct send_pkt_work;
> >  	struct sk_buff_head send_pkt_queue;
> >  
> > -	atomic_t queued_replies;
> > -
> >  	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
> >  	 * must be accessed with rx_lock held.
> >  	 */
> > @@ -158,7 +162,7 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> >  		container_of(work, struct virtio_vsock, send_pkt_work);
> >  	struct virtqueue *vq;
> >  	bool added = false;
> > -	bool restart_rx = false;
> > +	int pkts = 0;
> >  
> >  	mutex_lock(&vsock->tx_lock);
> >  
> > @@ -172,6 +176,12 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> >  		bool reply;
> >  		int ret;
> >  
> > +		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
> > +			/* Allow other works on the same queue to run */
> > +			queue_work(virtio_vsock_workqueue, work);
> > +			break;
> > +		}
> > +
> >  		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
> >  		if (!skb)
> >  			break;
> > @@ -184,17 +194,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> >  			break;
> >  		}
> >  
> > -		if (reply) {
> > -			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> > -			int val;
> > -
> > -			val = atomic_dec_return(&vsock->queued_replies);
> > -
> > -			/* Do we now have resources to resume rx processing? */
> > -			if (val + 1 == virtqueue_get_vring_size(rx_vq))
> > -				restart_rx = true;
> > -		}
> > -
> >  		added = true;
> >  	}
> >  
> > @@ -203,9 +202,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> >  
> >  out:
> >  	mutex_unlock(&vsock->tx_lock);
> > -
> > -	if (restart_rx)
> > -		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >  }
> >  
> >  /* Caller need to hold RCU for vsock.
> > @@ -261,9 +257,6 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> >  	 */
> >  	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
> >  	    virtio_transport_send_skb_fast_path(vsock, skb)) {
> > -		if (virtio_vsock_skb_reply(skb))
> > -			atomic_inc(&vsock->queued_replies);
> > -
> >  		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
> >  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
> >  	}
> > @@ -277,7 +270,7 @@ static int
> >  virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> >  {
> >  	struct virtio_vsock *vsock;
> > -	int cnt = 0, ret;
> > +	int ret;
> >  
> >  	rcu_read_lock();
> >  	vsock = rcu_dereference(the_virtio_vsock);
> > @@ -286,17 +279,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> >  		goto out_rcu;
> >  	}
> >  
> > -	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
> > -
> > -	if (cnt) {
> > -		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> > -		int new_cnt;
> > -
> > -		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
> > -		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
> > -		    new_cnt < virtqueue_get_vring_size(rx_vq))
> > -			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> > -	}
> > +	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
> >  
> >  	ret = 0;
> >  
> > @@ -367,18 +350,6 @@ static void virtio_transport_tx_work(struct work_struct *work)
> >  		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
> >  }
> >  
> > -/* Is there space left for replies to rx packets? */
> > -static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
> > -{
> > -	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
> > -	int val;
> > -
> > -	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
> > -	val = atomic_read(&vsock->queued_replies);
> > -
> > -	return val < virtqueue_get_vring_size(vq);
> > -}
> > -
> >  /* event_lock must be held */
> >  static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
> >  				       struct virtio_vsock_event *event)
> > @@ -613,6 +584,7 @@ static void virtio_transport_rx_work(struct work_struct *work)
> >  	struct virtio_vsock *vsock =
> >  		container_of(work, struct virtio_vsock, rx_work);
> >  	struct virtqueue *vq;
> > +	int pkts = 0;
> >  
> >  	vq = vsock->vqs[VSOCK_VQ_RX];
> >  
> > @@ -627,11 +599,9 @@ static void virtio_transport_rx_work(struct work_struct *work)
> >  			struct sk_buff *skb;
> >  			unsigned int len;
> >  
> > -			if (!virtio_transport_more_replies(vsock)) {
> > -				/* Stop rx until the device processes already
> > -				 * pending replies.  Leave rx virtqueue
> > -				 * callbacks disabled.
> > -				 */
> > +			if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
> > +				/* Allow other works on the same queue to run */
> > +				queue_work(virtio_vsock_workqueue, work);
> >  				goto out;
> >  			}
> >  
> > @@ -675,8 +645,6 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
> >  	vsock->rx_buf_max_nr = 0;
> >  	mutex_unlock(&vsock->rx_lock);
> >  
> > -	atomic_set(&vsock->queued_replies, 0);
> > -
> >  	ret = virtio_find_vqs(vdev, VSOCK_VQ_MAX, vsock->vqs, vqs_info, NULL);
> >  	if (ret < 0)
> >  		return ret;
> > -- 
> > 2.47.1
> >
Alexander Graf April 4, 2025, 8:04 a.m. UTC | #6
On 03.04.25 14:21, Michael S. Tsirkin wrote:
> On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
>> On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
>>> Ever since the introduction of the virtio vsock driver, it included
>>> pushback logic that blocks it from taking any new RX packets until the
>>> TX queue backlog becomes shallower than the virtqueue size.
>>>
>>> This logic works fine when you connect a user space application on the
>>> hypervisor with a virtio-vsock target, because the guest will stop
>>> receiving data until the host pulled all outstanding data from the VM.
>>>
>>> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>>>
>>>    Parent      Enclave
>>>
>>>      RX -------- TX
>>>      TX -------- RX
>>>
>>> This means we now have 2 virtio-vsock backends that both have the pushback
>>> logic. If the parent's TX queue runs full at the same time as the
>>> Enclave's, both virtio-vsock drivers fall into the pushback path and
>>> no longer accept RX traffic. However, that RX traffic is TX traffic on
>>> the other side which blocks that driver from making any forward
>>> progress. We're now in a deadlock.
>>>
>>> To resolve this, let's remove that pushback logic altogether and rely on
>>> higher levels (like credits) to ensure we do not consume unbounded
>>> memory.
>> The reason for queued_replies is that rx packet processing may emit tx
>> packets. Therefore tx virtqueue space is required in order to process
>> the rx virtqueue.
>>
>> queued_replies puts a bound on the amount of tx packets that can be
>> queued in memory so the other side cannot consume unlimited memory. Once
>> that bound has been reached, rx processing stops until the other side
>> frees up tx virtqueue space.
>>
>> It's been a while since I looked at this problem, so I don't have a
>> solution ready. In fact, last time I thought about it I wondered if the
>> design of virtio-vsock fundamentally suffers from deadlocks.
>>
>> I don't think removing queued_replies is possible without a replacement
>> for the bounded memory and virtqueue exhaustion issue though. Credits
>> are not a solution - they are about socket buffer space, not about
>> virtqueue space, which includes control packets that are not accounted
>> by socket buffer space.
>
> Hmm.
> Actually, let's think which packets require a response.
>
> VIRTIO_VSOCK_OP_REQUEST
> VIRTIO_VSOCK_OP_SHUTDOWN
> VIRTIO_VSOCK_OP_CREDIT_REQUEST
>
>
> the response to these always reports a state of an existing socket.
> and, only one type of response is relevant for each socket.
>
> So here's my suggestion:
> stop queueing replies on the vsock device, instead,
> simply store the response on the socket, and create a list of sockets
> that have replies to be transmitted
>
>
> WDYT?


Wouldn't that create the same problem again? The socket will eventually 
push back any new data that it can take because its FIFO is full. At 
that point, the "other side" could still have a queue full of requests 
on exactly that socket that need to get processed. We can now not pull 
those packets off the virtio queue, because we can not enqueue responses.

But that means now the one queue is blocked from making forward 
progress, because we are applying back pressure. And that means 
everything can grind to a halt and we have the same deadlock this patch 
is trying to fix.

I don't see how we can possibly guarantee a lossless data channel over a 
tiny wire (single, fixed size, in order virtio ring) while also 
guaranteeing bounded memory usage. One of the constraints need to go: 
Either we are no longer lossless or we effectively allow unbounded 
memory usage.


Alex
Michael S. Tsirkin April 4, 2025, 8:14 a.m. UTC | #7
On Fri, Apr 04, 2025 at 10:04:38AM +0200, Alexander Graf wrote:
> 
> On 03.04.25 14:21, Michael S. Tsirkin wrote:
> > On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
> > > > Ever since the introduction of the virtio vsock driver, it included
> > > > pushback logic that blocks it from taking any new RX packets until the
> > > > TX queue backlog becomes shallower than the virtqueue size.
> > > > 
> > > > This logic works fine when you connect a user space application on the
> > > > hypervisor with a virtio-vsock target, because the guest will stop
> > > > receiving data until the host pulled all outstanding data from the VM.
> > > > 
> > > > With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> > > > 
> > > >    Parent      Enclave
> > > > 
> > > >      RX -------- TX
> > > >      TX -------- RX
> > > > 
> > > > This means we now have 2 virtio-vsock backends that both have the pushback
> > > > logic. If the parent's TX queue runs full at the same time as the
> > > > Enclave's, both virtio-vsock drivers fall into the pushback path and
> > > > no longer accept RX traffic. However, that RX traffic is TX traffic on
> > > > the other side which blocks that driver from making any forward
> > > > progress. We're now in a deadlock.
> > > > 
> > > > To resolve this, let's remove that pushback logic altogether and rely on
> > > > higher levels (like credits) to ensure we do not consume unbounded
> > > > memory.
> > > The reason for queued_replies is that rx packet processing may emit tx
> > > packets. Therefore tx virtqueue space is required in order to process
> > > the rx virtqueue.
> > > 
> > > queued_replies puts a bound on the amount of tx packets that can be
> > > queued in memory so the other side cannot consume unlimited memory. Once
> > > that bound has been reached, rx processing stops until the other side
> > > frees up tx virtqueue space.
> > > 
> > > It's been a while since I looked at this problem, so I don't have a
> > > solution ready. In fact, last time I thought about it I wondered if the
> > > design of virtio-vsock fundamentally suffers from deadlocks.
> > > 
> > > I don't think removing queued_replies is possible without a replacement
> > > for the bounded memory and virtqueue exhaustion issue though. Credits
> > > are not a solution - they are about socket buffer space, not about
> > > virtqueue space, which includes control packets that are not accounted
> > > by socket buffer space.
> > 
> > Hmm.
> > Actually, let's think which packets require a response.
> > 
> > VIRTIO_VSOCK_OP_REQUEST
> > VIRTIO_VSOCK_OP_SHUTDOWN
> > VIRTIO_VSOCK_OP_CREDIT_REQUEST
> > 
> > 
> > the response to these always reports a state of an existing socket.
> > and, only one type of response is relevant for each socket.
> > 
> > So here's my suggestion:
> > stop queueing replies on the vsock device, instead,
> > simply store the response on the socket, and create a list of sockets
> > that have replies to be transmitted
> > 
> > 
> > WDYT?
> 
> 
> Wouldn't that create the same problem again? The socket will eventually push
> back any new data that it can take because its FIFO is full. At that point,
> the "other side" could still have a queue full of requests on exactly that
> socket that need to get processed. We can now not pull those packets off the
> virtio queue, because we can not enqueue responses.

Either I don't understand what you wrote or I did not explain myself
clearly. 

In this idea there needs to be a single response enqueued
like this in the socket, because, no more than one ever needs to
be outstanding per socket.

For example, until VIRTIO_VSOCK_OP_REQUEST
is responded to, the socket is not active and does not need to
send anything.


> 
> But that means now the one queue is blocked from making forward progress,
> because we are applying back pressure. And that means everything can grind
> to a halt and we have the same deadlock this patch is trying to fix.
> 
> I don't see how we can possibly guarantee a lossless data channel over a
> tiny wire (single, fixed size, in order virtio ring) while also guaranteeing
> bounded memory usage. One of the constraints need to go: Either we are no
> longer lossless or we effectively allow unbounded memory usage.
> 
> 
> Alex
Stefano Garzarella April 4, 2025, 8:30 a.m. UTC | #8
On Fri, Apr 04, 2025 at 04:14:51AM -0400, Michael S. Tsirkin wrote:
>On Fri, Apr 04, 2025 at 10:04:38AM +0200, Alexander Graf wrote:
>>
>> On 03.04.25 14:21, Michael S. Tsirkin wrote:
>> > On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
>> > > On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
>> > > > Ever since the introduction of the virtio vsock driver, it included
>> > > > pushback logic that blocks it from taking any new RX packets until the
>> > > > TX queue backlog becomes shallower than the virtqueue size.
>> > > >
>> > > > This logic works fine when you connect a user space application on the
>> > > > hypervisor with a virtio-vsock target, because the guest will stop
>> > > > receiving data until the host pulled all outstanding data from the VM.
>> > > >
>> > > > With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>> > > >
>> > > >    Parent      Enclave
>> > > >
>> > > >      RX -------- TX
>> > > >      TX -------- RX
>> > > >
>> > > > This means we now have 2 virtio-vsock backends that both have the pushback
>> > > > logic. If the parent's TX queue runs full at the same time as the
>> > > > Enclave's, both virtio-vsock drivers fall into the pushback path and
>> > > > no longer accept RX traffic. However, that RX traffic is TX traffic on
>> > > > the other side which blocks that driver from making any forward
>> > > > progress. We're now in a deadlock.
>> > > >
>> > > > To resolve this, let's remove that pushback logic altogether and rely on
>> > > > higher levels (like credits) to ensure we do not consume unbounded
>> > > > memory.
>> > > The reason for queued_replies is that rx packet processing may emit tx
>> > > packets. Therefore tx virtqueue space is required in order to process
>> > > the rx virtqueue.
>> > >
>> > > queued_replies puts a bound on the amount of tx packets that can be
>> > > queued in memory so the other side cannot consume unlimited memory. Once
>> > > that bound has been reached, rx processing stops until the other side
>> > > frees up tx virtqueue space.
>> > >
>> > > It's been a while since I looked at this problem, so I don't have a
>> > > solution ready. In fact, last time I thought about it I wondered if the
>> > > design of virtio-vsock fundamentally suffers from deadlocks.
>> > >
>> > > I don't think removing queued_replies is possible without a replacement
>> > > for the bounded memory and virtqueue exhaustion issue though. Credits
>> > > are not a solution - they are about socket buffer space, not about
>> > > virtqueue space, which includes control packets that are not accounted
>> > > by socket buffer space.
>> >
>> > Hmm.
>> > Actually, let's think which packets require a response.
>> >
>> > VIRTIO_VSOCK_OP_REQUEST
>> > VIRTIO_VSOCK_OP_SHUTDOWN
>> > VIRTIO_VSOCK_OP_CREDIT_REQUEST
>> >
>> >
>> > the response to these always reports a state of an existing socket.
>> > and, only one type of response is relevant for each socket.
>> >
>> > So here's my suggestion:
>> > stop queueing replies on the vsock device, instead,
>> > simply store the response on the socket, and create a list of sockets
>> > that have replies to be transmitted
>> >
>> >
>> > WDYT?
>>
>>
>> Wouldn't that create the same problem again? The socket will eventually push
>> back any new data that it can take because its FIFO is full. At that point,
>> the "other side" could still have a queue full of requests on exactly that
>> socket that need to get processed. We can now not pull those packets off the
>> virtio queue, because we can not enqueue responses.
>
>Either I don't understand what you wrote or I did not explain myself
>clearly.

I didn't fully understand either, but with this last message of yours 
it's clear to me and I like the idea!

>
>In this idea there needs to be a single response enqueued
>like this in the socket, because, no more than one ever needs to
>be outstanding per socket.
>
>For example, until VIRTIO_VSOCK_OP_REQUEST
>is responded to, the socket is not active and does not need to
>send anything.

One case I see is responding when we don't have the socket listening 
(e.g. the port is not open), so if before the user had a message that 
the port was not open, now instead connect() will timeout. So we could 
respond if we have space in the virtqueue, otherwise discard it without 
losing any important information or guarantee of a lossless channel.

So in summary:

- if we have an associated socket, then always respond (possibly
   allocating memory in the intermediate queue if the virtqueue is full
   as we already do). We need to figure out if a flood of
   VIRTIO_VSOCK_OP_CREDIT_REQUEST would cause problems, but we can always
   decide not to respond if we have sent this identical information
   before.

- if there is no associated socket, we only respond if virtqueue has
   space.

I like it and it seems feasible without changing anything in the 
specification.

Did I get it right?

Thanks,
Stefano
Michael S. Tsirkin April 4, 2025, 8:37 a.m. UTC | #9
On Fri, Apr 04, 2025 at 10:30:43AM +0200, Stefano Garzarella wrote:
> On Fri, Apr 04, 2025 at 04:14:51AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Apr 04, 2025 at 10:04:38AM +0200, Alexander Graf wrote:
> > > 
> > > On 03.04.25 14:21, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 02, 2025 at 12:14:24PM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, Apr 01, 2025 at 08:13:49PM +0000, Alexander Graf wrote:
> > > > > > Ever since the introduction of the virtio vsock driver, it included
> > > > > > pushback logic that blocks it from taking any new RX packets until the
> > > > > > TX queue backlog becomes shallower than the virtqueue size.
> > > > > >
> > > > > > This logic works fine when you connect a user space application on the
> > > > > > hypervisor with a virtio-vsock target, because the guest will stop
> > > > > > receiving data until the host pulled all outstanding data from the VM.
> > > > > >
> > > > > > With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> > > > > >
> > > > > >    Parent      Enclave
> > > > > >
> > > > > >      RX -------- TX
> > > > > >      TX -------- RX
> > > > > >
> > > > > > This means we now have 2 virtio-vsock backends that both have the pushback
> > > > > > logic. If the parent's TX queue runs full at the same time as the
> > > > > > Enclave's, both virtio-vsock drivers fall into the pushback path and
> > > > > > no longer accept RX traffic. However, that RX traffic is TX traffic on
> > > > > > the other side which blocks that driver from making any forward
> > > > > > progress. We're now in a deadlock.
> > > > > >
> > > > > > To resolve this, let's remove that pushback logic altogether and rely on
> > > > > > higher levels (like credits) to ensure we do not consume unbounded
> > > > > > memory.
> > > > > The reason for queued_replies is that rx packet processing may emit tx
> > > > > packets. Therefore tx virtqueue space is required in order to process
> > > > > the rx virtqueue.
> > > > >
> > > > > queued_replies puts a bound on the amount of tx packets that can be
> > > > > queued in memory so the other side cannot consume unlimited memory. Once
> > > > > that bound has been reached, rx processing stops until the other side
> > > > > frees up tx virtqueue space.
> > > > >
> > > > > It's been a while since I looked at this problem, so I don't have a
> > > > > solution ready. In fact, last time I thought about it I wondered if the
> > > > > design of virtio-vsock fundamentally suffers from deadlocks.
> > > > >
> > > > > I don't think removing queued_replies is possible without a replacement
> > > > > for the bounded memory and virtqueue exhaustion issue though. Credits
> > > > > are not a solution - they are about socket buffer space, not about
> > > > > virtqueue space, which includes control packets that are not accounted
> > > > > by socket buffer space.
> > > >
> > > > Hmm.
> > > > Actually, let's think which packets require a response.
> > > >
> > > > VIRTIO_VSOCK_OP_REQUEST
> > > > VIRTIO_VSOCK_OP_SHUTDOWN
> > > > VIRTIO_VSOCK_OP_CREDIT_REQUEST
> > > >
> > > >
> > > > the response to these always reports a state of an existing socket.
> > > > and, only one type of response is relevant for each socket.
> > > >
> > > > So here's my suggestion:
> > > > stop queueing replies on the vsock device, instead,
> > > > simply store the response on the socket, and create a list of sockets
> > > > that have replies to be transmitted
> > > >
> > > >
> > > > WDYT?
> > > 
> > > 
> > > Wouldn't that create the same problem again? The socket will eventually push
> > > back any new data that it can take because its FIFO is full. At that point,
> > > the "other side" could still have a queue full of requests on exactly that
> > > socket that need to get processed. We can now not pull those packets off the
> > > virtio queue, because we can not enqueue responses.
> > 
> > Either I don't understand what you wrote or I did not explain myself
> > clearly.
> 
> I didn't fully understand either, but with this last message of yours it's
> clear to me and I like the idea!
> 
> > 
> > In this idea there needs to be a single response enqueued
> > like this in the socket, because, no more than one ever needs to
> > be outstanding per socket.
> > 
> > For example, until VIRTIO_VSOCK_OP_REQUEST
> > is responded to, the socket is not active and does not need to
> > send anything.
> 
> One case I see is responding when we don't have the socket listening (e.g.
> the port is not open), so if before the user had a message that the port was
> not open, now instead connect() will timeout. So we could respond if we have
> space in the virtqueue, otherwise discard it without losing any important
> information or guarantee of a lossless channel.
> 
> So in summary:
> 
> - if we have an associated socket, then always respond (possibly
>   allocating memory in the intermediate queue if the virtqueue is full
>   as we already do). We need to figure out if a flood of
>   VIRTIO_VSOCK_OP_CREDIT_REQUEST would cause problems, but we can always
>   decide not to respond if we have sent this identical information
>   before.

If taking this path, need to consider not responding is within spec or not.
But again, credit update needed is just a single flag we need to set
on a socket. If we have anything we need to send, it can also update
the credits.


> - if there is no associated socket, we only respond if virtqueue has
>   space.
> 
> I like it and it seems feasible without changing anything in the
> specification.
> 
> Did I get it right?
> 
> Thanks,
> Stefano

That was the idea, yes.
diff mbox series

Patch

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f0e48e6911fc..54030c729767 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -26,6 +26,12 @@  static struct virtio_vsock __rcu *the_virtio_vsock;
 static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
 static struct virtio_transport virtio_transport; /* forward declaration */
 
+/*
+ * Max number of RX packets transferred before requeueing so we do
+ * not starve TX traffic because they share the same work queue.
+ */
+#define VSOCK_MAX_PKTS_PER_WORK 256
+
 struct virtio_vsock {
 	struct virtio_device *vdev;
 	struct virtqueue *vqs[VSOCK_VQ_MAX];
@@ -44,8 +50,6 @@  struct virtio_vsock {
 	struct work_struct send_pkt_work;
 	struct sk_buff_head send_pkt_queue;
 
-	atomic_t queued_replies;
-
 	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
 	 * must be accessed with rx_lock held.
 	 */
@@ -158,7 +162,7 @@  virtio_transport_send_pkt_work(struct work_struct *work)
 		container_of(work, struct virtio_vsock, send_pkt_work);
 	struct virtqueue *vq;
 	bool added = false;
-	bool restart_rx = false;
+	int pkts = 0;
 
 	mutex_lock(&vsock->tx_lock);
 
@@ -172,6 +176,12 @@  virtio_transport_send_pkt_work(struct work_struct *work)
 		bool reply;
 		int ret;
 
+		if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
+			/* Allow other works on the same queue to run */
+			queue_work(virtio_vsock_workqueue, work);
+			break;
+		}
+
 		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
 		if (!skb)
 			break;
@@ -184,17 +194,6 @@  virtio_transport_send_pkt_work(struct work_struct *work)
 			break;
 		}
 
-		if (reply) {
-			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
-			int val;
-
-			val = atomic_dec_return(&vsock->queued_replies);
-
-			/* Do we now have resources to resume rx processing? */
-			if (val + 1 == virtqueue_get_vring_size(rx_vq))
-				restart_rx = true;
-		}
-
 		added = true;
 	}
 
@@ -203,9 +202,6 @@  virtio_transport_send_pkt_work(struct work_struct *work)
 
 out:
 	mutex_unlock(&vsock->tx_lock);
-
-	if (restart_rx)
-		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
 /* Caller need to hold RCU for vsock.
@@ -261,9 +257,6 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 	 */
 	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
 	    virtio_transport_send_skb_fast_path(vsock, skb)) {
-		if (virtio_vsock_skb_reply(skb))
-			atomic_inc(&vsock->queued_replies);
-
 		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
 		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
 	}
@@ -277,7 +270,7 @@  static int
 virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct virtio_vsock *vsock;
-	int cnt = 0, ret;
+	int ret;
 
 	rcu_read_lock();
 	vsock = rcu_dereference(the_virtio_vsock);
@@ -286,17 +279,7 @@  virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 		goto out_rcu;
 	}
 
-	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
-
-	if (cnt) {
-		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
-		int new_cnt;
-
-		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
-		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
-		    new_cnt < virtqueue_get_vring_size(rx_vq))
-			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
-	}
+	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
 
 	ret = 0;
 
@@ -367,18 +350,6 @@  static void virtio_transport_tx_work(struct work_struct *work)
 		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
 }
 
-/* Is there space left for replies to rx packets? */
-static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
-{
-	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
-	int val;
-
-	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
-	val = atomic_read(&vsock->queued_replies);
-
-	return val < virtqueue_get_vring_size(vq);
-}
-
 /* event_lock must be held */
 static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
 				       struct virtio_vsock_event *event)
@@ -613,6 +584,7 @@  static void virtio_transport_rx_work(struct work_struct *work)
 	struct virtio_vsock *vsock =
 		container_of(work, struct virtio_vsock, rx_work);
 	struct virtqueue *vq;
+	int pkts = 0;
 
 	vq = vsock->vqs[VSOCK_VQ_RX];
 
@@ -627,11 +599,9 @@  static void virtio_transport_rx_work(struct work_struct *work)
 			struct sk_buff *skb;
 			unsigned int len;
 
-			if (!virtio_transport_more_replies(vsock)) {
-				/* Stop rx until the device processes already
-				 * pending replies.  Leave rx virtqueue
-				 * callbacks disabled.
-				 */
+			if (++pkts > VSOCK_MAX_PKTS_PER_WORK) {
+				/* Allow other works on the same queue to run */
+				queue_work(virtio_vsock_workqueue, work);
 				goto out;
 			}
 
@@ -675,8 +645,6 @@  static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
 	vsock->rx_buf_max_nr = 0;
 	mutex_unlock(&vsock->rx_lock);
 
-	atomic_set(&vsock->queued_replies, 0);
-
 	ret = virtio_find_vqs(vdev, VSOCK_VQ_MAX, vsock->vqs, vqs_info, NULL);
 	if (ret < 0)
 		return ret;