diff mbox series

[1/3] ALSA USB MIDI: Fix port starvation

Message ID c9aed355adc93d5de0cc4c740d16d19e3e210f79.camel@domdv.de (mailing list archive)
State New, archived
Headers show
Series ALSA USB MIDI: Predictable low latency for USB MIDI output | expand

Commit Message

Andreas Steinmetz March 14, 2020, 8:11 a.m. UTC
In case of a multiport USB MIDI interface the lower numbered ports starve
higher numbered ports when more than one port is busy transmitting data.
The starvation as of the current code is complete and can be for an
arbitrarily long time. Control from userspace is not possible without
at least halving the theoretically possible device throughput.
This is especially bad as there are now 16x16 interface products available.

An unpredicable and arbitrarily long latency is not acceptable. The
loss of half of the throughput is not acceptable, either. Fair
scheduling of all busy ports is required.

The patch below balances the actual USB output between all busy ports.
It is done in such a way that the single port use performance before
applying the patch is identical to the one after applying the patch.
To get there, the snd_usbmidi_transmit_byte helper had to be converted to
return output notification to allow for the 'repeat' shortcut.

The patch tries to avoid O(2) load increase by scaling the balancing
loop to the ports that are actually busy as far as this is possible.

For the patch to properly work the wMaxPacketSize of the device must be
large enough to allow for at least one MIDI event per port in a URB.
If this constraint is not met, which is quite improbable, starvation
will occur again. Though this could be fixed the likelyhood of such
a device is so low that the additional overhead introduced for all
other devices is not worth it. Current multi port MIDI interfaces do
typically have 2^n output ports and 2^x as wMaxPacketSize where x>n.

For a four port USB MIDI interface with a wMaxPacketSize of 64 the
maximum latency for any port is changed by the patch from indefinite
to 107.52ms.

The patch affects the output of all class compliant USB MIDI interfaces.
Users will typically experience either no change or increased response,
depending on their use case.

Signed-off-by: Andreas Steinmetz <ast@domdv.de>

Comments

Clemens Ladisch March 16, 2020, 12:03 p.m. UTC | #1
Andreas Steinmetz wrote:
> the snd_usbmidi_transmit_byte helper had to be converted to
> return output notification to allow for the 'repeat' shortcut.

Why not simply handle one MIDI byte per port in each iteration?  It
could be argued that single-byte MIDI commands are likely to be real-
time messages and deserve to go first.

> Current multi port MIDI interfaces do
> typically have 2^n output ports and 2^x as wMaxPacketSize where x>n.

The USB specification requires bulk endpoints to have a wMaxPacketSize
value of 8/16/32/64 for full speed, or exactly 512 for high speed.

> For the patch to properly work the wMaxPacketSize of the device must be
> large enough to allow for at least one MIDI event per port in a URB.

There are devices that handle only the first four bytes of a received
packet (because Windows used to send only small packets), and one of
them, the ESI M4U, actually has more than one port.

My original idea for that FIXME was to use round robin until the packet
is filled (or all ports are empty), and to store the next port index
(where to start for the next packet) in the endpoint.  This would be
able to distribute the balancing over multiple packets.


Regards,
Clemens
Andreas Steinmetz March 16, 2020, 1:31 p.m. UTC | #2
On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
> Andreas Steinmetz wrote:
> > the snd_usbmidi_transmit_byte helper had to be converted to
> > return output notification to allow for the 'repeat' shortcut.
> 
> Why not simply handle one MIDI byte per port in each iteration?  It
> could be argued that single-byte MIDI commands are likely to be real-
> time messages and deserve to go first.
> 

Actually the patch does exactly this. As soon as the helper signals
that a message is complete the next port is processed. The "repeat"
loop is just necessary to get a complete class compliant message for
transfer as the helper processes one byte at a time.
The range optimization is there to prevent O(2) performance cost if
possible.

> > Current multi port MIDI interfaces do
> > typically have 2^n output ports and 2^x as wMaxPacketSize where
> > x>n.
> 
> The USB specification requires bulk endpoints to have a
> wMaxPacketSize
> value of 8/16/32/64 for full speed, or exactly 512 for high speed.
> 
> > For the patch to properly work the wMaxPacketSize of the device
> > must be
> > large enough to allow for at least one MIDI event per port in a
> > URB.
> 
> There are devices that handle only the first four bytes of a received
> packet (because Windows used to send only small packets), and one of
> them, the ESI M4U, actually has more than one port.
> 

Again no problem as the patch is designed to prefer quirks over user
settings to prevent malfunctioning devices. Quirk processing and thus
size restrictions are processed after user selection.

> My original idea for that FIXME was to use round robin until the
> packet
> is filled (or all ports are empty), and to store the next port index
> (where to start for the next packet) in the endpoint.  This would be
> able to distribute the balancing over multiple packets.
> 

The problem with "round robin until the packet is filled" is that in
case of a large wMaxPacketSize there's then a huge latency.
I just got for further internal testing an USB3 8x8 interface that uses
a wMaxPacketSize of 512.
In such a case a port doing a sysex transfer will delay another port
then sending a realtime message by a minimum of 123ms per queued URB.
Make that 7 URBs and you have a sysex queue of 860ms until the realtime
message can be transmitted.
> 
> Regards,
> Clemens
Clemens Ladisch March 17, 2020, 10:26 a.m. UTC | #3
Andreas Steinmetz wrote:
> On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
>> Andreas Steinmetz wrote:
>>> the snd_usbmidi_transmit_byte helper had to be converted to
>>> return output notification to allow for the 'repeat' shortcut.
>>
>> Why not simply handle one MIDI byte per port in each iteration?  It
>> could be argued that single-byte MIDI commands are likely to be real-
>> time messages and deserve to go first.
>
> Actually the patch does exactly this. As soon as the helper signals
> that a message is complete the next port is processed. The "repeat"
> loop is just necessary to get a complete class compliant message for
> transfer as the helper processes one byte at a time.

Why is it necessary to immedately get a complete class-compliant message?

> The range optimization is there to prevent O(2) performance cost if
> possible.

Do you mean O(n^2)?  For what n?

>> My original idea for that FIXME was to use round robin until the packet
>> is filled (or all ports are empty), and to store the next port index
>> (where to start for the next packet) in the endpoint.  This would be
>> able to distribute the balancing over multiple packets.
>
> The problem with "round robin until the packet is filled" is that in
> case of a large wMaxPacketSize there's then a huge latency.

This mechanism does not require using the original packet size.
The point is that we should not restart with the first port in each
packet.


Regards,
Clemens
Andreas Steinmetz March 17, 2020, 3:47 p.m. UTC | #4
On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
> Andreas Steinmetz wrote:
> > On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
> > > Andreas Steinmetz wrote:
> > > > the snd_usbmidi_transmit_byte helper had to be converted to
> > > > return output notification to allow for the 'repeat' shortcut.
> > > 
> > > Why not simply handle one MIDI byte per port in each
> > > iteration?  It
> > > could be argued that single-byte MIDI commands are likely to be
> > > real-
> > > time messages and deserve to go first.
> > 
> > Actually the patch does exactly this. As soon as the helper signals
> > that a message is complete the next port is processed. The "repeat"
> > loop is just necessary to get a complete class compliant message
> > for
> > transfer as the helper processes one byte at a time.
> 
> Why is it necessary to immedately get a complete class-compliant
> message?
> 

Maybe that's a misunderstanding caused by me using 'message' instead of
'event'. We need a complete MIDI event (or up to 3 sysex bytes) to form
the class compliant 4 byte USB event. That's what the 'repeat' shortcut
is for. And after one such event is collected to the URB the code
switches to the next port.

> > The range optimization is there to prevent O(2) performance cost if
> > possible.
> 
> Do you mean O(n^2)?  For what n?
> 

What I do mean is that in case of only one port being active we would
do 16 full outer iterations over all ports to collect 1 event. And we
need to do this for all events of that port. Even if we use the hint
from port->active as a shortcut its always the full set of iterations.

Memorizing the first and the last port of the previous loop and then
restricting to this set makes a full 16 port run in the first iteration
and then only the active port for all subsequent iterations.

And yes, in the worst case of a 16x16 interface with ports 1 and 16
being active we're back to processing 16 ports for every iteration - I
do, however, see this scenario as being quite theoretical in practice.

> > > My original idea for that FIXME was to use round robin until the
> > > packet
> > > is filled (or all ports are empty), and to store the next port
> > > index
> > > (where to start for the next packet) in the endpoint.  This would
> > > be
> > > able to distribute the balancing over multiple packets.
> > 
> > The problem with "round robin until the packet is filled" is that
> > in
> > case of a large wMaxPacketSize there's then a huge
> > latency.
> 
> This mechanism does not require using the original packet size.
> The point is that we should not restart with the first port in each
> packet.

I was thinking about that but in practice I do not know any class
compliant interface that does have a number of outputs that is not
equal to 2^n, i.e. 1, 2, 4, 8, 16.

Then wMaxPacketSize is also always a 2^m value (16, 64, 512).

With m>=n+2 (I don't believe that there is any class compliant
interface for which this is not true) we can always start at the first
port while doing fair scheduling (when a new URB is filled all ports
get an equal change if they do have events pending).

Implementing a restart based on the last packet scheduled for the very
rare possibility of the above being not true just adds unnecessary
complexity.

> 
> 
> Regards,
> Clemens
Regards
Clemens Ladisch March 18, 2020, 9:35 a.m. UTC | #5
Andreas Steinmetz wrote:
> On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
>> Why is it necessary to immedately get a complete class-compliant
>> message?
>
> Maybe that's a misunderstanding caused by me using 'message' instead of
> 'event'. We need a complete MIDI event (or up to 3 sysex bytes) to form
> the class compliant 4 byte USB event. That's what the 'repeat' shortcut
> is for. And after one such event is collected to the URB the code
> switches to the next port.

Why is it necessary to get a complete USB event before switching to the
next port?  Why not just process one byte in each loop iteration?

>> The point is that we should not restart with the first port in each
>> packet.
>
> I was thinking about that but in practice I do not know any class
> compliant interface that does have a number of outputs that is not
> equal to 2^n, i.e. 1, 2, 4, 8, 16.
>
> Then wMaxPacketSize is also always a 2^m value (16, 64, 512).
>
> With m>=n+2 (I don't believe that there is any class compliant
> interface for which this is not true)

At least ESI M4U has n=2, m=2.  It's for devices like this that the
driver uses multiple packets.  (I do not know if this (revision of
the) device is still used.)

It's certainly true that using the full packet size and 7 packets is
excessive in most situations.  The driver should _automatically_
reduce these values when it is safe.  (I'm not saying that module
parameters are completely superfluous, but they should never be
necessary if the driver can avoid it.)


I think the lowest-hanging fruit are the hardcoded 0x10, and that
port->active is not a single bitfield; the latter will make
searching for active ports easier.  I'll create patches for this.


Regards,
Clemens
Andreas Steinmetz March 18, 2020, 3:50 p.m. UTC | #6
On Wed, 2020-03-18 at 10:35 +0100, Clemens Ladisch wrote:
> Andreas Steinmetz wrote:
> > On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
> > > Why is it necessary to immedately get a complete class-compliant
> > > message?
> > 
> > Maybe that's a misunderstanding caused by me using 'message'
> > instead of
> > 'event'. We need a complete MIDI event (or up to 3 sysex bytes) to
> > form
> > the class compliant 4 byte USB event. That's what the 'repeat'
> > shortcut
> > is for. And after one such event is collected to the URB the code
> > switches to the next port.
> 
> Why is it necessary to get a complete USB event before switching to
> the
> next port?  Why not just process one byte in each loop iteration?
> 

Because of fairness. Let's assume a keyboard sends 6 note on messages,
that's 6 3 byte events on a port, while another port just sends channel
pressure which is two bytes and with running status only one byte.
Instead of fair scheduling the new notes will be delayed by the channel
pressure when doing only one byte per iteration. So the grouped
together note on events would be torn apart further than necessary
which may be audible.

> > > The point is that we should not restart with the first port in
> > > each
> > > packet.
> > 
> > I was thinking about that but in practice I do not know any class
> > compliant interface that does have a number of outputs that is not
> > equal to 2^n, i.e. 1, 2, 4, 8, 16.
> > 
> > Then wMaxPacketSize is also always a 2^m value (16, 64, 512).
> > 
> > With m>=n+2 (I don't believe that there is any class compliant
> > interface for which this is not true)
> 
> At least ESI M4U has n=2, m=2.  It's for devices like this that the
> driver uses multiple packets.  (I do not know if this (revision of
> the) device is still used.)
> 

Well, in this rare case the patch wouldn't do any harm. The only thing
is that there would be no improvement in behaviour to the previous code
version for this special device.

Please note that I did take especially care to not increase cpu usage
(on x86_64) because I wanted to make sure that (hopefully) there is no
load increase for slow embedded systems.

> It's certainly true that using the full packet size and 7 packets is
> excessive in most situations.  The driver should _automatically_
> reduce these values when it is safe.  (I'm not saying that module
> parameters are completely superfluous, but they should never be
> necessary if the driver can avoid it.)
> 

If automatic tuning is possible, fine. Manual override should still be
possible, though, for the case the user knows better than an automatism
for the use case of the user.

> 
> I think the lowest-hanging fruit are the hardcoded 0x10, and that
> port->active is not a single bitfield; the latter will make
> searching for active ports easier.  I'll create patches for this.

I'm very interested.

Cheers
Andreas Steinmetz March 24, 2020, 6:17 p.m. UTC | #7
On Wed, 2020-03-18 at 10:35 +0100, Clemens Ladisch wrote:
> I think the lowest-hanging fruit are the hardcoded 0x10, and that
> port->active is not a single bitfield; the latter will make
> searching for active ports easier.  I'll create patches for this.

I've created a small test utility that may be useful to test
patches. See https://github.com/not1337/rawmidiperf

BTW: Have you you already have something to look at?

Regards
Andreas
diff mbox series

Patch

--- a/sound/usb/midi.c	2020-03-12 22:45:06.611877152 +0100
+++ b/sound/usb/midi.c	2020-03-13 02:33:21.392847930 +0100
@@ -554,7 +554,7 @@  static void snd_usbmidi_output_midiman_p
 /*
  * Converts MIDI commands to USB MIDI packets.
  */
-static void snd_usbmidi_transmit_byte(struct usbmidi_out_port *port,
+static int snd_usbmidi_transmit_byte(struct usbmidi_out_port *port,
 				      uint8_t b, struct urb *urb)
 {
 	uint8_t p0 = port->cable;
@@ -563,6 +563,7 @@  static void snd_usbmidi_transmit_byte(st
 
 	if (b >= 0xf8) {
 		output_packet(urb, p0 | 0x0f, b, 0, 0);
+		return 1;
 	} else if (b >= 0xf0) {
 		switch (b) {
 		case 0xf0:
@@ -585,20 +586,20 @@  static void snd_usbmidi_transmit_byte(st
 		case 0xf6:
 			output_packet(urb, p0 | 0x05, 0xf6, 0, 0);
 			port->state = STATE_UNKNOWN;
-			break;
+			return 1;
 		case 0xf7:
 			switch (port->state) {
 			case STATE_SYSEX_0:
 				output_packet(urb, p0 | 0x05, 0xf7, 0, 0);
-				break;
+				return 1;
 			case STATE_SYSEX_1:
 				output_packet(urb, p0 | 0x06, port->data[0],
 					      0xf7, 0);
-				break;
+				return 1;
 			case STATE_SYSEX_2:
 				output_packet(urb, p0 | 0x07, port->data[0],
 					      port->data[1], 0xf7);
-				break;
+				return 1;
 			}
 			port->state = STATE_UNKNOWN;
 			break;
@@ -619,7 +620,7 @@  static void snd_usbmidi_transmit_byte(st
 				port->state = STATE_UNKNOWN;
 			}
 			output_packet(urb, p0, port->data[0], b, 0);
-			break;
+			return 1;
 		case STATE_2PARAM_1:
 			port->data[1] = b;
 			port->state = STATE_2PARAM_2;
@@ -633,7 +634,7 @@  static void snd_usbmidi_transmit_byte(st
 				port->state = STATE_UNKNOWN;
 			}
 			output_packet(urb, p0, port->data[0], port->data[1], b);
-			break;
+			return 1;
 		case STATE_SYSEX_0:
 			port->data[0] = b;
 			port->state = STATE_SYSEX_1;
@@ -646,29 +647,49 @@  static void snd_usbmidi_transmit_byte(st
 			output_packet(urb, p0 | 0x04, port->data[0],
 				      port->data[1], b);
 			port->state = STATE_SYSEX_0;
-			break;
+			return 1;
 		}
 	}
+	return 0;
 }
 
 static void snd_usbmidi_standard_output(struct snd_usb_midi_out_endpoint *ep,
 					struct urb *urb)
 {
 	int p;
+	int start = 0;
+	int total = 0x10;
+	int max = ep->max_transfer - 4;
+
+	if (max < 0)
+		return;
+
+	while (1) {
+		int first = -1;
+		int final = -1;
 
-	/* FIXME: lower-numbered ports can starve higher-numbered ports */
-	for (p = 0; p < 0x10; ++p) {
-		struct usbmidi_out_port *port = &ep->ports[p];
-		if (!port->active)
-			continue;
-		while (urb->transfer_buffer_length + 3 < ep->max_transfer) {
+		for (p = start; p < total; ++p) {
+			struct usbmidi_out_port *port = &ep->ports[p];
 			uint8_t b;
+			if (!port->active)
+				continue;
+repeat:
 			if (snd_rawmidi_transmit(port->substream, &b, 1) != 1) {
 				port->active = 0;
-				break;
+				continue;
 			}
-			snd_usbmidi_transmit_byte(port, b, urb);
+			if (!snd_usbmidi_transmit_byte(port, b, urb))
+				goto repeat;
+			if (urb->transfer_buffer_length > max)
+				return;
+			if (first == -1)
+				first = p;
+			final = p;
 		}
+		if (final == -1)
+			return;
+		start = first;
+		total = final + 1;
 	}
 }