ALSA: snd-usb: fix IRQ triggered NULL pointer dereference
diff mbox

Message ID 20161220012514.9348-1-o-takashi@sakamocchi.jp
State New
Headers show

Commit Message

Takashi Sakamoto Dec. 20, 2016, 1:25 a.m. UTC
---

Hi,

> Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream")
> fixes a race-codition but it also introduces another really nasty data race
> regression which makes my usb sound card [1] completely useless, throwing
> the kernel into a panic if anything from userspace tries to start playback.
> 
> The problem is this: ep->data_subs is now set to NULL every time inside
> wait_clear_urbs(). ep->data_subs is initalized only in one place in
> start_endpoints(), then it is immediately wiped by the pre-existing call to
> wait_clear_urbs() inside snd_usb_endpoint_start().
> 
> To ilustrate, this is what happens in the non-irq part of the code:
> 
> Step 1 (inside start_endpoints): ep->data_subs = subs;
> Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep);
> Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep);
> Step 4 (inside wait_clear_urbs): ep->data_subs = NULL;
> 
> Here's a simplified call stack for the IRQ part (full one at the end):
> 
> (NULL dereference, param "subs" is passed NULL value of ep->data_subs)
> retire_playback_urb
> retire_outbound_urb
> snd_complete_urb
> (...)
> xhci_irq
> 
> Looking at the git log there seems to be quite a lot of history in this
> part of the codebase, dating back to 2012 or earlier. My fix is based on
> 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and
> e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture
> stream") with a few modifications of my own. My idea is to do the
> ep->data_subs wiping before endpoint initialization in a manner similar
> to these older commits by using stop_endpoints() in snd_usb_pcm_prepare()
> and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to
> not trigger the recently fixed stream stopping race again.
> 
> Full call stack:
> 
> [  131.093240] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> [  131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.095046] PGD 0
> [  131.095047]
> [  131.096509] Oops: 0000 [#1] PREEMPT SMP
> [  131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath kvm_intel mac80211 kvm
>  input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel pcbc aesni_intel libps2 cfg8
> 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core lpc_ich tpm
> snd_timer evdev shpchp i2c_i801 sch_fq_codel
> [  131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted 4.9.0-gd824cdc58ba0 #10
> [  131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 07/31/2015
> [  131.109592] task: ffff880154a70000 task.stack: ffffc90000f48000
> [  131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.113899] RSP: 0018:ffffc90000f4bc10 EFLAGS: 00010082
> [  131.116080] RAX: ffffffffa04cabe0 RBX: 0000000000000000 RCX: 0000000000000000
> [  131.118284] RDX: 0000000000000000 RSI: ffff8801435a4c00 RDI: 0000000000000000
> [  131.120505] RBP: ffffc90000f4bc40 R08: 0000000000000001 R09: 0000000000000001
> [  131.122807] R10: 0000000000000001 R11: ffffffff82f60d6d R12: ffff8801432a0238
> [  131.125265] R13: ffff8801435a4c00 R14: 0000000000000000 R15: ffff8801535c83b8
> [  131.127723] FS:  0000000000000000(0000) GS:ffff88015a000000(0000) knlGS:0000000000000000
> [  131.130217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  131.132568] CR2: 0000000000000010 CR3: 0000000001e0b000 CR4: 00000000001406e0
> [  131.135007] Call Trace:
> [  131.137329]  snd_complete_urb+0x190/0x2b0 [snd_usb_audio]
> [  131.139526]  __usb_hcd_giveback_urb+0x8e/0x160
> [  131.141948]  usb_hcd_giveback_urb+0x43/0x110
> [  131.144104]  xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd]
> [  131.146521]  finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd]
> [  131.148736]  xhci_irq+0x13a2/0x1ca0 [xhci_hcd]
> [  131.150972]  ? trace_hardirqs_on+0xd/0x10
> [  131.153209]  ? _raw_spin_unlock_irq+0x2c/0x50
> [  131.155390]  ? irq_thread+0xb5/0x1d0
> [  131.157771]  xhci_msi_irq+0x11/0x20 [xhci_hcd]
> [  131.159938]  irq_forced_thread_fn+0x2f/0x70
> [  131.162072]  ? irq_thread+0xb5/0x1d0
> [  131.164141]  irq_thread+0x149/0x1d0
> [  131.166132]  ? irq_finalize_oneshot.part.2+0xe0/0xe0
> [  131.168142]  ? wake_threads_waitq+0x30/0x30
> [  131.170149]  kthread+0x10f/0x150
> [  131.172144]  ? irq_thread_dtor+0xc0/0xc0
> [  131.174139]  ? kthread_create_on_node+0x40/0x40
> [  131.176116]  ret_from_fork+0x2a/0x40
> [  131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3 0f 1f 00 0f 1f 44 00 00 55 31 d2 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 ec 08 <48> 8b 47 10 48 8b 4f 70 48 c7 c7 88 7b 4d a0 4c 8b a0 78 01 00
> [  131.182382] RIP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] RSP: ffffc90000f4bc10
> [  131.184562] CR2: 0000000000000010
> 
> [1] 041e:3232 Creative Technology SoundBlaster X-FI HD
> [2] http://mailman.alsa-project.org/pipermail/alsa-devel/2016-December/115425.html

I can regenerate this bug with my EMU 0404 USB. My understanding of this bug is
quite similar to your perspective. This bug is quite critical. In my
understanding, we encounter this bug in all of cases in which snd-usb-audio is
applied for PCM playback substream.

Below workaround can also suppress this bug.

If initialization of 'struct snd_usb_endpoint.data_subs' can be done after all
of queued URBs are processed and corresponding complete interrupts are cought,
we can solve this critical bug.

> Signed-off-by: Ioan-Adrian Ratiu <adi@adirat.com>
> ---
>  sound/usb/endpoint.c | 11 ++---------
>  sound/usb/endpoint.h |  2 +-
>  sound/usb/pcm.c      | 13 ++++++-------
>  3 files changed, 9 insertions(+), 17 deletions(-)
> 
> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c
> index a2cdf3370afe..4465f324c2c2 100644
> --- a/sound/usb/endpoint.c
> +++ b/sound/usb/endpoint.c
> @@ -920,9 +920,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
>  /**
>   * snd_usb_endpoint_start: start an snd_usb_endpoint
>   *
> - * @ep:		the endpoint to start
> - * @can_sleep:	flag indicating whether the operation is executed in
> - * 		non-atomic context
> + * @ep: the endpoint to start
>   *
>   * A call to this function will increment the use count of the endpoint.
>   * In case it is not already running, the URBs for this endpoint will be
> @@ -932,7 +930,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
>   *
>   * Returns an error if the URB submission failed, 0 in all other cases.
>   */
> -int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep)
> +int snd_usb_endpoint_start(struct snd_usb_endpoint *ep)
>  {
>  	int err;
>  	unsigned int i;
> @@ -944,11 +942,6 @@ int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep)
>  	if (++ep->use_count != 1)
>  		return 0;
>  
> -	/* just to be sure */
> -	deactivate_urbs(ep, false);
> -	if (can_sleep)
> -		wait_clear_urbs(ep);
> -
>  	ep->active_mask = 0;
>  	ep->unlink_mask = 0;
>  	ep->phase = 0;
> diff --git a/sound/usb/endpoint.h b/sound/usb/endpoint.h
> index 6428392d8f62..584f295d7c77 100644
> --- a/sound/usb/endpoint.h
> +++ b/sound/usb/endpoint.h
> @@ -18,7 +18,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
>  				struct audioformat *fmt,
>  				struct snd_usb_endpoint *sync_ep);
>  
> -int  snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep);
> +int  snd_usb_endpoint_start(struct snd_usb_endpoint *ep);
>  void snd_usb_endpoint_stop(struct snd_usb_endpoint *ep);
>  void snd_usb_endpoint_sync_pending_stop(struct snd_usb_endpoint *ep);
>  int  snd_usb_endpoint_activate(struct snd_usb_endpoint *ep);
> diff --git a/sound/usb/pcm.c b/sound/usb/pcm.c
> index 34c6d4f2c0b6..db26f767f851 100644
> --- a/sound/usb/pcm.c
> +++ b/sound/usb/pcm.c
> @@ -218,7 +218,7 @@ int snd_usb_init_pitch(struct snd_usb_audio *chip, int iface,
>  	}
>  }
>  
> -static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
> +static int start_endpoints(struct snd_usb_substream *subs)
>  {
>  	int err;
>  
> @@ -231,7 +231,7 @@ static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
>  		dev_dbg(&subs->dev->dev, "Starting data EP @%p\n", ep);
>  
>  		ep->data_subs = subs;
> -		err = snd_usb_endpoint_start(ep, can_sleep);
> +		err = snd_usb_endpoint_start(ep);
>  		if (err < 0) {
>  			clear_bit(SUBSTREAM_FLAG_DATA_EP_STARTED, &subs->flags);
>  			return err;
> @@ -260,7 +260,7 @@ static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
>  		dev_dbg(&subs->dev->dev, "Starting sync EP @%p\n", ep);
>  
>  		ep->sync_slave = subs->data_endpoint;
> -		err = snd_usb_endpoint_start(ep, can_sleep);
> +		err = snd_usb_endpoint_start(ep);
>  		if (err < 0) {
>  			clear_bit(SUBSTREAM_FLAG_SYNC_EP_STARTED, &subs->flags);
>  			return err;
> @@ -809,8 +809,7 @@ static int snd_usb_pcm_prepare(struct snd_pcm_substream *substream)
>  		goto unlock;
>  	}
>  
> -	snd_usb_endpoint_sync_pending_stop(subs->sync_endpoint);
> -	snd_usb_endpoint_sync_pending_stop(subs->data_endpoint);
> +	stop_endpoints(subs, true);
>  
>  	ret = set_format(subs, subs->cur_audiofmt);
>  	if (ret < 0)
> @@ -850,7 +849,7 @@ static int snd_usb_pcm_prepare(struct snd_pcm_substream *substream)
>  	/* for playback, submit the URBs now; otherwise, the first hwptr_done
>  	 * updates for all URBs would happen at the same time when starting */
>  	if (subs->direction == SNDRV_PCM_STREAM_PLAYBACK)
> -		ret = start_endpoints(subs, true);
> +		return start_endpoints(subs);
>  
>   unlock:
>  	snd_usb_unlock_shutdown(subs->stream->chip);
> @@ -1666,7 +1665,7 @@ static int snd_usb_substream_capture_trigger(struct snd_pcm_substream *substream
>  
>  	switch (cmd) {
>  	case SNDRV_PCM_TRIGGER_START:
> -		err = start_endpoints(subs, false);
> +		err = start_endpoints(subs);
>  		if (err < 0)
>  			return err;

This patch works better, but not the best. It's a bit intrusive.

What we need to synchronize process contexts to interrupt context till all of
queued URBs are handled. This seems to be a main purpose of wait_clear_bits,
while current implementation of snd_complete_urb()/wait_clear_bits() is not
enough.

Thanks for your reporting.

Takashi Sakamoto

Comments

Takashi Iwai Dec. 21, 2016, 10:17 a.m. UTC | #1
On Tue, 20 Dec 2016 14:04:56 +0100,
Takashi Sakamoto wrote:
> 
> Hi,
> 
> On Dec 20 2016 20:47, Ioan-Adrian Ratiu wrote:
> > On Tue, 20 Dec 2016, Takashi Sakamoto <o-takashi@sakamocchi.jp> wrote:
> >> ---
> >>
> >> Hi,
> >>
> >>> Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream")
> >>> fixes a race-codition but it also introduces another really nasty data race
> >>> regression which makes my usb sound card [1] completely useless, throwing
> >>> the kernel into a panic if anything from userspace tries to start playback.
> >>>
> >>> The problem is this: ep->data_subs is now set to NULL every time inside
> >>> wait_clear_urbs(). ep->data_subs is initalized only in one place in
> >>> start_endpoints(), then it is immediately wiped by the pre-existing call to
> >>> wait_clear_urbs() inside snd_usb_endpoint_start().
> >>>
> >>> To ilustrate, this is what happens in the non-irq part of the code:
> >>>
> >>> Step 1 (inside start_endpoints): ep->data_subs = subs;
> >>> Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep);
> >>> Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep);
> >>> Step 4 (inside wait_clear_urbs): ep->data_subs = NULL;
> >>>
> >>> Here's a simplified call stack for the IRQ part (full one at the end):
> >>>
> >>> (NULL dereference, param "subs" is passed NULL value of ep->data_subs)
> >>> retire_playback_urb
> >>> retire_outbound_urb
> >>> snd_complete_urb
> >>> (...)
> >>> xhci_irq
> >>>
> >>> Looking at the git log there seems to be quite a lot of history in this
> >>> part of the codebase, dating back to 2012 or earlier. My fix is based on
> >>> 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and
> >>> e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture
> >>> stream") with a few modifications of my own. My idea is to do the
> >>> ep->data_subs wiping before endpoint initialization in a manner similar
> >>> to these older commits by using stop_endpoints() in snd_usb_pcm_prepare()
> >>> and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to
> >>> not trigger the recently fixed stream stopping race again.
> >>>
> >>> Full call stack:
> >>>
> >>> [  131.093240] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> >>> [  131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> >>> [  131.095046] PGD 0
> >>> [  131.095047]
> >>> [  131.096509] Oops: 0000 [#1] PREEMPT SMP
> >>> [  131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath kvm_intel mac80211 kvm
> >>>  input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel pcbc aesni_intel libps2 cfg8
> >>> 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core lpc_ich tpm
> >>> snd_timer evdev shpchp i2c_i801 sch_fq_codel
> >>> [  131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted 4.9.0-gd824cdc58ba0 #10
> >>> [  131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 07/31/2015
> >>> [  131.109592] task: ffff880154a70000 task.stack: ffffc90000f48000
> >>> [  131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> >>> [  131.113899] RSP: 0018:ffffc90000f4bc10 EFLAGS: 00010082
> >>> [  131.116080] RAX: ffffffffa04cabe0 RBX: 0000000000000000 RCX: 0000000000000000
> >>> [  131.118284] RDX: 0000000000000000 RSI: ffff8801435a4c00 RDI: 0000000000000000
> >>> [  131.120505] RBP: ffffc90000f4bc40 R08: 0000000000000001 R09: 0000000000000001
> >>> [  131.122807] R10: 0000000000000001 R11: ffffffff82f60d6d R12: ffff8801432a0238
> >>> [  131.125265] R13: ffff8801435a4c00 R14: 0000000000000000 R15: ffff8801535c83b8
> >>> [  131.127723] FS:  0000000000000000(0000) GS:ffff88015a000000(0000) knlGS:0000000000000000
> >>> [  131.130217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  131.132568] CR2: 0000000000000010 CR3: 0000000001e0b000 CR4: 00000000001406e0
> >>> [  131.135007] Call Trace:
> >>> [  131.137329]  snd_complete_urb+0x190/0x2b0 [snd_usb_audio]
> >>> [  131.139526]  __usb_hcd_giveback_urb+0x8e/0x160
> >>> [  131.141948]  usb_hcd_giveback_urb+0x43/0x110
> >>> [  131.144104]  xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd]
> >>> [  131.146521]  finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd]
> >>> [  131.148736]  xhci_irq+0x13a2/0x1ca0 [xhci_hcd]
> >>> [  131.150972]  ? trace_hardirqs_on+0xd/0x10
> >>> [  131.153209]  ? _raw_spin_unlock_irq+0x2c/0x50
> >>> [  131.155390]  ? irq_thread+0xb5/0x1d0
> >>> [  131.157771]  xhci_msi_irq+0x11/0x20 [xhci_hcd]
> >>> [  131.159938]  irq_forced_thread_fn+0x2f/0x70
> >>> [  131.162072]  ? irq_thread+0xb5/0x1d0
> >>> [  131.164141]  irq_thread+0x149/0x1d0
> >>> [  131.166132]  ? irq_finalize_oneshot.part.2+0xe0/0xe0
> >>> [  131.168142]  ? wake_threads_waitq+0x30/0x30
> >>> [  131.170149]  kthread+0x10f/0x150
> >>> [  131.172144]  ? irq_thread_dtor+0xc0/0xc0
> >>> [  131.174139]  ? kthread_create_on_node+0x40/0x40
> >>> [  131.176116]  ret_from_fork+0x2a/0x40
> >>> [  131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3 0f 1f 00 0f 1f 44 00 00 55 31 d2 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 ec 08 <48> 8b 47 10 48 8b 4f 70 48 c7 c7 88 7b 4d a0 4c 8b a0 78 01 00
> >>> [  131.182382] RIP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] RSP: ffffc90000f4bc10
> >>> [  131.184562] CR2: 0000000000000010
> >>>
> >>> [1] 041e:3232 Creative Technology SoundBlaster X-FI HD
> >>> [2] http://mailman.alsa-project.org/pipermail/alsa-devel/2016-December/115425.html
> >>
> >> I can regenerate this bug with my EMU 0404 USB. My understanding of this bug is
> >> quite similar to your perspective. This bug is quite critical. In my
> >> understanding, we encounter this bug in all of cases in which snd-usb-audio is
> >> applied for PCM playback substream.
> >>
> >> Below workaround can also suppress this bug.
> >>
> >> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c
> >> index a2cdf33..9b34f76 100644
> >> --- a/sound/usb/endpoint.c
> >> +++ b/sound/usb/endpoint.c
> >> @@ -537,11 +537,14 @@ static int wait_clear_urbs(struct snd_usb_endpoint *ep)
> >>                         alive, ep->ep_num);
> >>         clear_bit(EP_FLAG_STOPPING, &ep->flags);
> >>
> >> -       ep->data_subs = NULL;
> >>         ep->sync_slave = NULL;
> >>         ep->retire_data_urb = NULL;
> >>         ep->prepare_data_urb = NULL;
> >>
> >> +       msleep(200);
> >> +
> >> +       ep->data_subs = NULL;
> >> +
> >>         return 0;
> >>  }
> >>
> >> If initialization of 'struct snd_usb_endpoint.data_subs' can be done after all
> >> of queued URBs are processed and corresponding complete interrupts are cought,
> >> we can solve this critical bug.
> >>
> >>> Signed-off-by: Ioan-Adrian Ratiu <adi@adirat.com>
> >>> ---
> >>>  sound/usb/endpoint.c | 11 ++---------
> >>>  sound/usb/endpoint.h |  2 +-
> >>>  sound/usb/pcm.c      | 13 ++++++-------
> >>>  3 files changed, 9 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c
> >>> index a2cdf3370afe..4465f324c2c2 100644
> >>> --- a/sound/usb/endpoint.c
> >>> +++ b/sound/usb/endpoint.c
> >>> @@ -920,9 +920,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
> >>>  /**
> >>>   * snd_usb_endpoint_start: start an snd_usb_endpoint
> >>>   *
> >>> - * @ep:		the endpoint to start
> >>> - * @can_sleep:	flag indicating whether the operation is executed in
> >>> - * 		non-atomic context
> >>> + * @ep: the endpoint to start
> >>>   *
> >>>   * A call to this function will increment the use count of the endpoint.
> >>>   * In case it is not already running, the URBs for this endpoint will be
> >>> @@ -932,7 +930,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
> >>>   *
> >>>   * Returns an error if the URB submission failed, 0 in all other cases.
> >>>   */
> >>> -int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep)
> >>> +int snd_usb_endpoint_start(struct snd_usb_endpoint *ep)
> >>>  {
> >>>  	int err;
> >>>  	unsigned int i;
> >>> @@ -944,11 +942,6 @@ int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep)
> >>>  	if (++ep->use_count != 1)
> >>>  		return 0;
> >>>
> >>> -	/* just to be sure */
> >>> -	deactivate_urbs(ep, false);
> >>> -	if (can_sleep)
> >>> -		wait_clear_urbs(ep);
> >>> -
> >>>  	ep->active_mask = 0;
> >>>  	ep->unlink_mask = 0;
> >>>  	ep->phase = 0;
> >>> diff --git a/sound/usb/endpoint.h b/sound/usb/endpoint.h
> >>> index 6428392d8f62..584f295d7c77 100644
> >>> --- a/sound/usb/endpoint.h
> >>> +++ b/sound/usb/endpoint.h
> >>> @@ -18,7 +18,7 @@ int snd_usb_endpoint_set_params(struct snd_usb_endpoint *ep,
> >>>  				struct audioformat *fmt,
> >>>  				struct snd_usb_endpoint *sync_ep);
> >>>
> >>> -int  snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep);
> >>> +int  snd_usb_endpoint_start(struct snd_usb_endpoint *ep);
> >>>  void snd_usb_endpoint_stop(struct snd_usb_endpoint *ep);
> >>>  void snd_usb_endpoint_sync_pending_stop(struct snd_usb_endpoint *ep);
> >>>  int  snd_usb_endpoint_activate(struct snd_usb_endpoint *ep);
> >>> diff --git a/sound/usb/pcm.c b/sound/usb/pcm.c
> >>> index 34c6d4f2c0b6..db26f767f851 100644
> >>> --- a/sound/usb/pcm.c
> >>> +++ b/sound/usb/pcm.c
> >>> @@ -218,7 +218,7 @@ int snd_usb_init_pitch(struct snd_usb_audio *chip, int iface,
> >>>  	}
> >>>  }
> >>>
> >>> -static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
> >>> +static int start_endpoints(struct snd_usb_substream *subs)
> >>>  {
> >>>  	int err;
> >>>
> >>> @@ -231,7 +231,7 @@ static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
> >>>  		dev_dbg(&subs->dev->dev, "Starting data EP @%p\n", ep);
> >>>
> >>>  		ep->data_subs = subs;
> >>> -		err = snd_usb_endpoint_start(ep, can_sleep);
> >>> +		err = snd_usb_endpoint_start(ep);
> >>>  		if (err < 0) {
> >>>  			clear_bit(SUBSTREAM_FLAG_DATA_EP_STARTED, &subs->flags);
> >>>  			return err;
> >>> @@ -260,7 +260,7 @@ static int start_endpoints(struct snd_usb_substream *subs, bool can_sleep)
> >>>  		dev_dbg(&subs->dev->dev, "Starting sync EP @%p\n", ep);
> >>>
> >>>  		ep->sync_slave = subs->data_endpoint;
> >>> -		err = snd_usb_endpoint_start(ep, can_sleep);
> >>> +		err = snd_usb_endpoint_start(ep);
> >>>  		if (err < 0) {
> >>>  			clear_bit(SUBSTREAM_FLAG_SYNC_EP_STARTED, &subs->flags);
> >>>  			return err;
> >>> @@ -809,8 +809,7 @@ static int snd_usb_pcm_prepare(struct snd_pcm_substream *substream)
> >>>  		goto unlock;
> >>>  	}
> >>>
> >>> -	snd_usb_endpoint_sync_pending_stop(subs->sync_endpoint);
> >>> -	snd_usb_endpoint_sync_pending_stop(subs->data_endpoint);
> >>> +	stop_endpoints(subs, true);
> >>>
> >>>  	ret = set_format(subs, subs->cur_audiofmt);
> >>>  	if (ret < 0)
> >>> @@ -850,7 +849,7 @@ static int snd_usb_pcm_prepare(struct snd_pcm_substream *substream)
> >>>  	/* for playback, submit the URBs now; otherwise, the first hwptr_done
> >>>  	 * updates for all URBs would happen at the same time when starting */
> >>>  	if (subs->direction == SNDRV_PCM_STREAM_PLAYBACK)
> >>> -		ret = start_endpoints(subs, true);
> >>> +		return start_endpoints(subs);
> >>>
> >>>   unlock:
> >>>  	snd_usb_unlock_shutdown(subs->stream->chip);
> >>> @@ -1666,7 +1665,7 @@ static int snd_usb_substream_capture_trigger(struct snd_pcm_substream *substream
> >>>
> >>>  	switch (cmd) {
> >>>  	case SNDRV_PCM_TRIGGER_START:
> >>> -		err = start_endpoints(subs, false);
> >>> +		err = start_endpoints(subs);
> >>>  		if (err < 0)
> >>>  			return err;
> >>
> >> This patch works better, but not the best. It's a bit intrusive.
> >
> > I disagree. Being intrusive is not a good reason to reject a patch,
> > especially if the alternative is to insert random magic number delays
> > in hopes of synchronizing process and irq contexts to avoid critical
> > errors.
> 
> I apologize if you got some negative opinions in my previous message.
> 
> My intention to show the ugly workaround is not to merge it instead of
> yours, but clear what causes this bug.
> 
> For example, we can also apply below patch for this bug, to solve it
> roughly.
> 
> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c
> index a2cdf33..6ec4eb1 100644
> --- a/sound/usb/endpoint.c
> +++ b/sound/usb/endpoint.c
> @@ -162,7 +162,7 @@ int snd_usb_endpoint_next_packet_size(struct
> snd_usb_endpoint *ep)
>  static void retire_outbound_urb(struct snd_usb_endpoint *ep,
>                                 struct snd_urb_ctx *urb_ctx)
>  {
> -       if (ep->retire_data_urb)
> +       if (ep->retire_data_urb && ep->data_subs)
>                 ep->retire_data_urb(ep->data_subs, urb_ctx->urb);
>  }
> 
> But this is just a workaround and can't fix it thoroughly. The main
> issue is synchronization between interrupt/process contexts.
> 
> > Please take the time to fully analyze my patch and let's have a
> > discussion based on it, not reject it outright and replace it with
> > a quick and dirty delay hack.
> 
> OK. I'll deliberately check it again so that I have no overlooks. I
> with this thread also catch the other developers enough helpful to
> you. (I just eventually caught your patch in LKML and not developer
> for this category of devices.)

Sorry for the late reply, as I've been (still) off and had bad net
connections.

About your fix: Sakamoto-san is right, it's no good way to fix this
kind or problem.  The easiest option right now is just to revert my
previous fix, as it obviously introduces another regression.  The
correct fix will be taken after that.

I'm going to prepare a revert patch and ask Linus to take it before
rc1.


thanks,

Takashi



> 
> > Ionel
> >
> >>
> >> What we need to synchronize process contexts to interrupt context till all of
> >> queued URBs are handled. This seems to be a main purpose of wait_clear_bits,
> >> while current implementation of snd_complete_urb()/wait_clear_bits() is not
> >> enough.
> >>
> >> Thanks for your reporting.
> >>
> >> Takashi Sakamoto
> >> --
> >> 2.9.3
> 
> 
> Thanks
> 
> Takashi Sakamoto
>
Takashi Iwai Dec. 30, 2016, 2:39 p.m. UTC | #2
On Wed, 21 Dec 2016 11:38:54 +0100,
Ioan-Adrian Ratiu wrote:
> 
> >> > Please take the time to fully analyze my patch and let's have a
> >> > discussion based on it, not reject it outright and replace it with
> >> > a quick and dirty delay hack.
> >> 
> >> OK. I'll deliberately check it again so that I have no overlooks. I
> >> with this thread also catch the other developers enough helpful to
> >> you. (I just eventually caught your patch in LKML and not developer
> >> for this category of devices.)
> >
> > Sorry for the late reply, as I've been (still) off and had bad net
> > connections.
> >
> > About your fix: Sakamoto-san is right, it's no good way to fix this
> > kind or problem.  The easiest option right now is just to revert my
> > previous fix, as it obviously introduces another regression.  The
> > correct fix will be taken after that.
> >
> > I'm going to prepare a revert patch and ask Linus to take it before
> > rc1.
> 
> I agree with reverting the initial commit decision because my problem
> disappears with it.
> 
> But can you please state a reason for why my patch is "no good way to
> fix"? Being too intrusive is not a good reason.

"Being too intrusive" is the exact reason why it's not good as a
"regression fix" like this case.  The logic you've implemented in the
patch itself looks good (although the code introduces a bug, the
unbalance of snd_usb_*lock_shutdown()).  The only point I couldn't
take it is that it's rather a fundamental change, not a quick fix for
a regression.

What's the worst case scenario in a regression fix?  It's when a fix
introduces yet another regression.  It'd be much worse if the new
regression is deeper.  The complicated logical change has a potential
risk of such.  Thus an intrusive change is not always good as a
"regression fix".

In general, if the change were trivial, it's obviously OK to take as a
regression fix -- where the logic is also trivial in most cases, too.
But when the fix becomes complex, we need to rethink.  Especially when
the original buggy commit is small, reverting it is often better as a
clear cut.

Think in that way: you're addressing a deeper problem that was
revealed accidentally by my previous bad fix.  Writing the change as
if it were merely a regression fix gives the wrong understanding to
readers :)

That said, I'd appreciate if you respin the fix again with a
combination of my previous fix and brush up the code a bit as a whole,
and document it not as a regression fix but as a complete fix of the
existing races.  I can write it in my side quickly, but it'd be almost
in the same form as you posted, I suppose.


thanks,

Takashi
Adi Ratiu Dec. 30, 2016, 9 p.m. UTC | #3
On Fri, 30 Dec 2016, Takashi Iwai <tiwai@suse.de> wrote:
> On Wed, 21 Dec 2016 11:38:54 +0100,
> Ioan-Adrian Ratiu wrote:
>> 
>> >> > Please take the time to fully analyze my patch and let's have a
>> >> > discussion based on it, not reject it outright and replace it with
>> >> > a quick and dirty delay hack.
>> >> 
>> >> OK. I'll deliberately check it again so that I have no overlooks. I
>> >> with this thread also catch the other developers enough helpful to
>> >> you. (I just eventually caught your patch in LKML and not developer
>> >> for this category of devices.)
>> >
>> > Sorry for the late reply, as I've been (still) off and had bad net
>> > connections.
>> >
>> > About your fix: Sakamoto-san is right, it's no good way to fix this
>> > kind or problem.  The easiest option right now is just to revert my
>> > previous fix, as it obviously introduces another regression.  The
>> > correct fix will be taken after that.
>> >
>> > I'm going to prepare a revert patch and ask Linus to take it before
>> > rc1.
>> 
>> I agree with reverting the initial commit decision because my problem
>> disappears with it.
>> 
>> But can you please state a reason for why my patch is "no good way to
>> fix"? Being too intrusive is not a good reason.
>
> "Being too intrusive" is the exact reason why it's not good as a
> "regression fix" like this case.  The logic you've implemented in the
> patch itself looks good (although the code introduces a bug, the
> unbalance of snd_usb_*lock_shutdown()).  The only point I couldn't
> take it is that it's rather a fundamental change, not a quick fix for
> a regression.
>
> What's the worst case scenario in a regression fix?  It's when a fix
> introduces yet another regression.  It'd be much worse if the new
> regression is deeper.  The complicated logical change has a potential
> risk of such.  Thus an intrusive change is not always good as a
> "regression fix".
>
> In general, if the change were trivial, it's obviously OK to take as a
> regression fix -- where the logic is also trivial in most cases, too.
> But when the fix becomes complex, we need to rethink.  Especially when
> the original buggy commit is small, reverting it is often better as a
> clear cut.
>
> Think in that way: you're addressing a deeper problem that was
> revealed accidentally by my previous bad fix.  Writing the change as
> if it were merely a regression fix gives the wrong understanding to
> readers :)

Yes, this makes sense. Thank you.

>
> That said, I'd appreciate if you respin the fix again with a
> combination of my previous fix and brush up the code a bit as a whole,
> and document it not as a regression fix but as a complete fix of the
> existing races.  I can write it in my side quickly, but it'd be almost
> in the same form as you posted, I suppose.

I'll find some time to do a rebase either tomorrow (31 Dec) or until
2 Jan at the latest.

Ionel

>
>
> thanks,
>
> Takashi

Patch
diff mbox

diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c
index a2cdf33..9b34f76 100644
--- a/sound/usb/endpoint.c
+++ b/sound/usb/endpoint.c
@@ -537,11 +537,14 @@  static int wait_clear_urbs(struct snd_usb_endpoint *ep)
                        alive, ep->ep_num);
        clear_bit(EP_FLAG_STOPPING, &ep->flags);
 
-       ep->data_subs = NULL;
        ep->sync_slave = NULL;
        ep->retire_data_urb = NULL;
        ep->prepare_data_urb = NULL;
 
+       msleep(200);
+
+       ep->data_subs = NULL;
+
        return 0;
 }