diff mbox series

[v3,06/19] ASoC: soc-core: add soc_unbind_dai_link()

Message ID 877e4e3jni.wl-kuninori.morimoto.gx@renesas.com (mailing list archive)
State New, archived
Headers show
Series ASoC: soc-core cleanup - step 4 | expand

Commit Message

Kuninori Morimoto Nov. 5, 2019, 6:46 a.m. UTC
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>

It is easy to read code if it is cleanly using paired function/naming,
like start <-> stop, register <-> unregister, etc, etc.
But, current ALSA SoC code is very random, unbalance, not paired, etc.
It is easy to create bug at the such code, and it will be difficult to
debug.

ALSA SoC has soc_bind_dai_link(), but its paired soc_unbind_dai_link()
is not implemented.
More confusable is that soc_remove_pcm_runtimes() which should be
soc_unbind_dai_link() is implemented without synchronised
to soc_bind_dai_link().

This patch cleanup this unbalance.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
---
v2 -> v3
	- no change
	- add Reviewed-by

 sound/soc/soc-core.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

Comments

Pierre-Louis Bossart Nov. 12, 2019, 5:21 a.m. UTC | #1
On 11/5/19 12:46 AM, Kuninori Morimoto wrote:
> 
> From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
> 
> It is easy to read code if it is cleanly using paired function/naming,
> like start <-> stop, register <-> unregister, etc, etc.
> But, current ALSA SoC code is very random, unbalance, not paired, etc.
> It is easy to create bug at the such code, and it will be difficult to
> debug.
> 
> ALSA SoC has soc_bind_dai_link(), but its paired soc_unbind_dai_link()
> is not implemented.
> More confusable is that soc_remove_pcm_runtimes() which should be
> soc_unbind_dai_link() is implemented without synchronised
> to soc_bind_dai_link().
> 
> This patch cleanup this unbalance.
> 
> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

Morimoto-san, this patch seems to introduce a regression in our SOF 
module removal tests. Without a couple of load/unload modules cycles, we 
hit a kernel oops while freeing the card.

see https://github.com/thesofproject/linux/issues/1466 for some logs.

This issue did not happen with our November 6 rebase on Mark's tree, and 
showed up today. I couldn't really bisect the whole tree due to other 
issues, so manually applied your patches on top of this 11/06 tree and 
bisected from there.

I will need to confirm this finding (it's quite late for me) but looking 
at the code I wonder if the move of pcm_runtime deletion is correct?


> ---
> v2 -> v3
> 	- no change
> 	- add Reviewed-by
> 
>   sound/soc/soc-core.c | 22 ++++++++++++----------
>   1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index e8ff6f2..1e8dd19 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -470,14 +470,6 @@ static struct snd_soc_pcm_runtime *soc_new_pcm_runtime(
>   	return NULL;
>   }
>   
> -static void soc_remove_pcm_runtimes(struct snd_soc_card *card)
> -{
> -	struct snd_soc_pcm_runtime *rtd, *_rtd;
> -
> -	for_each_card_rtds_safe(card, rtd, _rtd)
> -		soc_free_pcm_runtime(rtd);
> -}
> -
>   struct snd_soc_pcm_runtime *snd_soc_get_pcm_runtime(struct snd_soc_card *card,
>   		const char *dai_link)
>   {
> @@ -1037,6 +1029,16 @@ static int soc_dai_link_sanity_check(struct snd_soc_card *card,
>   	return 0;
>   }
>   
> +static void soc_unbind_dai_link(struct snd_soc_card *card,
> +				struct snd_soc_dai_link *dai_link)
> +{
> +	struct snd_soc_pcm_runtime *rtd;
> +
> +	rtd = snd_soc_get_pcm_runtime(card, dai_link->name);
> +	if (rtd)
> +		soc_free_pcm_runtime(rtd);
> +}
> +
>   static int soc_bind_dai_link(struct snd_soc_card *card,
>   	struct snd_soc_dai_link *dai_link)
>   {
> @@ -1466,6 +1468,8 @@ void snd_soc_remove_dai_link(struct snd_soc_card *card,
>   		card->remove_dai_link(card, dai_link);
>   
>   	list_del(&dai_link->list);
> +
> +	soc_unbind_dai_link(card, dai_link);
>   }
>   EXPORT_SYMBOL_GPL(snd_soc_remove_dai_link);
>   
> @@ -1974,8 +1978,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
>   	for_each_card_links_safe(card, link, _link)
>   		snd_soc_remove_dai_link(card, link);
>   
> -	soc_remove_pcm_runtimes(card);
> -
>   	/* remove auxiliary devices */
>   	soc_remove_aux_devices(card);
>   	soc_unbind_aux_dev(card);
>
Kuninori Morimoto Nov. 12, 2019, 5:37 a.m. UTC | #2
Hi Pierre-Louis

Thank you for reporting

> > From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
> > 
> > It is easy to read code if it is cleanly using paired function/naming,
> > like start <-> stop, register <-> unregister, etc, etc.
> > But, current ALSA SoC code is very random, unbalance, not paired, etc.
> > It is easy to create bug at the such code, and it will be difficult to
> > debug.
> > 
> > ALSA SoC has soc_bind_dai_link(), but its paired soc_unbind_dai_link()
> > is not implemented.
> > More confusable is that soc_remove_pcm_runtimes() which should be
> > soc_unbind_dai_link() is implemented without synchronised
> > to soc_bind_dai_link().
> > 
> > This patch cleanup this unbalance.
> > 
> > Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
> > Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> 
> Morimoto-san, this patch seems to introduce a regression in our SOF
> module removal tests. Without a couple of load/unload modules cycles,
> we hit a kernel oops while freeing the card.
> 
> see https://github.com/thesofproject/linux/issues/1466 for some logs.
> 
> This issue did not happen with our November 6 rebase on Mark's tree,
> and showed up today. I couldn't really bisect the whole tree due to
> other issues, so manually applied your patches on top of this 11/06
> tree and bisected from there.
> 
> I will need to confirm this finding (it's quite late for me) but
> looking at the code I wonder if the move of pcm_runtime deletion is
> correct?

Hmm...
It is just merged verbose 2 functions into 1,
nothing changed from logic point of view
if my understanding was correct.

And now, I tried unbind test for cpu/codec/card on my side,
but nothing happen...

I'm using this commit

	bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b
	("ASoC: soc-core: add soc_unbind_dai_link()")


Thank you for your help !!
Best regards
---
Kuninori Morimoto
Kuninori Morimoto Nov. 12, 2019, 5:50 a.m. UTC | #3
Hi Pierre-Louis, again

> > Morimoto-san, this patch seems to introduce a regression in our SOF
> > module removal tests. Without a couple of load/unload modules cycles,
> > we hit a kernel oops while freeing the card.
> > 
> > see https://github.com/thesofproject/linux/issues/1466 for some logs.
> > 
> > This issue did not happen with our November 6 rebase on Mark's tree,
> > and showed up today. I couldn't really bisect the whole tree due to
> > other issues, so manually applied your patches on top of this 11/06
> > tree and bisected from there.
> > 
> > I will need to confirm this finding (it's quite late for me) but
> > looking at the code I wonder if the move of pcm_runtime deletion is
> > correct?
> 
> Hmm...
> It is just merged verbose 2 functions into 1,
> nothing changed from logic point of view
> if my understanding was correct.
> 
> And now, I tried unbind test for cpu/codec/card on my side,
> but nothing happen...
> 
> I'm using this commit
> 
> 	bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b
> 	("ASoC: soc-core: add soc_unbind_dai_link()")

Does it happen from soc-topology.c :: remove_link ?

Thank you for your help !!
Best regards
---
Kuninori Morimoto
Kuninori Morimoto Nov. 12, 2019, 6:42 a.m. UTC | #4
Hi Pierre-Louis, again and again

> > I'm using this commit
> > 
> > 	bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b
> > 	("ASoC: soc-core: add soc_unbind_dai_link()")
> 
> Does it happen from soc-topology.c :: remove_link ?

I can't test, but can this patch solve your issue?

I guess topology related rtd free timing was changed by
this balanceup.

-----------
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 1e8dd19..af89aad 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -384,6 +384,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd)
 	if (!rtd)
 		return;
 
+	/* need to sync the delayed work before releasing resources */
+	flush_delayed_work(&rtd->delayed_work);
+
 	list_del(&rtd->list);
 
 	/*
diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
index 1c00119..9865a2d 100644
--- a/sound/soc/soc-pcm.c
+++ b/sound/soc/soc-pcm.c
@@ -2883,15 +2883,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream)
 	return ret;
 }
 
-static void soc_pcm_private_free(struct snd_pcm *pcm)
-{
-	struct snd_soc_pcm_runtime *rtd = pcm->private_data;
-
-	/* need to sync the delayed work before releasing resources */
-	flush_delayed_work(&rtd->delayed_work);
-	snd_soc_pcm_component_free(pcm);
-}
-
 /* create a new pcm */
 int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 {
@@ -3033,7 +3024,7 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 		return ret;
 	}
 
-	pcm->private_free = soc_pcm_private_free;
+	pcm->private_free = snd_soc_pcm_component_free;
 	pcm->no_device_suspend = true;
 out:
 	dev_info(rtd->card->dev, "%s <-> %s mapping ok\n",

 
Thank you for your help !!
Best regards
---
Kuninori Morimoto
Pierre-Louis Bossart Nov. 12, 2019, 5:11 p.m. UTC | #5
>> Does it happen from soc-topology.c :: remove_link ?

it seems to happen after the topology remove link, see the traces below

> 
> I can't test, but can this patch solve your issue?

No, the problem remains after applying your suggested fix.

I added a bunch of traces and it seems we have a nasty case of corrupted 
linked lists:

diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c
index 98ef0666add2..5b0139ebe8f3 100644
--- a/sound/soc/soc-component.c
+++ b/sound/soc/soc-component.c
@@ -518,11 +518,39 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm)

  void snd_soc_pcm_component_free(struct snd_pcm *pcm)
  {
-       struct snd_soc_pcm_runtime *rtd = pcm->private_data;
+       struct snd_soc_pcm_runtime *rtd;
         struct snd_soc_rtdcom_list *rtdcom;
         struct snd_soc_component *component;

-       for_each_rtd_components(rtd, rtdcom, component)
-               if (component->driver->pcm_destruct)
+       pr_err("plb: %s start\n", __func__);
+
+       if (!pcm)
+               pr_err("plb: %s PCM is NULL\n", __func__);
+
+       pr_err("plb: %s accessing private data\n", __func__);
+       rtd = pcm->private_data;
+       pr_err("plb: %s accessed private data\n", __func__);
+
+       if (!rtd)
+               pr_err("plb: %s RTD is NULL\n", __func__);
+
+       pr_err("plb: %s accessing components\n", __func__);
+       for_each_rtd_components(rtd, rtdcom, component) {
+               pr_err("plb: %s processing component\n", __func__);
+               if (!component)
+                       pr_err("plb: %s component is NULL\n", __func__);
+
+               if (!component->driver)
+                       pr_err("plb: %s component driver is NULL\n", 
__func__);
+
+               pr_err("plb: %s pcm_destruct checks\n", __func__);
+               if (component->driver->pcm_destruct) {
+                       pr_err("plb: %s pcm_destruct start\n", __func__);
                         component->driver->pcm_destruct(component, pcm);
+                       pr_err("plb: %s pcm_destruct done\n", __func__);
+               }
+               pr_err("plb: %s processing component done\n", __func__);
+       }
+
+       pr_err("plb: %s done\n", __func__);
  }

And the results show the for_each_rtd_components loop goes in the weeds.

    82.069990] sof-audio-pci 0000:00:1f.3: plb: remove_link start
[   82.069993] sof-audio-pci 0000:00:1f.3: plb: remove_link 2
[   82.069996] sof-audio-pci 0000:00:1f.3: plb: remove_link before 
snd_soc_remove_dai_link
[   82.069998] plb: snd_soc_remove_dai_link start
[   82.070016] plb: snd_soc_remove_dai_link done
[   82.070020] sof-audio-pci 0000:00:1f.3: plb: remove_link done
<removed DSP power down sequence>
[   82.179021] plb: snd_soc_pcm_component_free start
[   82.179023] plb: snd_soc_pcm_component_free accessing private data
[   82.179024] plb: snd_soc_pcm_component_free accessed private data
[   82.179025] plb: snd_soc_pcm_component_free accessing components
[   82.179025] plb: snd_soc_pcm_component_free processing component
[   82.179029] BUG: kernel NULL pointer dereference, address: 
0000000000000064
[   82.179030] #PF: supervisor read access in kernel mode
[   82.179031] #PF: error_code(0x0000) - not-present page
[   82.179032] PGD 0 P4D 0
[   82.179034] Oops: 0000 [#1] SMP NOPTI
[   82.179036] CPU: 3 PID: 768 Comm: pulseaudio Not tainted 
5.4.0-rc5-test+ #31
[   82.179036] Hardware name: Acer Swift SF314-55/MILLER_WL, BIOS V1.05 
10/03/2018
[   82.179042] RIP: 0010:snd_soc_pcm_component_free+0xc7/0x16a 
[snd_soc_core]
[   82.179043] Code: 43 08 48 c7 c6 f0 24 6e c0 4c 39 e0 0f 84 a9 00 00 
00 48 8b 2b 48 85 ed 0f 84 9d 00 00 00 48 c7 c7 00 51 6e c0 e8 d2 5d 5d 
f2 <48> 83 7d 60 00 75 13 48 c7 c6 f0 24 6e c0 48 c7 c7 20 51 6e c0 e8
[   82.179044] RSP: 0018:ffffa70180bf3d78 EFLAGS: 00010246
[   82.179046] RAX: 0000000000000034 RBX: ffffa00f7aaf3968 RCX: 
0000000000000006
[   82.179047] RDX: 0000000000000000 RSI: 0000000000000092 RDI: 
ffffa00fa5ad63d0
[   82.179048] RBP: 0000000000000004 R08: ffffa70180bf3c3d R09: 
0000000000001518
[   82.179049] R10: ffffa70180bf3c38 R11: ffffa70180bf3c3d R12: 
ffffa00fa1be4eb0
[   82.179050] R13: ffffa00fa27aa000 R14: dead000000000122 R15: 
dead000000000100
[   82.179052] FS:  00007f4e7e5ebc80(0000) GS:ffffa00fa5ac0000(0000) 
knlGS:0000000000000000
[   82.179054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.179055] CR2: 0000000000000064 CR3: 0000000253d68005 CR4: 
00000000003606e0
[   82.179056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[   82.179057] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[   82.179058] Call Trace:
[   82.179064]  snd_pcm_free+0x1a/0x50 [snd_pcm]


I have absolutely no idea what all these data structures are, just 
reporting this.

reverting "ASoC: soc-core: add soc_unbind_dai_link()" is the only 
work-around at this point. i've tested this module load/unload for hours 
without issues.

It's actually quite interesting since this snd_soc_pcm_component_free() 
calls a .pcm_destruct() callback that's not used by the SOF driver. It's 
only used on Intel platforms for the Skylake/SST driver, not sure why 
and if SOF is missing something.
Mark Brown Nov. 12, 2019, 7:03 p.m. UTC | #6
On Tue, Nov 12, 2019 at 11:11:32AM -0600, Pierre-Louis Bossart wrote:

> +       for_each_rtd_components(rtd, rtdcom, component) {
> +               pr_err("plb: %s processing component\n", __func__);
> +               if (!component)
> +                       pr_err("plb: %s component is NULL\n", __func__);

Could you perhaps add traces of which components are being accessed at
each stage?  We might want to go through and just add something like
that in the code anyway to help figure things out.
Pierre-Louis Bossart Nov. 12, 2019, 9:08 p.m. UTC | #7
>> +       for_each_rtd_components(rtd, rtdcom, component) {
>> +               pr_err("plb: %s processing component\n", __func__);
>> +               if (!component)
>> +                       pr_err("plb: %s component is NULL\n", __func__);
> 
> Could you perhaps add traces of which components are being accessed at
> each stage?  We might want to go through and just add something like
> that in the code anyway to help figure things out.

I tried to add more traces but couldn't triangulate on a clear issue, 
and the traces have an Heisenbug effect.

So I switched to higher-level code analysis: it turns out that 
soc_dai_link_remove() routine is called from both topology and on card 
cleanup.

The patch 06/19 in this series essentially forces the pcm_runtimes to be 
freed in both cases, so possibly twice for topology-managed dailinks - 
or using information that's been freed already.

I 'fixed' this by adding an additional parameter to avoid doing the pcm 
runtime free from the topology (as was done before), and the kernel oops 
goes away. My tests have been running for 45mn now, when without change 
I get a kernel oops in less than 10-20 cycles (but still more than 
apparently our CI tracks, something to improve).

I pushed the code on GitHub to check if there are any negative points 
reported by the Intel CI, should be complete shortly:
https://github.com/thesofproject/linux/pull/1469

I am not sure the suggested fix is correct, I don't fully get what the 
topology and card cleanups should do and how the work is split, if at all.
Kuninori Morimoto Nov. 13, 2019, 4:37 a.m. UTC | #8
Hi Pierre-Louis

Thank you for your report

> >> +       for_each_rtd_components(rtd, rtdcom, component) {
> >> +               pr_err("plb: %s processing component\n", __func__);
> >> +               if (!component)
> >> +                       pr_err("plb: %s component is NULL\n", __func__);
> > 
> > Could you perhaps add traces of which components are being accessed at
> > each stage?  We might want to go through and just add something like
> > that in the code anyway to help figure things out.
> 
> I tried to add more traces but couldn't triangulate on a clear issue,
> and the traces have an Heisenbug effect.
> 
> So I switched to higher-level code analysis: it turns out that
> soc_dai_link_remove() routine is called from both topology and on card
> cleanup.
> 
> The patch 06/19 in this series essentially forces the pcm_runtimes to
> be freed in both cases, so possibly twice for topology-managed
> dailinks - or using information that's been freed already.
> 
> I 'fixed' this by adding an additional parameter to avoid doing the
> pcm runtime free from the topology (as was done before), and the
> kernel oops goes away. My tests have been running for 45mn now, when
> without change I get a kernel oops in less than 10-20 cycles (but
> still more than apparently our CI tracks, something to improve).
> 
> I pushed the code on GitHub to check if there are any negative points
> reported by the Intel CI, should be complete shortly:
> https://github.com/thesofproject/linux/pull/1469
> 
> I am not sure the suggested fix is correct, I don't fully get what the
> topology and card cleanups should do and how the work is split, if at
> all.

BTW, I guess your kernel is appling this patch,
but, is it correct ?

	df95a16d2a9626dcfc3f2b3671c9b91fa076c997
	("ASoC: soc-core: fix RIP warning on card removal")

If so, I think I could understand the issue.
But, before explaining detail,
I want to confirm that my solution is correct or not first.

Can you please check attached patch ?
Then, please remove above RIP warning patch.
It is not clean patch, but OK to confirming, so far.
I think
	- no kernel Oops
	- no RIP warning

Thank you for your help !!
Best regards
---
Kuninori Morimoto
From 0d825237eea4baad0c489e1c43a58373f41a3632 Mon Sep 17 00:00:00 2001
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Date: Wed, 13 Nov 2019 11:58:18 +0900
Subject: [PATCH] topology die fixup patch candidate 1

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 include/sound/soc-component.h |  2 +-
 sound/soc/soc-component.c     |  5 ++---
 sound/soc/soc-core.c          | 14 ++++++++------
 sound/soc/soc-pcm.c           | 10 ----------
 4 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/include/sound/soc-component.h b/include/sound/soc-component.h
index 6aa3ecb..cf726a5 100644
--- a/include/sound/soc-component.h
+++ b/include/sound/soc-component.h
@@ -413,6 +413,6 @@ struct page *snd_soc_pcm_component_page(struct snd_pcm_substream *substream,
 int snd_soc_pcm_component_mmap(struct snd_pcm_substream *substream,
 			       struct vm_area_struct *vma);
 int snd_soc_pcm_component_new(struct snd_pcm *pcm);
-void snd_soc_pcm_component_free(struct snd_pcm *pcm);
+void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd);
 
 #endif /* __SOC_COMPONENT_H */
diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c
index 98ef066..195979f 100644
--- a/sound/soc/soc-component.c
+++ b/sound/soc/soc-component.c
@@ -516,13 +516,12 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm)
 	return 0;
 }
 
-void snd_soc_pcm_component_free(struct snd_pcm *pcm)
+void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd)
 {
-	struct snd_soc_pcm_runtime *rtd = pcm->private_data;
 	struct snd_soc_rtdcom_list *rtdcom;
 	struct snd_soc_component *component;
 
 	for_each_rtd_components(rtd, rtdcom, component)
 		if (component->driver->pcm_destruct)
-			component->driver->pcm_destruct(component, pcm);
+			component->driver->pcm_destruct(component, rtd->pcm);
 }
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 1e8dd19..8e7ff7d 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -386,6 +386,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd)
 
 	list_del(&rtd->list);
 
+	flush_delayed_work(&rtd->delayed_work);
+	snd_soc_pcm_component_free(rtd);
+
 	/*
 	 * we don't need to call kfree() for rtd->dev
 	 * see
@@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
 {
 	struct snd_soc_dai_link *link, *_link;
 
-	/* free the ALSA card at first; this syncs with pending operations */
-	if (card->snd_card) {
-		snd_card_free(card->snd_card);
-		card->snd_card = NULL;
-	}
-
 	/* remove and free each DAI */
 	soc_remove_link_dais(card);
 	soc_remove_link_components(card);
@@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
 	/* remove the card */
 	if (card->remove)
 		card->remove(card);
+
+	if (card->snd_card) {
+		snd_card_free(card->snd_card);
+		card->snd_card = NULL;
+	}
 }
 
 static int snd_soc_instantiate_card(struct snd_soc_card *card)
diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
index 1c00119..a60e381 100644
--- a/sound/soc/soc-pcm.c
+++ b/sound/soc/soc-pcm.c
@@ -2883,15 +2883,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream)
 	return ret;
 }
 
-static void soc_pcm_private_free(struct snd_pcm *pcm)
-{
-	struct snd_soc_pcm_runtime *rtd = pcm->private_data;
-
-	/* need to sync the delayed work before releasing resources */
-	flush_delayed_work(&rtd->delayed_work);
-	snd_soc_pcm_component_free(pcm);
-}
-
 /* create a new pcm */
 int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 {
@@ -3033,7 +3024,6 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 		return ret;
 	}
 
-	pcm->private_free = soc_pcm_private_free;
 	pcm->no_device_suspend = true;
 out:
 	dev_info(rtd->card->dev, "%s <-> %s mapping ok\n",
Takashi Iwai Nov. 13, 2019, 5:57 a.m. UTC | #9
On Wed, 13 Nov 2019 05:37:46 +0100,
Kuninori Morimoto wrote:
> 
> @@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
>  {
>  	struct snd_soc_dai_link *link, *_link;
>  
> -	/* free the ALSA card at first; this syncs with pending operations */
> -	if (card->snd_card) {
> -		snd_card_free(card->snd_card);
> -		card->snd_card = NULL;
> -	}
> -
>  	/* remove and free each DAI */
>  	soc_remove_link_dais(card);
>  	soc_remove_link_components(card);
> @@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
>  	/* remove the card */
>  	if (card->remove)
>  		card->remove(card);
> +
> +	if (card->snd_card) {
> +		snd_card_free(card->snd_card);
> +		card->snd_card = NULL;
> +	}

This will likely break unbind again; when unbind is performed in a
busy state, the code may release still-in-use resources.  The rcar
driver seems to have its own disconnect_sync() call so it'd work even
with this change, but others wouldn't.

At least you need to call snd_card_disconnect_sync() at the first
place, then release the rest at the second place.


thanks,

Takashi
Kuninori Morimoto Nov. 13, 2019, 6:33 a.m. UTC | #10
Hi Takashi-san

> > @@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
> >  {
> >  	struct snd_soc_dai_link *link, *_link;
> >  
> > -	/* free the ALSA card at first; this syncs with pending operations */
> > -	if (card->snd_card) {
> > -		snd_card_free(card->snd_card);
> > -		card->snd_card = NULL;
> > -	}
> > -
> >  	/* remove and free each DAI */
> >  	soc_remove_link_dais(card);
> >  	soc_remove_link_components(card);
> > @@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
> >  	/* remove the card */
> >  	if (card->remove)
> >  		card->remove(card);
> > +
> > +	if (card->snd_card) {
> > +		snd_card_free(card->snd_card);
> > +		card->snd_card = NULL;
> > +	}
> 
> This will likely break unbind again; when unbind is performed in a
> busy state, the code may release still-in-use resources.  The rcar
> driver seems to have its own disconnect_sync() call so it'd work even
> with this change, but others wouldn't.
> 
> At least you need to call snd_card_disconnect_sync() at the first
> place, then release the rest at the second place.

Ahh, I didn't notice about busy state and async process.
Thank you for pointing it.

Thank you for your help !!
Best regards
---
Kuninori Morimoto
Pierre-Louis Bossart Nov. 13, 2019, 4:05 p.m. UTC | #11
On 11/12/19 10:37 PM, Kuninori Morimoto wrote:
> 
> Hi Pierre-Louis
> 
> Thank you for your report
> 
>>>> +       for_each_rtd_components(rtd, rtdcom, component) {
>>>> +               pr_err("plb: %s processing component\n", __func__);
>>>> +               if (!component)
>>>> +                       pr_err("plb: %s component is NULL\n", __func__);
>>>
>>> Could you perhaps add traces of which components are being accessed at
>>> each stage?  We might want to go through and just add something like
>>> that in the code anyway to help figure things out.
>>
>> I tried to add more traces but couldn't triangulate on a clear issue,
>> and the traces have an Heisenbug effect.
>>
>> So I switched to higher-level code analysis: it turns out that
>> soc_dai_link_remove() routine is called from both topology and on card
>> cleanup.
>>
>> The patch 06/19 in this series essentially forces the pcm_runtimes to
>> be freed in both cases, so possibly twice for topology-managed
>> dailinks - or using information that's been freed already.
>>
>> I 'fixed' this by adding an additional parameter to avoid doing the
>> pcm runtime free from the topology (as was done before), and the
>> kernel oops goes away. My tests have been running for 45mn now, when
>> without change I get a kernel oops in less than 10-20 cycles (but
>> still more than apparently our CI tracks, something to improve).
>>
>> I pushed the code on GitHub to check if there are any negative points
>> reported by the Intel CI, should be complete shortly:
>> https://github.com/thesofproject/linux/pull/1469
>>
>> I am not sure the suggested fix is correct, I don't fully get what the
>> topology and card cleanups should do and how the work is split, if at
>> all.
> 
> BTW, I guess your kernel is appling this patch,
> but, is it correct ?
> 
> 	df95a16d2a9626dcfc3f2b3671c9b91fa076c997
> 	("ASoC: soc-core: fix RIP warning on card removal")

Sorry morimoto-san, I am not getting the question.

Are you asking if the patch is correct?

Or are you asking if the kernel used for this test include this patch? 
The answer is yes, this patch ("ASoC: soc-core: fix RIP warning on card 
removal") was merged by Mark and the SOF tree does use it, since we 
follow Mark's tree.

> If so, I think I could understand the issue.
> But, before explaining detail,
> I want to confirm that my solution is correct or not first.
> 
> Can you please check attached patch ?

Takashi's feedback seems to hint at problems with this patch, so will 
wait for further instructions here if you want me to test.
Thanks!

> Then, please remove above RIP warning patch.
> It is not clean patch, but OK to confirming, so far.
> I think
> 	- no kernel Oops
> 	- no RIP warning
> 
> Thank you for your help !!
> Best regards
> ---
> Kuninori Morimoto
> 
> 
> _______________________________________________
> Alsa-devel mailing list
> Alsa-devel@alsa-project.org
> https://mailman.alsa-project.org/mailman/listinfo/alsa-devel
>
Kuninori Morimoto Nov. 14, 2019, 12:18 a.m. UTC | #12
Hi Pierre-Louis

> > BTW, I guess your kernel is appling this patch,
> > but, is it correct ?
> > 
> > 	df95a16d2a9626dcfc3f2b3671c9b91fa076c997
> > 	("ASoC: soc-core: fix RIP warning on card removal")
> 
> Sorry morimoto-san, I am not getting the question.
> 
> Are you asking if the patch is correct?
> 
> Or are you asking if the kernel used for this test include this patch?
> The answer is yes, this patch ("ASoC: soc-core: fix RIP warning on
> card removal") was merged by Mark and the SOF tree does use it, since
> we follow Mark's tree.

Oops, sorry for my noise.
I thought you are using a little bit old kernel + extra patch + SOF tree.

> > If so, I think I could understand the issue.
> > But, before explaining detail,
> > I want to confirm that my solution is correct or not first.
> > 
> > Can you please check attached patch ?
> 
> Takashi's feedback seems to hint at problems with this patch, so will
> wait for further instructions here if you want me to test.
> Thanks!

OK, I will post patch as "RFC".
Because I can't test it.

Thank you for your help !!
Best regards
---
Kuninori Morimoto
Kuninori Morimoto Nov. 14, 2019, 1:03 a.m. UTC | #13
Hi Pierre-Louis, again
Cc Takashi-san

> > Takashi's feedback seems to hint at problems with this patch, so will
> > wait for further instructions here if you want me to test.
> > Thanks!
> 
> OK, I will post patch as "RFC".
> Because I can't test it.

I try to explain it here, not at RFC :)

I'm not 100% sure detail of SOF / topology, but in my understanding,
topology want to add *extra* dai_link by using snd_soc_add_dai_link().
And it will be removed by snd_soc_romove_dai_link().

Before, this snd_soc_add/remove_dai_link() and/or its related
functions are unbalanced.
Now, these are balance-uping, and it finds the random operation issue.

topology will call snd_soc_remove_dai_link() via (A).
In my understanding, currently, SOF and skylake are calling it.

	static void soc_cleanup_card_resources(struct snd_soc_card *card)
	{
		struct snd_soc_dai_link *link, *_link;

		/* This should be called before snd_card_free() */
	(A)	soc_remove_link_components(card);

		/* free the ALSA card at first; this syncs with pending operations */
		if (card->snd_card) {
	(B)		snd_card_free(card->snd_card);
			card->snd_card = NULL;
		}

		/* remove and free each DAI */
	(X)	soc_remove_link_dais(card);

		for_each_card_links_safe(card, link, _link)
	(C)		snd_soc_remove_dai_link(card, link);

		...
	}

At (A), topology calls snd_soc_remove_dai_link().
Then topology rtd, and its related all data are freed.

Next, (B) is called, and then, pcm->private_free = soc_pcm_private_free()
is called.

	static void soc_pcm_private_free(struct snd_pcm *pcm)
	{
		struct snd_soc_pcm_runtime *rtd = pcm->private_data;

		/* need to sync the delayed work before releasing resources */
		flush_delayed_work(&rtd->delayed_work);
		snd_soc_pcm_component_free(pcm);
	}

Here, it gets rtd via pcm->private_data.
But, topology related rtd are already freed at (A).
Above both flush_delayed_work() and snd_soc_pcm_component_free()
are using rtd via pcm->private_data.
I guess, your kernel access to already freed memory, and get Oops.

Normal sound card is freeing rtd at (C), thus no damage.

These flush_delayed_work() and snd_soc_pcm_component_free()
are rtd related data's finalization.
If so, these should be called when rtd was freed,
not sound card was freed.
It is very natural and understandable, I think.

In other words, pcm->private_free = soc_pcm_private_free()
is no longer needed.

But, here one other issue exist.
If we do above idea, there is zero chance to call
soc_remove_dai() to topology related dai at (X).
Because (A) removes rtd from card too, and,
(X) is based on card connected rtd.

This means, (X) need to be called before (C) (= for normal sound)
and (A) (= for topology).

Now, I want to focus this patch which is the reason why
snd_card_free() = (B) is located there.

	4efda5f2130da033aeedc5b3205569893b910de2
	("ASoC: Fix use-after-free at card unregistration")

Original snd_card_free() was called last of this function.
But moved to top to avoid use-after-free issue.
The issue was happen at soc_pcm_free() which was pcm->private_free,
today it is updated to soc_pcm_private_free().

In other words, (B) need to be called before (C) (= for normal sound)
and (A) (= for topology), because it needs (not yet freed) rtd.
But, (A) need to be called before (B),
because it needs card->snd_card pointer.

If we call flush_delayed_work() and snd_soc_pcm_component_free()
(= same as soc_pcm_private_free()) when rtd was freed (= (C), (A)),
there is no reason to call snd_card_free() at top of this function.
It can be called end of this function, again.

But, in such case, it will likely break unbind again, as Takashi-san said.
when unbind is performed in a busy state, the code may release still-in-use resources.
At least we need to call snd_card_disconnect_sync() at the first place.

The final code can be...

	static void soc_cleanup_card_resources(struct snd_soc_card *card)
	{
		struct snd_soc_dai_link *link, *_link;

		if (card->snd_card)
	(Z)		snd_card_disconnect_sync(card->snd_card);

	(X)	soc_remove_link_dais(card);
	(A)	soc_remove_link_components(card);

		for_each_card_links_safe(card, link, _link)
	(C)		snd_soc_remove_dai_link(card, link);

		...
		if (card->snd_card) {
	(B)		snd_card_free(card->snd_card);
			card->snd_card = NULL;
		}
	}

To avoid release still-in-use resources,
call snd_card_disconnect_sync() at (Z).

(X) is called for both non-topology and topology.

    topology removes rtd via (A), and
non topology removes rtd via (C).

And snd_card_free() is no longer related to
use-after-free issue. Thus, (B) is OK.

But, these are just my guess.
It works for me, but I can't re-produce the issue.

Below is the patch for "testing".

-------------
diff --git a/include/sound/soc-component.h b/include/sound/soc-component.h
index 6aa3ecb..cf726a5 100644
--- a/include/sound/soc-component.h
+++ b/include/sound/soc-component.h
@@ -413,6 +413,6 @@ struct page *snd_soc_pcm_component_page(struct snd_pcm_substream *substream,
 int snd_soc_pcm_component_mmap(struct snd_pcm_substream *substream,
 			       struct vm_area_struct *vma);
 int snd_soc_pcm_component_new(struct snd_pcm *pcm);
-void snd_soc_pcm_component_free(struct snd_pcm *pcm);
+void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd);
 
 #endif /* __SOC_COMPONENT_H */
diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c
index 98ef066..195979f 100644
--- a/sound/soc/soc-component.c
+++ b/sound/soc/soc-component.c
@@ -516,13 +516,12 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm)
 	return 0;
 }
 
-void snd_soc_pcm_component_free(struct snd_pcm *pcm)
+void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd)
 {
-	struct snd_soc_pcm_runtime *rtd = pcm->private_data;
 	struct snd_soc_rtdcom_list *rtdcom;
 	struct snd_soc_component *component;
 
 	for_each_rtd_components(rtd, rtdcom, component)
 		if (component->driver->pcm_destruct)
-			component->driver->pcm_destruct(component, pcm);
+			component->driver->pcm_destruct(component, rtd->pcm);
 }
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 92260a9..42d87bb 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -419,6 +419,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd)
 
 	list_del(&rtd->list);
 
+	flush_delayed_work(&rtd->delayed_work);
+	snd_soc_pcm_component_free(rtd);
+
 	/*
 	 * we don't need to call kfree() for rtd->dev
 	 * see
@@ -1944,17 +1947,12 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
 {
 	struct snd_soc_dai_link *link, *_link;
 
-	/* This should be called before snd_card_free() */
-	soc_remove_link_components(card);
-
-	/* free the ALSA card at first; this syncs with pending operations */
-	if (card->snd_card) {
-		snd_card_free(card->snd_card);
-		card->snd_card = NULL;
-	}
+	if (card->snd_card)
+		snd_card_disconnect_sync(card->snd_card);
 
 	/* remove and free each DAI */
 	soc_remove_link_dais(card);
+	soc_remove_link_components(card);
 
 	for_each_card_links_safe(card, link, _link)
 		snd_soc_remove_dai_link(card, link);
@@ -1969,6 +1967,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card)
 	/* remove the card */
 	if (card->remove)
 		card->remove(card);
+
+	if (card->snd_card) {
+		snd_card_free(card->snd_card);
+		card->snd_card = NULL;
+	}
 }
 
 static int snd_soc_bind_card(struct snd_soc_card *card)
diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
index 4bf71e3..6e24f0a5 100644
--- a/sound/soc/soc-pcm.c
+++ b/sound/soc/soc-pcm.c
@@ -2892,15 +2892,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream)
 	return ret;
 }
 
-static void soc_pcm_private_free(struct snd_pcm *pcm)
-{
-	struct snd_soc_pcm_runtime *rtd = pcm->private_data;
-
-	/* need to sync the delayed work before releasing resources */
-	flush_delayed_work(&rtd->delayed_work);
-	snd_soc_pcm_component_free(pcm);
-}
-
 /* create a new pcm */
 int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 {
@@ -3042,7 +3033,6 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num)
 		return ret;
 	}
 
-	pcm->private_free = soc_pcm_private_free;
 	pcm->no_device_suspend = true;
 out:
 	dev_info(rtd->card->dev, "%s <-> %s mapping ok\n",



Thank you for your help !!
Best regards
---
Kuninori Morimoto
Pierre-Louis Bossart Nov. 16, 2019, 12:19 a.m. UTC | #14
> But, these are just my guess.
> It works for me, but I can't re-produce the issue.
> 
> Below is the patch for "testing".

Morimoto-san, I haven't fully tested the logic and all the long 
description, but the patch suggested seems to work on all the platforms 
I tested on - and actually improves things on Baytrail/Cherrytrail.

I haven't seen any oopses in several hours now and no regression 
reported by Intel CI https://github.com/thesofproject/linux/pull/1504

Let's see what others think or tested.
Kuninori Morimoto Nov. 18, 2019, 12:47 a.m. UTC | #15
Hi Pierre-Louis

Thank you for testing

> > But, these are just my guess.
> > It works for me, but I can't re-produce the issue.
> > 
> > Below is the patch for "testing".
> 
> Morimoto-san, I haven't fully tested the logic and all the long
> description, but the patch suggested seems to work on all the
> platforms I tested on - and actually improves things on
> Baytrail/Cherrytrail.
> 
> I haven't seen any oopses in several hours now and no regression
> reported by Intel CI https://github.com/thesofproject/linux/pull/1504

Wow !! Nice to know !!
I'm very happy about that !!

OK, I will post the patch officially with your tested by.

> Let's see what others think or tested.

Yes, it is nice idea

Thank you for your help !!
Best regards
---
Kuninori Morimoto
Pierre-Louis Bossart Nov. 18, 2019, 3:14 p.m. UTC | #16
On 11/17/19 6:47 PM, Kuninori Morimoto wrote:
> 
> Hi Pierre-Louis
> 
> Thank you for testing
> 
>>> But, these are just my guess.
>>> It works for me, but I can't re-produce the issue.
>>>
>>> Below is the patch for "testing".
>>
>> Morimoto-san, I haven't fully tested the logic and all the long
>> description, but the patch suggested seems to work on all the
>> platforms I tested on - and actually improves things on
>> Baytrail/Cherrytrail.
>>
>> I haven't seen any oopses in several hours now and no regression
>> reported by Intel CI https://github.com/thesofproject/linux/pull/1504
> 
> Wow !! Nice to know !!
> I'm very happy about that !!
> 
> OK, I will post the patch officially with your tested by.
> 
>> Let's see what others think or tested.

the only issue we found in additional testing is that we need to insert 
a ~20s delay after each of module insertion/removal sequences, or we get 
stuck in at some point. That is a likely sign of delayed work not 
properly canceled - it may however be a different topic altogether, I 
saw the same sighting with my initial patch which reverted the "ASoC: 
soc-core: add soc_unbind_dai_link()" patch.

Others at Intel redid my tests and also found that the module 
load/unload tests worked across the board on CI devices.

from the Intel side we're good.
diff mbox series

Patch

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index e8ff6f2..1e8dd19 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -470,14 +470,6 @@  static struct snd_soc_pcm_runtime *soc_new_pcm_runtime(
 	return NULL;
 }
 
-static void soc_remove_pcm_runtimes(struct snd_soc_card *card)
-{
-	struct snd_soc_pcm_runtime *rtd, *_rtd;
-
-	for_each_card_rtds_safe(card, rtd, _rtd)
-		soc_free_pcm_runtime(rtd);
-}
-
 struct snd_soc_pcm_runtime *snd_soc_get_pcm_runtime(struct snd_soc_card *card,
 		const char *dai_link)
 {
@@ -1037,6 +1029,16 @@  static int soc_dai_link_sanity_check(struct snd_soc_card *card,
 	return 0;
 }
 
+static void soc_unbind_dai_link(struct snd_soc_card *card,
+				struct snd_soc_dai_link *dai_link)
+{
+	struct snd_soc_pcm_runtime *rtd;
+
+	rtd = snd_soc_get_pcm_runtime(card, dai_link->name);
+	if (rtd)
+		soc_free_pcm_runtime(rtd);
+}
+
 static int soc_bind_dai_link(struct snd_soc_card *card,
 	struct snd_soc_dai_link *dai_link)
 {
@@ -1466,6 +1468,8 @@  void snd_soc_remove_dai_link(struct snd_soc_card *card,
 		card->remove_dai_link(card, dai_link);
 
 	list_del(&dai_link->list);
+
+	soc_unbind_dai_link(card, dai_link);
 }
 EXPORT_SYMBOL_GPL(snd_soc_remove_dai_link);
 
@@ -1974,8 +1978,6 @@  static void soc_cleanup_card_resources(struct snd_soc_card *card)
 	for_each_card_links_safe(card, link, _link)
 		snd_soc_remove_dai_link(card, link);
 
-	soc_remove_pcm_runtimes(card);
-
 	/* remove auxiliary devices */
 	soc_remove_aux_devices(card);
 	soc_unbind_aux_dev(card);