Message ID | 877e4e3jni.wl-kuninori.morimoto.gx@renesas.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ASoC: soc-core cleanup - step 4 | expand |
On 11/5/19 12:46 AM, Kuninori Morimoto wrote: > > From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> > > It is easy to read code if it is cleanly using paired function/naming, > like start <-> stop, register <-> unregister, etc, etc. > But, current ALSA SoC code is very random, unbalance, not paired, etc. > It is easy to create bug at the such code, and it will be difficult to > debug. > > ALSA SoC has soc_bind_dai_link(), but its paired soc_unbind_dai_link() > is not implemented. > More confusable is that soc_remove_pcm_runtimes() which should be > soc_unbind_dai_link() is implemented without synchronised > to soc_bind_dai_link(). > > This patch cleanup this unbalance. > > Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> > Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Morimoto-san, this patch seems to introduce a regression in our SOF module removal tests. Without a couple of load/unload modules cycles, we hit a kernel oops while freeing the card. see https://github.com/thesofproject/linux/issues/1466 for some logs. This issue did not happen with our November 6 rebase on Mark's tree, and showed up today. I couldn't really bisect the whole tree due to other issues, so manually applied your patches on top of this 11/06 tree and bisected from there. I will need to confirm this finding (it's quite late for me) but looking at the code I wonder if the move of pcm_runtime deletion is correct? > --- > v2 -> v3 > - no change > - add Reviewed-by > > sound/soc/soc-core.c | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c > index e8ff6f2..1e8dd19 100644 > --- a/sound/soc/soc-core.c > +++ b/sound/soc/soc-core.c > @@ -470,14 +470,6 @@ static struct snd_soc_pcm_runtime *soc_new_pcm_runtime( > return NULL; > } > > -static void soc_remove_pcm_runtimes(struct snd_soc_card *card) > -{ > - struct snd_soc_pcm_runtime *rtd, *_rtd; > - > - for_each_card_rtds_safe(card, rtd, _rtd) > - soc_free_pcm_runtime(rtd); > -} > - > struct snd_soc_pcm_runtime *snd_soc_get_pcm_runtime(struct snd_soc_card *card, > const char *dai_link) > { > @@ -1037,6 +1029,16 @@ static int soc_dai_link_sanity_check(struct snd_soc_card *card, > return 0; > } > > +static void soc_unbind_dai_link(struct snd_soc_card *card, > + struct snd_soc_dai_link *dai_link) > +{ > + struct snd_soc_pcm_runtime *rtd; > + > + rtd = snd_soc_get_pcm_runtime(card, dai_link->name); > + if (rtd) > + soc_free_pcm_runtime(rtd); > +} > + > static int soc_bind_dai_link(struct snd_soc_card *card, > struct snd_soc_dai_link *dai_link) > { > @@ -1466,6 +1468,8 @@ void snd_soc_remove_dai_link(struct snd_soc_card *card, > card->remove_dai_link(card, dai_link); > > list_del(&dai_link->list); > + > + soc_unbind_dai_link(card, dai_link); > } > EXPORT_SYMBOL_GPL(snd_soc_remove_dai_link); > > @@ -1974,8 +1978,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) > for_each_card_links_safe(card, link, _link) > snd_soc_remove_dai_link(card, link); > > - soc_remove_pcm_runtimes(card); > - > /* remove auxiliary devices */ > soc_remove_aux_devices(card); > soc_unbind_aux_dev(card); >
Hi Pierre-Louis Thank you for reporting > > From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> > > > > It is easy to read code if it is cleanly using paired function/naming, > > like start <-> stop, register <-> unregister, etc, etc. > > But, current ALSA SoC code is very random, unbalance, not paired, etc. > > It is easy to create bug at the such code, and it will be difficult to > > debug. > > > > ALSA SoC has soc_bind_dai_link(), but its paired soc_unbind_dai_link() > > is not implemented. > > More confusable is that soc_remove_pcm_runtimes() which should be > > soc_unbind_dai_link() is implemented without synchronised > > to soc_bind_dai_link(). > > > > This patch cleanup this unbalance. > > > > Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> > > Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> > > Morimoto-san, this patch seems to introduce a regression in our SOF > module removal tests. Without a couple of load/unload modules cycles, > we hit a kernel oops while freeing the card. > > see https://github.com/thesofproject/linux/issues/1466 for some logs. > > This issue did not happen with our November 6 rebase on Mark's tree, > and showed up today. I couldn't really bisect the whole tree due to > other issues, so manually applied your patches on top of this 11/06 > tree and bisected from there. > > I will need to confirm this finding (it's quite late for me) but > looking at the code I wonder if the move of pcm_runtime deletion is > correct? Hmm... It is just merged verbose 2 functions into 1, nothing changed from logic point of view if my understanding was correct. And now, I tried unbind test for cpu/codec/card on my side, but nothing happen... I'm using this commit bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b ("ASoC: soc-core: add soc_unbind_dai_link()") Thank you for your help !! Best regards --- Kuninori Morimoto
Hi Pierre-Louis, again > > Morimoto-san, this patch seems to introduce a regression in our SOF > > module removal tests. Without a couple of load/unload modules cycles, > > we hit a kernel oops while freeing the card. > > > > see https://github.com/thesofproject/linux/issues/1466 for some logs. > > > > This issue did not happen with our November 6 rebase on Mark's tree, > > and showed up today. I couldn't really bisect the whole tree due to > > other issues, so manually applied your patches on top of this 11/06 > > tree and bisected from there. > > > > I will need to confirm this finding (it's quite late for me) but > > looking at the code I wonder if the move of pcm_runtime deletion is > > correct? > > Hmm... > It is just merged verbose 2 functions into 1, > nothing changed from logic point of view > if my understanding was correct. > > And now, I tried unbind test for cpu/codec/card on my side, > but nothing happen... > > I'm using this commit > > bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b > ("ASoC: soc-core: add soc_unbind_dai_link()") Does it happen from soc-topology.c :: remove_link ? Thank you for your help !! Best regards --- Kuninori Morimoto
Hi Pierre-Louis, again and again > > I'm using this commit > > > > bc7a9091e5b927ecc20dbb3bc07a5a09783fc27b > > ("ASoC: soc-core: add soc_unbind_dai_link()") > > Does it happen from soc-topology.c :: remove_link ? I can't test, but can this patch solve your issue? I guess topology related rtd free timing was changed by this balanceup. ----------- diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c index 1e8dd19..af89aad 100644 --- a/sound/soc/soc-core.c +++ b/sound/soc/soc-core.c @@ -384,6 +384,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd) if (!rtd) return; + /* need to sync the delayed work before releasing resources */ + flush_delayed_work(&rtd->delayed_work); + list_del(&rtd->list); /* diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c index 1c00119..9865a2d 100644 --- a/sound/soc/soc-pcm.c +++ b/sound/soc/soc-pcm.c @@ -2883,15 +2883,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream) return ret; } -static void soc_pcm_private_free(struct snd_pcm *pcm) -{ - struct snd_soc_pcm_runtime *rtd = pcm->private_data; - - /* need to sync the delayed work before releasing resources */ - flush_delayed_work(&rtd->delayed_work); - snd_soc_pcm_component_free(pcm); -} - /* create a new pcm */ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) { @@ -3033,7 +3024,7 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) return ret; } - pcm->private_free = soc_pcm_private_free; + pcm->private_free = snd_soc_pcm_component_free; pcm->no_device_suspend = true; out: dev_info(rtd->card->dev, "%s <-> %s mapping ok\n", Thank you for your help !! Best regards --- Kuninori Morimoto
>> Does it happen from soc-topology.c :: remove_link ? it seems to happen after the topology remove link, see the traces below > > I can't test, but can this patch solve your issue? No, the problem remains after applying your suggested fix. I added a bunch of traces and it seems we have a nasty case of corrupted linked lists: diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c index 98ef0666add2..5b0139ebe8f3 100644 --- a/sound/soc/soc-component.c +++ b/sound/soc/soc-component.c @@ -518,11 +518,39 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm) void snd_soc_pcm_component_free(struct snd_pcm *pcm) { - struct snd_soc_pcm_runtime *rtd = pcm->private_data; + struct snd_soc_pcm_runtime *rtd; struct snd_soc_rtdcom_list *rtdcom; struct snd_soc_component *component; - for_each_rtd_components(rtd, rtdcom, component) - if (component->driver->pcm_destruct) + pr_err("plb: %s start\n", __func__); + + if (!pcm) + pr_err("plb: %s PCM is NULL\n", __func__); + + pr_err("plb: %s accessing private data\n", __func__); + rtd = pcm->private_data; + pr_err("plb: %s accessed private data\n", __func__); + + if (!rtd) + pr_err("plb: %s RTD is NULL\n", __func__); + + pr_err("plb: %s accessing components\n", __func__); + for_each_rtd_components(rtd, rtdcom, component) { + pr_err("plb: %s processing component\n", __func__); + if (!component) + pr_err("plb: %s component is NULL\n", __func__); + + if (!component->driver) + pr_err("plb: %s component driver is NULL\n", __func__); + + pr_err("plb: %s pcm_destruct checks\n", __func__); + if (component->driver->pcm_destruct) { + pr_err("plb: %s pcm_destruct start\n", __func__); component->driver->pcm_destruct(component, pcm); + pr_err("plb: %s pcm_destruct done\n", __func__); + } + pr_err("plb: %s processing component done\n", __func__); + } + + pr_err("plb: %s done\n", __func__); } And the results show the for_each_rtd_components loop goes in the weeds. 82.069990] sof-audio-pci 0000:00:1f.3: plb: remove_link start [ 82.069993] sof-audio-pci 0000:00:1f.3: plb: remove_link 2 [ 82.069996] sof-audio-pci 0000:00:1f.3: plb: remove_link before snd_soc_remove_dai_link [ 82.069998] plb: snd_soc_remove_dai_link start [ 82.070016] plb: snd_soc_remove_dai_link done [ 82.070020] sof-audio-pci 0000:00:1f.3: plb: remove_link done <removed DSP power down sequence> [ 82.179021] plb: snd_soc_pcm_component_free start [ 82.179023] plb: snd_soc_pcm_component_free accessing private data [ 82.179024] plb: snd_soc_pcm_component_free accessed private data [ 82.179025] plb: snd_soc_pcm_component_free accessing components [ 82.179025] plb: snd_soc_pcm_component_free processing component [ 82.179029] BUG: kernel NULL pointer dereference, address: 0000000000000064 [ 82.179030] #PF: supervisor read access in kernel mode [ 82.179031] #PF: error_code(0x0000) - not-present page [ 82.179032] PGD 0 P4D 0 [ 82.179034] Oops: 0000 [#1] SMP NOPTI [ 82.179036] CPU: 3 PID: 768 Comm: pulseaudio Not tainted 5.4.0-rc5-test+ #31 [ 82.179036] Hardware name: Acer Swift SF314-55/MILLER_WL, BIOS V1.05 10/03/2018 [ 82.179042] RIP: 0010:snd_soc_pcm_component_free+0xc7/0x16a [snd_soc_core] [ 82.179043] Code: 43 08 48 c7 c6 f0 24 6e c0 4c 39 e0 0f 84 a9 00 00 00 48 8b 2b 48 85 ed 0f 84 9d 00 00 00 48 c7 c7 00 51 6e c0 e8 d2 5d 5d f2 <48> 83 7d 60 00 75 13 48 c7 c6 f0 24 6e c0 48 c7 c7 20 51 6e c0 e8 [ 82.179044] RSP: 0018:ffffa70180bf3d78 EFLAGS: 00010246 [ 82.179046] RAX: 0000000000000034 RBX: ffffa00f7aaf3968 RCX: 0000000000000006 [ 82.179047] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffa00fa5ad63d0 [ 82.179048] RBP: 0000000000000004 R08: ffffa70180bf3c3d R09: 0000000000001518 [ 82.179049] R10: ffffa70180bf3c38 R11: ffffa70180bf3c3d R12: ffffa00fa1be4eb0 [ 82.179050] R13: ffffa00fa27aa000 R14: dead000000000122 R15: dead000000000100 [ 82.179052] FS: 00007f4e7e5ebc80(0000) GS:ffffa00fa5ac0000(0000) knlGS:0000000000000000 [ 82.179054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 82.179055] CR2: 0000000000000064 CR3: 0000000253d68005 CR4: 00000000003606e0 [ 82.179056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 82.179057] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 82.179058] Call Trace: [ 82.179064] snd_pcm_free+0x1a/0x50 [snd_pcm] I have absolutely no idea what all these data structures are, just reporting this. reverting "ASoC: soc-core: add soc_unbind_dai_link()" is the only work-around at this point. i've tested this module load/unload for hours without issues. It's actually quite interesting since this snd_soc_pcm_component_free() calls a .pcm_destruct() callback that's not used by the SOF driver. It's only used on Intel platforms for the Skylake/SST driver, not sure why and if SOF is missing something.
On Tue, Nov 12, 2019 at 11:11:32AM -0600, Pierre-Louis Bossart wrote: > + for_each_rtd_components(rtd, rtdcom, component) { > + pr_err("plb: %s processing component\n", __func__); > + if (!component) > + pr_err("plb: %s component is NULL\n", __func__); Could you perhaps add traces of which components are being accessed at each stage? We might want to go through and just add something like that in the code anyway to help figure things out.
>> + for_each_rtd_components(rtd, rtdcom, component) { >> + pr_err("plb: %s processing component\n", __func__); >> + if (!component) >> + pr_err("plb: %s component is NULL\n", __func__); > > Could you perhaps add traces of which components are being accessed at > each stage? We might want to go through and just add something like > that in the code anyway to help figure things out. I tried to add more traces but couldn't triangulate on a clear issue, and the traces have an Heisenbug effect. So I switched to higher-level code analysis: it turns out that soc_dai_link_remove() routine is called from both topology and on card cleanup. The patch 06/19 in this series essentially forces the pcm_runtimes to be freed in both cases, so possibly twice for topology-managed dailinks - or using information that's been freed already. I 'fixed' this by adding an additional parameter to avoid doing the pcm runtime free from the topology (as was done before), and the kernel oops goes away. My tests have been running for 45mn now, when without change I get a kernel oops in less than 10-20 cycles (but still more than apparently our CI tracks, something to improve). I pushed the code on GitHub to check if there are any negative points reported by the Intel CI, should be complete shortly: https://github.com/thesofproject/linux/pull/1469 I am not sure the suggested fix is correct, I don't fully get what the topology and card cleanups should do and how the work is split, if at all.
Hi Pierre-Louis Thank you for your report > >> + for_each_rtd_components(rtd, rtdcom, component) { > >> + pr_err("plb: %s processing component\n", __func__); > >> + if (!component) > >> + pr_err("plb: %s component is NULL\n", __func__); > > > > Could you perhaps add traces of which components are being accessed at > > each stage? We might want to go through and just add something like > > that in the code anyway to help figure things out. > > I tried to add more traces but couldn't triangulate on a clear issue, > and the traces have an Heisenbug effect. > > So I switched to higher-level code analysis: it turns out that > soc_dai_link_remove() routine is called from both topology and on card > cleanup. > > The patch 06/19 in this series essentially forces the pcm_runtimes to > be freed in both cases, so possibly twice for topology-managed > dailinks - or using information that's been freed already. > > I 'fixed' this by adding an additional parameter to avoid doing the > pcm runtime free from the topology (as was done before), and the > kernel oops goes away. My tests have been running for 45mn now, when > without change I get a kernel oops in less than 10-20 cycles (but > still more than apparently our CI tracks, something to improve). > > I pushed the code on GitHub to check if there are any negative points > reported by the Intel CI, should be complete shortly: > https://github.com/thesofproject/linux/pull/1469 > > I am not sure the suggested fix is correct, I don't fully get what the > topology and card cleanups should do and how the work is split, if at > all. BTW, I guess your kernel is appling this patch, but, is it correct ? df95a16d2a9626dcfc3f2b3671c9b91fa076c997 ("ASoC: soc-core: fix RIP warning on card removal") If so, I think I could understand the issue. But, before explaining detail, I want to confirm that my solution is correct or not first. Can you please check attached patch ? Then, please remove above RIP warning patch. It is not clean patch, but OK to confirming, so far. I think - no kernel Oops - no RIP warning Thank you for your help !! Best regards --- Kuninori Morimoto From 0d825237eea4baad0c489e1c43a58373f41a3632 Mon Sep 17 00:00:00 2001 From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Date: Wed, 13 Nov 2019 11:58:18 +0900 Subject: [PATCH] topology die fixup patch candidate 1 Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> --- include/sound/soc-component.h | 2 +- sound/soc/soc-component.c | 5 ++--- sound/soc/soc-core.c | 14 ++++++++------ sound/soc/soc-pcm.c | 10 ---------- 4 files changed, 11 insertions(+), 20 deletions(-) diff --git a/include/sound/soc-component.h b/include/sound/soc-component.h index 6aa3ecb..cf726a5 100644 --- a/include/sound/soc-component.h +++ b/include/sound/soc-component.h @@ -413,6 +413,6 @@ struct page *snd_soc_pcm_component_page(struct snd_pcm_substream *substream, int snd_soc_pcm_component_mmap(struct snd_pcm_substream *substream, struct vm_area_struct *vma); int snd_soc_pcm_component_new(struct snd_pcm *pcm); -void snd_soc_pcm_component_free(struct snd_pcm *pcm); +void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd); #endif /* __SOC_COMPONENT_H */ diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c index 98ef066..195979f 100644 --- a/sound/soc/soc-component.c +++ b/sound/soc/soc-component.c @@ -516,13 +516,12 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm) return 0; } -void snd_soc_pcm_component_free(struct snd_pcm *pcm) +void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd) { - struct snd_soc_pcm_runtime *rtd = pcm->private_data; struct snd_soc_rtdcom_list *rtdcom; struct snd_soc_component *component; for_each_rtd_components(rtd, rtdcom, component) if (component->driver->pcm_destruct) - component->driver->pcm_destruct(component, pcm); + component->driver->pcm_destruct(component, rtd->pcm); } diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c index 1e8dd19..8e7ff7d 100644 --- a/sound/soc/soc-core.c +++ b/sound/soc/soc-core.c @@ -386,6 +386,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd) list_del(&rtd->list); + flush_delayed_work(&rtd->delayed_work); + snd_soc_pcm_component_free(rtd); + /* * we don't need to call kfree() for rtd->dev * see @@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) { struct snd_soc_dai_link *link, *_link; - /* free the ALSA card at first; this syncs with pending operations */ - if (card->snd_card) { - snd_card_free(card->snd_card); - card->snd_card = NULL; - } - /* remove and free each DAI */ soc_remove_link_dais(card); soc_remove_link_components(card); @@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) /* remove the card */ if (card->remove) card->remove(card); + + if (card->snd_card) { + snd_card_free(card->snd_card); + card->snd_card = NULL; + } } static int snd_soc_instantiate_card(struct snd_soc_card *card) diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c index 1c00119..a60e381 100644 --- a/sound/soc/soc-pcm.c +++ b/sound/soc/soc-pcm.c @@ -2883,15 +2883,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream) return ret; } -static void soc_pcm_private_free(struct snd_pcm *pcm) -{ - struct snd_soc_pcm_runtime *rtd = pcm->private_data; - - /* need to sync the delayed work before releasing resources */ - flush_delayed_work(&rtd->delayed_work); - snd_soc_pcm_component_free(pcm); -} - /* create a new pcm */ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) { @@ -3033,7 +3024,6 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) return ret; } - pcm->private_free = soc_pcm_private_free; pcm->no_device_suspend = true; out: dev_info(rtd->card->dev, "%s <-> %s mapping ok\n",
On Wed, 13 Nov 2019 05:37:46 +0100, Kuninori Morimoto wrote: > > @@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) > { > struct snd_soc_dai_link *link, *_link; > > - /* free the ALSA card at first; this syncs with pending operations */ > - if (card->snd_card) { > - snd_card_free(card->snd_card); > - card->snd_card = NULL; > - } > - > /* remove and free each DAI */ > soc_remove_link_dais(card); > soc_remove_link_components(card); > @@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) > /* remove the card */ > if (card->remove) > card->remove(card); > + > + if (card->snd_card) { > + snd_card_free(card->snd_card); > + card->snd_card = NULL; > + } This will likely break unbind again; when unbind is performed in a busy state, the code may release still-in-use resources. The rcar driver seems to have its own disconnect_sync() call so it'd work even with this change, but others wouldn't. At least you need to call snd_card_disconnect_sync() at the first place, then release the rest at the second place. thanks, Takashi
Hi Takashi-san > > @@ -1965,12 +1968,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) > > { > > struct snd_soc_dai_link *link, *_link; > > > > - /* free the ALSA card at first; this syncs with pending operations */ > > - if (card->snd_card) { > > - snd_card_free(card->snd_card); > > - card->snd_card = NULL; > > - } > > - > > /* remove and free each DAI */ > > soc_remove_link_dais(card); > > soc_remove_link_components(card); > > @@ -1988,6 +1985,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) > > /* remove the card */ > > if (card->remove) > > card->remove(card); > > + > > + if (card->snd_card) { > > + snd_card_free(card->snd_card); > > + card->snd_card = NULL; > > + } > > This will likely break unbind again; when unbind is performed in a > busy state, the code may release still-in-use resources. The rcar > driver seems to have its own disconnect_sync() call so it'd work even > with this change, but others wouldn't. > > At least you need to call snd_card_disconnect_sync() at the first > place, then release the rest at the second place. Ahh, I didn't notice about busy state and async process. Thank you for pointing it. Thank you for your help !! Best regards --- Kuninori Morimoto
On 11/12/19 10:37 PM, Kuninori Morimoto wrote: > > Hi Pierre-Louis > > Thank you for your report > >>>> + for_each_rtd_components(rtd, rtdcom, component) { >>>> + pr_err("plb: %s processing component\n", __func__); >>>> + if (!component) >>>> + pr_err("plb: %s component is NULL\n", __func__); >>> >>> Could you perhaps add traces of which components are being accessed at >>> each stage? We might want to go through and just add something like >>> that in the code anyway to help figure things out. >> >> I tried to add more traces but couldn't triangulate on a clear issue, >> and the traces have an Heisenbug effect. >> >> So I switched to higher-level code analysis: it turns out that >> soc_dai_link_remove() routine is called from both topology and on card >> cleanup. >> >> The patch 06/19 in this series essentially forces the pcm_runtimes to >> be freed in both cases, so possibly twice for topology-managed >> dailinks - or using information that's been freed already. >> >> I 'fixed' this by adding an additional parameter to avoid doing the >> pcm runtime free from the topology (as was done before), and the >> kernel oops goes away. My tests have been running for 45mn now, when >> without change I get a kernel oops in less than 10-20 cycles (but >> still more than apparently our CI tracks, something to improve). >> >> I pushed the code on GitHub to check if there are any negative points >> reported by the Intel CI, should be complete shortly: >> https://github.com/thesofproject/linux/pull/1469 >> >> I am not sure the suggested fix is correct, I don't fully get what the >> topology and card cleanups should do and how the work is split, if at >> all. > > BTW, I guess your kernel is appling this patch, > but, is it correct ? > > df95a16d2a9626dcfc3f2b3671c9b91fa076c997 > ("ASoC: soc-core: fix RIP warning on card removal") Sorry morimoto-san, I am not getting the question. Are you asking if the patch is correct? Or are you asking if the kernel used for this test include this patch? The answer is yes, this patch ("ASoC: soc-core: fix RIP warning on card removal") was merged by Mark and the SOF tree does use it, since we follow Mark's tree. > If so, I think I could understand the issue. > But, before explaining detail, > I want to confirm that my solution is correct or not first. > > Can you please check attached patch ? Takashi's feedback seems to hint at problems with this patch, so will wait for further instructions here if you want me to test. Thanks! > Then, please remove above RIP warning patch. > It is not clean patch, but OK to confirming, so far. > I think > - no kernel Oops > - no RIP warning > > Thank you for your help !! > Best regards > --- > Kuninori Morimoto > > > _______________________________________________ > Alsa-devel mailing list > Alsa-devel@alsa-project.org > https://mailman.alsa-project.org/mailman/listinfo/alsa-devel >
Hi Pierre-Louis > > BTW, I guess your kernel is appling this patch, > > but, is it correct ? > > > > df95a16d2a9626dcfc3f2b3671c9b91fa076c997 > > ("ASoC: soc-core: fix RIP warning on card removal") > > Sorry morimoto-san, I am not getting the question. > > Are you asking if the patch is correct? > > Or are you asking if the kernel used for this test include this patch? > The answer is yes, this patch ("ASoC: soc-core: fix RIP warning on > card removal") was merged by Mark and the SOF tree does use it, since > we follow Mark's tree. Oops, sorry for my noise. I thought you are using a little bit old kernel + extra patch + SOF tree. > > If so, I think I could understand the issue. > > But, before explaining detail, > > I want to confirm that my solution is correct or not first. > > > > Can you please check attached patch ? > > Takashi's feedback seems to hint at problems with this patch, so will > wait for further instructions here if you want me to test. > Thanks! OK, I will post patch as "RFC". Because I can't test it. Thank you for your help !! Best regards --- Kuninori Morimoto
Hi Pierre-Louis, again Cc Takashi-san > > Takashi's feedback seems to hint at problems with this patch, so will > > wait for further instructions here if you want me to test. > > Thanks! > > OK, I will post patch as "RFC". > Because I can't test it. I try to explain it here, not at RFC :) I'm not 100% sure detail of SOF / topology, but in my understanding, topology want to add *extra* dai_link by using snd_soc_add_dai_link(). And it will be removed by snd_soc_romove_dai_link(). Before, this snd_soc_add/remove_dai_link() and/or its related functions are unbalanced. Now, these are balance-uping, and it finds the random operation issue. topology will call snd_soc_remove_dai_link() via (A). In my understanding, currently, SOF and skylake are calling it. static void soc_cleanup_card_resources(struct snd_soc_card *card) { struct snd_soc_dai_link *link, *_link; /* This should be called before snd_card_free() */ (A) soc_remove_link_components(card); /* free the ALSA card at first; this syncs with pending operations */ if (card->snd_card) { (B) snd_card_free(card->snd_card); card->snd_card = NULL; } /* remove and free each DAI */ (X) soc_remove_link_dais(card); for_each_card_links_safe(card, link, _link) (C) snd_soc_remove_dai_link(card, link); ... } At (A), topology calls snd_soc_remove_dai_link(). Then topology rtd, and its related all data are freed. Next, (B) is called, and then, pcm->private_free = soc_pcm_private_free() is called. static void soc_pcm_private_free(struct snd_pcm *pcm) { struct snd_soc_pcm_runtime *rtd = pcm->private_data; /* need to sync the delayed work before releasing resources */ flush_delayed_work(&rtd->delayed_work); snd_soc_pcm_component_free(pcm); } Here, it gets rtd via pcm->private_data. But, topology related rtd are already freed at (A). Above both flush_delayed_work() and snd_soc_pcm_component_free() are using rtd via pcm->private_data. I guess, your kernel access to already freed memory, and get Oops. Normal sound card is freeing rtd at (C), thus no damage. These flush_delayed_work() and snd_soc_pcm_component_free() are rtd related data's finalization. If so, these should be called when rtd was freed, not sound card was freed. It is very natural and understandable, I think. In other words, pcm->private_free = soc_pcm_private_free() is no longer needed. But, here one other issue exist. If we do above idea, there is zero chance to call soc_remove_dai() to topology related dai at (X). Because (A) removes rtd from card too, and, (X) is based on card connected rtd. This means, (X) need to be called before (C) (= for normal sound) and (A) (= for topology). Now, I want to focus this patch which is the reason why snd_card_free() = (B) is located there. 4efda5f2130da033aeedc5b3205569893b910de2 ("ASoC: Fix use-after-free at card unregistration") Original snd_card_free() was called last of this function. But moved to top to avoid use-after-free issue. The issue was happen at soc_pcm_free() which was pcm->private_free, today it is updated to soc_pcm_private_free(). In other words, (B) need to be called before (C) (= for normal sound) and (A) (= for topology), because it needs (not yet freed) rtd. But, (A) need to be called before (B), because it needs card->snd_card pointer. If we call flush_delayed_work() and snd_soc_pcm_component_free() (= same as soc_pcm_private_free()) when rtd was freed (= (C), (A)), there is no reason to call snd_card_free() at top of this function. It can be called end of this function, again. But, in such case, it will likely break unbind again, as Takashi-san said. when unbind is performed in a busy state, the code may release still-in-use resources. At least we need to call snd_card_disconnect_sync() at the first place. The final code can be... static void soc_cleanup_card_resources(struct snd_soc_card *card) { struct snd_soc_dai_link *link, *_link; if (card->snd_card) (Z) snd_card_disconnect_sync(card->snd_card); (X) soc_remove_link_dais(card); (A) soc_remove_link_components(card); for_each_card_links_safe(card, link, _link) (C) snd_soc_remove_dai_link(card, link); ... if (card->snd_card) { (B) snd_card_free(card->snd_card); card->snd_card = NULL; } } To avoid release still-in-use resources, call snd_card_disconnect_sync() at (Z). (X) is called for both non-topology and topology. topology removes rtd via (A), and non topology removes rtd via (C). And snd_card_free() is no longer related to use-after-free issue. Thus, (B) is OK. But, these are just my guess. It works for me, but I can't re-produce the issue. Below is the patch for "testing". ------------- diff --git a/include/sound/soc-component.h b/include/sound/soc-component.h index 6aa3ecb..cf726a5 100644 --- a/include/sound/soc-component.h +++ b/include/sound/soc-component.h @@ -413,6 +413,6 @@ struct page *snd_soc_pcm_component_page(struct snd_pcm_substream *substream, int snd_soc_pcm_component_mmap(struct snd_pcm_substream *substream, struct vm_area_struct *vma); int snd_soc_pcm_component_new(struct snd_pcm *pcm); -void snd_soc_pcm_component_free(struct snd_pcm *pcm); +void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd); #endif /* __SOC_COMPONENT_H */ diff --git a/sound/soc/soc-component.c b/sound/soc/soc-component.c index 98ef066..195979f 100644 --- a/sound/soc/soc-component.c +++ b/sound/soc/soc-component.c @@ -516,13 +516,12 @@ int snd_soc_pcm_component_new(struct snd_pcm *pcm) return 0; } -void snd_soc_pcm_component_free(struct snd_pcm *pcm) +void snd_soc_pcm_component_free(struct snd_soc_pcm_runtime *rtd) { - struct snd_soc_pcm_runtime *rtd = pcm->private_data; struct snd_soc_rtdcom_list *rtdcom; struct snd_soc_component *component; for_each_rtd_components(rtd, rtdcom, component) if (component->driver->pcm_destruct) - component->driver->pcm_destruct(component, pcm); + component->driver->pcm_destruct(component, rtd->pcm); } diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c index 92260a9..42d87bb 100644 --- a/sound/soc/soc-core.c +++ b/sound/soc/soc-core.c @@ -419,6 +419,9 @@ static void soc_free_pcm_runtime(struct snd_soc_pcm_runtime *rtd) list_del(&rtd->list); + flush_delayed_work(&rtd->delayed_work); + snd_soc_pcm_component_free(rtd); + /* * we don't need to call kfree() for rtd->dev * see @@ -1944,17 +1947,12 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) { struct snd_soc_dai_link *link, *_link; - /* This should be called before snd_card_free() */ - soc_remove_link_components(card); - - /* free the ALSA card at first; this syncs with pending operations */ - if (card->snd_card) { - snd_card_free(card->snd_card); - card->snd_card = NULL; - } + if (card->snd_card) + snd_card_disconnect_sync(card->snd_card); /* remove and free each DAI */ soc_remove_link_dais(card); + soc_remove_link_components(card); for_each_card_links_safe(card, link, _link) snd_soc_remove_dai_link(card, link); @@ -1969,6 +1967,11 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) /* remove the card */ if (card->remove) card->remove(card); + + if (card->snd_card) { + snd_card_free(card->snd_card); + card->snd_card = NULL; + } } static int snd_soc_bind_card(struct snd_soc_card *card) diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c index 4bf71e3..6e24f0a5 100644 --- a/sound/soc/soc-pcm.c +++ b/sound/soc/soc-pcm.c @@ -2892,15 +2892,6 @@ static int dpcm_fe_dai_close(struct snd_pcm_substream *fe_substream) return ret; } -static void soc_pcm_private_free(struct snd_pcm *pcm) -{ - struct snd_soc_pcm_runtime *rtd = pcm->private_data; - - /* need to sync the delayed work before releasing resources */ - flush_delayed_work(&rtd->delayed_work); - snd_soc_pcm_component_free(pcm); -} - /* create a new pcm */ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) { @@ -3042,7 +3033,6 @@ int soc_new_pcm(struct snd_soc_pcm_runtime *rtd, int num) return ret; } - pcm->private_free = soc_pcm_private_free; pcm->no_device_suspend = true; out: dev_info(rtd->card->dev, "%s <-> %s mapping ok\n", Thank you for your help !! Best regards --- Kuninori Morimoto
> But, these are just my guess. > It works for me, but I can't re-produce the issue. > > Below is the patch for "testing". Morimoto-san, I haven't fully tested the logic and all the long description, but the patch suggested seems to work on all the platforms I tested on - and actually improves things on Baytrail/Cherrytrail. I haven't seen any oopses in several hours now and no regression reported by Intel CI https://github.com/thesofproject/linux/pull/1504 Let's see what others think or tested.
Hi Pierre-Louis Thank you for testing > > But, these are just my guess. > > It works for me, but I can't re-produce the issue. > > > > Below is the patch for "testing". > > Morimoto-san, I haven't fully tested the logic and all the long > description, but the patch suggested seems to work on all the > platforms I tested on - and actually improves things on > Baytrail/Cherrytrail. > > I haven't seen any oopses in several hours now and no regression > reported by Intel CI https://github.com/thesofproject/linux/pull/1504 Wow !! Nice to know !! I'm very happy about that !! OK, I will post the patch officially with your tested by. > Let's see what others think or tested. Yes, it is nice idea Thank you for your help !! Best regards --- Kuninori Morimoto
On 11/17/19 6:47 PM, Kuninori Morimoto wrote: > > Hi Pierre-Louis > > Thank you for testing > >>> But, these are just my guess. >>> It works for me, but I can't re-produce the issue. >>> >>> Below is the patch for "testing". >> >> Morimoto-san, I haven't fully tested the logic and all the long >> description, but the patch suggested seems to work on all the >> platforms I tested on - and actually improves things on >> Baytrail/Cherrytrail. >> >> I haven't seen any oopses in several hours now and no regression >> reported by Intel CI https://github.com/thesofproject/linux/pull/1504 > > Wow !! Nice to know !! > I'm very happy about that !! > > OK, I will post the patch officially with your tested by. > >> Let's see what others think or tested. the only issue we found in additional testing is that we need to insert a ~20s delay after each of module insertion/removal sequences, or we get stuck in at some point. That is a likely sign of delayed work not properly canceled - it may however be a different topic altogether, I saw the same sighting with my initial patch which reverted the "ASoC: soc-core: add soc_unbind_dai_link()" patch. Others at Intel redid my tests and also found that the module load/unload tests worked across the board on CI devices. from the Intel side we're good.
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c index e8ff6f2..1e8dd19 100644 --- a/sound/soc/soc-core.c +++ b/sound/soc/soc-core.c @@ -470,14 +470,6 @@ static struct snd_soc_pcm_runtime *soc_new_pcm_runtime( return NULL; } -static void soc_remove_pcm_runtimes(struct snd_soc_card *card) -{ - struct snd_soc_pcm_runtime *rtd, *_rtd; - - for_each_card_rtds_safe(card, rtd, _rtd) - soc_free_pcm_runtime(rtd); -} - struct snd_soc_pcm_runtime *snd_soc_get_pcm_runtime(struct snd_soc_card *card, const char *dai_link) { @@ -1037,6 +1029,16 @@ static int soc_dai_link_sanity_check(struct snd_soc_card *card, return 0; } +static void soc_unbind_dai_link(struct snd_soc_card *card, + struct snd_soc_dai_link *dai_link) +{ + struct snd_soc_pcm_runtime *rtd; + + rtd = snd_soc_get_pcm_runtime(card, dai_link->name); + if (rtd) + soc_free_pcm_runtime(rtd); +} + static int soc_bind_dai_link(struct snd_soc_card *card, struct snd_soc_dai_link *dai_link) { @@ -1466,6 +1468,8 @@ void snd_soc_remove_dai_link(struct snd_soc_card *card, card->remove_dai_link(card, dai_link); list_del(&dai_link->list); + + soc_unbind_dai_link(card, dai_link); } EXPORT_SYMBOL_GPL(snd_soc_remove_dai_link); @@ -1974,8 +1978,6 @@ static void soc_cleanup_card_resources(struct snd_soc_card *card) for_each_card_links_safe(card, link, _link) snd_soc_remove_dai_link(card, link); - soc_remove_pcm_runtimes(card); - /* remove auxiliary devices */ soc_remove_aux_devices(card); soc_unbind_aux_dev(card);