Message ID | 20241212110221.1509163-1-ckeepax@opensource.cirrus.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | soundwire: intel_auxdevice: Don't disable IRQs before removing children | expand |
On 12/12/24 5:02 AM, Charles Keepax wrote: > Currently the auxiliary device for the link disables IRQs before it > calls sdw_bus_master_delete(). This has the side effect that > none of the devices on the link can access their own registers whilst > their remove functions run, because the IRQs are required for bus > transactions to function. Obviously, devices should be able to access > their own registers during disable to park the device suitably. What does 'park the device suitably' mean really? That sounds scary. What happens if the manager abruptly ceases to operate and yanks the power? Does the device catch on fire? On a related note, the manager should *never* expect to find the device in a 'suitable' state but always proceed with complete re-initialization. It would be good to expand on the issue, introducing new locks is one of those "no good deed goes unpunished" situations. > It would appear the reason for the disabling of the IRQs is that the IRQ > handler iterates through a linked list of all the links, once a link is > removed the memory pointed at by this linked list is freed, but not > removed from the linked_list itself. Add a list_del() for the linked > list item, note whilst the list itself is contained in the intel_init > portion of the code, the list remove needs to be attached to the > auxiliary device for the link, since that owns the memory that the list > points at. Locking is also required to ensure the IRQ handler runs > before or after any additions/removals from the list. that sounds very detailed but the initial reason for all this is still unclear.
On Mon, Dec 16, 2024 at 11:35:23AM -0600, Pierre-Louis Bossart wrote: > On 12/12/24 5:02 AM, Charles Keepax wrote: > > Currently the auxiliary device for the link disables IRQs before it > > calls sdw_bus_master_delete(). This has the side effect that > > none of the devices on the link can access their own registers whilst > > their remove functions run, because the IRQs are required for bus > > transactions to function. Obviously, devices should be able to access > > their own registers during disable to park the device suitably. > > What does 'park the device suitably' mean really? That sounds > scary. What happens if the manager abruptly ceases to operate and > yanks the power? Does the device catch on fire? On a related note, > the manager should *never* expect to find the device in a 'suitable' > state but always proceed with complete re-initialization. > > It would be good to expand on the issue, introducing new locks > is one of those "no good deed goes unpunished" situations. > I would agree that one should never make hardware that needs parked to avoid damage, but people do stupid things. Also, it doesn't have to be as catastrophic as that, it is usually a case of wanting to move the device into its lowest power state, turn off regulators on the device etc. > > It would appear the reason for the disabling of the IRQs is that the IRQ > > handler iterates through a linked list of all the links, once a link is > > removed the memory pointed at by this linked list is freed, but not > > removed from the linked_list itself. Add a list_del() for the linked > > list item, note whilst the list itself is contained in the intel_init > > portion of the code, the list remove needs to be attached to the > > auxiliary device for the link, since that owns the memory that the list > > points at. Locking is also required to ensure the IRQ handler runs > > before or after any additions/removals from the list. > > that sounds very detailed but the initial reason for all this is still > unclear. If you want, I can add the exact reason I am adding this change to the commit message, which is that regmap IRQ sensibly masks IRQs as they are removed, so when the cs42l43 removes one sees a bunch of failed transations, as the bus has broken. But to be honest I feel like it is overly specific, one could construct any number of similar situations, the real problem here is it is completely normal and reasonable to want to communicate with a device in remove and we should support that. Thanks, Charles
On 16/12/2024 5:35 pm, Pierre-Louis Bossart wrote: > On 12/12/24 5:02 AM, Charles Keepax wrote: >> Currently the auxiliary device for the link disables IRQs before it >> calls sdw_bus_master_delete(). This has the side effect that >> none of the devices on the link can access their own registers whilst >> their remove functions run, because the IRQs are required for bus >> transactions to function. Obviously, devices should be able to access >> their own registers during disable to park the device suitably. > > What does 'park the device suitably' mean really? That sounds scary. What happens if the manager abruptly ceases to operate and yanks the power? Does the device catch on fire? On a related note, the manager should *never* expect to find the device in a 'suitable' state but always proceed with complete re-initialization. > > It would be good to expand on the issue, introducing new locks is one of those "no good deed goes unpunished" situations. > >> It would appear the reason for the disabling of the IRQs is that the IRQ >> handler iterates through a linked list of all the links, once a link is >> removed the memory pointed at by this linked list is freed, but not >> removed from the linked_list itself. Add a list_del() for the linked >> list item, note whilst the list itself is contained in the intel_init >> portion of the code, the list remove needs to be attached to the >> auxiliary device for the link, since that owns the memory that the list >> points at. Locking is also required to ensure the IRQ handler runs >> before or after any additions/removals from the list. > > that sounds very detailed but the initial reason for all this is still unclear. > For example, if the bus driver module is unloaded, the kernel will call remove() on all the child drivers. The bus should remain functional while the child drivers deal with this unexpected unload. This could for example be writing some registers to put the peripheral into a low-power state. On ACPI systems the drivers don't have control of regulators so can't just pull power from the peripheral.
On 12/17/24 4:49 AM, Richard Fitzgerald wrote: > On 16/12/2024 5:35 pm, Pierre-Louis Bossart wrote: >> On 12/12/24 5:02 AM, Charles Keepax wrote: >>> Currently the auxiliary device for the link disables IRQs before it >>> calls sdw_bus_master_delete(). This has the side effect that >>> none of the devices on the link can access their own registers whilst >>> their remove functions run, because the IRQs are required for bus >>> transactions to function. Obviously, devices should be able to access >>> their own registers during disable to park the device suitably. >> >> What does 'park the device suitably' mean really? That sounds scary. What happens if the manager abruptly ceases to operate and yanks the power? Does the device catch on fire? On a related note, the manager should *never* expect to find the device in a 'suitable' state but always proceed with complete re-initialization. >> >> It would be good to expand on the issue, introducing new locks is one of those "no good deed goes unpunished" situations. >> >>> It would appear the reason for the disabling of the IRQs is that the IRQ >>> handler iterates through a linked list of all the links, once a link is >>> removed the memory pointed at by this linked list is freed, but not >>> removed from the linked_list itself. Add a list_del() for the linked >>> list item, note whilst the list itself is contained in the intel_init >>> portion of the code, the list remove needs to be attached to the >>> auxiliary device for the link, since that owns the memory that the list >>> points at. Locking is also required to ensure the IRQ handler runs >>> before or after any additions/removals from the list. >> >> that sounds very detailed but the initial reason for all this is still unclear. >> > For example, if the bus driver module is unloaded, the kernel will call > remove() on all the child drivers. The bus should remain functional > while the child drivers deal with this unexpected unload. This could > for example be writing some registers to put the peripheral into a > low-power state. On ACPI systems the drivers don't have control of > regulators so can't just pull power from the peripheral. Answering to the two replies at once: If the bus driver is unloaded, then the SoundWire clock will stop toggling. That's a rather large piece of information for the device to change states - I am pretty sure the SDCA spec even mandates that the state changes to at least PS_2. But to some extent one could argue that a remove() should be more aggressive than a suspend() and the driver could use PS_4 as the lower power state - there is no real requirement to restart interaction with the device with a simple procedure. The other problem I have with the notion of 'link_lock' is that we already have a notion of 'bus_lock'. And in everything we did so far the terms manager, link and bus are interoperable. So adding a new concept that looks very similar to the existing one shouldn't be done with an explanation of what lock is used for what.
On Tue, Dec 17, 2024 at 05:49:17PM -0600, Pierre-Louis Bossart wrote: > On 12/17/24 4:49 AM, Richard Fitzgerald wrote: > > On 16/12/2024 5:35 pm, Pierre-Louis Bossart wrote: > >> On 12/12/24 5:02 AM, Charles Keepax wrote: > > For example, if the bus driver module is unloaded, the kernel will call > > remove() on all the child drivers. The bus should remain functional > > while the child drivers deal with this unexpected unload. This could > > for example be writing some registers to put the peripheral into a > > low-power state. On ACPI systems the drivers don't have control of > > regulators so can't just pull power from the peripheral. > > Answering to the two replies at once: > > If the bus driver is unloaded, then the SoundWire clock will stop > toggling. That's a rather large piece of information for the device > to change states - The clock should only stop toggling after the drivers have been removed, anything else is a bug. > I am pretty sure the SDCA spec even mandates that > the state changes to at least PS_2. This code applies to more than just SDCA devices. > But to some extent one could argue that a remove() should be more > aggressive than a suspend() and the driver could use PS_4 as the > lower power state - there is no real requirement to restart > interaction with the device with a simple procedure. > Not really sure I follow this bit, none of this has anything to do with when one restarts interacting with the device. It is about leaving the device in a nice state when you stop interacting with it. > The other problem I have with the notion of 'link_lock' is that > we already have a notion of 'bus_lock'. And in everything we did so > far the terms manager, link and bus are interoperable. So adding a > new concept that looks very similar to the existing one shouldn't > be done with an explanation of what lock is used for what. I don't see much confusion here, the two locks are at different levels in the stack. If is fairly normal for a framework to have a lock and drivers to have individual locks under that. And the comment with the lock states it is protecting the list. That said I am not attached to this way of solving the problem either, all I am attached to is allowing devices to communicate in their remove functions. I think perhaps the important questions here are do you object to my assertion that a device should be able to communicate in its remove function? Or do you object to the way I have solved that problem? I am certainly open to other solutions, if you have any suggestions? Thanks, Charles
On 12/18/24 4:51 AM, Charles Keepax wrote: > On Tue, Dec 17, 2024 at 05:49:17PM -0600, Pierre-Louis Bossart wrote: >> On 12/17/24 4:49 AM, Richard Fitzgerald wrote: >>> On 16/12/2024 5:35 pm, Pierre-Louis Bossart wrote: >>>> On 12/12/24 5:02 AM, Charles Keepax wrote: >>> For example, if the bus driver module is unloaded, the kernel will call >>> remove() on all the child drivers. The bus should remain functional >>> while the child drivers deal with this unexpected unload. This could >>> for example be writing some registers to put the peripheral into a >>> low-power state. On ACPI systems the drivers don't have control of >>> regulators so can't just pull power from the peripheral. >> >> Answering to the two replies at once: >> >> If the bus driver is unloaded, then the SoundWire clock will stop >> toggling. That's a rather large piece of information for the device >> to change states - > > The clock should only stop toggling after the drivers have been > removed, anything else is a bug. > >> I am pretty sure the SDCA spec even mandates that >> the state changes to at least PS_2. > > This code applies to more than just SDCA devices. > >> But to some extent one could argue that a remove() should be more >> aggressive than a suspend() and the driver could use PS_4 as the >> lower power state - there is no real requirement to restart >> interaction with the device with a simple procedure. >> > > Not really sure I follow this bit, none of this has anything to do > with when one restarts interacting with the device. It is about > leaving the device in a nice state when you stop interacting with > it. > >> The other problem I have with the notion of 'link_lock' is that >> we already have a notion of 'bus_lock'. And in everything we did so >> far the terms manager, link and bus are interoperable. So adding a >> new concept that looks very similar to the existing one shouldn't >> be done with an explanation of what lock is used for what. > > I don't see much confusion here, the two locks are at different > levels in the stack. If is fairly normal for a framework to have > a lock and drivers to have individual locks under that. And the > comment with the lock states it is protecting the list. > > That said I am not attached to this way of solving the problem > either, all I am attached to is allowing devices to communicate in > their remove functions. I think perhaps the important questions > here are do you object to my assertion that a device should be > able to communicate in its remove function? Or do you object to > the way I have solved that problem? I am certainly open to other > solutions, if you have any suggestions? I agree that the device should be reachable during the remove(), but I believe the scope of expected interaction should be limited to a strict minimum. To be clearer, so far not a single device had a requirement for any sort of interaction on remove. You would need to clarify which codec driver needs this. I don't see how the 'link_lock' and 'bus_lock' are at different levels of the stack, the 'master' device and the 'auxiliary' device are both quite thin and I don't quite see what's different between the two.
>> Or do you object to >> the way I have solved that problem? I am certainly open to other >> solutions, if you have any suggestions? Having looked at the code in more details, I think there are other issues, see e.g. this part of the code called from snd_bus_master_delete(). static int sdw_delete_slave(struct device *dev, void *data) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_bus *bus = slave->bus; pm_runtime_disable(dev); sdw_slave_debugfs_exit(slave); mutex_lock(&bus->bus_lock); if (slave->dev_num) { /* clear dev_num if assigned */ clear_bit(slave->dev_num, bus->assigned); if (bus->ops && bus->ops->put_device_num) bus->ops->put_device_num(bus, slave); } So at this point an interaction with the device is not longer possible, even if the Cadence interrupts are kept active, since there's no valid device number to use... list_del_init(&slave->node); mutex_unlock(&bus->bus_lock); ... but this is where the .remove() will take place. device_unregister(dev); return 0; } What am I missing?
On Wed, Dec 18, 2024 at 04:40:22PM -0500, Pierre-Louis Bossart wrote: > I agree that the device should be reachable during the remove(), > but I believe the scope of expected interaction should be limited > to a strict minimum. To be clearer, so far not a single device had > a requirement for any sort of interaction on remove. You would need > to clarify which codec driver needs this. Ok I can add that to the commit message. > I don't see how the 'link_lock' and 'bus_lock' are at different > levels of the stack, the 'master' device and the 'auxiliary' device > are both quite thin and I don't quite see what's different between > the two. Not sure I follow that, like for example I assume I could just make the code use the bus_lock instead of adding a lock but to me that feels like a layer violation, you have a driver directly taking framework locks. But I guess I don't totally object if everyone else prefers that. In terms of other solutions, I could look at redoing more of the IRQ handling. If rather than using a hard coded linked list, the IRQs were properly registered as IRQs with the IRQ framework then we could simply add and remove them using the normal APIs, which would be a lot cleaner. Of course, fundamentally that is pretty equivalent but it will then just use a lock from the IRQ framework. > Having looked at the code in more details, I think there are other issues, > see e.g. this part of the code called from snd_bus_master_delete(). > > static int sdw_delete_slave(struct device *dev, void *data) > { > struct sdw_slave *slave = dev_to_sdw_dev(dev); > struct sdw_bus *bus = slave->bus; > > pm_runtime_disable(dev); > > sdw_slave_debugfs_exit(slave); > > mutex_lock(&bus->bus_lock); > > if (slave->dev_num) { /* clear dev_num if assigned */ > clear_bit(slave->dev_num, bus->assigned); > if (bus->ops && bus->ops->put_device_num) > bus->ops->put_device_num(bus, slave); > } > > So at this point an interaction with the device is not longer possible, even > if the Cadence interrupts are kept active, since there's no valid device > number to use... > > list_del_init(&slave->node); > mutex_unlock(&bus->bus_lock); > > ... but this is where the .remove() will take place. > > device_unregister(dev); > return 0; > } > > What am I missing? Hmm... yes that is a good spot, I will investigate that further. Certainly I do see these operations happening in the wrong order but it doesn't seem to cause the register transactions to fail. Most likely it is best to reverse these two, it makes sense to not clear the device number until we are finished with the device, but would be good to understand what is going on first. Thanks, Charles
On Thu, Dec 19, 2024 at 10:27:53AM +0000, Charles Keepax wrote: > On Wed, Dec 18, 2024 at 04:40:22PM -0500, Pierre-Louis Bossart wrote: > > Having looked at the code in more details, I think there are other issues, > > see e.g. this part of the code called from snd_bus_master_delete(). > > > > static int sdw_delete_slave(struct device *dev, void *data) > > { > > struct sdw_slave *slave = dev_to_sdw_dev(dev); > > struct sdw_bus *bus = slave->bus; > > > > pm_runtime_disable(dev); > > > > sdw_slave_debugfs_exit(slave); > > > > mutex_lock(&bus->bus_lock); > > > > if (slave->dev_num) { /* clear dev_num if assigned */ > > clear_bit(slave->dev_num, bus->assigned); > > if (bus->ops && bus->ops->put_device_num) > > bus->ops->put_device_num(bus, slave); > > } > > > > So at this point an interaction with the device is not longer possible, even > > if the Cadence interrupts are kept active, since there's no valid device > > number to use... > > > > list_del_init(&slave->node); > > mutex_unlock(&bus->bus_lock); > > > > ... but this is where the .remove() will take place. > > > > device_unregister(dev); > > return 0; > > } > > > > What am I missing? > > Hmm... yes that is a good spot, I will investigate that further. > Certainly I do see these operations happening in the wrong order > but it doesn't seem to cause the register transactions to fail. > Most likely it is best to reverse these two, it makes sense > to not clear the device number until we are finished with the > device, but would be good to understand what is going on first. Ok, so yeah nothing on the read/write path checks assigned. So yeah it has marked the device as not being assigned but that doesn't actually stop anything from communicating with the device, the dev_num is still the same one the hardware has. Assigned is only checked when handling slave status events. So I think your point is valid this code should also be changed although the only way it can currently go wrong is if a new device registered on the bus at exactly this moment and was assigned the dev_num of the old device. Likely will be the new year before I get the time to think through the details. Thanks, Charles
diff --git a/drivers/soundwire/intel.h b/drivers/soundwire/intel.h index dddd293814418..4df582ceaed1a 100644 --- a/drivers/soundwire/intel.h +++ b/drivers/soundwire/intel.h @@ -45,6 +45,7 @@ struct sdw_intel_link_res { u32 link_mask; struct sdw_cdns *cdns; struct list_head list; + struct mutex *link_lock; /* lock protecting list */ struct hdac_bus *hbus; }; diff --git a/drivers/soundwire/intel_auxdevice.c b/drivers/soundwire/intel_auxdevice.c index 599954d927529..5b2bbafef1ac0 100644 --- a/drivers/soundwire/intel_auxdevice.c +++ b/drivers/soundwire/intel_auxdevice.c @@ -498,9 +498,12 @@ static void intel_link_remove(struct auxiliary_device *auxdev) if (!bus->prop.hw_disabled) { sdw_intel_debugfs_exit(sdw); cancel_delayed_work_sync(&cdns->attach_dwork); - sdw_cdns_enable_interrupt(cdns, false); } + sdw_bus_master_delete(bus); + + if (!bus->prop.hw_disabled) + sdw_cdns_enable_interrupt(cdns, false); } int intel_link_process_wakeen_event(struct auxiliary_device *auxdev) diff --git a/drivers/soundwire/intel_init.c b/drivers/soundwire/intel_init.c index 12e7a98f319f8..db49ee3808151 100644 --- a/drivers/soundwire/intel_init.c +++ b/drivers/soundwire/intel_init.c @@ -28,6 +28,15 @@ static void intel_link_dev_release(struct device *dev) kfree(ldev); } +static void intel_link_list_del(void *data) +{ + struct sdw_intel_link_res *link = data; + + mutex_lock(link->link_lock); + list_del(&link->list); + mutex_unlock(link->link_lock); +} + /* alloc, init and add link devices */ static struct sdw_intel_link_dev *intel_link_dev_register(struct sdw_intel_res *res, struct sdw_intel_ctx *ctx, @@ -78,6 +87,7 @@ static struct sdw_intel_link_dev *intel_link_dev_register(struct sdw_intel_res * link->shim_vs = res->mmio_base + SDW_SHIM2_VS_BASE(link_id); link->shim_lock = res->eml_lock; } + link->link_lock = &ctx->link_lock; link->ops = res->ops; link->dev = res->dev; @@ -144,8 +154,10 @@ irqreturn_t sdw_intel_thread(int irq, void *dev_id) struct sdw_intel_ctx *ctx = dev_id; struct sdw_intel_link_res *link; + mutex_lock(&ctx->link_lock); list_for_each_entry(link, &ctx->link_list, list) sdw_cdns_irq(irq, link->cdns); + mutex_unlock(&ctx->link_lock); return IRQ_HANDLED; } @@ -209,6 +221,7 @@ static struct sdw_intel_ctx ctx->link_mask = res->link_mask; ctx->handle = res->handle; mutex_init(&ctx->shim_lock); + mutex_init(&ctx->link_lock); link_mask = ctx->link_mask; @@ -245,7 +258,10 @@ static struct sdw_intel_ctx i++; goto err; } + mutex_lock(&ctx->link_lock); list_add_tail(&link->list, &ctx->link_list); + mutex_unlock(&ctx->link_lock); + devm_add_action_or_reset(&ldev->auxdev.dev, intel_link_list_del, link); bus = &link->cdns->bus; /* Calculate number of slaves */ list_for_each(node, &bus->slaves) diff --git a/include/linux/soundwire/sdw_intel.h b/include/linux/soundwire/sdw_intel.h index 580086417e4b0..4444c99aead5f 100644 --- a/include/linux/soundwire/sdw_intel.h +++ b/include/linux/soundwire/sdw_intel.h @@ -304,6 +304,7 @@ struct sdw_intel_ctx { acpi_handle handle; struct sdw_intel_link_dev **ldev; struct list_head link_list; + struct mutex link_lock; /* lock protecting link_list */ struct mutex shim_lock; /* lock for access to shared SHIM registers */ u32 shim_mask; u32 shim_base;