Message ID | 1350894794-1494-3-git-send-email-ming.lei@canonical.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Mon, 22 Oct 2012, Ming Lei wrote: > +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) > +{ > + dev->power.memalloc_noio_resume = enable; > + > + if (!dev->parent) > + return; > + > + if (enable) { > + pm_runtime_set_memalloc_noio(dev->parent, 1); > + } else { > + /* only clear the flag for one device if all > + * children of the device don't set the flag. > + */ > + if (device_for_each_child(dev->parent, NULL, > + dev_memalloc_noio)) > + return; > + > + pm_runtime_set_memalloc_noio(dev->parent, 0); > + } > +} > +EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio); Tail recursion should be implemented as a loop, not as an explicit recursion. That is, the function should be: void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) { do { dev->power.memalloc_noio_resume = enable; if (!enable) { /* * Don't clear the parent's flag if any of the * parent's children have their flag set. */ if (device_for_each_child(dev->parent, NULL, dev_memalloc_noio)) return; } dev = dev->parent; } while (dev); } except that you need to add locking, for two reasons: There's a race. What happens if another child sets the flag between the time device_for_each_child() runs and the next loop iteration? Even without a race, access to bitfields is not SMP-safe without locking. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > > Tail recursion should be implemented as a loop, not as an explicit > recursion. That is, the function should be: > > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) > { > do { > dev->power.memalloc_noio_resume = enable; > > if (!enable) { > /* > * Don't clear the parent's flag if any of the > * parent's children have their flag set. > */ > if (device_for_each_child(dev->parent, NULL, > dev_memalloc_noio)) > return; > } > dev = dev->parent; > } while (dev); > } OK, will take the non-recursion implementation for saving kernel stack space. > > except that you need to add locking, for two reasons: > > There's a race. What happens if another child sets the flag > between the time device_for_each_child() runs and the next loop > iteration? Yes, I know the race, and not adding a lock because the function is mostly called in .probe() or .remove() callback and its parent's device lock is held to avoid this race. Considered that it may be called in async probe() (scsi disk), one lock is needed, the simplest way is to add a global lock. Any suggestion? > > Even without a race, access to bitfields is not SMP-safe > without locking. You mean one ancestor device might not be in active when one of its descendants is being probed or removed? Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Oct 2012, Ming Lei wrote: > On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > > > > Tail recursion should be implemented as a loop, not as an explicit > > recursion. That is, the function should be: > > > > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) > > { > > do { > > dev->power.memalloc_noio_resume = enable; > > > > if (!enable) { > > /* > > * Don't clear the parent's flag if any of the > > * parent's children have their flag set. > > */ > > if (device_for_each_child(dev->parent, NULL, > > dev_memalloc_noio)) > > return; > > } > > dev = dev->parent; > > } while (dev); > > } > > OK, will take the non-recursion implementation for saving kernel > stack space. > > > > > except that you need to add locking, for two reasons: > > > > There's a race. What happens if another child sets the flag > > between the time device_for_each_child() runs and the next loop > > iteration? > > Yes, I know the race, and not adding a lock because the function > is mostly called in .probe() or .remove() callback and its parent's device > lock is held to avoid this race. > > Considered that it may be called in async probe() (scsi disk), one lock > is needed, the simplest way is to add a global lock. Any suggestion? No. Because of where you put the new flag, it must be protected by dev->power.lock. And this means the iterative implementation shown above can't be used as is. It will have to be more like this: void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) { spin_lock_irq(&dev->power.lock); dev->power.memalloc_noio_resume = enable; while (dev->parent) { spin_unlock_irq(&dev->power.lock); dev = dev->parent; spin_lock_irq(&dev->power.lock); /* * Don't clear the parent's flag if any of the * parent's children have their flag set. */ if (!enable && device_for_each_child(dev->parent, NULL, dev_memalloc_noio)) break; dev->power.memalloc_noio_resume = enable; } spin_unlock_irq(&dev->power.lock); } > > Even without a race, access to bitfields is not SMP-safe > > without locking. > > You mean one ancestor device might not be in active when > one of its descendants is being probed or removed? No. Consider this example: struct foo { int a:1; int b:1; } x; Consider what happens if CPU 0 does "x.a = 1" at the same time as another CPU 1 does "x.b = 1". The compiler might produce object code looking like this for CPU 0: move x, reg1 or 0x1, reg1 move reg1, x and this for CPU 1: move x, reg2 or 0x2, reg2 move reg2, x With no locking, the two "or" instructions could execute simultaneously. What will the final value of x be? The two CPUs will interfere, even though they are touching different bitfields. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 23, 2012 at 10:46 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Tue, 23 Oct 2012, Ming Lei wrote: > >> On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote: >> > >> > Tail recursion should be implemented as a loop, not as an explicit >> > recursion. That is, the function should be: >> > >> > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) >> > { >> > do { >> > dev->power.memalloc_noio_resume = enable; >> > >> > if (!enable) { >> > /* >> > * Don't clear the parent's flag if any of the >> > * parent's children have their flag set. >> > */ >> > if (device_for_each_child(dev->parent, NULL, >> > dev_memalloc_noio)) >> > return; >> > } >> > dev = dev->parent; >> > } while (dev); >> > } >> >> OK, will take the non-recursion implementation for saving kernel >> stack space. >> >> > >> > except that you need to add locking, for two reasons: >> > >> > There's a race. What happens if another child sets the flag >> > between the time device_for_each_child() runs and the next loop >> > iteration? >> >> Yes, I know the race, and not adding a lock because the function >> is mostly called in .probe() or .remove() callback and its parent's device >> lock is held to avoid this race. >> >> Considered that it may be called in async probe() (scsi disk), one lock >> is needed, the simplest way is to add a global lock. Any suggestion? > > No. Because of where you put the new flag, it must be protected by > dev->power.lock. And this means the iterative implementation shown > above can't be used as is. It will have to be more like this: > > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) > { > spin_lock_irq(&dev->power.lock); > dev->power.memalloc_noio_resume = enable; > > while (dev->parent) { > spin_unlock_irq(&dev->power.lock); > dev = dev->parent; > > spin_lock_irq(&dev->power.lock); > /* > * Don't clear the parent's flag if any of the > * parent's children have their flag set. > */ > if (!enable && device_for_each_child(dev->parent, NULL, s/dev->parent/dev > dev_memalloc_noio)) > break; > dev->power.memalloc_noio_resume = enable; > } > spin_unlock_irq(&dev->power.lock); > } With the problem of non-SMP-safe bitfields access, the power.lock should be held, but that is not enough to prevent children from being probed or disconnected. Looks another lock is still needed. I think a global lock is OK in the infrequent path. > >> > Even without a race, access to bitfields is not SMP-safe >> > without locking. >> >> You mean one ancestor device might not be in active when >> one of its descendants is being probed or removed? > > No. Consider this example: > > struct foo { > int a:1; > int b:1; > } x; > > Consider what happens if CPU 0 does "x.a = 1" at the same time as > another CPU 1 does "x.b = 1". The compiler might produce object code > looking like this for CPU 0: > > move x, reg1 > or 0x1, reg1 > move reg1, x > > and this for CPU 1: > > move x, reg2 > or 0x2, reg2 > move reg2, x > > With no locking, the two "or" instructions could execute > simultaneously. What will the final value of x be? > > The two CPUs will interfere, even though they are touching different > bitfields. Got it, thanks for your detailed explanation. Looks the problem is worse than above, not only bitfields are affected, the adjacent fields might be involved too, see: http://lwn.net/Articles/478657/ Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Oct 2012, Ming Lei wrote: > With the problem of non-SMP-safe bitfields access, the power.lock should > be held, but that is not enough to prevent children from being probed or > disconnected. Looks another lock is still needed. I think a global lock > is OK in the infrequent path. Agreed. > Got it, thanks for your detailed explanation. > > Looks the problem is worse than above, not only bitfields are affected, the > adjacent fields might be involved too, see: > > http://lwn.net/Articles/478657/ Linus made it clear (in various emails at the time) that the kernel requires the compiler not to do the sort of things discussed in that article. But even the restrictions he wanted would not prevent adjacent bitfields from interfering with each other. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Looks the problem is worse than above, not only bitfields are affected, the > adjacent fields might be involved too, see: > > http://lwn.net/Articles/478657/ Not mentioned in there is that even with x86/amd64 given a struct with the following adjacent fields: char a; char b; char c; then foo->b |= 0x80; might do a 32bit RMW cycle. This will (well might - but probably does) happen if compiled to a 'BTS' instruction. The x86 instruction set docs are actually unclear as to whether the 32bit cycle might even be misaligned! amd64 might do a 64bit cycle (not checked the docs). David -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 24 Oct 2012 10:06:36 +0100 "David Laight" <David.Laight@ACULAB.COM> wrote: > > Looks the problem is worse than above, not only bitfields are affected, the > > adjacent fields might be involved too, see: > > > > http://lwn.net/Articles/478657/ > > Not mentioned in there is that even with x86/amd64 given > a struct with the following adjacent fields: > char a; > char b; > char c; > then foo->b |= 0x80; might do a 32bit RMW cycle. There are processors that will do this for the char case at least as they do byte ops by a mix of 32bit ops and rotate. > This will (well might - but probably does) happen > if compiled to a 'BTS' instruction. > The x86 instruction set docs are actually unclear > as to whether the 32bit cycle might even be misaligned! > amd64 might do a 64bit cycle (not checked the docs). Even with a suitably aligned field the compiler is at liberty to generate things like reg = 0x80 reg |= foo->b foo->b = reg; One reason it's a good idea to use set_bit/test_bit and friends. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c index 3148b10..a75eeca 100644 --- a/drivers/base/power/runtime.c +++ b/drivers/base/power/runtime.c @@ -124,6 +124,59 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev) } EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration); +static int dev_memalloc_noio(struct device *dev, void *data) +{ + if (dev->power.memalloc_noio_resume) + return 1; + return 0; +} + +/* + * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag. + * @dev: Device to handle. + * @enable: True for setting the flag and False for clearing the flag. + * + * Set the flag for all devices in the path from the device to the + * root device in the device tree if @enable is true, otherwise clear + * the flag for devices in the path which sibliings don't set the flag. + * + * The function should only be called by block device, or network + * device driver for solving the deadlock problem during runtime + * resume: + * if memory allocation with GFP_KERNEL is called inside runtime + * resume callback of any one of its ancestors(or the block device + * itself), the deadlock may be triggered inside the memory + * allocation since it might not complete until the block device + * becomes active and the involed page I/O finishes. The situation + * is pointed out first by Alan Stern. Network device are involved + * in iSCSI kind of situation. + * + * No lock is provovided in the function for handling hotplog race + * because pm_runtime_set_memalloc_noio(false) is called in parent's + * remove path. + */ +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable) +{ + dev->power.memalloc_noio_resume = enable; + + if (!dev->parent) + return; + + if (enable) { + pm_runtime_set_memalloc_noio(dev->parent, 1); + } else { + /* only clear the flag for one device if all + * children of the device don't set the flag. + */ + if (device_for_each_child(dev->parent, NULL, + dev_memalloc_noio)) + return; + + pm_runtime_set_memalloc_noio(dev->parent, 0); + } +} +EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio); + /** * rpm_check_suspend_allowed - Test whether a device may be suspended. * @dev: Device to test. diff --git a/include/linux/pm.h b/include/linux/pm.h index 007e687..5b0ee4d 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -538,6 +538,7 @@ struct dev_pm_info { unsigned int irq_safe:1; unsigned int use_autosuspend:1; unsigned int timer_autosuspends:1; + unsigned int memalloc_noio_resume:1; enum rpm_request request; enum rpm_status runtime_status; int runtime_error; diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h index f271860..775e063 100644 --- a/include/linux/pm_runtime.h +++ b/include/linux/pm_runtime.h @@ -47,6 +47,7 @@ extern void pm_runtime_set_autosuspend_delay(struct device *dev, int delay); extern unsigned long pm_runtime_autosuspend_expiration(struct device *dev); extern void pm_runtime_update_max_time_suspended(struct device *dev, s64 delta_ns); +extern void pm_runtime_set_memalloc_noio(struct device *dev, bool enable); static inline bool pm_children_suspended(struct device *dev) { @@ -149,6 +150,8 @@ static inline void pm_runtime_set_autosuspend_delay(struct device *dev, int delay) {} static inline unsigned long pm_runtime_autosuspend_expiration( struct device *dev) { return 0; } +static inline void pm_runtime_set_memalloc_noio(struct device *dev, + bool enable){} #endif /* !CONFIG_PM_RUNTIME */
The patch introduces the flag of memalloc_noio_resume in 'struct dev_pm_info' to help PM core to teach mm not allocating memory with GFP_KERNEL flag for avoiding probable deadlock problem. As explained in the comment, any GFP_KERNEL allocation inside runtime_resume on any one of device in the path from one block or network device to the root device in the device tree may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or clears the flag on device of the path recursively. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Ming Lei <ming.lei@canonical.com> --- drivers/base/power/runtime.c | 53 ++++++++++++++++++++++++++++++++++++++++++ include/linux/pm.h | 1 + include/linux/pm_runtime.h | 3 +++ 3 files changed, 57 insertions(+)