diff mbox

[RFC,v2,2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

Message ID 1350894794-1494-3-git-send-email-ming.lei@canonical.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Ming Lei Oct. 22, 2012, 8:33 a.m. UTC
The patch introduces the flag of memalloc_noio_resume in
'struct dev_pm_info' to help PM core to teach mm not allocating
memory with GFP_KERNEL flag for avoiding probable deadlock
problem.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume on any one of device in the path from one block
or network device to the root device in the device tree may cause
deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device of the path recursively.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/base/power/runtime.c |   53 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/pm.h           |    1 +
 include/linux/pm_runtime.h   |    3 +++
 3 files changed, 57 insertions(+)

Comments

Alan Stern Oct. 22, 2012, 2:33 p.m. UTC | #1
On Mon, 22 Oct 2012, Ming Lei wrote:

> +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> +{
> +	dev->power.memalloc_noio_resume = enable;
> +
> +	if (!dev->parent)
> +		return;
> +
> +	if (enable) {
> +		pm_runtime_set_memalloc_noio(dev->parent, 1);
> +	} else {
> +		/* only clear the flag for one device if all
> +		 * children of the device don't set the flag.
> +		 */
> +		if (device_for_each_child(dev->parent, NULL,
> +					  dev_memalloc_noio))
> +			return;
> +
> +		pm_runtime_set_memalloc_noio(dev->parent, 0);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);

Tail recursion should be implemented as a loop, not as an explicit
recursion.  That is, the function should be:

void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
{
	do {
		dev->power.memalloc_noio_resume = enable;

		if (!enable) {
			/*
			 * Don't clear the parent's flag if any of the
			 * parent's children have their flag set.
			 */
			if (device_for_each_child(dev->parent, NULL,
					  dev_memalloc_noio))
				return;
		}
		dev = dev->parent;
	} while (dev);
}

except that you need to add locking, for two reasons:

	There's a race.  What happens if another child sets the flag
	between the time device_for_each_child() runs and the next loop
	iteration?

	Even without a race, access to bitfields is not SMP-safe 
	without locking.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Oct. 23, 2012, 9:08 a.m. UTC | #2
On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Tail recursion should be implemented as a loop, not as an explicit
> recursion.  That is, the function should be:
>
> void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> {
>         do {
>                 dev->power.memalloc_noio_resume = enable;
>
>                 if (!enable) {
>                         /*
>                          * Don't clear the parent's flag if any of the
>                          * parent's children have their flag set.
>                          */
>                         if (device_for_each_child(dev->parent, NULL,
>                                           dev_memalloc_noio))
>                                 return;
>                 }
>                 dev = dev->parent;
>         } while (dev);
> }

OK, will take the non-recursion implementation for saving kernel
stack space.

>
> except that you need to add locking, for two reasons:
>
>         There's a race.  What happens if another child sets the flag
>         between the time device_for_each_child() runs and the next loop
>         iteration?

Yes, I know the race, and not adding a lock because the function
is mostly called in .probe() or .remove() callback and its parent's device
lock is held to avoid this race.

Considered that it may be called in async probe() (scsi disk), one lock
is needed, the simplest way is to add a global lock. Any suggestion?

>
>         Even without a race, access to bitfields is not SMP-safe
>         without locking.

You mean one ancestor device might not be in active when
one of its descendants is being probed or removed?


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Oct. 23, 2012, 2:46 p.m. UTC | #3
On Tue, 23 Oct 2012, Ming Lei wrote:

> On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > Tail recursion should be implemented as a loop, not as an explicit
> > recursion.  That is, the function should be:
> >
> > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> > {
> >         do {
> >                 dev->power.memalloc_noio_resume = enable;
> >
> >                 if (!enable) {
> >                         /*
> >                          * Don't clear the parent's flag if any of the
> >                          * parent's children have their flag set.
> >                          */
> >                         if (device_for_each_child(dev->parent, NULL,
> >                                           dev_memalloc_noio))
> >                                 return;
> >                 }
> >                 dev = dev->parent;
> >         } while (dev);
> > }
> 
> OK, will take the non-recursion implementation for saving kernel
> stack space.
> 
> >
> > except that you need to add locking, for two reasons:
> >
> >         There's a race.  What happens if another child sets the flag
> >         between the time device_for_each_child() runs and the next loop
> >         iteration?
> 
> Yes, I know the race, and not adding a lock because the function
> is mostly called in .probe() or .remove() callback and its parent's device
> lock is held to avoid this race.
> 
> Considered that it may be called in async probe() (scsi disk), one lock
> is needed, the simplest way is to add a global lock. Any suggestion?

No.  Because of where you put the new flag, it must be protected by
dev->power.lock.  And this means the iterative implementation shown
above can't be used as is.  It will have to be more like this:

void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
{
	spin_lock_irq(&dev->power.lock);
	dev->power.memalloc_noio_resume = enable;

	while (dev->parent) {
		spin_unlock_irq(&dev->power.lock);
		dev = dev->parent;

		spin_lock_irq(&dev->power.lock);
		/*
		 * Don't clear the parent's flag if any of the
		 * parent's children have their flag set.
		 */
		if (!enable && device_for_each_child(dev->parent, NULL,
				dev_memalloc_noio))
			break;
		dev->power.memalloc_noio_resume = enable;
	}
	spin_unlock_irq(&dev->power.lock);
}

> >         Even without a race, access to bitfields is not SMP-safe
> >         without locking.
> 
> You mean one ancestor device might not be in active when
> one of its descendants is being probed or removed?

No.  Consider this example:

	struct foo {
		int a:1;
		int b:1;
	} x;

Consider what happens if CPU 0 does "x.a = 1" at the same time as 
another CPU 1 does "x.b = 1".  The compiler might produce object code 
looking like this for CPU 0:

	move	x, reg1
	or	0x1, reg1
	move	reg1, x

and this for CPU 1:

	move	x, reg2
	or	0x2, reg2
	move	reg2, x

With no locking, the two "or" instructions could execute 
simultaneously.  What will the final value of x be?

The two CPUs will interfere, even though they are touching different 
bitfields.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Oct. 23, 2012, 3:18 p.m. UTC | #4
On Tue, Oct 23, 2012 at 10:46 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Tue, 23 Oct 2012, Ming Lei wrote:
>
>> On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> >
>> > Tail recursion should be implemented as a loop, not as an explicit
>> > recursion.  That is, the function should be:
>> >
>> > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
>> > {
>> >         do {
>> >                 dev->power.memalloc_noio_resume = enable;
>> >
>> >                 if (!enable) {
>> >                         /*
>> >                          * Don't clear the parent's flag if any of the
>> >                          * parent's children have their flag set.
>> >                          */
>> >                         if (device_for_each_child(dev->parent, NULL,
>> >                                           dev_memalloc_noio))
>> >                                 return;
>> >                 }
>> >                 dev = dev->parent;
>> >         } while (dev);
>> > }
>>
>> OK, will take the non-recursion implementation for saving kernel
>> stack space.
>>
>> >
>> > except that you need to add locking, for two reasons:
>> >
>> >         There's a race.  What happens if another child sets the flag
>> >         between the time device_for_each_child() runs and the next loop
>> >         iteration?
>>
>> Yes, I know the race, and not adding a lock because the function
>> is mostly called in .probe() or .remove() callback and its parent's device
>> lock is held to avoid this race.
>>
>> Considered that it may be called in async probe() (scsi disk), one lock
>> is needed, the simplest way is to add a global lock. Any suggestion?
>
> No.  Because of where you put the new flag, it must be protected by
> dev->power.lock.  And this means the iterative implementation shown
> above can't be used as is.  It will have to be more like this:
>
> void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> {
>         spin_lock_irq(&dev->power.lock);
>         dev->power.memalloc_noio_resume = enable;
>
>         while (dev->parent) {
>                 spin_unlock_irq(&dev->power.lock);
>                 dev = dev->parent;
>
>                 spin_lock_irq(&dev->power.lock);
>                 /*
>                  * Don't clear the parent's flag if any of the
>                  * parent's children have their flag set.
>                  */
>                 if (!enable && device_for_each_child(dev->parent, NULL,

s/dev->parent/dev

>                                 dev_memalloc_noio))
>                         break;
>                 dev->power.memalloc_noio_resume = enable;
>         }
>         spin_unlock_irq(&dev->power.lock);
> }

With the problem of non-SMP-safe bitfields access, the power.lock should
be held, but that is not enough to prevent children from being probed or
disconnected. Looks another lock is still needed. I think a global lock
is OK in the infrequent path.

>
>> >         Even without a race, access to bitfields is not SMP-safe
>> >         without locking.
>>
>> You mean one ancestor device might not be in active when
>> one of its descendants is being probed or removed?
>
> No.  Consider this example:
>
>         struct foo {
>                 int a:1;
>                 int b:1;
>         } x;
>
> Consider what happens if CPU 0 does "x.a = 1" at the same time as
> another CPU 1 does "x.b = 1".  The compiler might produce object code
> looking like this for CPU 0:
>
>         move    x, reg1
>         or      0x1, reg1
>         move    reg1, x
>
> and this for CPU 1:
>
>         move    x, reg2
>         or      0x2, reg2
>         move    reg2, x
>
> With no locking, the two "or" instructions could execute
> simultaneously.  What will the final value of x be?
>
> The two CPUs will interfere, even though they are touching different
> bitfields.

Got it, thanks for your detailed explanation.

Looks the problem is worse than above, not only bitfields are affected, the
adjacent fields might be involved too, see:

           http://lwn.net/Articles/478657/


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Oct. 23, 2012, 6:16 p.m. UTC | #5
On Tue, 23 Oct 2012, Ming Lei wrote:

> With the problem of non-SMP-safe bitfields access, the power.lock should
> be held, but that is not enough to prevent children from being probed or
> disconnected. Looks another lock is still needed. I think a global lock
> is OK in the infrequent path.

Agreed.

> Got it, thanks for your detailed explanation.
> 
> Looks the problem is worse than above, not only bitfields are affected, the
> adjacent fields might be involved too, see:
> 
>            http://lwn.net/Articles/478657/

Linus made it clear (in various emails at the time) that the kernel
requires the compiler not to do the sort of things discussed in that
article.  But even the restrictions he wanted would not prevent
adjacent bitfields from interfering with each other.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Oct. 24, 2012, 9:06 a.m. UTC | #6
> Looks the problem is worse than above, not only bitfields are affected, the
> adjacent fields might be involved too, see:
> 
>            http://lwn.net/Articles/478657/

Not mentioned in there is that even with x86/amd64 given
a struct with the following adjacent fields:
	char a;
	char b;
	char c;
then foo->b |= 0x80; might do a 32bit RMW cycle.
This will (well might - but probably does) happen
if compiled to a 'BTS' instruction.
The x86 instruction set docs are actually unclear
as to whether the 32bit cycle might even be misaligned!
amd64 might do a 64bit cycle (not checked the docs).

	David



--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Cox Oct. 24, 2012, 12:26 p.m. UTC | #7
On Wed, 24 Oct 2012 10:06:36 +0100
"David Laight" <David.Laight@ACULAB.COM> wrote:

> > Looks the problem is worse than above, not only bitfields are affected, the
> > adjacent fields might be involved too, see:
> > 
> >            http://lwn.net/Articles/478657/
> 
> Not mentioned in there is that even with x86/amd64 given
> a struct with the following adjacent fields:
> 	char a;
> 	char b;
> 	char c;
> then foo->b |= 0x80; might do a 32bit RMW cycle.

There are processors that will do this for the char case at least as they
do byte ops by a mix of 32bit ops and rotate.

> This will (well might - but probably does) happen
> if compiled to a 'BTS' instruction.
> The x86 instruction set docs are actually unclear
> as to whether the 32bit cycle might even be misaligned!
> amd64 might do a 64bit cycle (not checked the docs).

Even with a suitably aligned field the compiler is at liberty to generate
things like

	reg = 0x80
	reg |= foo->b
	foo->b = reg;

One reason it's a good idea to use set_bit/test_bit and friends.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 3148b10..a75eeca 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -124,6 +124,59 @@  unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
 
+static int dev_memalloc_noio(struct device *dev, void *data)
+{
+	if (dev->power.memalloc_noio_resume)
+		return 1;
+	return 0;
+}
+
+/*
+ * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
+ * @dev: Device to handle.
+ * @enable: True for setting the flag and False for clearing the flag.
+ *
+ * Set the flag for all devices in the path from the device to the
+ * root device in the device tree if @enable is true, otherwise clear
+ * the flag for devices in the path which sibliings don't set the flag.
+ *
+ * The function should only be called by block device, or network
+ * device driver for solving the deadlock problem during runtime
+ * resume:
+ * 	if memory allocation with GFP_KERNEL is called inside runtime
+ * 	resume callback of any one of its ancestors(or the block device
+ * 	itself), the deadlock may be triggered inside the memory
+ * 	allocation since it might not complete until the block device
+ * 	becomes active and the involed page I/O finishes. The situation
+ * 	is pointed out first by Alan Stern. Network device are involved
+ * 	in iSCSI kind of situation.
+ *
+ * No lock is provovided in the function for handling hotplog race
+ * because pm_runtime_set_memalloc_noio(false) is called in parent's
+ * remove path.
+ */
+void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
+{
+	dev->power.memalloc_noio_resume = enable;
+
+	if (!dev->parent)
+		return;
+
+	if (enable) {
+		pm_runtime_set_memalloc_noio(dev->parent, 1);
+	} else {
+		/* only clear the flag for one device if all
+		 * children of the device don't set the flag.
+		 */
+		if (device_for_each_child(dev->parent, NULL,
+					  dev_memalloc_noio))
+			return;
+
+		pm_runtime_set_memalloc_noio(dev->parent, 0);
+	}
+}
+EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);
+
 /**
  * rpm_check_suspend_allowed - Test whether a device may be suspended.
  * @dev: Device to test.
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 007e687..5b0ee4d 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -538,6 +538,7 @@  struct dev_pm_info {
 	unsigned int		irq_safe:1;
 	unsigned int		use_autosuspend:1;
 	unsigned int		timer_autosuspends:1;
+	unsigned int		memalloc_noio_resume:1;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
 	int			runtime_error;
diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index f271860..775e063 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -47,6 +47,7 @@  extern void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
 extern unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
 extern void pm_runtime_update_max_time_suspended(struct device *dev,
 						 s64 delta_ns);
+extern void pm_runtime_set_memalloc_noio(struct device *dev, bool enable);
 
 static inline bool pm_children_suspended(struct device *dev)
 {
@@ -149,6 +150,8 @@  static inline void pm_runtime_set_autosuspend_delay(struct device *dev,
 						int delay) {}
 static inline unsigned long pm_runtime_autosuspend_expiration(
 				struct device *dev) { return 0; }
+static inline void pm_runtime_set_memalloc_noio(struct device *dev,
+						bool enable){}
 
 #endif /* !CONFIG_PM_RUNTIME */