Message ID | 1350403183-12650-2-git-send-email-ming.lei@canonical.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Tue, 16 Oct 2012 23:59:41 +0800 Ming Lei <ming.lei@canonical.com> wrote: > This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of > 'struct task_struct'), so that the flag can be set by one task > to avoid doing I/O inside memory allocation in the task's context. > > The patch trys to solve one deadlock problem caused by block device, > and the problem may happen at least in the below situations: > > - during block device runtime resume, if memory allocation with > GFP_KERNEL is called inside runtime resume callback of any one > of its ancestors(or the block device itself), the deadlock may be > triggered inside the memory allocation since it might not complete > until the block device becomes active and the involed page I/O finishes. > The situation is pointed out first by Alan Stern. It is not a good > approach to convert all GFP_KERNEL in the path into GFP_NOIO because > several subsystems may be involved(for example, PCI, USB and SCSI may > be involved for usb mass stoarage device) > > - during error handling of usb mass storage deivce, USB bus reset > will be put on the device, so there shouldn't have any > memory allocation with GFP_KERNEL during USB bus reset, otherwise > the deadlock similar with above may be triggered. Unfortunately, any > usb device may include one mass storage interface in theory, so it > requires all usb interface drivers to handle the situation. In fact, > most usb drivers don't know how to handle bus reset on the device > and don't provide .pre_set() and .post_reset() callback at all, so > USB core has to unbind and bind driver for these devices. So it > is still not practical to resort to GFP_NOIO for solving the problem. > > Also the introduced solution can be used by block subsystem or block > drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing > actual I/O transfer. The patch seems reasonable to me. I'd like to see some examples of these resume-time callsite which are performing the GFP_KERNEL allocations, please. You have found some kernel bugs, so those should be fully described. > @@ -1848,6 +1849,16 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) > #define used_math() tsk_used_math(current) > > +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) > +#define memalloc_noio_save(noio_flag) do { \ > + (noio_flag) = current->flags & PF_MEMALLOC_NOIO; \ > + current->flags |= PF_MEMALLOC_NOIO; \ > +} while (0) > +#define memalloc_noio_restore(noio_flag) do { \ > + if (!(noio_flag)) \ > + current->flags &= ~PF_MEMALLOC_NOIO; \ > +} while (0) This is just awful. Why oh why do we write code in macros when we have a nice C compiler? These can all be done as nice, clean, type-safe, documented C functions. And if they can be done that way, they *should* be done that way! And I suggest that a better name for memalloc_noio_save() is memalloc_noio_set(). So this: static inline unsigned memalloc_noio(void) { return current->flags & PF_MEMALLOC_NOIO; } static inline unsigned memalloc_noio_set(unsigned flags) { unsigned ret = memalloc_noio(); current->flags |= PF_MEMALLOC_NOIO; return ret; } static inline unsigned memalloc_noio_restore(unsigned flags) { current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; } (I think that's correct? It's probably more efficient this way). -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 17, 2012 at 4:19 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > The patch seems reasonable to me. I'd like to see some examples of > these resume-time callsite which are performing the GFP_KERNEL > allocations, please. You have found some kernel bugs, so those should > be fully described. There are two examples on 2/3 and 3/3 of the patchset, see below link: http://marc.info/?l=linux-kernel&m=135040325717213&w=2 http://marc.info/?l=linux-kernel&m=135040327317222&w=2 Sorry for not Cc them to linux-mm because I am afraid of making noise in mm list. > > This is just awful. Why oh why do we write code in macros when we have > a nice C compiler? The two helpers are following style of local_irq_save() and local_irq_restore(), so that people can use them easily, that is why I define them as macro instead of inline. > > These can all be done as nice, clean, type-safe, documented C > functions. And if they can be done that way, they *should* be done > that way! > > And I suggest that a better name for memalloc_noio_save() is > memalloc_noio_set(). So this: IMO, renaming as memalloc_noio_set() might not be better than _save because the _set name doesn't indicate that the flag should be stored first. > > static inline unsigned memalloc_noio(void) > { > return current->flags & PF_MEMALLOC_NOIO; > } > > static inline unsigned memalloc_noio_set(unsigned flags) > { > unsigned ret = memalloc_noio(); > > current->flags |= PF_MEMALLOC_NOIO; > return ret; > } > > static inline unsigned memalloc_noio_restore(unsigned flags) > { > current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; > } > > (I think that's correct? It's probably more efficient this way). Yes, it is correct and more clean, and I will take it. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 17, 2012 at 4:19 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > The patch seems reasonable to me. I'd like to see some examples of > these resume-time callsite which are performing the GFP_KERNEL > allocations, please. You have found some kernel bugs, so those should > be fully described. OK, there are two examples of GFP_KERNEL allocation in subsystem runtime resume path: 1), almost all devices in some pci platform acpi_os_allocate <-acpi_ut_allocate <-ACPI_ALLOCATE_ZEROED <-acpi_evaluate_object <-__acpi_bus_set_power <-acpi_bus_set_power <-acpi_pci_set_power_state <-platform_pci_set_power_state <-pci_platform_power_transition <-__pci_complete_power_transition <-pci_set_power_state <-pci_restore_standard_config <-pci_pm_runtime_resume 2), all devices in usb subsystem usb_get_status <-finish_port_resume <-usb_port_resume <-generic_resume <-usb_resume_device <-usb_resume_both <-usb_runtime_resume I also have many examples in which GFP_KERNEL allocation is involved in runtime resume path of individual drivers. The above two examples just show how difficult to solve the problem by traditional way, :-) Also as pointed by Oliver, network driver need memory allocation with no io in iSCSI runtime resume situation too. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(2012/10/17 0:59), Ming Lei wrote: > This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of > 'struct task_struct'), so that the flag can be set by one task > to avoid doing I/O inside memory allocation in the task's context. > > The patch trys to solve one deadlock problem caused by block device, > and the problem may happen at least in the below situations: > > - during block device runtime resume, if memory allocation with > GFP_KERNEL is called inside runtime resume callback of any one > of its ancestors(or the block device itself), the deadlock may be > triggered inside the memory allocation since it might not complete > until the block device becomes active and the involed page I/O finishes. > The situation is pointed out first by Alan Stern. It is not a good > approach to convert all GFP_KERNEL in the path into GFP_NOIO because > several subsystems may be involved(for example, PCI, USB and SCSI may > be involved for usb mass stoarage device) > > - during error handling of usb mass storage deivce, USB bus reset > will be put on the device, so there shouldn't have any > memory allocation with GFP_KERNEL during USB bus reset, otherwise > the deadlock similar with above may be triggered. Unfortunately, any > usb device may include one mass storage interface in theory, so it > requires all usb interface drivers to handle the situation. In fact, > most usb drivers don't know how to handle bus reset on the device > and don't provide .pre_set() and .post_reset() callback at all, so > USB core has to unbind and bind driver for these devices. So it > is still not practical to resort to GFP_NOIO for solving the problem. > > Also the introduced solution can be used by block subsystem or block > drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing > actual I/O transfer. > > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: Oliver Neukum <oneukum@suse.de> > Cc: Jiri Kosina <jiri.kosina@suse.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mel Gorman <mel@csn.ul.ie> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: "Rafael J. Wysocki" <rjw@sisk.pl> > Cc: linux-mm <linux-mm@kvack.org> > Signed-off-by: Minchan Kim <minchan@kernel.org> > Signed-off-by: Ming Lei <ming.lei@canonical.com> > --- > include/linux/sched.h | 11 +++++++++++ > mm/page_alloc.c | 10 +++++++++- > mm/vmscan.c | 13 +++++++++++++ > 3 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index f6961c9..c149ae7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1811,6 +1811,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ > +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ > @@ -1848,6 +1849,16 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) > #define used_math() tsk_used_math(current) > > +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) > +#define memalloc_noio_save(noio_flag) do { \ > + (noio_flag) = current->flags & PF_MEMALLOC_NOIO; \ > + current->flags |= PF_MEMALLOC_NOIO; \ > +} while (0) > +#define memalloc_noio_restore(noio_flag) do { \ > + if (!(noio_flag)) \ > + current->flags &= ~PF_MEMALLOC_NOIO; \ > +} while (0) > + > /* > * task->jobctl flags > */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8e1be1c..e3746dd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2630,10 +2630,18 @@ retry_cpuset: > page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, > zonelist, high_zoneidx, alloc_flags, > preferred_zone, migratetype); > - if (unlikely(!page)) > + if (unlikely(!page)) { > + /* > + * Resume, block IO and its error handling path > + * can deadlock because I/O on the device might not > + * complete. > + */ > + if (unlikely(memalloc_noio())) > + gfp_mask &= ~GFP_IOFS; > page = __alloc_pages_slowpath(gfp_mask, order, > zonelist, high_zoneidx, nodemask, > preferred_zone, migratetype); > + } > > trace_mm_page_alloc(page, order, gfp_mask, migratetype); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 1e9aa66..6647805 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3298,6 +3298,19 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) > }; > unsigned long nr_slab_pages0, nr_slab_pages1; > > + if (unlikely(memalloc_noio())) { > + sc.gfp_mask &= ~GFP_IOFS; > + shrink.gfp_mask = sc.gfp_mask; > + /* > + * We allow to reclaim only clean pages. > + * It can affect RECLAIM_SWAP and RECLAIM_WRITE mode > + * but this is really rare event and allocator can > + * fallback to other zones. > + */ > + sc.may_writepage = 0; > + sc.may_swap = 0; I think the idea is reasonable. I have a request. In current implemententation of vmscan.c, it seems sc.may_writepage, sc.may_swap are handled independent from gfp_mask. So, could you drop changes from this patch and handle these flags in another patch if these flags should be unset if ~GFP_IOFS ? I think try_to_free_page() path's sc.may_xxxx should be handled in the same way. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 17, 2012 at 1:14 PM, Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > I think the idea is reasonable. I have a request. Thanks for your comment. > > In current implemententation of vmscan.c, it seems sc.may_writepage, sc.may_swap > are handled independent from gfp_mask. > > So, could you drop changes from this patch and handle these flags in another patch > if these flags should be unset if ~GFP_IOFS ? OK, I agree. In theory, mm should make sure no I/O is involved if memory allocation users passes ~GFP_IOFS. > > I think try_to_free_page() path's sc.may_xxxx should be handled in the same way. Yes, alloc_page_buffers() and dma_alloc_from_contiguous may drop into the path, so gfp flag should be changed in try_to_free_page() too. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Oct 2012 09:54:09 +0800 Ming Lei <ming.lei@canonical.com> wrote: > On Wed, Oct 17, 2012 at 4:19 AM, Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > The patch seems reasonable to me. I'd like to see some examples of > > these resume-time callsite which are performing the GFP_KERNEL > > allocations, please. You have found some kernel bugs, so those should > > be fully described. > > There are two examples on 2/3 and 3/3 of the patchset, see below link: > > http://marc.info/?l=linux-kernel&m=135040325717213&w=2 > http://marc.info/?l=linux-kernel&m=135040327317222&w=2 > > Sorry for not Cc them to linux-mm because I am afraid of making noise > in mm list. Don't worry about mailing list noise ;) > > > > This is just awful. Why oh why do we write code in macros when we have > > a nice C compiler? > > The two helpers are following style of local_irq_save() and > local_irq_restore(), so that people can use them easily, that is > why I define them as macro instead of inline. local_irq_save() and local_irq_restore() were mistakes :( It's silly to write what appears to be a C function and then have it operate like Pascal (warning: I last wrote some Pascal in 66 B.C.). > > > > These can all be done as nice, clean, type-safe, documented C > > functions. And if they can be done that way, they *should* be done > > that way! > > > > And I suggest that a better name for memalloc_noio_save() is > > memalloc_noio_set(). So this: > > IMO, renaming as memalloc_noio_set() might not be better than _save > because the _set name doesn't indicate that the flag should be stored first. You could add __must_check to the function definition to ensure that all callers save its return value. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 18, 2012 at 7:54 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > local_irq_save() and local_irq_restore() were mistakes :( It's silly to > write what appears to be a C function and then have it operate like > Pascal (warning: I last wrote some Pascal in 66 B.C.). Considered that spin_lock_irqsave/spin_unlock_irqrestore also follow the style, kernel guys have been accustomed to the usage, I am inclined to keep that as macro, :-) >> IMO, renaming as memalloc_noio_set() might not be better than _save >> because the _set name doesn't indicate that the flag should be stored first. > > You could add __must_check to the function definition to ensure that > all callers save its return value. Yes, we can do that, but the function name is not better than _save from readability. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/sched.h b/include/linux/sched.h index f6961c9..c149ae7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1811,6 +1811,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ @@ -1848,6 +1849,16 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) #define used_math() tsk_used_math(current) +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) +#define memalloc_noio_save(noio_flag) do { \ + (noio_flag) = current->flags & PF_MEMALLOC_NOIO; \ + current->flags |= PF_MEMALLOC_NOIO; \ +} while (0) +#define memalloc_noio_restore(noio_flag) do { \ + if (!(noio_flag)) \ + current->flags &= ~PF_MEMALLOC_NOIO; \ +} while (0) + /* * task->jobctl flags */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e1be1c..e3746dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2630,10 +2630,18 @@ retry_cpuset: page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, alloc_flags, preferred_zone, migratetype); - if (unlikely(!page)) + if (unlikely(!page)) { + /* + * Resume, block IO and its error handling path + * can deadlock because I/O on the device might not + * complete. + */ + if (unlikely(memalloc_noio())) + gfp_mask &= ~GFP_IOFS; page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); + } trace_mm_page_alloc(page, order, gfp_mask, migratetype); diff --git a/mm/vmscan.c b/mm/vmscan.c index 1e9aa66..6647805 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3298,6 +3298,19 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) }; unsigned long nr_slab_pages0, nr_slab_pages1; + if (unlikely(memalloc_noio())) { + sc.gfp_mask &= ~GFP_IOFS; + shrink.gfp_mask = sc.gfp_mask; + /* + * We allow to reclaim only clean pages. + * It can affect RECLAIM_SWAP and RECLAIM_WRITE mode + * but this is really rare event and allocator can + * fallback to other zones. + */ + sc.may_writepage = 0; + sc.may_swap = 0; + } + cond_resched(); /* * We need to be able to allocate from the reserves for RECLAIM_SWAP