diff mbox series

[1/1] mm/page_alloc: add a warning about high order allocations

Message ID 20181225153927.2873-2-khorenko@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series mm: add a warning about high order allocations | expand

Commit Message

Konstantin Khorenko Dec. 25, 2018, 3:39 p.m. UTC
Adds sysctl "vm.warn_high_order". If set it will warn about
allocations with order >= vm.warn_high_order.
Prints only 32 warnings at most and skips all __GFP_NOWARN allocations.

The code is under config option, disabled by default.
If enabled, default vm.warn_high_order is 3 (PAGE_ALLOC_COSTLY_ORDER).

Sysctl max value is set to 100 which should be enough to exceed maximum
order (normally MAX_ORDER == 11).

Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
---
 kernel/sysctl.c | 15 +++++++++++++++
 mm/Kconfig      | 18 ++++++++++++++++++
 mm/page_alloc.c | 25 +++++++++++++++++++++++++
 3 files changed, 58 insertions(+)

Comments

Konstantin Khorenko Dec. 27, 2018, 4:05 p.m. UTC | #1
On 12/26/2018 11:40 AM, Michal Hocko wrote:
> Appart from general comments as a reply to the cover (btw. this all
> should be in the changelog because this is the _why_ part of the
> justification which should be _always_ part of the changelog).

Thank you, will add in the next version of the patch alltogether
with other changes if any.

> On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
> [...]
>> +config WARN_HIGH_ORDER
>> +	bool "Enable complains about high order memory allocations"
>> +	depends on !LOCKDEP
>
> Why?

LOCKDEP makes structures big, so if we see a high order allocation warning
on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
kernel performance is not our target.
i can remove !LOCKDEP dependence here, but then need to adjust default
warning level i think, or logs will be spammed.

>> +	default n
>> +	help
>> +	  Enables warnings on high order memory allocations. This allows to
>> +	  determine users of large memory chunks and rework them to decrease
>> +	  allocation latency. Note, some debug options make kernel structures
>> +	  fat.
>> +
>> +config WARN_HIGH_ORDER_LEVEL
>> +	int "Define page order level considered as too high"
>> +	depends on WARN_HIGH_ORDER
>> +	default 3
>> +	help
>> +	  Defines page order starting which the system to complain about.
>> +	  Default is current PAGE_ALLOC_COSTLY_ORDER.
>> +
>>  config HWPOISON_INJECT
>>  	tristate "HWPoison pages injector"
>>  	depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e95b5b7c9c3d..258892adb861 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4341,6 +4341,30 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
>>  					ac->high_zoneidx, ac->nodemask);
>>  }
>>
>> +#ifdef CONFIG_WARN_HIGH_ORDER
>> +int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
>> +
>> +/*
>> + * Complain if we allocate a high order page unless there is a __GFP_NOWARN
>> + * flag provided.
>> + *
>> + * Shuts up after 32 complains.
>> + */
>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>> +{
>> +	static atomic_t warn_count = ATOMIC_INIT(32);
>> +
>> +	if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
>> +		WARN(atomic_dec_if_positive(&warn_count) >= 0,
>> +		     "order %d >= %d, gfp 0x%x\n",
>> +		     order, warn_order, gfp_mask);
>> +}
>
> We do have ratelimit functionality, so why cannot you use it?

Well, my idea was to really shut up the warning after some number of messages
(if a node is in production and its uptime, say, a year, i don't want to see
many warnings in logs, first several is enough - let's fix them first).

If i use printk_ratelimited() i could get 24 days delay at most AFAIK,
but okay, i can switch to printk_ratelimited() if you prefer.

struct ratelimit_state {
         int             interval;

(gdb) p ((1LL<<31) -1)/1000/60/60/24
$11 = 24

>> +#else
>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>> +{
>> +}
>> +#endif
>> +
>>  /*
>>   * This is the 'heart' of the zoned buddy allocator.
>>   */
>> @@ -4361,6 +4385,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>>  		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
>>  		return NULL;
>>  	}
>> +	warn_high_order(order, gfp_mask);
>>
>>  	gfp_mask &= gfp_allowed_mask;
>>  	alloc_mask = gfp_mask;
>
> Why do you warn about all allocations in the hot path? I thought you
> want to catch expensive allocations so I would assume that you would
> stick that into a slow path after we are not able to allocate anything
> after the first round of compaction.

The idea is to catch big allocations soon and preferably during testing,
not on production nodes under high load, that's why i've chosen hot path.

And if we switch to the slow path, we'll have to run all tests under
additional serious load - to generate memory fragmentation.
Not so convenient.

> Also do you want to warn about opportunistic GFP_NOWAIT allocations that
> have a reasonable fallback?

Yes, i would like to. Sometimes allocation flags come from upper level and
it's not always evident if there is GFP_NOWAIT flag or not, so let's we
are noticed about such a case and verify if it's legal and not called often.
If yes - we'll just mark it with NO_WARN.


 > And forgot to mention other opportunistic allocations like THP of
 > course.

hugepages are allocated with NOWARN already, AFAIK.
alloc_fresh_huge_page_node(), __hugetlb_alloc_buddy_huge_page()


Thank you once again, Michal.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
Michal Hocko Dec. 27, 2018, 4:50 p.m. UTC | #2
On Thu 27-12-18 16:05:18, Konstantin Khorenko wrote:
> On 12/26/2018 11:40 AM, Michal Hocko wrote:
> > Appart from general comments as a reply to the cover (btw. this all
> > should be in the changelog because this is the _why_ part of the
> > justification which should be _always_ part of the changelog).
> 
> Thank you, will add in the next version of the patch alltogether
> with other changes if any.
> 
> > On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
> > [...]
> >> +config WARN_HIGH_ORDER
> >> +	bool "Enable complains about high order memory allocations"
> >> +	depends on !LOCKDEP
> >
> > Why?
> 
> LOCKDEP makes structures big, so if we see a high order allocation warning
> on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
> kernel performance is not our target.
> i can remove !LOCKDEP dependence here, but then need to adjust default
> warning level i think, or logs will be spammed.

OK, I see but this just points to how this is not really a suitable
solution for the problem you are looking for.

> >> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> >> +{
> >> +	static atomic_t warn_count = ATOMIC_INIT(32);
> >> +
> >> +	if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
> >> +		WARN(atomic_dec_if_positive(&warn_count) >= 0,
> >> +		     "order %d >= %d, gfp 0x%x\n",
> >> +		     order, warn_order, gfp_mask);
> >> +}
> >
> > We do have ratelimit functionality, so why cannot you use it?
> 
> Well, my idea was to really shut up the warning after some number of messages
> (if a node is in production and its uptime, say, a year, i don't want to see
> many warnings in logs, first several is enough - let's fix them first).

OK, but it is quite likely that the system is perfectly healthy and
unfragmented after fresh boot when doing a large order allocations is
perfectly fine. Note that it is smaller order allocations that generate
fragmentation in general.
Konstantin Khorenko Dec. 28, 2018, 2:45 p.m. UTC | #3
On 12/27/2018 07:50 PM, Michal Hocko wrote:
> On Thu 27-12-18 16:05:18, Konstantin Khorenko wrote:
>> On 12/26/2018 11:40 AM, Michal Hocko wrote:
>>> Appart from general comments as a reply to the cover (btw. this all
>>> should be in the changelog because this is the _why_ part of the
>>> justification which should be _always_ part of the changelog).
>>
>> Thank you, will add in the next version of the patch alltogether
>> with other changes if any.
>>
>>> On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
>>> [...]
>>>> +config WARN_HIGH_ORDER
>>>> +	bool "Enable complains about high order memory allocations"
>>>> +	depends on !LOCKDEP
>>>
>>> Why?
>>
>> LOCKDEP makes structures big, so if we see a high order allocation warning
>> on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
>> kernel performance is not our target.
>> i can remove !LOCKDEP dependence here, but then need to adjust default
>> warning level i think, or logs will be spammed.
>
> OK, I see but this just points to how this is not really a suitable
> solution for the problem you are looking for.

i have to admit, yes, it is inconvenient to have such a dependency.
i would be really glad to hear alternatives...

Just to mention: tracepoints are for issues investigation (when we already have any),
and i'm thinking about issues prevention. Gathering places which might
lead to problem and rework to prevent possible issues.

>>>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>>>> +{
>>>> +	static atomic_t warn_count = ATOMIC_INIT(32);
>>>> +
>>>> +	if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
>>>> +		WARN(atomic_dec_if_positive(&warn_count) >= 0,
>>>> +		     "order %d >= %d, gfp 0x%x\n",
>>>> +		     order, warn_order, gfp_mask);
>>>> +}
>>>
>>> We do have ratelimit functionality, so why cannot you use it?
>>
>> Well, my idea was to really shut up the warning after some number of messages
>> (if a node is in production and its uptime, say, a year, i don't want to see
>> many warnings in logs, first several is enough - let's fix them first).
>
> OK, but it is quite likely that the system is perfectly healthy and
> unfragmented after fresh boot when doing a large order allocations is
> perfectly fine.

You are right again, Michal. Thus just switch those early allocations to
kvmalloc() and on boot on an unfragmented system same kmalloc() will be used.
And no new warning for that place. And no any chance this place ever trigger
compaction (in case we did not notice some way this allocation might also
happen later, not on a fresh node).

 > Note that it is smaller order allocations that generate
 > fragmentation in general.

And again - yes, that's true. Still i don't know how to avoid memory fragmentation,
but can avoid (ok, significantly decrease the number of) big allocations which
might trigger compaction because of that fragmentation.


--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
diff mbox series

Patch

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..28a8ebfa7a1d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -176,6 +176,10 @@  extern int unaligned_dump_stack;
 extern int no_unaligned_warning;
 #endif
 
+#ifdef CONFIG_WARN_HIGH_ORDER
+extern int warn_order;
+#endif
+
 #ifdef CONFIG_PROC_SYSCTL
 
 /**
@@ -1675,6 +1679,17 @@  static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&mmap_rnd_compat_bits_min,
 		.extra2		= (void *)&mmap_rnd_compat_bits_max,
 	},
+#endif
+#ifdef CONFIG_WARN_HIGH_ORDER
+	{
+		.procname	= "warn_high_order",
+		.data		= &warn_order,
+		.maxlen		= sizeof(warn_order),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 #endif
 	{ }
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index d85e39da47ae..4e2d613d52f1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -336,6 +336,24 @@  config MEMORY_FAILURE
 	  even when some of its memory has uncorrected errors. This requires
 	  special hardware support and typically ECC memory.
 
+config WARN_HIGH_ORDER
+	bool "Enable complains about high order memory allocations"
+	depends on !LOCKDEP
+	default n
+	help
+	  Enables warnings on high order memory allocations. This allows to
+	  determine users of large memory chunks and rework them to decrease
+	  allocation latency. Note, some debug options make kernel structures
+	  fat.
+
+config WARN_HIGH_ORDER_LEVEL
+	int "Define page order level considered as too high"
+	depends on WARN_HIGH_ORDER
+	default 3
+	help
+	  Defines page order starting which the system to complain about.
+	  Default is current PAGE_ALLOC_COSTLY_ORDER.
+
 config HWPOISON_INJECT
 	tristate "HWPoison pages injector"
 	depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e95b5b7c9c3d..258892adb861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4341,6 +4341,30 @@  static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
 					ac->high_zoneidx, ac->nodemask);
 }
 
+#ifdef CONFIG_WARN_HIGH_ORDER
+int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
+
+/*
+ * Complain if we allocate a high order page unless there is a __GFP_NOWARN
+ * flag provided.
+ *
+ * Shuts up after 32 complains.
+ */
+static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
+{
+	static atomic_t warn_count = ATOMIC_INIT(32);
+
+	if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
+		WARN(atomic_dec_if_positive(&warn_count) >= 0,
+		     "order %d >= %d, gfp 0x%x\n",
+		     order, warn_order, gfp_mask);
+}
+#else
+static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
+{
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -4361,6 +4385,7 @@  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
 		return NULL;
 	}
+	warn_high_order(order, gfp_mask);
 
 	gfp_mask &= gfp_allowed_mask;
 	alloc_mask = gfp_mask;