mbox series

[git,pull] device mapper fixes for 6.7-rc2

Message ID ZVjgpLACW4/0NkBB@redhat.com (mailing list archive)
State New, archived
Headers show
Series [git,pull] device mapper fixes for 6.7-rc2 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-6.7/dm-fixes

Message

Mike Snitzer Nov. 18, 2023, 4:04 p.m. UTC
Hi Linus,

The following changes since commit b85ea95d086471afb4ad062012a4d73cd328fa86:

  Linux 6.7-rc1 (2023-11-12 16:19:07 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-6.7/dm-fixes

for you to fetch changes up to 13648e04a9b831b3dfa5cf3887dfa6cf8fe5fe69:

  dm-crypt: start allocating with MAX_ORDER (2023-11-17 14:41:15 -0500)

Please pull, thanks.
Mike

----------------------------------------------------------------
- Various fixes for the DM delay target to address regressions
  introduced during the 6.7 merge.

- Fixes to both DM bufio and the verity target for no-sleep mode, to
  address sleeping while atomic issues.

- Update DM crypt target in response to the treewide change that made
  MAX_ORDER inclusive.
-----BEGIN PGP SIGNATURE-----

iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmVY35gACgkQxSPxCi2d
A1pt/gf+NVN/3N0BMYlc/L129+QCIreDH/PuhoxVxMLrhZ8oERw7XkgGIu5JoR1C
Sec5OWJWV0+AZ3kSeW8WBoFFGDdfTYQTId2R3mF3a+34mpAvjTMqnvGkA1GrixCt
xvj2cQjQkBWwwbikAkHZOUoK3UEPFurk/a7oXTAl43kZaV6gWHEAY/bxet9Wuc7+
NWN6xEg7sHhdVgJYltGkOl8Pf9y3zPDUT8SqFT0J7dPEGytNHscozDdRJiLmvoS6
g4WkMkrg87Ude7g0/l7hCM9b3ElamtMsZtyy43cvaKp3UrwFiAzFMCprEl4TS5Br
LNhuwegIBWMVCklRiTDG+uHnbMxKZg==
=9Lfg
-----END PGP SIGNATURE-----

----------------------------------------------------------------
Mikulas Patocka (6):
      dm-delay: fix a race between delay_presuspend and delay_bio
      dm-delay: fix bugs introduced by kthread mode
      dm-delay: avoid duplicate logic
      dm-bufio: fix no-sleep mode
      dm-verity: don't use blocking calls from tasklets
      dm-crypt: start allocating with MAX_ORDER

 drivers/md/dm-bufio.c         |  87 ++++++++++++++++++++++----------
 drivers/md/dm-crypt.c         |   2 +-
 drivers/md/dm-delay.c         | 112 ++++++++++++++++++++----------------------
 drivers/md/dm-verity-fec.c    |   4 +-
 drivers/md/dm-verity-target.c |  23 ++++-----
 drivers/md/dm-verity.h        |   2 +-
 6 files changed, 130 insertions(+), 100 deletions(-)

Comments

Linus Torvalds Nov. 18, 2023, 6:33 p.m. UTC | #1
On Sat, 18 Nov 2023 at 08:04, Mike Snitzer <snitzer@kernel.org> wrote:
>
> - Update DM crypt target in response to the treewide change that made
>   MAX_ORDER inclusive.

Your fix is obviously correct, and was an unfortunate semantic
conflict that I (and in my defense, apparently linux-next) missed this
merge window.

But now that I notice the mis-merge I also think the original change
may just have been wrong.

Not only did it change MAX_ORDER semantics from their historical
definition, the *argument* for changing it is bogus.

That commit claims that the traditional MAX_ORDER definition was
counter-intuitive, and that was really the *only* argument for the
change.

But the thing is, the 0..MAX-1 is often the *norm* in the kernel
because we count from 0 and often have max values determined by
powers-of-two etc

Just in the mm, we have MPOL_MAX, MAX_NUMNODES, KM_MAX_IDX,
MAX_SWAPFILES, MAX_NR_GENS, COMPACT_CLUSTER_MAX, MAX_LRU_BATCH, and
probably others.  And as far as I can tell, *none* of them are any
kind of "inclusive" max (that's from a very quick grep for 'MAX', I'm
sure I missed some, and maybe I missed some case where it was actually
inclusive).

So the dm fix in commit 13648e04a9b8 ("dm-crypt: start allocating with
MAX_ORDER") is clearly and obviously a fix, and I have pulled this,
but the more I look at it, the more I think that commit 23baf831a32c
("mm, treewide: redefine MAX_ORDER sanely") was just *complete*
garbage.

Calling the old MAXORDER insane (by implication: "redefine sanely")
and counter-intuitive is clearly bogus. It's neither unusual, insane
_or_ counter-intuitive, just by all the other similar cases we have.

Andrew, Kirill - I'm inclined to just revert that commit (and the new
dm fix it resulted in), unless you can back it up with more than a
completely bogus commit message.

What are the alleged "number of bugs all over the kernel" that the old
MAX_ORDER definition had? In light of the other "MAX" definitions I
found from a second of grepping, I really think the argument for that
was just wrong.

And old MAX_ORDER it is. I went back in history, and the MAX_ORDER
define comes from 2.3.27pre6, where we renamed NR_MEM_LISTS to
MAX_ORDER.

And that was back in 1999.

So we have literally 24 years of MAX_ORDER that was upended for some
very questionable reasons. And changing the meaning of it
*immediately* caused a bug.

                   Linus
pr-tracker-bot@kernel.org Nov. 18, 2023, 6:36 p.m. UTC | #2
The pull request you sent on Sat, 18 Nov 2023 11:04:52 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-6.7/dm-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/05aa69b096a089dc85391e36ccdce76961694e22

Thank you!
Linus Torvalds Nov. 18, 2023, 7:14 p.m. UTC | #3
On Sat, 18 Nov 2023 at 10:33, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What are the alleged "number of bugs all over the kernel" that the old
> MAX_ORDER definition had? In light of the other "MAX" definitions I
> found from a second of grepping, I really think the argument for that
> was just wrong.

Bah. I looked at the commits leading up to it, and I really think that
"if that's the fallout from 24 years of history, then it wasn't a huge
deal".

Compared to the inevitable problems that the semantic change will
cause for backports (and already caused for this merge window), I
still think this was all likely a mistake.

In fact, the biggest takeaway from those patches is that we probably
should have introduced some kind of "this is the max _size_ you can
allocate", because a few of them were related to that particular
calculation.

The floppy driver example is actually a good example, in that that is
*exactly* what it wanted, but it had been just done wrong as

  #define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT)

and variations of that is what half the fixes were..

IOW, I really think we should just have added a

  #define MAX_GFP_SIZE (1UL << (PAGE_SHIFT+MAX_ORDER-1))

and the rest were basically all just the trivial off-by-one things.

The __iommu_dma_alloc_pages() code is just so odd that any MAX_ORDER
issues are more about "WTH is this code doing" And the "fix" is just
more of the same.

The *whole* point of having the max for a power-of-two thing is that
you can do things like

        mask = (1<<MAX)-1;

and that's exactly what the iommu code was trying to do. Except for
some *unfathomable* reason the iommu code did

        order_mask &= (2U << MAX_ORDER) - 1;

and honestly, you can't blame MAX_ORDER for the confusion. What a
strange off-by-one. MAX_ORDER was *literally* designed to be easy for
this case, and people screwed it up.

And no, it's *not* because "inclusive is more intuitive". As
mentioned, this whole "it's past the end* is the common case for MAX
values in the mm, for exactly these kinds of reasons. I already
listted several other cases.

The reason there were 9 bugs is because MAX_ORDER is an old thing, and
testing doesn't catch it because nobody triggers the max case. The
crazy iommu code goes back to 2016, for example.

That

    84 files changed, 223 insertions(+), 253 deletions(-)

really seems bogus for the 9 fixes that had accumulated in this area
in the last 24 years.

Oh well. Now we have places with 'MAX_ORDER + 1' instead of 'MAX_ORDER
- 1'. I'm not seeing that it's a win, and I do think "semantic change
after 24 years" is a loss and is going to cause problems.

               Linus
Kirill A . Shutemov Nov. 19, 2023, 3:21 p.m. UTC | #4
On Sat, Nov 18, 2023 at 11:14:25AM -0800, Linus Torvalds wrote:
> On Sat, 18 Nov 2023 at 10:33, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > What are the alleged "number of bugs all over the kernel" that the old
> > MAX_ORDER definition had? In light of the other "MAX" definitions I
> > found from a second of grepping, I really think the argument for that
> > was just wrong.
> 
> Bah. I looked at the commits leading up to it, and I really think that
> "if that's the fallout from 24 years of history, then it wasn't a huge
> deal".

That's part that hit upstream and never got caught. I personally spend
some time debugging due MAX_ORDER-1 thing.

> Compared to the inevitable problems that the semantic change will
> cause for backports (and already caused for this merge window), I
> still think this was all likely a mistake.

I think the mistake I made is leaving the name the same. Just renaming it
to MAX_PAGE_ORDER or something would give a heads up to not-yet-upstream
code.

> In fact, the biggest takeaway from those patches is that we probably
> should have introduced some kind of "this is the max _size_ you can
> allocate", because a few of them were related to that particular
> calculation.
> 
> The floppy driver example is actually a good example, in that that is
> *exactly* what it wanted, but it had been just done wrong as
> 
>   #define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT)
> 
> and variations of that is what half the fixes were..
> 
> IOW, I really think we should just have added a
> 
>   #define MAX_GFP_SIZE (1UL << (PAGE_SHIFT+MAX_ORDER-1))
> 
> and the rest were basically all just the trivial off-by-one things.
> 
> The __iommu_dma_alloc_pages() code is just so odd that any MAX_ORDER
> issues are more about "WTH is this code doing" And the "fix" is just
> more of the same.
> 
> The *whole* point of having the max for a power-of-two thing is that
> you can do things like
> 
>         mask = (1<<MAX)-1;
> 
> and that's exactly what the iommu code was trying to do. Except for
> some *unfathomable* reason the iommu code did
> 
>         order_mask &= (2U << MAX_ORDER) - 1;
> 
> and honestly, you can't blame MAX_ORDER for the confusion. What a
> strange off-by-one. MAX_ORDER was *literally* designed to be easy for
> this case, and people screwed it up.

Agreed.

> And no, it's *not* because "inclusive is more intuitive". As
> mentioned, this whole "it's past the end* is the common case for MAX
> values in the mm, for exactly these kinds of reasons. I already
> listted several other cases.
> 
> The reason there were 9 bugs is because MAX_ORDER is an old thing, and
> testing doesn't catch it because nobody triggers the max case. The
> crazy iommu code goes back to 2016, for example.
> 
> That
> 
>     84 files changed, 223 insertions(+), 253 deletions(-)
> 
> really seems bogus for the 9 fixes that had accumulated in this area
> in the last 24 years.
> 
> Oh well. Now we have places with 'MAX_ORDER + 1' instead of 'MAX_ORDER
> - 1'. I'm not seeing that it's a win, and I do think "semantic change
> after 24 years" is a loss and is going to cause problems.

I think it worth introducing NR_ORDERS for MAX_ORDER + 1 to define how
many page orders page allocator supports. Sounds good?
Kirill A . Shutemov Nov. 20, 2023, 1:31 p.m. UTC | #5
On Sun, Nov 19, 2023 at 06:21:36PM +0300, Kirill A. Shutemov wrote:
> > Oh well. Now we have places with 'MAX_ORDER + 1' instead of 'MAX_ORDER
> > - 1'. I'm not seeing that it's a win, and I do think "semantic change
> > after 24 years" is a loss and is going to cause problems.
> 
> I think it worth introducing NR_ORDERS for MAX_ORDER + 1 to define how
> many page orders page allocator supports. Sounds good?

Something like this:

From 124d43bddb274e97de0ca8b125c0a41030b3ac57 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 20 Nov 2023 15:14:29 +0300
Subject: [PATCH] MM: Introduce NR_ORDERS

NR_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

NR_ORDERS assists in defining arrays of page orders and allows for more
natural iteration over them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 .../admin-guide/kdump/vmcoreinfo.rst          |  4 ++--
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |  2 +-
 arch/sparc/kernel/traps_64.c                  |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_device_test.c   |  2 +-
 drivers/gpu/drm/ttm/ttm_pool.c                | 20 +++++++++----------
 include/drm/ttm/ttm_pool.h                    |  2 +-
 include/linux/mmzone.h                        |  6 ++++--
 kernel/crash_core.c                           |  2 +-
 lib/test_meminit.c                            |  2 +-
 mm/compaction.c                               |  2 +-
 mm/kmsan/init.c                               |  2 +-
 mm/page_alloc.c                               | 13 ++++++------
 mm/page_reporting.c                           |  2 +-
 mm/show_mem.c                                 |  8 ++++----
 mm/vmstat.c                                   | 12 +++++------
 15 files changed, 41 insertions(+), 40 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 78e4d2e7ba14..e32d3214b4a0 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -172,7 +172,7 @@ variables.
 Offset of the free_list's member. This value is used to compute the number
 of free pages.
 
-Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
+Each zone has a free_area structure array called free_area[NR_ORDERS].
 The free_list represents a linked list of free page blocks.
 
 (list_head, next|prev)
@@ -189,7 +189,7 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
 information. Makedumpfile gets the start address of the vmalloc region
 from this.
 
-(zone.free_area, MAX_ORDER + 1)
+(zone.free_area, NR_ORDERS)
 -------------------------------
 
 Free areas descriptor. User-space tools use this value to iterate the
diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
index fe5472a184a3..47305934b7ff 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -16,7 +16,7 @@ struct hyp_pool {
 	 * API at EL2.
 	 */
 	hyp_spinlock_t lock;
-	struct list_head free_area[MAX_ORDER + 1];
+	struct list_head free_area[NR_ORDERS];
 	phys_addr_t range_start;
 	phys_addr_t range_end;
 	unsigned short max_order;
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 08ffd17d5ec3..c127eab8283f 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
 
 	/* Now allocate error trap reporting scoreboard. */
 	sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
-	for (order = 0; order <= MAX_ORDER; order++) {
+	for (order = 0; order < NR_ORDERS; order++) {
 		if ((PAGE_SIZE << order) >= sz)
 			break;
 	}
diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index b1b423b68cdf..f7e8c12ec696 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -175,7 +175,7 @@ static void ttm_device_init_pools(struct kunit *test)
 
 	if (params->pools_init_expected) {
 		for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
-			for (int j = 0; j <= MAX_ORDER; ++j) {
+			for (int j = 0; j < NR_ORDERS; ++j) {
 				pt = pool->caching[i].orders[j];
 				KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool);
 				KUNIT_EXPECT_EQ(test, pt.caching, i);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index fe610a3cace0..701362f015ec 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
 
 static atomic_long_t allocated_pages;
 
-static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
-static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
+static struct ttm_pool_type global_write_combined[NR_ORDERS];
+static struct ttm_pool_type global_uncached[NR_ORDERS];
 
-static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
-static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
+static struct ttm_pool_type global_dma32_write_combined[NR_ORDERS];
+static struct ttm_pool_type global_dma32_uncached[NR_ORDERS];
 
 static spinlock_t shrinker_lock;
 static struct list_head shrinker_list;
@@ -568,7 +568,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 
 	if (use_dma_alloc || nid != NUMA_NO_NODE) {
 		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-			for (j = 0; j <= MAX_ORDER; ++j)
+			for (j = 0; j < NR_ORDERS; ++j)
 				ttm_pool_type_init(&pool->caching[i].orders[j],
 						   pool, i, j);
 	}
@@ -601,7 +601,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
 
 	if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
 		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-			for (j = 0; j <= MAX_ORDER; ++j)
+			for (j = 0; j < NR_ORDERS; ++j)
 				ttm_pool_type_fini(&pool->caching[i].orders[j]);
 	}
 
@@ -656,7 +656,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
 	unsigned int i;
 
 	seq_puts(m, "\t ");
-	for (i = 0; i <= MAX_ORDER; ++i)
+	for (i = 0; i < NR_ORDERS; ++i)
 		seq_printf(m, " ---%2u---", i);
 	seq_puts(m, "\n");
 }
@@ -667,7 +667,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
 {
 	unsigned int i;
 
-	for (i = 0; i <= MAX_ORDER; ++i)
+	for (i = 0; i < NR_ORDERS; ++i)
 		seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
 	seq_puts(m, "\n");
 }
@@ -776,7 +776,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
-	for (i = 0; i <= MAX_ORDER; ++i) {
+	for (i = 0; i < NR_ORDERS; ++i) {
 		ttm_pool_type_init(&global_write_combined[i], NULL,
 				   ttm_write_combined, i);
 		ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
@@ -816,7 +816,7 @@ void ttm_pool_mgr_fini(void)
 {
 	unsigned int i;
 
-	for (i = 0; i <= MAX_ORDER; ++i) {
+	for (i = 0; i < NR_ORDERS; ++i) {
 		ttm_pool_type_fini(&global_write_combined[i]);
 		ttm_pool_type_fini(&global_uncached[i]);
 
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 30a347e5aa11..8e54bf456981 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -74,7 +74,7 @@ struct ttm_pool {
 	bool use_dma32;
 
 	struct {
-		struct ttm_pool_type orders[MAX_ORDER + 1];
+		struct ttm_pool_type orders[NR_ORDERS];
 	} caching[TTM_NUM_CACHING_TYPES];
 };
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3c25226beeed..9f457ebdaa9a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -34,6 +34,8 @@
 
 #define IS_MAX_ORDER_ALIGNED(pfn) IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)
 
+#define NR_ORDERS (MAX_ORDER + 1)
+
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
@@ -95,7 +97,7 @@ static inline bool migratetype_is_mergeable(int mt)
 }
 
 #define for_each_migratetype_order(order, type) \
-	for (order = 0; order <= MAX_ORDER; order++) \
+	for (order = 0; order < NR_ORDERS; order++) \
 		for (type = 0; type < MIGRATE_TYPES; type++)
 
 extern int page_group_by_mobility_disabled;
@@ -943,7 +945,7 @@ struct zone {
 	CACHELINE_PADDING(_pad1_);
 
 	/* free areas of different sizes */
-	struct free_area	free_area[MAX_ORDER + 1];
+	struct free_area	free_area[NR_ORDERS];
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	/* Pages to be accepted. All pages on the list are MAX_ORDER */
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index efe87d501c8c..acd7201b4385 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -802,7 +802,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(list_head, prev);
 	VMCOREINFO_OFFSET(vmap_area, va_start);
 	VMCOREINFO_OFFSET(vmap_area, list);
-	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
+	VMCOREINFO_LENGTH(zone.free_area, NR_ORDERS);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
 	VMCOREINFO_NUMBER(NR_FREE_PAGES);
diff --git a/lib/test_meminit.c b/lib/test_meminit.c
index 0ae35223d773..32f44bd61422 100644
--- a/lib/test_meminit.c
+++ b/lib/test_meminit.c
@@ -93,7 +93,7 @@ static int __init test_pages(int *total_failures)
 	int failures = 0, num_tests = 0;
 	int i;
 
-	for (i = 0; i <= MAX_ORDER; i++)
+	for (i = 0; i < NR_ORDERS; i++)
 		num_tests += do_alloc_pages_order(i, &failures);
 
 	REPORT_FAILURES_IN_FN();
diff --git a/mm/compaction.c b/mm/compaction.c
index 01ba298739dd..f819c51d856f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2226,7 +2226,7 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 
 	/* Direct compactor: Is a suitable page free? */
 	ret = COMPACT_NO_SUITABLE_PAGE;
-	for (order = cc->order; order <= MAX_ORDER; order++) {
+	for (order = cc->order; order < NR_ORDERS; order++) {
 		struct free_area *area = &cc->zone->free_area[order];
 		bool can_steal;
 
diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index ffedf4dbc49d..8aefbcd997f3 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -96,7 +96,7 @@ void __init kmsan_init_shadow(void)
 struct metadata_page_pair {
 	struct page *shadow, *origin;
 };
-static struct metadata_page_pair held_back[MAX_ORDER + 1] __initdata;
+static struct metadata_page_pair held_back[NR_ORDERS] __initdata;
 
 /*
  * Eager metadata allocation. When the memblock allocator is freeing pages to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..0bac7794d33e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1571,7 +1571,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	struct page *page;
 
 	/* Find a page of the appropriate size in the preferred list */
-	for (current_order = order; current_order <= MAX_ORDER; ++current_order) {
+	for (current_order = order; current_order < NR_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
@@ -1941,7 +1941,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		for (order = 0; order <= MAX_ORDER; order++) {
+		for (order = 0; order < NR_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
 
 			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
@@ -2051,8 +2051,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	return false;
 
 find_smallest:
-	for (current_order = order; current_order <= MAX_ORDER;
-							current_order++) {
+	for (current_order = order; current_order < NR_ORDERS; current_order++) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
 				start_migratetype, false, &can_steal);
@@ -3007,7 +3006,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		return true;
 
 	/* For a high-order request, check at least one suitable page is free */
-	for (o = order; o <= MAX_ORDER; o++) {
+	for (o = order; o < NR_ORDERS; o++) {
 		struct free_area *area = &z->free_area[o];
 		int mt;
 
@@ -6635,7 +6634,7 @@ bool is_free_buddy_page(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	unsigned int order;
 
-	for (order = 0; order <= MAX_ORDER; order++) {
+	for (order = 0; order < NR_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
 
 		if (PageBuddy(page_head) &&
@@ -6690,7 +6689,7 @@ bool take_page_off_buddy(struct page *page)
 	bool ret = false;
 
 	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order <= MAX_ORDER; order++) {
+	for (order = 0; order < NR_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
 		int page_order = buddy_order(page_head);
 
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index b021f482a4cb..14c0782a3711 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -276,7 +276,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 		return err;
 
 	/* Process each free list starting from lowest order/mt */
-	for (order = page_reporting_order; order <= MAX_ORDER; order++) {
+	for (order = page_reporting_order; order < NR_ORDERS; order++) {
 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
 			/* We do not pull pages from the isolate free list */
 			if (is_migrate_isolate(mt))
diff --git a/mm/show_mem.c b/mm/show_mem.c
index ba0808d6917f..77fa459962c6 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -352,8 +352,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 
 	for_each_populated_zone(zone) {
 		unsigned int order;
-		unsigned long nr[MAX_ORDER + 1], flags, total = 0;
-		unsigned char types[MAX_ORDER + 1];
+		unsigned long nr[NR_ORDERS], flags, total = 0;
+		unsigned char types[NR_ORDERS];
 
 		if (zone_idx(zone) > max_zone_idx)
 			continue;
@@ -363,7 +363,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 		printk(KERN_CONT "%s: ", zone->name);
 
 		spin_lock_irqsave(&zone->lock, flags);
-		for (order = 0; order <= MAX_ORDER; order++) {
+		for (order = 0; order < NR_ORDERS; order++) {
 			struct free_area *area = &zone->free_area[order];
 			int type;
 
@@ -377,7 +377,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
-		for (order = 0; order <= MAX_ORDER; order++) {
+		for (order = 0; order < NR_ORDERS; order++) {
 			printk(KERN_CONT "%lu*%lukB ",
 			       nr[order], K(1UL) << order);
 			if (nr[order])
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 359460deb377..f9ac978ef748 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1059,7 +1059,7 @@ static void fill_contig_page_info(struct zone *zone,
 	info->free_blocks_total = 0;
 	info->free_blocks_suitable = 0;
 
-	for (order = 0; order <= MAX_ORDER; order++) {
+	for (order = 0; order < NR_ORDERS; order++) {
 		unsigned long blocks;
 
 		/*
@@ -1475,7 +1475,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 	int order;
 
 	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-	for (order = 0; order <= MAX_ORDER; ++order)
+	for (order = 0; order < NR_ORDERS; ++order)
 		/*
 		 * Access to nr_free is lockless as nr_free is used only for
 		 * printing purposes. Use data_race to avoid KCSAN warning.
@@ -1504,7 +1504,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pgdat->node_id,
 					zone->name,
 					migratetype_names[mtype]);
-		for (order = 0; order <= MAX_ORDER; ++order) {
+		for (order = 0; order < NR_ORDERS; ++order) {
 			unsigned long freecount = 0;
 			struct free_area *area;
 			struct list_head *curr;
@@ -1544,7 +1544,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)
 
 	/* Print header */
 	seq_printf(m, "%-43s ", "Free pages count per migrate type at order");
-	for (order = 0; order <= MAX_ORDER; ++order)
+	for (order = 0; order < NR_ORDERS; ++order)
 		seq_printf(m, "%6d ", order);
 	seq_putc(m, '\n');
 
@@ -2180,7 +2180,7 @@ static void unusable_show_print(struct seq_file *m,
 	seq_printf(m, "Node %d, zone %8s ",
 				pgdat->node_id,
 				zone->name);
-	for (order = 0; order <= MAX_ORDER; ++order) {
+	for (order = 0; order < NR_ORDERS; ++order) {
 		fill_contig_page_info(zone, order, &info);
 		index = unusable_free_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
@@ -2232,7 +2232,7 @@ static void extfrag_show_print(struct seq_file *m,
 	seq_printf(m, "Node %d, zone %8s ",
 				pgdat->node_id,
 				zone->name);
-	for (order = 0; order <= MAX_ORDER; ++order) {
+	for (order = 0; order < NR_ORDERS; ++order) {
 		fill_contig_page_info(zone, order, &info);
 		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%2d.%03d ", index / 1000, index % 1000);
Linus Torvalds Nov. 20, 2023, 5:36 p.m. UTC | #6
On Mon, 20 Nov 2023 at 05:31, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> NR_ORDERS defines the number of page orders supported by the page
> allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.
>
> NR_ORDERS assists in defining arrays of page orders and allows for more
> natural iteration over them.

Well, the thing is, I really think we need to rename or remove
"MAX_ORDER" entirely.

Because as-is, that #define now has completely different meaning than
it used to have for 24 years. Which is not only confusing to people
who might just remember the old semantics, it's a problem for
backporting (and for merging stuff, but that is a fairly temporary
pain and _maybe_ this one-time device mapper thing was the only time
it will ever happen)

               Linus
Andrew Morton Nov. 20, 2023, 5:40 p.m. UTC | #7
On Mon, 20 Nov 2023 09:36:46 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 20 Nov 2023 at 05:31, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > NR_ORDERS defines the number of page orders supported by the page
> > allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.
> >
> > NR_ORDERS assists in defining arrays of page orders and allows for more
> > natural iteration over them.
> 
> Well, the thing is, I really think we need to rename or remove
> "MAX_ORDER" entirely.
> 
> Because as-is, that #define now has completely different meaning than
> it used to have for 24 years. Which is not only confusing to people
> who might just remember the old semantics, it's a problem for
> backporting (and for merging stuff, but that is a fairly temporary
> pain and _maybe_ this one-time device mapper thing was the only time
> it will ever happen)
> 

Yes please.  MAX_ORDER was always a poor name - it implies that it's
inclusive.  Renaming it to NR_ORDERS makes it clearer that the usual
and expected "up to but not including" semantics apply.
Kirill A . Shutemov Nov. 20, 2023, 10:17 p.m. UTC | #8
On Mon, Nov 20, 2023 at 09:40:48AM -0800, Andrew Morton wrote:
> On Mon, 20 Nov 2023 09:36:46 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, 20 Nov 2023 at 05:31, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > NR_ORDERS defines the number of page orders supported by the page
> > > allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.
> > >
> > > NR_ORDERS assists in defining arrays of page orders and allows for more
> > > natural iteration over them.
> > 
> > Well, the thing is, I really think we need to rename or remove
> > "MAX_ORDER" entirely.
> > 
> > Because as-is, that #define now has completely different meaning than
> > it used to have for 24 years. Which is not only confusing to people
> > who might just remember the old semantics, it's a problem for
> > backporting (and for merging stuff, but that is a fairly temporary
> > pain and _maybe_ this one-time device mapper thing was the only time
> > it will ever happen)
> > 
> 
> Yes please.  MAX_ORDER was always a poor name - it implies that it's
> inclusive.  Renaming it to NR_ORDERS makes it clearer that the usual
> and expected "up to but not including" semantics apply.

I personally would prefer to have both value for different use cases.
What about MAX_PAGE_ORDER and NR_PAGE_ORDERS?

If it is okay, I will prepare patches.