diff mbox series

[v2,1/2] mm: cma: allocate cma areas bottom-up

Message ID 20201217201214.3414100-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] mm: cma: allocate cma areas bottom-up | expand

Commit Message

Roman Gushchin Dec. 17, 2020, 8:12 p.m. UTC
Currently cma areas without a fixed base are allocated close to the
end of the node. This placement is sub-optimal because of compaction:
it brings pages into the cma area. In particular, it can bring in hot
executable pages, even if there is a plenty of free memory on the
machine. This results in cma allocation failures.

Instead let's place cma areas close to the beginning of a node.
In this case the compaction will help to free cma areas, resulting
in better cma allocation success rates.

If there is enough memory let's try to allocate bottom-up starting
with 4GB to exclude any possible interference with DMA32. On smaller
machines or in a case of a failure, stick with the old behavior.

16GB vm, 2GB cma area:
With this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
[    0.002931] hugetlb_cma: reserved 2048 MiB on node 0

Without this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
[    0.002934] hugetlb_cma: reserved 2048 MiB on node 0

v2:
  - switched to memblock_set_bottom_up(true), by Mike
  - start with 4GB, by Mike

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/cma.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Mike Rapoport Dec. 20, 2020, 6:48 a.m. UTC | #1
On Thu, Dec 17, 2020 at 12:12:13PM -0800, Roman Gushchin wrote:
> Currently cma areas without a fixed base are allocated close to the
> end of the node. This placement is sub-optimal because of compaction:
> it brings pages into the cma area. In particular, it can bring in hot
> executable pages, even if there is a plenty of free memory on the
> machine. This results in cma allocation failures.
> 
> Instead let's place cma areas close to the beginning of a node.
> In this case the compaction will help to free cma areas, resulting
> in better cma allocation success rates.
> 
> If there is enough memory let's try to allocate bottom-up starting
> with 4GB to exclude any possible interference with DMA32. On smaller
> machines or in a case of a failure, stick with the old behavior.
> 
> 16GB vm, 2GB cma area:
> With this patch:
> [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
> [    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
> [    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
> [    0.002931] hugetlb_cma: reserved 2048 MiB on node 0
> 
> Without this patch:
> [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
> [    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
> [    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
> [    0.002934] hugetlb_cma: reserved 2048 MiB on node 0
> 
> v2:
>   - switched to memblock_set_bottom_up(true), by Mike
>   - start with 4GB, by Mike
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

With one nit below 

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/cma.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 7f415d7cda9f..21fd40c092f0 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -337,6 +337,22 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
>  			limit = highmem_start;
>  		}
>  
> +		/*
> +		 * If there is enough memory, try a bottom-up allocation first.
> +		 * It will place the new cma area close to the start of the node
> +		 * and guarantee that the compaction is moving pages out of the
> +		 * cma area and not into it.
> +		 * Avoid using first 4GB to not interfere with constrained zones
> +		 * like DMA/DMA32.
> +		 */
> +		if (!memblock_bottom_up() &&
> +		    memblock_end >= SZ_4G + size) {

This seems short enough to fit a single line

> +			memblock_set_bottom_up(true);
> +			addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
> +							limit, nid, true);
> +			memblock_set_bottom_up(false);
> +		}
> +
>  		if (!addr) {
>  			addr = memblock_alloc_range_nid(size, alignment, base,
>  					limit, nid, true);
> -- 
> 2.26.2
>
Roman Gushchin Dec. 21, 2020, 5:05 p.m. UTC | #2
On Sun, Dec 20, 2020 at 08:48:48AM +0200, Mike Rapoport wrote:
> On Thu, Dec 17, 2020 at 12:12:13PM -0800, Roman Gushchin wrote:
> > Currently cma areas without a fixed base are allocated close to the
> > end of the node. This placement is sub-optimal because of compaction:
> > it brings pages into the cma area. In particular, it can bring in hot
> > executable pages, even if there is a plenty of free memory on the
> > machine. This results in cma allocation failures.
> > 
> > Instead let's place cma areas close to the beginning of a node.
> > In this case the compaction will help to free cma areas, resulting
> > in better cma allocation success rates.
> > 
> > If there is enough memory let's try to allocate bottom-up starting
> > with 4GB to exclude any possible interference with DMA32. On smaller
> > machines or in a case of a failure, stick with the old behavior.
> > 
> > 16GB vm, 2GB cma area:
> > With this patch:
> > [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
> > [    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
> > [    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
> > [    0.002931] hugetlb_cma: reserved 2048 MiB on node 0
> > 
> > Without this patch:
> > [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
> > [    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
> > [    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
> > [    0.002934] hugetlb_cma: reserved 2048 MiB on node 0
> > 
> > v2:
> >   - switched to memblock_set_bottom_up(true), by Mike
> >   - start with 4GB, by Mike
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> With one nit below 
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> > ---
> >  mm/cma.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 7f415d7cda9f..21fd40c092f0 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -337,6 +337,22 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
> >  			limit = highmem_start;
> >  		}
> >  
> > +		/*
> > +		 * If there is enough memory, try a bottom-up allocation first.
> > +		 * It will place the new cma area close to the start of the node
> > +		 * and guarantee that the compaction is moving pages out of the
> > +		 * cma area and not into it.
> > +		 * Avoid using first 4GB to not interfere with constrained zones
> > +		 * like DMA/DMA32.
> > +		 */
> > +		if (!memblock_bottom_up() &&
> > +		    memblock_end >= SZ_4G + size) {
>

Hi Mike!

> This seems short enough to fit a single line

Indeed. An updated version below.

Thank you for the review of the series!

I assume it's simpler to route both patches through the mm tree.
What do you think?

Thanks!

--

From f88bd0a425c7181bd26a4cf900e6924a7b521419 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Mon, 14 Dec 2020 20:20:52 -0800
Subject: [PATCH v3 1/2] mm: cma: allocate cma areas bottom-up

Currently cma areas without a fixed base are allocated close to the
end of the node. This placement is sub-optimal because of compaction:
it brings pages into the cma area. In particular, it can bring in hot
executable pages, even if there is a plenty of free memory on the
machine. This results in cma allocation failures.

Instead let's place cma areas close to the beginning of a node.
In this case the compaction will help to free cma areas, resulting
in better cma allocation success rates.

If there is enough memory let's try to allocate bottom-up starting
with 4GB to exclude any possible interference with DMA32. On smaller
machines or in a case of a failure, stick with the old behavior.

16GB vm, 2GB cma area:
With this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
[    0.002931] hugetlb_cma: reserved 2048 MiB on node 0

Without this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
[    0.002934] hugetlb_cma: reserved 2048 MiB on node 0

v3:
  - code alignment fix, by Mike
v2:
  - switched to memblock_set_bottom_up(true), by Mike
  - start with 4GB, by Mike

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/cma.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/cma.c b/mm/cma.c
index 20c4f6f40037..4fe74c9d83b0 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -336,6 +336,21 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
 			limit = highmem_start;
 		}
 
+		/*
+		 * If there is enough memory, try a bottom-up allocation first.
+		 * It will place the new cma area close to the start of the node
+		 * and guarantee that the compaction is moving pages out of the
+		 * cma area and not into it.
+		 * Avoid using first 4GB to not interfere with constrained zones
+		 * like DMA/DMA32.
+		 */
+		if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) {
+			memblock_set_bottom_up(true);
+			addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
+							limit, nid, true);
+			memblock_set_bottom_up(false);
+		}
+
 		if (!addr) {
 			addr = memblock_alloc_range_nid(size, alignment, base,
 					limit, nid, true);
Andrew Morton Dec. 23, 2020, 4:06 a.m. UTC | #3
On Mon, 21 Dec 2020 09:05:51 -0800 Roman Gushchin <guro@fb.com> wrote:

> Subject: [PATCH v3 1/2] mm: cma: allocate cma areas bottom-up

i386 allmodconfig:

In file included from ./include/vdso/const.h:5,
                 from ./include/linux/const.h:4,
                 from ./include/linux/bits.h:5,
                 from ./include/linux/bitops.h:6,
                 from ./include/linux/kernel.h:11,
                 from ./include/asm-generic/bug.h:20,
                 from ./arch/x86/include/asm/bug.h:93,
                 from ./include/linux/bug.h:5,
                 from ./include/linux/mmdebug.h:5,
                 from ./include/linux/mm.h:9,
                 from ./include/linux/memblock.h:13,
                 from mm/cma.c:24:
mm/cma.c: In function ‘cma_declare_contiguous_nid’:
./include/uapi/linux/const.h:20:19: warning: conversion from ‘long long unsigned int’ to ‘phys_addr_t’ {aka ‘unsigned int’} changes value from ‘4294967296’ to ‘0’ [-Woverflow]
 #define __AC(X,Y) (X##Y)
                   ^~~~~~
./include/uapi/linux/const.h:21:18: note: in expansion of macro ‘__AC’
 #define _AC(X,Y) __AC(X,Y)
                  ^~~~
./include/linux/sizes.h:46:18: note: in expansion of macro ‘_AC’
 #define SZ_4G    _AC(0x100000000, ULL)
                  ^~~
mm/cma.c:349:53: note: in expansion of macro ‘SZ_4G’
    addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
                                                     ^~~~~
Roman Gushchin Dec. 23, 2020, 4:35 p.m. UTC | #4
On Tue, Dec 22, 2020 at 08:06:06PM -0800, Andrew Morton wrote:
> On Mon, 21 Dec 2020 09:05:51 -0800 Roman Gushchin <guro@fb.com> wrote:
> 
> > Subject: [PATCH v3 1/2] mm: cma: allocate cma areas bottom-up
> 
> i386 allmodconfig:
> 
> In file included from ./include/vdso/const.h:5,
>                  from ./include/linux/const.h:4,
>                  from ./include/linux/bits.h:5,
>                  from ./include/linux/bitops.h:6,
>                  from ./include/linux/kernel.h:11,
>                  from ./include/asm-generic/bug.h:20,
>                  from ./arch/x86/include/asm/bug.h:93,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:9,
>                  from ./include/linux/memblock.h:13,
>                  from mm/cma.c:24:
> mm/cma.c: In function ‘cma_declare_contiguous_nid’:
> ./include/uapi/linux/const.h:20:19: warning: conversion from ‘long long unsigned int’ to ‘phys_addr_t’ {aka ‘unsigned int’} changes value from ‘4294967296’ to ‘0’ [-Woverflow]
>  #define __AC(X,Y) (X##Y)
>                    ^~~~~~
> ./include/uapi/linux/const.h:21:18: note: in expansion of macro ‘__AC’
>  #define _AC(X,Y) __AC(X,Y)
>                   ^~~~
> ./include/linux/sizes.h:46:18: note: in expansion of macro ‘_AC’
>  #define SZ_4G    _AC(0x100000000, ULL)
>                   ^~~
> mm/cma.c:349:53: note: in expansion of macro ‘SZ_4G’
>     addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
>                                                      ^~~~~
> 

I thought that (!memblock_bottom_up() && memblock_end >= SZ_4G + size)
can't be true on a 32-bit platform, so the whole if clause can be compiled out.
Maybe it's because memblock_end can be equal to SZ_4G and if the size == 0...

I have no better idea than wrapping everything into
#if BITS_PER_LONG > 32
#endif.

Thanks!

--

diff --git a/mm/cma.c b/mm/cma.c
index 4fe74c9d83b0..5d69b498603a 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -344,12 +344,14 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
                 * Avoid using first 4GB to not interfere with constrained zones
                 * like DMA/DMA32.
                 */
+#if BITS_PER_LONG > 32
                if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) {
                        memblock_set_bottom_up(true);
                        addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
                                                        limit, nid, true);
                        memblock_set_bottom_up(false);
                }
+#endif
 
                if (!addr) {
                        addr = memblock_alloc_range_nid(size, alignment, base,
Mike Rapoport Dec. 23, 2020, 10:10 p.m. UTC | #5
On Wed, Dec 23, 2020 at 08:35:37AM -0800, Roman Gushchin wrote:
> On Tue, Dec 22, 2020 at 08:06:06PM -0800, Andrew Morton wrote:
> > On Mon, 21 Dec 2020 09:05:51 -0800 Roman Gushchin <guro@fb.com> wrote:
> > 
> > > Subject: [PATCH v3 1/2] mm: cma: allocate cma areas bottom-up
> > 
> > i386 allmodconfig:
> > 
> > In file included from ./include/vdso/const.h:5,
> >                  from ./include/linux/const.h:4,
> >                  from ./include/linux/bits.h:5,
> >                  from ./include/linux/bitops.h:6,
> >                  from ./include/linux/kernel.h:11,
> >                  from ./include/asm-generic/bug.h:20,
> >                  from ./arch/x86/include/asm/bug.h:93,
> >                  from ./include/linux/bug.h:5,
> >                  from ./include/linux/mmdebug.h:5,
> >                  from ./include/linux/mm.h:9,
> >                  from ./include/linux/memblock.h:13,
> >                  from mm/cma.c:24:
> > mm/cma.c: In function ‘cma_declare_contiguous_nid’:
> > ./include/uapi/linux/const.h:20:19: warning: conversion from ‘long long unsigned int’ to ‘phys_addr_t’ {aka ‘unsigned int’} changes value from ‘4294967296’ to ‘0’ [-Woverflow]
> >  #define __AC(X,Y) (X##Y)
> >                    ^~~~~~
> > ./include/uapi/linux/const.h:21:18: note: in expansion of macro ‘__AC’
> >  #define _AC(X,Y) __AC(X,Y)
> >                   ^~~~
> > ./include/linux/sizes.h:46:18: note: in expansion of macro ‘_AC’
> >  #define SZ_4G    _AC(0x100000000, ULL)
> >                   ^~~
> > mm/cma.c:349:53: note: in expansion of macro ‘SZ_4G’
> >     addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
> >                                                      ^~~~~
> > 
> 
> I thought that (!memblock_bottom_up() && memblock_end >= SZ_4G + size)
> can't be true on a 32-bit platform, so the whole if clause can be compiled out.
> Maybe it's because memblock_end can be equal to SZ_4G and if the size == 0...
> 
> I have no better idea than wrapping everything into
> #if BITS_PER_LONG > 32
> #endif.

32-bit systems can have more than 32 bit in the physical address.
I think a better option would be to use CONFIG_PHYS_ADDR_T_64BIT
 
> Thanks!
> 
> --
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 4fe74c9d83b0..5d69b498603a 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -344,12 +344,14 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
>                  * Avoid using first 4GB to not interfere with constrained zones
>                  * like DMA/DMA32.
>                  */
> +#if BITS_PER_LONG > 32
>                 if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) {
>                         memblock_set_bottom_up(true);
>                         addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
>                                                         limit, nid, true);
>                         memblock_set_bottom_up(false);
>                 }
> +#endif
>  
>                 if (!addr) {
>                         addr = memblock_alloc_range_nid(size, alignment, base,
Roman Gushchin Dec. 28, 2020, 7:36 p.m. UTC | #6
On Thu, Dec 24, 2020 at 12:10:39AM +0200, Mike Rapoport wrote:
> On Wed, Dec 23, 2020 at 08:35:37AM -0800, Roman Gushchin wrote:
> > On Tue, Dec 22, 2020 at 08:06:06PM -0800, Andrew Morton wrote:
> > > On Mon, 21 Dec 2020 09:05:51 -0800 Roman Gushchin <guro@fb.com> wrote:
> > > 
> > > > Subject: [PATCH v3 1/2] mm: cma: allocate cma areas bottom-up
> > > 
> > > i386 allmodconfig:
> > > 
> > > In file included from ./include/vdso/const.h:5,
> > >                  from ./include/linux/const.h:4,
> > >                  from ./include/linux/bits.h:5,
> > >                  from ./include/linux/bitops.h:6,
> > >                  from ./include/linux/kernel.h:11,
> > >                  from ./include/asm-generic/bug.h:20,
> > >                  from ./arch/x86/include/asm/bug.h:93,
> > >                  from ./include/linux/bug.h:5,
> > >                  from ./include/linux/mmdebug.h:5,
> > >                  from ./include/linux/mm.h:9,
> > >                  from ./include/linux/memblock.h:13,
> > >                  from mm/cma.c:24:
> > > mm/cma.c: In function ‘cma_declare_contiguous_nid’:
> > > ./include/uapi/linux/const.h:20:19: warning: conversion from ‘long long unsigned int’ to ‘phys_addr_t’ {aka ‘unsigned int’} changes value from ‘4294967296’ to ‘0’ [-Woverflow]
> > >  #define __AC(X,Y) (X##Y)
> > >                    ^~~~~~
> > > ./include/uapi/linux/const.h:21:18: note: in expansion of macro ‘__AC’
> > >  #define _AC(X,Y) __AC(X,Y)
> > >                   ^~~~
> > > ./include/linux/sizes.h:46:18: note: in expansion of macro ‘_AC’
> > >  #define SZ_4G    _AC(0x100000000, ULL)
> > >                   ^~~
> > > mm/cma.c:349:53: note: in expansion of macro ‘SZ_4G’
> > >     addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
> > >                                                      ^~~~~
> > > 
> > 
> > I thought that (!memblock_bottom_up() && memblock_end >= SZ_4G + size)
> > can't be true on a 32-bit platform, so the whole if clause can be compiled out.
> > Maybe it's because memblock_end can be equal to SZ_4G and if the size == 0...
> > 
> > I have no better idea than wrapping everything into
> > #if BITS_PER_LONG > 32
> > #endif.
> 
> 32-bit systems can have more than 32 bit in the physical address.
> I think a better option would be to use CONFIG_PHYS_ADDR_T_64BIT

I agree. An updated fixup below.

Andrew, can you, please, replace the previous fixup with this one?

Thanks!

--

diff --git a/mm/cma.c b/mm/cma.c
index 4fe74c9d83b0..0ba69cd16aeb 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -344,12 +344,14 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
                 * Avoid using first 4GB to not interfere with constrained zones
                 * like DMA/DMA32.
                 */
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
                if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) {
                        memblock_set_bottom_up(true);
                        addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
                                                        limit, nid, true);
                        memblock_set_bottom_up(false);
                }
+#endif
 
                if (!addr) {
                        addr = memblock_alloc_range_nid(size, alignment, base,
diff mbox series

Patch

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7cda9f..21fd40c092f0 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -337,6 +337,22 @@  int __init cma_declare_contiguous_nid(phys_addr_t base,
 			limit = highmem_start;
 		}
 
+		/*
+		 * If there is enough memory, try a bottom-up allocation first.
+		 * It will place the new cma area close to the start of the node
+		 * and guarantee that the compaction is moving pages out of the
+		 * cma area and not into it.
+		 * Avoid using first 4GB to not interfere with constrained zones
+		 * like DMA/DMA32.
+		 */
+		if (!memblock_bottom_up() &&
+		    memblock_end >= SZ_4G + size) {
+			memblock_set_bottom_up(true);
+			addr = memblock_alloc_range_nid(size, alignment, SZ_4G,
+							limit, nid, true);
+			memblock_set_bottom_up(false);
+		}
+
 		if (!addr) {
 			addr = memblock_alloc_range_nid(size, alignment, base,
 					limit, nid, true);