diff mbox series

mm/sparse: Fix flags overlap in section_mem_map

Message ID 20210427083019.110184-1-wangwensheng4@huawei.com (mailing list archive)
State New, archived
Headers show
Series mm/sparse: Fix flags overlap in section_mem_map | expand

Commit Message

Wang Wensheng April 27, 2021, 8:30 a.m. UTC
The section_mem_map member of struct mem_section stores some flags and
the address of struct page array of the mem_section.

Additionally the node id of the mem_section is stored during early boot,
where the struct page array has not been allocated. In other words, the
higher bits of section_mem_map are used for two purpose, and the node id
should be clear properly after the early boot.

Currently the node id field is overlapped with the flag field and cannot
be clear properly. That overlapped bits would then be treated as
mem_section flags and may lead to unexpected side effects.

Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
field always locates after flags field. That's why the overlap occurs -
forgetting to increase SECTION_NID_SHIFT when adding new mem_section
flag.

Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
---
 include/linux/mmzone.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand April 27, 2021, 9:05 a.m. UTC | #1
On 27.04.21 10:30, Wang Wensheng wrote:
> The section_mem_map member of struct mem_section stores some flags and
> the address of struct page array of the mem_section.
> 
> Additionally the node id of the mem_section is stored during early boot,
> where the struct page array has not been allocated. In other words, the
> higher bits of section_mem_map are used for two purpose, and the node id
> should be clear properly after the early boot.
> 
> Currently the node id field is overlapped with the flag field and cannot
> be clear properly. That overlapped bits would then be treated as
> mem_section flags and may lead to unexpected side effects.
> 
> Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> field always locates after flags field. That's why the overlap occurs -
> forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> flag.
> 
> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> ---
>   include/linux/mmzone.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 47946ce..b01694d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
>   #define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
>   #define SECTION_MAP_LAST_BIT		(1UL<<5)
>   #define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT		3
> +#define SECTION_NID_SHIFT		order_base_2(SECTION_MAP_LAST_BIT)
>   
>   static inline struct page *__section_mem_map_addr(struct mem_section *section)
>   {
> 

Well, all sections around during boot that have an early NID are early 
... so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, 
but not broken.

But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT. 
sparse_init_one_section() would leave the bit set if the nid happens to 
have that bit set (e.g., node 2,3). It's semi-broken then, because we 
force all pfn_to_online_page() through the slow path.


That whole section flag setting code is fragile.
HORIGUCHI NAOYA(堀口 直也) June 23, 2021, 11:09 p.m. UTC | #2
On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> On 27.04.21 10:30, Wang Wensheng wrote:
> > The section_mem_map member of struct mem_section stores some flags and
> > the address of struct page array of the mem_section.
> > 
> > Additionally the node id of the mem_section is stored during early boot,
> > where the struct page array has not been allocated. In other words, the
> > higher bits of section_mem_map are used for two purpose, and the node id
> > should be clear properly after the early boot.
> > 
> > Currently the node id field is overlapped with the flag field and cannot
> > be clear properly. That overlapped bits would then be treated as
> > mem_section flags and may lead to unexpected side effects.
> > 
> > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > field always locates after flags field. That's why the overlap occurs -
> > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > flag.
> > 
> > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> > ---
> >   include/linux/mmzone.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 47946ce..b01694d 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> >   #define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
> >   #define SECTION_MAP_LAST_BIT		(1UL<<5)
> >   #define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
> > -#define SECTION_NID_SHIFT		3
> > +#define SECTION_NID_SHIFT		order_base_2(SECTION_MAP_LAST_BIT)
> >   static inline struct page *__section_mem_map_addr(struct mem_section *section)
> >   {
> > 
> 
> Well, all sections around during boot that have an early NID are early ...
> so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> broken.
> 
> But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> sparse_init_one_section() would leave the bit set if the nid happens to have
> that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> pfn_to_online_page() through the slow path.
> 
> 
> That whole section flag setting code is fragile.

Hi Wensheng, David,

This patch seems not accepted or updated yet, so what's going on?

We recently saw the exact issue on testing crash utilities with latest
kernels on 4 NUMA node system.  SECTION_TAINT_ZONE_DEVICE seems to be
set wrongly, and makedumpfile could fail due to this. So we need a fix.

I thought of another approach like below before finding this thread,
so if you are fine, I'd like to submit a patch with this. And if you
like going with order_base_2() approach, I'm glad to ack this patch.

  --- a/include/linux/mmzone.h
  +++ b/include/linux/mmzone.h
  @@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
    *      which results in PFN_SECTION_SHIFT equal 6.
    * To sum it up, at least 6 bits are available.
    */
  +#define SECTION_MAP_LAST_SHIFT         5
   #define SECTION_MARKED_PRESENT         (1UL<<0)
   #define SECTION_HAS_MEM_MAP            (1UL<<1)
   #define SECTION_IS_ONLINE              (1UL<<2)
   #define SECTION_IS_EARLY               (1UL<<3)
   #define SECTION_TAINT_ZONE_DEVICE      (1UL<<4)
  -#define SECTION_MAP_LAST_BIT           (1UL<<5)
  +#define SECTION_MAP_LAST_BIT           (1UL<<SECTION_MAP_LAST_SHIFT)
   #define SECTION_MAP_MASK               (~(SECTION_MAP_LAST_BIT-1))
  -#define SECTION_NID_SHIFT              3
  +#define SECTION_NID_SHIFT              SECTION_MAP_LAST_SHIFT
  
   static inline struct page *__section_mem_map_addr(struct mem_section *section)
   {

Thanks,
Naoya Horiguchi
Dan Williams June 25, 2021, 9:23 p.m. UTC | #3
On Wed, Jun 23, 2021 at 4:10 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> > On 27.04.21 10:30, Wang Wensheng wrote:
> > > The section_mem_map member of struct mem_section stores some flags and
> > > the address of struct page array of the mem_section.
> > >
> > > Additionally the node id of the mem_section is stored during early boot,
> > > where the struct page array has not been allocated. In other words, the
> > > higher bits of section_mem_map are used for two purpose, and the node id
> > > should be clear properly after the early boot.
> > >
> > > Currently the node id field is overlapped with the flag field and cannot
> > > be clear properly. That overlapped bits would then be treated as
> > > mem_section flags and may lead to unexpected side effects.
> > >
> > > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > > field always locates after flags field. That's why the overlap occurs -
> > > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > > flag.
> > >
> > > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> > > ---
> > >   include/linux/mmzone.h | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 47946ce..b01694d 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> > >   #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > >   #define SECTION_MAP_LAST_BIT              (1UL<<5)
> > >   #define SECTION_MAP_MASK          (~(SECTION_MAP_LAST_BIT-1))
> > > -#define SECTION_NID_SHIFT          3
> > > +#define SECTION_NID_SHIFT          order_base_2(SECTION_MAP_LAST_BIT)
> > >   static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > >   {
> > >
> >
> > Well, all sections around during boot that have an early NID are early ...
> > so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> > broken.
> >
> > But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> > sparse_init_one_section() would leave the bit set if the nid happens to have
> > that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> > pfn_to_online_page() through the slow path.
> >
> >
> > That whole section flag setting code is fragile.
>
> Hi Wensheng, David,
>
> This patch seems not accepted or updated yet, so what's going on?
>
> We recently saw the exact issue on testing crash utilities with latest
> kernels on 4 NUMA node system.  SECTION_TAINT_ZONE_DEVICE seems to be
> set wrongly, and makedumpfile could fail due to this. So we need a fix.
>
> I thought of another approach like below before finding this thread,
> so if you are fine, I'd like to submit a patch with this. And if you
> like going with order_base_2() approach, I'm glad to ack this patch.
>
>   --- a/include/linux/mmzone.h
>   +++ b/include/linux/mmzone.h
>   @@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
>     *      which results in PFN_SECTION_SHIFT equal 6.
>     * To sum it up, at least 6 bits are available.
>     */
>   +#define SECTION_MAP_LAST_SHIFT         5
>    #define SECTION_MARKED_PRESENT         (1UL<<0)
>    #define SECTION_HAS_MEM_MAP            (1UL<<1)
>    #define SECTION_IS_ONLINE              (1UL<<2)
>    #define SECTION_IS_EARLY               (1UL<<3)
>    #define SECTION_TAINT_ZONE_DEVICE      (1UL<<4)
>   -#define SECTION_MAP_LAST_BIT           (1UL<<5)
>   +#define SECTION_MAP_LAST_BIT           (1UL<<SECTION_MAP_LAST_SHIFT)
>    #define SECTION_MAP_MASK               (~(SECTION_MAP_LAST_BIT-1))
>   -#define SECTION_NID_SHIFT              3
>   +#define SECTION_NID_SHIFT              SECTION_MAP_LAST_SHIFT

Rather than make it dynamic, why not just make it 6 directly since
that matches the comment about the maximum number of flags available.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946ce..b01694d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1325,7 +1325,7 @@  extern size_t mem_section_usage_size(void);
 #define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
 #define SECTION_MAP_LAST_BIT		(1UL<<5)
 #define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT		3
+#define SECTION_NID_SHIFT		order_base_2(SECTION_MAP_LAST_BIT)
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {