[v2,1/2] mm: vmalloc: implement vrealloc()

Message ID	20240722163111.4766-2-dakr@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Danilo Krummrich <dakr@kernel.org> To: cl@linux.com, penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, urezki@gmail.com, hch@infradead.org, kees@kernel.org, ojeda@kernel.org, wedsonaf@gmail.com, mhocko@kernel.org, mpe@ellerman.id.au, chandan.babu@oracle.com, christian.koenig@amd.com, maz@kernel.org, oliver.upton@linux.dev Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, rust-for-linux@vger.kernel.org, Danilo Krummrich <dakr@kernel.org> Subject: [PATCH v2 1/2] mm: vmalloc: implement vrealloc() Date: Mon, 22 Jul 2024 18:29:23 +0200 Message-ID: <20240722163111.4766-2-dakr@kernel.org> In-Reply-To: <20240722163111.4766-1-dakr@kernel.org> References: <20240722163111.4766-1-dakr@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Align kvrealloc() with krealloc() \| expand [v2,0/2] Align kvrealloc() with krealloc() [v2,1/2] mm: vmalloc: implement vrealloc() [v2,2/2] mm: kvmalloc: align kvrealloc() with krealloc()

Danilo Krummrich July 22, 2024, 4:29 p.m. UTC

Implement vrealloc() analogous to krealloc().

Currently, krealloc() requires the caller to pass the size of the
previous memory allocation, which, instead, should be self-contained.

We attempt to fix this in a subsequent patch which, in order to do so,
requires vrealloc().

Besides that, we need realloc() functions for kernel allocators in Rust
too. With `Vec` or `KVec` respectively, potentially growing (and
shrinking) data structures are rather common.

Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
 include/linux/vmalloc.h |  4 +++
 mm/vmalloc.c            | 59 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

Vlastimil Babka July 26, 2024, 2:37 p.m. UTC | #1

On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> Implement vrealloc() analogous to krealloc().
> 
> Currently, krealloc() requires the caller to pass the size of the
> previous memory allocation, which, instead, should be self-contained.
> 
> We attempt to fix this in a subsequent patch which, in order to do so,
> requires vrealloc().
> 
> Besides that, we need realloc() functions for kernel allocators in Rust
> too. With `Vec` or `KVec` respectively, potentially growing (and
> shrinking) data structures are rather common.
> 
> Signed-off-by: Danilo Krummrich <dakr@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node_noprof);
>  
> +/**
> + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> + * @p: object to reallocate memory for
> + * @size: the size to reallocate
> + * @flags: the flags for the page level allocator
> + *
> + * The contents of the object pointed to are preserved up to the lesser of the
> + * new and old size (__GFP_ZERO flag is effectively ignored).

Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
new, kvrealloc() did the same before patch 2/2.

But it's also fundamentally not true for krealloc(), or kvrealloc()
switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
bucket, we don't know what was the exact prior allocation size. Worse, we
started poisoning the padding in debug configurations, so even a
kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
poison now...

I guess we should just document __GFP_ZERO is not honored at all for
realloc, and maybe start even warning :/ Hopefully nobody relies on that.

> + *
> + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
> + * @p is not a %NULL pointer, the object pointed to is freed.
> + *
> + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
> + *         failure
> + */
> +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
> +{
> +	size_t old_size = 0;
> +	void *n;
> +
> +	if (!size) {
> +		vfree(p);
> +		return NULL;
> +	}
> +
> +	if (p) {
> +		struct vm_struct *vm;
> +
> +		vm = find_vm_area(p);
> +		if (unlikely(!vm)) {
> +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
> +			return NULL;
> +		}
> +
> +		old_size = get_vm_area_size(vm);
> +	}
> +
> +	if (size <= old_size) {
> +		/*
> +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
> +		 * What would be a good heuristic for when to shrink the
> +		 * vm_area?
> +		 */
> +		return (void *)p;
> +	}
> +
> +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
> +	n = __vmalloc_noprof(size, flags);
> +	if (!n)
> +		return NULL;
> +
> +	if (p) {
> +		memcpy(n, p, old_size);
> +		vfree(p);
> +	}
> +
> +	return n;
> +}
> +
>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)

Danilo Krummrich July 26, 2024, 8:05 p.m. UTC | #2

On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
> On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> > Implement vrealloc() analogous to krealloc().
> > 
> > Currently, krealloc() requires the caller to pass the size of the
> > previous memory allocation, which, instead, should be self-contained.
> > 
> > We attempt to fix this in a subsequent patch which, in order to do so,
> > requires vrealloc().
> > 
> > Besides that, we need realloc() functions for kernel allocators in Rust
> > too. With `Vec` or `KVec` respectively, potentially growing (and
> > shrinking) data structures are rather common.
> > 
> > Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
> >  }
> >  EXPORT_SYMBOL(vzalloc_node_noprof);
> >  
> > +/**
> > + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> > + * @p: object to reallocate memory for
> > + * @size: the size to reallocate
> > + * @flags: the flags for the page level allocator
> > + *
> > + * The contents of the object pointed to are preserved up to the lesser of the
> > + * new and old size (__GFP_ZERO flag is effectively ignored).
> 
> Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
> 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
> new, kvrealloc() did the same before patch 2/2.

Taking it (too) literal, it's not wrong. The contents of the object pointed to
are indeed preserved up to the lesser of the new and old size. It's just that
the rest may be "preserved" as well.

I work on implementing shrink and grow for vrealloc(). In the meantime I think
we could probably just memset() spare memory to zero.

nommu would still uses krealloc() though...

> 
> But it's also fundamentally not true for krealloc(), or kvrealloc()
> switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
> bucket, we don't know what was the exact prior allocation size.

Probably a stupid question, but can't we just zero the full bucket initially and
make sure to memset() spare memory in the bucket to zero when krealloc() is
called with new_size < ksize()?

> Worse, we
> started poisoning the padding in debug configurations, so even a
> kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
> poison now...

As in writing magics directly to the spare memory in the bucket? Which would
then also be copied over to a new buffer in __do_krealloc()?

> 
> I guess we should just document __GFP_ZERO is not honored at all for
> realloc, and maybe start even warning :/ Hopefully nobody relies on that.

I think it'd be great to make __GFP_ZERO work in all cases. However, if that's
really not possible, I'd prefer if we could at least gurantee that
*realloc(NULL, size, flags | __GFP_ZERO) is a valid call, i.e.
WARN_ON(p && flags & __GFP_ZERO).

> 
> > + *
> > + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
> > + * @p is not a %NULL pointer, the object pointed to is freed.
> > + *
> > + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
> > + *         failure
> > + */
> > +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
> > +{
> > +	size_t old_size = 0;
> > +	void *n;
> > +
> > +	if (!size) {
> > +		vfree(p);
> > +		return NULL;
> > +	}
> > +
> > +	if (p) {
> > +		struct vm_struct *vm;
> > +
> > +		vm = find_vm_area(p);
> > +		if (unlikely(!vm)) {
> > +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
> > +			return NULL;
> > +		}
> > +
> > +		old_size = get_vm_area_size(vm);
> > +	}
> > +
> > +	if (size <= old_size) {
> > +		/*
> > +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
> > +		 * What would be a good heuristic for when to shrink the
> > +		 * vm_area?
> > +		 */
> > +		return (void *)p;
> > +	}
> > +
> > +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
> > +	n = __vmalloc_noprof(size, flags);
> > +	if (!n)
> > +		return NULL;
> > +
> > +	if (p) {
> > +		memcpy(n, p, old_size);
> > +		vfree(p);
> > +	}
> > +
> > +	return n;
> > +}
> > +
> >  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> >  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
> >  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
>

Danilo Krummrich July 29, 2024, 7:08 p.m. UTC | #3

On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
> On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
> > On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> > > Implement vrealloc() analogous to krealloc().
> > > 
> > > Currently, krealloc() requires the caller to pass the size of the
> > > previous memory allocation, which, instead, should be self-contained.
> > > 
> > > We attempt to fix this in a subsequent patch which, in order to do so,
> > > requires vrealloc().
> > > 
> > > Besides that, we need realloc() functions for kernel allocators in Rust
> > > too. With `Vec` or `KVec` respectively, potentially growing (and
> > > shrinking) data structures are rather common.
> > > 
> > > Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> > 
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
> > >  }
> > >  EXPORT_SYMBOL(vzalloc_node_noprof);
> > >  
> > > +/**
> > > + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> > > + * @p: object to reallocate memory for
> > > + * @size: the size to reallocate
> > > + * @flags: the flags for the page level allocator
> > > + *
> > > + * The contents of the object pointed to are preserved up to the lesser of the
> > > + * new and old size (__GFP_ZERO flag is effectively ignored).
> > 
> > Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
> > 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
> > new, kvrealloc() did the same before patch 2/2.
> 
> Taking it (too) literal, it's not wrong. The contents of the object pointed to
> are indeed preserved up to the lesser of the new and old size. It's just that
> the rest may be "preserved" as well.
> 
> I work on implementing shrink and grow for vrealloc(). In the meantime I think
> we could probably just memset() spare memory to zero.

Probably, this was a bad idea. Even with shrinking implemented we'd need to
memset() potential spare memory of the last page to zero, when new_size <
old_size.

Analogously, the same would be true for krealloc() buckets. That's probably not
worth it.

I think we should indeed just document that __GFP_ZERO doesn't work for
re-allocating memory and start to warn about it. As already mentioned, I think
we should at least gurantee that *realloc(NULL, size, flags | __GFP_ZERO) is
valid, i.e. WARN_ON(p && flags & __GFP_ZERO).

> 
> nommu would still uses krealloc() though...
> 
> > 
> > But it's also fundamentally not true for krealloc(), or kvrealloc()
> > switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
> > bucket, we don't know what was the exact prior allocation size.
> 
> Probably a stupid question, but can't we just zero the full bucket initially and
> make sure to memset() spare memory in the bucket to zero when krealloc() is
> called with new_size < ksize()?
> 
> > Worse, we
> > started poisoning the padding in debug configurations, so even a
> > kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
> > poison now...
> 
> As in writing magics directly to the spare memory in the bucket? Which would
> then also be copied over to a new buffer in __do_krealloc()?
> 
> > 
> > I guess we should just document __GFP_ZERO is not honored at all for
> > realloc, and maybe start even warning :/ Hopefully nobody relies on that.
> 
> I think it'd be great to make __GFP_ZERO work in all cases. However, if that's
> really not possible, I'd prefer if we could at least gurantee that
> *realloc(NULL, size, flags | __GFP_ZERO) is a valid call, i.e.
> WARN_ON(p && flags & __GFP_ZERO).
> 
> > 
> > > + *
> > > + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
> > > + * @p is not a %NULL pointer, the object pointed to is freed.
> > > + *
> > > + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
> > > + *         failure
> > > + */
> > > +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
> > > +{
> > > +	size_t old_size = 0;
> > > +	void *n;
> > > +
> > > +	if (!size) {
> > > +		vfree(p);
> > > +		return NULL;
> > > +	}
> > > +
> > > +	if (p) {
> > > +		struct vm_struct *vm;
> > > +
> > > +		vm = find_vm_area(p);
> > > +		if (unlikely(!vm)) {
> > > +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
> > > +			return NULL;
> > > +		}
> > > +
> > > +		old_size = get_vm_area_size(vm);
> > > +	}
> > > +
> > > +	if (size <= old_size) {
> > > +		/*
> > > +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
> > > +		 * What would be a good heuristic for when to shrink the
> > > +		 * vm_area?
> > > +		 */
> > > +		return (void *)p;
> > > +	}
> > > +
> > > +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
> > > +	n = __vmalloc_noprof(size, flags);
> > > +	if (!n)
> > > +		return NULL;
> > > +
> > > +	if (p) {
> > > +		memcpy(n, p, old_size);
> > > +		vfree(p);
> > > +	}
> > > +
> > > +	return n;
> > > +}
> > > +
> > >  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> > >  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
> > >  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> >

Danilo Krummrich July 30, 2024, 1:35 a.m. UTC | #4

On Mon, Jul 29, 2024 at 09:08:16PM +0200, Danilo Krummrich wrote:
> On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
> > On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
> > > On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> > > > Implement vrealloc() analogous to krealloc().
> > > > 
> > > > Currently, krealloc() requires the caller to pass the size of the
> > > > previous memory allocation, which, instead, should be self-contained.
> > > > 
> > > > We attempt to fix this in a subsequent patch which, in order to do so,
> > > > requires vrealloc().
> > > > 
> > > > Besides that, we need realloc() functions for kernel allocators in Rust
> > > > too. With `Vec` or `KVec` respectively, potentially growing (and
> > > > shrinking) data structures are rather common.
> > > > 
> > > > Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> > > 
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > 
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
> > > >  }
> > > >  EXPORT_SYMBOL(vzalloc_node_noprof);
> > > >  
> > > > +/**
> > > > + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> > > > + * @p: object to reallocate memory for
> > > > + * @size: the size to reallocate
> > > > + * @flags: the flags for the page level allocator
> > > > + *
> > > > + * The contents of the object pointed to are preserved up to the lesser of the
> > > > + * new and old size (__GFP_ZERO flag is effectively ignored).
> > > 
> > > Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
> > > 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
> > > new, kvrealloc() did the same before patch 2/2.
> > 
> > Taking it (too) literal, it's not wrong. The contents of the object pointed to
> > are indeed preserved up to the lesser of the new and old size. It's just that
> > the rest may be "preserved" as well.
> > 
> > I work on implementing shrink and grow for vrealloc(). In the meantime I think
> > we could probably just memset() spare memory to zero.
> 
> Probably, this was a bad idea. Even with shrinking implemented we'd need to
> memset() potential spare memory of the last page to zero, when new_size <
> old_size.
> 
> Analogously, the same would be true for krealloc() buckets. That's probably not
> worth it.
> 
> I think we should indeed just document that __GFP_ZERO doesn't work for
> re-allocating memory and start to warn about it. As already mentioned, I think
> we should at least gurantee that *realloc(NULL, size, flags | __GFP_ZERO) is
> valid, i.e. WARN_ON(p && flags & __GFP_ZERO).

Maybe I spoke a bit to soon with this last paragraph. I think continuously
gowing something with __GFP_ZERO is a legitimate use case. I just did a quick
grep for users of krealloc() with __GFP_ZERO and found 18 matches.

So, I think, at least for now, we should instead document that __GFP_ZERO is
only fully honored when the buffer is grown continuously (without intermediate
shrinking) and __GFP_ZERO is supplied in every iteration.

In case I miss something here, and not even this case is safe, it looks like
we have 18 broken users of krealloc().

> 
> > 
> > nommu would still uses krealloc() though...
> > 
> > > 
> > > But it's also fundamentally not true for krealloc(), or kvrealloc()
> > > switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
> > > bucket, we don't know what was the exact prior allocation size.
> > 
> > Probably a stupid question, but can't we just zero the full bucket initially and
> > make sure to memset() spare memory in the bucket to zero when krealloc() is
> > called with new_size < ksize()?
> > 
> > > Worse, we
> > > started poisoning the padding in debug configurations, so even a
> > > kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
> > > poison now...
> > 
> > As in writing magics directly to the spare memory in the bucket? Which would
> > then also be copied over to a new buffer in __do_krealloc()?
> > 
> > > 
> > > I guess we should just document __GFP_ZERO is not honored at all for
> > > realloc, and maybe start even warning :/ Hopefully nobody relies on that.
> > 
> > I think it'd be great to make __GFP_ZERO work in all cases. However, if that's
> > really not possible, I'd prefer if we could at least gurantee that
> > *realloc(NULL, size, flags | __GFP_ZERO) is a valid call, i.e.
> > WARN_ON(p && flags & __GFP_ZERO).
> > 
> > > 
> > > > + *
> > > > + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
> > > > + * @p is not a %NULL pointer, the object pointed to is freed.
> > > > + *
> > > > + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
> > > > + *         failure
> > > > + */
> > > > +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
> > > > +{
> > > > +	size_t old_size = 0;
> > > > +	void *n;
> > > > +
> > > > +	if (!size) {
> > > > +		vfree(p);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	if (p) {
> > > > +		struct vm_struct *vm;
> > > > +
> > > > +		vm = find_vm_area(p);
> > > > +		if (unlikely(!vm)) {
> > > > +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
> > > > +			return NULL;
> > > > +		}
> > > > +
> > > > +		old_size = get_vm_area_size(vm);
> > > > +	}
> > > > +
> > > > +	if (size <= old_size) {
> > > > +		/*
> > > > +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
> > > > +		 * What would be a good heuristic for when to shrink the
> > > > +		 * vm_area?
> > > > +		 */
> > > > +		return (void *)p;
> > > > +	}
> > > > +
> > > > +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
> > > > +	n = __vmalloc_noprof(size, flags);
> > > > +	if (!n)
> > > > +		return NULL;
> > > > +
> > > > +	if (p) {
> > > > +		memcpy(n, p, old_size);
> > > > +		vfree(p);
> > > > +	}
> > > > +
> > > > +	return n;
> > > > +}
> > > > +
> > > >  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> > > >  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
> > > >  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> > >

Vlastimil Babka July 30, 2024, 12:15 p.m. UTC | #5

On 7/30/24 3:35 AM, Danilo Krummrich wrote:
> On Mon, Jul 29, 2024 at 09:08:16PM +0200, Danilo Krummrich wrote:
>> On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
>>> On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
>>>> On 7/22/24 6:29 PM, Danilo Krummrich wrote:
>>>>> Implement vrealloc() analogous to krealloc().
>>>>>
>>>>> Currently, krealloc() requires the caller to pass the size of the
>>>>> previous memory allocation, which, instead, should be self-contained.
>>>>>
>>>>> We attempt to fix this in a subsequent patch which, in order to do so,
>>>>> requires vrealloc().
>>>>>
>>>>> Besides that, we need realloc() functions for kernel allocators in Rust
>>>>> too. With `Vec` or `KVec` respectively, potentially growing (and
>>>>> shrinking) data structures are rather common.
>>>>>
>>>>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>>> --- a/mm/vmalloc.c
>>>>> +++ b/mm/vmalloc.c
>>>>> @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
>>>>>  }
>>>>>  EXPORT_SYMBOL(vzalloc_node_noprof);
>>>>>  
>>>>> +/**
>>>>> + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
>>>>> + * @p: object to reallocate memory for
>>>>> + * @size: the size to reallocate
>>>>> + * @flags: the flags for the page level allocator
>>>>> + *
>>>>> + * The contents of the object pointed to are preserved up to the lesser of the
>>>>> + * new and old size (__GFP_ZERO flag is effectively ignored).
>>>>
>>>> Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
>>>> 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
>>>> new, kvrealloc() did the same before patch 2/2.
>>>
>>> Taking it (too) literal, it's not wrong. The contents of the object pointed to
>>> are indeed preserved up to the lesser of the new and old size. It's just that
>>> the rest may be "preserved" as well.
>>>
>>> I work on implementing shrink and grow for vrealloc(). In the meantime I think
>>> we could probably just memset() spare memory to zero.
>>
>> Probably, this was a bad idea. Even with shrinking implemented we'd need to
>> memset() potential spare memory of the last page to zero, when new_size <
>> old_size.
>>
>> Analogously, the same would be true for krealloc() buckets. That's probably not
>> worth it.

I think it could remove unexpected bad surprises with the API so why not
do it.

>> I think we should indeed just document that __GFP_ZERO doesn't work for
>> re-allocating memory and start to warn about it. As already mentioned, I think
>> we should at least gurantee that *realloc(NULL, size, flags | __GFP_ZERO) is
>> valid, i.e. WARN_ON(p && flags & __GFP_ZERO).
> 
> Maybe I spoke a bit to soon with this last paragraph. I think continuously
> gowing something with __GFP_ZERO is a legitimate use case. I just did a quick
> grep for users of krealloc() with __GFP_ZERO and found 18 matches.
> 
> So, I think, at least for now, we should instead document that __GFP_ZERO is
> only fully honored when the buffer is grown continuously (without intermediate
> shrinking) and __GFP_ZERO is supplied in every iteration.
> 
> In case I miss something here, and not even this case is safe, it looks like
> we have 18 broken users of krealloc().

+CC Feng Tang

Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
extra allocated kmalloc space than requested") and preceding commits, if
slub_debug is enabled (red zoning or user tracking), only the 56 bytes
will be zeroed. The rest will be either unknown garbage, or redzone.

Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
bytes (result of ksize()) will be copied, including the garbage/redzone.
I think it's fixable because when we do this in slub_debug, we also
store the original size in the metadata, so we could read it back and
adjust how many bytes are copied.

Then we could guarantee that if __GFP_ZERO is used consistently on
initial kmalloc() and on krealloc() and the user doesn't corrupt the
extra space themselves (which is a bug anyway that the redzoning is
supposed to catch) all will be fine.

There might be also KASAN side to this, I see poison_kmalloc_redzone()
is also redzoning the area between requested size and cache's object_size?

>>
>>>
>>> nommu would still uses krealloc() though...
>>>
>>>>
>>>> But it's also fundamentally not true for krealloc(), or kvrealloc()
>>>> switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
>>>> bucket, we don't know what was the exact prior allocation size.
>>>
>>> Probably a stupid question, but can't we just zero the full bucket initially and
>>> make sure to memset() spare memory in the bucket to zero when krealloc() is
>>> called with new_size < ksize()?
>>>
>>>> Worse, we
>>>> started poisoning the padding in debug configurations, so even a
>>>> kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
>>>> poison now...
>>>
>>> As in writing magics directly to the spare memory in the bucket? Which would
>>> then also be copied over to a new buffer in __do_krealloc()?
>>>
>>>>
>>>> I guess we should just document __GFP_ZERO is not honored at all for
>>>> realloc, and maybe start even warning :/ Hopefully nobody relies on that.
>>>
>>> I think it'd be great to make __GFP_ZERO work in all cases. However, if that's
>>> really not possible, I'd prefer if we could at least gurantee that
>>> *realloc(NULL, size, flags | __GFP_ZERO) is a valid call, i.e.
>>> WARN_ON(p && flags & __GFP_ZERO).
>>>
>>>>
>>>>> + *
>>>>> + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
>>>>> + * @p is not a %NULL pointer, the object pointed to is freed.
>>>>> + *
>>>>> + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
>>>>> + *         failure
>>>>> + */
>>>>> +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
>>>>> +{
>>>>> +	size_t old_size = 0;
>>>>> +	void *n;
>>>>> +
>>>>> +	if (!size) {
>>>>> +		vfree(p);
>>>>> +		return NULL;
>>>>> +	}
>>>>> +
>>>>> +	if (p) {
>>>>> +		struct vm_struct *vm;
>>>>> +
>>>>> +		vm = find_vm_area(p);
>>>>> +		if (unlikely(!vm)) {
>>>>> +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
>>>>> +			return NULL;
>>>>> +		}
>>>>> +
>>>>> +		old_size = get_vm_area_size(vm);
>>>>> +	}
>>>>> +
>>>>> +	if (size <= old_size) {
>>>>> +		/*
>>>>> +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
>>>>> +		 * What would be a good heuristic for when to shrink the
>>>>> +		 * vm_area?
>>>>> +		 */
>>>>> +		return (void *)p;
>>>>> +	}
>>>>> +
>>>>> +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
>>>>> +	n = __vmalloc_noprof(size, flags);
>>>>> +	if (!n)
>>>>> +		return NULL;
>>>>> +
>>>>> +	if (p) {
>>>>> +		memcpy(n, p, old_size);
>>>>> +		vfree(p);
>>>>> +	}
>>>>> +
>>>>> +	return n;
>>>>> +}
>>>>> +
>>>>>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>>>>>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>>>>>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
>>>>

Danilo Krummrich July 30, 2024, 1:14 p.m. UTC | #6

On Tue, Jul 30, 2024 at 02:15:34PM +0200, Vlastimil Babka wrote:
> On 7/30/24 3:35 AM, Danilo Krummrich wrote:
> > On Mon, Jul 29, 2024 at 09:08:16PM +0200, Danilo Krummrich wrote:
> >> On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
> >>> On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
> >>>> On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> >>>>> Implement vrealloc() analogous to krealloc().
> >>>>>
> >>>>> Currently, krealloc() requires the caller to pass the size of the
> >>>>> previous memory allocation, which, instead, should be self-contained.
> >>>>>
> >>>>> We attempt to fix this in a subsequent patch which, in order to do so,
> >>>>> requires vrealloc().
> >>>>>
> >>>>> Besides that, we need realloc() functions for kernel allocators in Rust
> >>>>> too. With `Vec` or `KVec` respectively, potentially growing (and
> >>>>> shrinking) data structures are rather common.
> >>>>>
> >>>>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> >>>>
> >>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>
> >>>>> --- a/mm/vmalloc.c
> >>>>> +++ b/mm/vmalloc.c
> >>>>> @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
> >>>>>  }
> >>>>>  EXPORT_SYMBOL(vzalloc_node_noprof);
> >>>>>  
> >>>>> +/**
> >>>>> + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> >>>>> + * @p: object to reallocate memory for
> >>>>> + * @size: the size to reallocate
> >>>>> + * @flags: the flags for the page level allocator
> >>>>> + *
> >>>>> + * The contents of the object pointed to are preserved up to the lesser of the
> >>>>> + * new and old size (__GFP_ZERO flag is effectively ignored).
> >>>>
> >>>> Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
> >>>> 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
> >>>> new, kvrealloc() did the same before patch 2/2.
> >>>
> >>> Taking it (too) literal, it's not wrong. The contents of the object pointed to
> >>> are indeed preserved up to the lesser of the new and old size. It's just that
> >>> the rest may be "preserved" as well.
> >>>
> >>> I work on implementing shrink and grow for vrealloc(). In the meantime I think
> >>> we could probably just memset() spare memory to zero.
> >>
> >> Probably, this was a bad idea. Even with shrinking implemented we'd need to
> >> memset() potential spare memory of the last page to zero, when new_size <
> >> old_size.
> >>
> >> Analogously, the same would be true for krealloc() buckets. That's probably not
> >> worth it.
> 
> I think it could remove unexpected bad surprises with the API so why not
> do it.

We'd either need to do it *every* time we shrink an allocation on spec, or we
only do it when shrinking with __GFP_ZERO flag set, which might be a bit
counter-intuitive.

If we do it, I'd probably vote for the latter semantics. While it sounds more
error prone, it's less wasteful and enough to cover the most common case where
the actual *realloc() call is always with the same parameters, but a changing
size.

> 
> >> I think we should indeed just document that __GFP_ZERO doesn't work for
> >> re-allocating memory and start to warn about it. As already mentioned, I think
> >> we should at least gurantee that *realloc(NULL, size, flags | __GFP_ZERO) is
> >> valid, i.e. WARN_ON(p && flags & __GFP_ZERO).
> > 
> > Maybe I spoke a bit to soon with this last paragraph. I think continuously
> > gowing something with __GFP_ZERO is a legitimate use case. I just did a quick
> > grep for users of krealloc() with __GFP_ZERO and found 18 matches.
> > 
> > So, I think, at least for now, we should instead document that __GFP_ZERO is
> > only fully honored when the buffer is grown continuously (without intermediate
> > shrinking) and __GFP_ZERO is supplied in every iteration.
> > 
> > In case I miss something here, and not even this case is safe, it looks like
> > we have 18 broken users of krealloc().
> 
> +CC Feng Tang
> 
> Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
> cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
> extra allocated kmalloc space than requested") and preceding commits, if
> slub_debug is enabled (red zoning or user tracking), only the 56 bytes
> will be zeroed. The rest will be either unknown garbage, or redzone.
> 
> Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
> bytes (result of ksize()) will be copied, including the garbage/redzone.
> I think it's fixable because when we do this in slub_debug, we also
> store the original size in the metadata, so we could read it back and
> adjust how many bytes are copied.
> 
> Then we could guarantee that if __GFP_ZERO is used consistently on
> initial kmalloc() and on krealloc() and the user doesn't corrupt the
> extra space themselves (which is a bug anyway that the redzoning is
> supposed to catch) all will be fine.

Ok, so those 18 users are indeed currently broken, but only when slub_debug is
enabled (assuming that all of those are consistently growing with __GFP_ZERO).

> 
> There might be also KASAN side to this, I see poison_kmalloc_redzone()
> is also redzoning the area between requested size and cache's object_size?
> 
> >>
> >>>
> >>> nommu would still uses krealloc() though...
> >>>
> >>>>
> >>>> But it's also fundamentally not true for krealloc(), or kvrealloc()
> >>>> switching from a kmalloc to valloc. ksize() returns the size of the kmalloc
> >>>> bucket, we don't know what was the exact prior allocation size.
> >>>
> >>> Probably a stupid question, but can't we just zero the full bucket initially and
> >>> make sure to memset() spare memory in the bucket to zero when krealloc() is
> >>> called with new_size < ksize()?
> >>>
> >>>> Worse, we
> >>>> started poisoning the padding in debug configurations, so even a
> >>>> kmalloc(__GFP_ZERO) followed by krealloc(__GFP_ZERO) can give you unexpected
> >>>> poison now...
> >>>
> >>> As in writing magics directly to the spare memory in the bucket? Which would
> >>> then also be copied over to a new buffer in __do_krealloc()?
> >>>
> >>>>
> >>>> I guess we should just document __GFP_ZERO is not honored at all for
> >>>> realloc, and maybe start even warning :/ Hopefully nobody relies on that.
> >>>
> >>> I think it'd be great to make __GFP_ZERO work in all cases. However, if that's
> >>> really not possible, I'd prefer if we could at least gurantee that
> >>> *realloc(NULL, size, flags | __GFP_ZERO) is a valid call, i.e.
> >>> WARN_ON(p && flags & __GFP_ZERO).
> >>>
> >>>>
> >>>>> + *
> >>>>> + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and
> >>>>> + * @p is not a %NULL pointer, the object pointed to is freed.
> >>>>> + *
> >>>>> + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of
> >>>>> + *         failure
> >>>>> + */
> >>>>> +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags)
> >>>>> +{
> >>>>> +	size_t old_size = 0;
> >>>>> +	void *n;
> >>>>> +
> >>>>> +	if (!size) {
> >>>>> +		vfree(p);
> >>>>> +		return NULL;
> >>>>> +	}
> >>>>> +
> >>>>> +	if (p) {
> >>>>> +		struct vm_struct *vm;
> >>>>> +
> >>>>> +		vm = find_vm_area(p);
> >>>>> +		if (unlikely(!vm)) {
> >>>>> +			WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p);
> >>>>> +			return NULL;
> >>>>> +		}
> >>>>> +
> >>>>> +		old_size = get_vm_area_size(vm);
> >>>>> +	}
> >>>>> +
> >>>>> +	if (size <= old_size) {
> >>>>> +		/*
> >>>>> +		 * TODO: Shrink the vm_area, i.e. unmap and free unused pages.
> >>>>> +		 * What would be a good heuristic for when to shrink the
> >>>>> +		 * vm_area?
> >>>>> +		 */
> >>>>> +		return (void *)p;
> >>>>> +	}
> >>>>> +
> >>>>> +	/* TODO: Grow the vm_area, i.e. allocate and map additional pages. */
> >>>>> +	n = __vmalloc_noprof(size, flags);
> >>>>> +	if (!n)
> >>>>> +		return NULL;
> >>>>> +
> >>>>> +	if (p) {
> >>>>> +		memcpy(n, p, old_size);
> >>>>> +		vfree(p);
> >>>>> +	}
> >>>>> +
> >>>>> +	return n;
> >>>>> +}
> >>>>> +
> >>>>>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> >>>>>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
> >>>>>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> >>>>
>

Vlastimil Babka July 30, 2024, 1:58 p.m. UTC | #7

On 7/30/24 3:14 PM, Danilo Krummrich wrote:
> On Tue, Jul 30, 2024 at 02:15:34PM +0200, Vlastimil Babka wrote:
>> On 7/30/24 3:35 AM, Danilo Krummrich wrote:
>>> On Mon, Jul 29, 2024 at 09:08:16PM +0200, Danilo Krummrich wrote:
>>>> On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
>>>>> On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
>>>>>> On 7/22/24 6:29 PM, Danilo Krummrich wrote:
>>>>>>> Implement vrealloc() analogous to krealloc().
>>>>>>>
>>>>>>> Currently, krealloc() requires the caller to pass the size of the
>>>>>>> previous memory allocation, which, instead, should be self-contained.
>>>>>>>
>>>>>>> We attempt to fix this in a subsequent patch which, in order to do so,
>>>>>>> requires vrealloc().
>>>>>>>
>>>>>>> Besides that, we need realloc() functions for kernel allocators in Rust
>>>>>>> too. With `Vec` or `KVec` respectively, potentially growing (and
>>>>>>> shrinking) data structures are rather common.
>>>>>>>
>>>>>>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
>>>>>>
>>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>
>>>>>>> --- a/mm/vmalloc.c
>>>>>>> +++ b/mm/vmalloc.c
>>>>>>> @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
>>>>>>>  }
>>>>>>>  EXPORT_SYMBOL(vzalloc_node_noprof);
>>>>>>>  
>>>>>>> +/**
>>>>>>> + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
>>>>>>> + * @p: object to reallocate memory for
>>>>>>> + * @size: the size to reallocate
>>>>>>> + * @flags: the flags for the page level allocator
>>>>>>> + *
>>>>>>> + * The contents of the object pointed to are preserved up to the lesser of the
>>>>>>> + * new and old size (__GFP_ZERO flag is effectively ignored).
>>>>>>
>>>>>> Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
>>>>>> 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
>>>>>> new, kvrealloc() did the same before patch 2/2.
>>>>>
>>>>> Taking it (too) literal, it's not wrong. The contents of the object pointed to
>>>>> are indeed preserved up to the lesser of the new and old size. It's just that
>>>>> the rest may be "preserved" as well.
>>>>>
>>>>> I work on implementing shrink and grow for vrealloc(). In the meantime I think
>>>>> we could probably just memset() spare memory to zero.
>>>>
>>>> Probably, this was a bad idea. Even with shrinking implemented we'd need to
>>>> memset() potential spare memory of the last page to zero, when new_size <
>>>> old_size.
>>>>
>>>> Analogously, the same would be true for krealloc() buckets. That's probably not
>>>> worth it.
>>
>> I think it could remove unexpected bad surprises with the API so why not
>> do it.
> 
> We'd either need to do it *every* time we shrink an allocation on spec, or we
> only do it when shrinking with __GFP_ZERO flag set, which might be a bit
> counter-intuitive.

I don't think it is that much counterintuitive.

> If we do it, I'd probably vote for the latter semantics. While it sounds more
> error prone, it's less wasteful and enough to cover the most common case where
> the actual *realloc() call is always with the same parameters, but a changing
> size.

Yeah. Or with hardening enabled (init_on_alloc) it could be done always.

Danilo Krummrich July 30, 2024, 2:32 p.m. UTC | #8

On Tue, Jul 30, 2024 at 03:58:25PM +0200, Vlastimil Babka wrote:
> On 7/30/24 3:14 PM, Danilo Krummrich wrote:
> > On Tue, Jul 30, 2024 at 02:15:34PM +0200, Vlastimil Babka wrote:
> >> On 7/30/24 3:35 AM, Danilo Krummrich wrote:
> >>> On Mon, Jul 29, 2024 at 09:08:16PM +0200, Danilo Krummrich wrote:
> >>>> On Fri, Jul 26, 2024 at 10:05:47PM +0200, Danilo Krummrich wrote:
> >>>>> On Fri, Jul 26, 2024 at 04:37:43PM +0200, Vlastimil Babka wrote:
> >>>>>> On 7/22/24 6:29 PM, Danilo Krummrich wrote:
> >>>>>>> Implement vrealloc() analogous to krealloc().
> >>>>>>>
> >>>>>>> Currently, krealloc() requires the caller to pass the size of the
> >>>>>>> previous memory allocation, which, instead, should be self-contained.
> >>>>>>>
> >>>>>>> We attempt to fix this in a subsequent patch which, in order to do so,
> >>>>>>> requires vrealloc().
> >>>>>>>
> >>>>>>> Besides that, we need realloc() functions for kernel allocators in Rust
> >>>>>>> too. With `Vec` or `KVec` respectively, potentially growing (and
> >>>>>>> shrinking) data structures are rather common.
> >>>>>>>
> >>>>>>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> >>>>>>
> >>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>>>
> >>>>>>> --- a/mm/vmalloc.c
> >>>>>>> +++ b/mm/vmalloc.c
> >>>>>>> @@ -4037,6 +4037,65 @@ void *vzalloc_node_noprof(unsigned long size, int node)
> >>>>>>>  }
> >>>>>>>  EXPORT_SYMBOL(vzalloc_node_noprof);
> >>>>>>>  
> >>>>>>> +/**
> >>>>>>> + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged
> >>>>>>> + * @p: object to reallocate memory for
> >>>>>>> + * @size: the size to reallocate
> >>>>>>> + * @flags: the flags for the page level allocator
> >>>>>>> + *
> >>>>>>> + * The contents of the object pointed to are preserved up to the lesser of the
> >>>>>>> + * new and old size (__GFP_ZERO flag is effectively ignored).
> >>>>>>
> >>>>>> Well, technically not correct as we don't shrink. Get 8 pages, kvrealloc to
> >>>>>> 4 pages, kvrealloc back to 8 and the last 4 are not zeroed. But it's not
> >>>>>> new, kvrealloc() did the same before patch 2/2.
> >>>>>
> >>>>> Taking it (too) literal, it's not wrong. The contents of the object pointed to
> >>>>> are indeed preserved up to the lesser of the new and old size. It's just that
> >>>>> the rest may be "preserved" as well.
> >>>>>
> >>>>> I work on implementing shrink and grow for vrealloc(). In the meantime I think
> >>>>> we could probably just memset() spare memory to zero.
> >>>>
> >>>> Probably, this was a bad idea. Even with shrinking implemented we'd need to
> >>>> memset() potential spare memory of the last page to zero, when new_size <
> >>>> old_size.
> >>>>
> >>>> Analogously, the same would be true for krealloc() buckets. That's probably not
> >>>> worth it.
> >>
> >> I think it could remove unexpected bad surprises with the API so why not
> >> do it.
> > 
> > We'd either need to do it *every* time we shrink an allocation on spec, or we
> > only do it when shrinking with __GFP_ZERO flag set, which might be a bit
> > counter-intuitive.
> 
> I don't think it is that much counterintuitive.
> 
> > If we do it, I'd probably vote for the latter semantics. While it sounds more
> > error prone, it's less wasteful and enough to cover the most common case where
> > the actual *realloc() call is always with the same parameters, but a changing
> > size.
> 
> Yeah. Or with hardening enabled (init_on_alloc) it could be done always.
> 

Ok, sounds good. Will go with that then.

Feng Tang Sept. 2, 2024, 1:36 a.m. UTC | #9

On Tue, Jul 30, 2024 at 08:15:34PM +0800, Vlastimil Babka wrote:
> On 7/30/24 3:35 AM, Danilo Krummrich wrote:
[...]
> > 
> > Maybe I spoke a bit to soon with this last paragraph. I think continuously
> > gowing something with __GFP_ZERO is a legitimate use case. I just did a quick
> > grep for users of krealloc() with __GFP_ZERO and found 18 matches.
> > 
> > So, I think, at least for now, we should instead document that __GFP_ZERO is
> > only fully honored when the buffer is grown continuously (without intermediate
> > shrinking) and __GFP_ZERO is supplied in every iteration.
> > 
> > In case I miss something here, and not even this case is safe, it looks like
> > we have 18 broken users of krealloc().
> 
> +CC Feng Tang

Sorry for the late reply!

> 
> Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
> cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
> extra allocated kmalloc space than requested") and preceding commits, if
> slub_debug is enabled (red zoning or user tracking), only the 56 bytes
> will be zeroed. The rest will be either unknown garbage, or redzone.

Yes.

> 
> Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
> bytes (result of ksize()) will be copied, including the garbage/redzone.
> I think it's fixable because when we do this in slub_debug, we also
> store the original size in the metadata, so we could read it back and
> adjust how many bytes are copied.

krealloc() --> __do_krealloc() --> ksize()
When ksize() is called, as we don't know what user will do with the
extra space ([57, 64] here), the orig_size check will be unset by
__ksize() calling skip_orig_size_check(). 

And if the newsize is bigger than the old 'ksize', the 'orig_size'
will be correctly set for the newly allocated kmalloc object.

For the 'unstable' branch of -mm tree, which has all latest patches
from Danilo, I run some basic test and it seems to be fine. 

> 
> Then we could guarantee that if __GFP_ZERO is used consistently on
> initial kmalloc() and on krealloc() and the user doesn't corrupt the
> extra space themselves (which is a bug anyway that the redzoning is
> supposed to catch) all will be fine.
> 
> There might be also KASAN side to this, I see poison_kmalloc_redzone()
> is also redzoning the area between requested size and cache's object_size?

AFAIK, KASAN has 3 modes: generic, SW-taged, HW-tagged, while the
latter 2 modes relied on arm64. For 'generic' mode, poison_kmalloc_redzone()
only redzone its own shadow memory, and not the kmalloc object data
space [orig_size + 1, ksize]. For the other 2 modes, I have no hardware
to test, but I guess they are also fine, otherwise there should be
already some bug report :), as normal kmalloc() may call it too. 

Thanks,
Feng

Feng Tang Sept. 2, 2024, 7:04 a.m. UTC | #10

On Mon, Sep 02, 2024 at 09:36:26AM +0800, Tang, Feng wrote:
> On Tue, Jul 30, 2024 at 08:15:34PM +0800, Vlastimil Babka wrote:
> > On 7/30/24 3:35 AM, Danilo Krummrich wrote:
[...]
> > 
> > Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
> > cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
> > extra allocated kmalloc space than requested") and preceding commits, if
> > slub_debug is enabled (red zoning or user tracking), only the 56 bytes
> > will be zeroed. The rest will be either unknown garbage, or redzone.
> 
> Yes.
> 
> > 
> > Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
> > bytes (result of ksize()) will be copied, including the garbage/redzone.
> > I think it's fixable because when we do this in slub_debug, we also
> > store the original size in the metadata, so we could read it back and
> > adjust how many bytes are copied.
> 
> krealloc() --> __do_krealloc() --> ksize()
> When ksize() is called, as we don't know what user will do with the
> extra space ([57, 64] here), the orig_size check will be unset by
> __ksize() calling skip_orig_size_check(). 
> 
> And if the newsize is bigger than the old 'ksize', the 'orig_size'
> will be correctly set for the newly allocated kmalloc object.
> 
> For the 'unstable' branch of -mm tree, which has all latest patches
> from Danilo, I run some basic test and it seems to be fine. 

when doing more test, I found one case matching Vlastimil's previous
concern, that if we kzalloc a small object, and then krealloc with
a slightly bigger size which can still reuse the kmalloc object,
some redzone will be preserved.

With test code like: 

	buf = kzalloc(36, GFP_KERNEL);
	memset(buf, 0xff, 36);

	buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO);

Data after kzalloc+memset :

	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc  
	ffff88802189b070: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  

Data after krealloc:

	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc
	ffff88802189b070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

If we really want to make [37, 48] to be zeroed too, we can lift the
get_orig_size() from slub.c to slab_common.c and use it as the start
of zeroing in krealloc().

Thanks,
Feng

Vlastimil Babka Sept. 2, 2024, 8:56 a.m. UTC | #11

On 9/2/24 09:04, Feng Tang wrote:
> On Mon, Sep 02, 2024 at 09:36:26AM +0800, Tang, Feng wrote:
>> On Tue, Jul 30, 2024 at 08:15:34PM +0800, Vlastimil Babka wrote:
>> > On 7/30/24 3:35 AM, Danilo Krummrich wrote:
> [...]
>> > 
>> > Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
>> > cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
>> > extra allocated kmalloc space than requested") and preceding commits, if
>> > slub_debug is enabled (red zoning or user tracking), only the 56 bytes
>> > will be zeroed. The rest will be either unknown garbage, or redzone.
>> 
>> Yes.
>> 
>> > 
>> > Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
>> > bytes (result of ksize()) will be copied, including the garbage/redzone.
>> > I think it's fixable because when we do this in slub_debug, we also
>> > store the original size in the metadata, so we could read it back and
>> > adjust how many bytes are copied.
>> 
>> krealloc() --> __do_krealloc() --> ksize()
>> When ksize() is called, as we don't know what user will do with the
>> extra space ([57, 64] here), the orig_size check will be unset by
>> __ksize() calling skip_orig_size_check(). 
>> 
>> And if the newsize is bigger than the old 'ksize', the 'orig_size'
>> will be correctly set for the newly allocated kmalloc object.

Yes, but the memcpy() to the new object will be done using ksize() thus
include the redzone, e.g. [57, 64]

>> For the 'unstable' branch of -mm tree, which has all latest patches
>> from Danilo, I run some basic test and it seems to be fine. 

To test it would not always be enough to expect some slub_debug to fail,
you'd e.g. have to kmalloc(48, GFP_KERNEL | GFP_ZERO), krealloc(128,
GFP_KERNEL | GFP_ZERO) and then verify there are zeroes from 48 to 128. I
suspect there won't be zeroes from 48 to 64 due to redzone.

(this would have made a great lib/slub_kunit.c test :))

> when doing more test, I found one case matching Vlastimil's previous
> concern, that if we kzalloc a small object, and then krealloc with
> a slightly bigger size which can still reuse the kmalloc object,
> some redzone will be preserved.
> 
> With test code like: 
> 
> 	buf = kzalloc(36, GFP_KERNEL);
> 	memset(buf, 0xff, 36);
> 
> 	buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO);
> 
> Data after kzalloc+memset :
> 
> 	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
> 	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
> 	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc  
> 	ffff88802189b070: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  
> 
> Data after krealloc:
> 
> 	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc
> 	ffff88802189b070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> If we really want to make [37, 48] to be zeroed too, we can lift the
> get_orig_size() from slub.c to slab_common.c and use it as the start
> of zeroing in krealloc().

Or maybe just move krealloc() to mm/slub.c so there are no unnecessary calls
between the files.

We should also set a new orig_size in cases we are shrinking or enlarging
within same object (i.e. 48->40 or 48->64). In case of shrinking, we also
might need to redzone the shrinked area (i.e. [40, 48]) or later checks will
fail.  But if the current object is from kfence, then probably not do any of
this... sigh this gets complicated. And really we need kunit tests for all
the scenarios :/

> Thanks,
> Feng

Feng Tang Sept. 3, 2024, 3:18 a.m. UTC | #12

On Mon, Sep 02, 2024 at 10:56:57AM +0200, Vlastimil Babka wrote:
> On 9/2/24 09:04, Feng Tang wrote:
> > On Mon, Sep 02, 2024 at 09:36:26AM +0800, Tang, Feng wrote:
> >> On Tue, Jul 30, 2024 at 08:15:34PM +0800, Vlastimil Babka wrote:
> >> > On 7/30/24 3:35 AM, Danilo Krummrich wrote:
> > [...]
> >> > 
> >> > Let's say we kmalloc(56, __GFP_ZERO), we get an object from kmalloc-64
> >> > cache. Since commit 946fa0dbf2d89 ("mm/slub: extend redzone check to
> >> > extra allocated kmalloc space than requested") and preceding commits, if
> >> > slub_debug is enabled (red zoning or user tracking), only the 56 bytes
> >> > will be zeroed. The rest will be either unknown garbage, or redzone.
> >> 
> >> Yes.
> >> 
> >> > 
> >> > Then we might e.g. krealloc(120) and get a kmalloc-128 object and 64
> >> > bytes (result of ksize()) will be copied, including the garbage/redzone.
> >> > I think it's fixable because when we do this in slub_debug, we also
> >> > store the original size in the metadata, so we could read it back and
> >> > adjust how many bytes are copied.
> >> 
> >> krealloc() --> __do_krealloc() --> ksize()
> >> When ksize() is called, as we don't know what user will do with the
> >> extra space ([57, 64] here), the orig_size check will be unset by
> >> __ksize() calling skip_orig_size_check(). 
> >> 
> >> And if the newsize is bigger than the old 'ksize', the 'orig_size'
> >> will be correctly set for the newly allocated kmalloc object.
> 
> Yes, but the memcpy() to the new object will be done using ksize() thus
> include the redzone, e.g. [57, 64]

Right.

> 
> >> For the 'unstable' branch of -mm tree, which has all latest patches
> >> from Danilo, I run some basic test and it seems to be fine. 
> 
> To test it would not always be enough to expect some slub_debug to fail,
> you'd e.g. have to kmalloc(48, GFP_KERNEL | GFP_ZERO), krealloc(128,
> GFP_KERNEL | GFP_ZERO) and then verify there are zeroes from 48 to 128. I
> suspect there won't be zeroes from 48 to 64 due to redzone.

Yes, you are right.
 
> (this would have made a great lib/slub_kunit.c test :))

Agree.

> > when doing more test, I found one case matching Vlastimil's previous
> > concern, that if we kzalloc a small object, and then krealloc with
> > a slightly bigger size which can still reuse the kmalloc object,
> > some redzone will be preserved.
> > 
> > With test code like: 
> > 
> > 	buf = kzalloc(36, GFP_KERNEL);
> > 	memset(buf, 0xff, 36);
> > 
> > 	buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO);
> > 
> > Data after kzalloc+memset :
> > 
> > 	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
> > 	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
> > 	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc  
> > 	ffff88802189b070: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  
> > 
> > Data after krealloc:
> > 
> > 	ffff88802189b040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 	ffff88802189b050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 	ffff88802189b060: ff ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc
> > 	ffff88802189b070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 
> > If we really want to make [37, 48] to be zeroed too, we can lift the
> > get_orig_size() from slub.c to slab_common.c and use it as the start
> > of zeroing in krealloc().
> 
> Or maybe just move krealloc() to mm/slub.c so there are no unnecessary calls
> between the files.
> 
> We should also set a new orig_size in cases we are shrinking or enlarging
> within same object (i.e. 48->40 or 48->64). In case of shrinking, we also
> might need to redzone the shrinked area (i.e. [40, 48]) or later checks will
> fail.  But if the current object is from kfence, then probably not do any of
> this... sigh this gets complicated. And really we need kunit tests for all
> the scenarios :/

Good point! will think about and try to implement it to ensure the
orig_size and kmalloc-redzone check setting is kept. 

Thanks,
Feng

Feng Tang Sept. 6, 2024, 7:35 a.m. UTC | #13

On Tue, Sep 03, 2024 at 11:18:48AM +0800, Tang, Feng wrote:
> On Mon, Sep 02, 2024 at 10:56:57AM +0200, Vlastimil Babka wrote:
[...]
> > > If we really want to make [37, 48] to be zeroed too, we can lift the
> > > get_orig_size() from slub.c to slab_common.c and use it as the start
> > > of zeroing in krealloc().
> > 
> > Or maybe just move krealloc() to mm/slub.c so there are no unnecessary calls
> > between the files.
> > 
> > We should also set a new orig_size in cases we are shrinking or enlarging
> > within same object (i.e. 48->40 or 48->64). In case of shrinking, we also
> > might need to redzone the shrinked area (i.e. [40, 48]) or later checks will
> > fail.  But if the current object is from kfence, then probably not do any of
> > this... sigh this gets complicated. And really we need kunit tests for all
> > the scenarios :/
> 
> Good point! will think about and try to implement it to ensure the
> orig_size and kmalloc-redzone check setting is kept. 

I checked this, and as you mentioned, there is some kfence and kasan stuff
which needs to be handled to manage the 'orig_size'. As this work depends
on patches in both -slab tree and -mm tree, will base it againt linux-next
tree and send out the patches for review soon.

Thanks,
Feng

[v2,1/2] mm: vmalloc: implement vrealloc()

Commit Message

Comments

Patch