diff mbox series

[v2] memcg: add charging of already allocated slab objects

Message ID 20240827235228.1591842-1-shakeel.butt@linux.dev (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2] memcg: add charging of already allocated slab objects | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Shakeel Butt Aug. 27, 2024, 11:52 p.m. UTC
At the moment, the slab objects are charged to the memcg at the
allocation time. However there are cases where slab objects are
allocated at the time where the right target memcg to charge it to is
not known. One such case is the network sockets for the incoming
connection which are allocated in the softirq context.

Couple hundred thousand connections are very normal on large loaded
server and almost all of those sockets underlying those connections get
allocated in the softirq context and thus not charged to any memcg.
However later at the accept() time we know the right target memcg to
charge. Let's add new API to charge already allocated objects, so we can
have better accounting of the memory usage.

To measure the performance impact of this change, tcp_crr is used from
the neper [1] performance suite. Basically it is a network ping pong
test with new connection for each ping pong.

The server and the client are run inside 3 level of cgroup hierarchy
using the following commands:

Server:
 $ tcp_crr -6

Client:
 $ tcp_crr -6 -c -H ${server_ip}

If the client and server run on different machines with 50 GBPS NIC,
there is no visible impact of the change.

For the same machine experiment with v6.11-rc5 as base.

          base (throughput)     with-patch
tcp_crr   14545 (+- 80)         14463 (+- 56)

It seems like the performance impact is within the noise.

Link: https://github.com/google/neper [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
Changes since v1:
- Correctly handle large allocations which bypass slab
- Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds

RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
Changes since the RFC:
- Added check for already charged slab objects.
- Added performance results from neper's tcp_crr

 include/linux/slab.h            |  1 +
 mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
 net/ipv4/inet_connection_sock.c |  5 ++--
 3 files changed, 55 insertions(+), 2 deletions(-)

Comments

Yosry Ahmed Aug. 28, 2024, 12:34 a.m. UTC | #1
On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> At the moment, the slab objects are charged to the memcg at the
> allocation time. However there are cases where slab objects are
> allocated at the time where the right target memcg to charge it to is
> not known. One such case is the network sockets for the incoming
> connection which are allocated in the softirq context.
>
> Couple hundred thousand connections are very normal on large loaded
> server and almost all of those sockets underlying those connections get
> allocated in the softirq context and thus not charged to any memcg.
> However later at the accept() time we know the right target memcg to
> charge. Let's add new API to charge already allocated objects, so we can
> have better accounting of the memory usage.
>
> To measure the performance impact of this change, tcp_crr is used from
> the neper [1] performance suite. Basically it is a network ping pong
> test with new connection for each ping pong.
>
> The server and the client are run inside 3 level of cgroup hierarchy
> using the following commands:
>
> Server:
>  $ tcp_crr -6
>
> Client:
>  $ tcp_crr -6 -c -H ${server_ip}
>
> If the client and server run on different machines with 50 GBPS NIC,
> there is no visible impact of the change.
>
> For the same machine experiment with v6.11-rc5 as base.
>
>           base (throughput)     with-patch
> tcp_crr   14545 (+- 80)         14463 (+- 56)
>
> It seems like the performance impact is within the noise.
>
> Link: https://github.com/google/neper [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
> v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
> Changes since v1:
> - Correctly handle large allocations which bypass slab
> - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
>
> RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
> Changes since the RFC:
> - Added check for already charged slab objects.
> - Added performance results from neper's tcp_crr
>
>  include/linux/slab.h            |  1 +
>  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
>  net/ipv4/inet_connection_sock.c |  5 ++--
>  3 files changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index eb2bf4629157..05cfab107c72 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -547,6 +547,7 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>                             gfp_t gfpflags) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_lru(...)      alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
>
> +bool kmem_cache_charge(void *objp, gfp_t gfpflags);
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>
>  kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
> diff --git a/mm/slub.c b/mm/slub.c
> index c9d8a2497fd6..8265ea5f25be 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2185,6 +2185,43 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
>
>         __memcg_slab_free_hook(s, slab, p, objects, obj_exts);
>  }
> +
> +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> +
> +static __fastpath_inline
> +bool memcg_slab_post_charge(void *p, gfp_t flags)
> +{
> +       struct slabobj_ext *slab_exts;
> +       struct kmem_cache *s;
> +       struct folio *folio;
> +       struct slab *slab;
> +       unsigned long off;
> +
> +       folio = virt_to_folio(p);
> +       if (!folio_test_slab(folio)) {
> +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> +                                               folio_order(folio)) == 0;

Will this charge the folio again if it was already charged? It seems
like we avoid this for already charged slab objects below but not
here.

> +       }
> +
> +       slab = folio_slab(folio);
> +       s = slab->slab_cache;
> +
> +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> +               return true;

Would it be clearer to check if the slab cache is one of
kmalloc_caches[KMALLOC_NORMAL]? This should be doable by comparing the
address of the slab cache with the addresses of
kmalloc_cache[KMALLOC_NORMAL] (perhaps in a helper). I need to refer
to your reply to Roman to understand why this works.

> +
> +       /* Ignore already charged objects. */
> +       slab_exts = slab_obj_exts(slab);
> +       if (slab_exts) {
> +               off = obj_to_index(s, slab, p);
> +               if (unlikely(slab_exts[off].objcg))
> +                       return true;
> +       }
> +
> +       return __memcg_slab_post_alloc_hook(s, NULL, flags, 1, &p);
> +}
> +
>  #else /* CONFIG_MEMCG */
>  static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
>                                               struct list_lru *lru,
> @@ -2198,6 +2235,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>                                         void **p, int objects)
>  {
>  }
> +
> +static inline bool memcg_slab_post_charge(void *p, gfp_t flags)
> +{
> +       return true;
> +}
>  #endif /* CONFIG_MEMCG */
>
>  /*
> @@ -4062,6 +4104,15 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_lru_noprof);
>
> +bool kmem_cache_charge(void *objp, gfp_t gfpflags)
> +{
> +       if (!memcg_kmem_online())
> +               return true;
> +
> +       return memcg_slab_post_charge(objp, gfpflags);
> +}
> +EXPORT_SYMBOL(kmem_cache_charge);
> +
>  /**
>   * kmem_cache_alloc_node - Allocate an object on the specified node
>   * @s: The cache to allocate from.
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 64d07b842e73..3c13ca8c11fb 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -715,6 +715,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
>         release_sock(sk);
>         if (newsk && mem_cgroup_sockets_enabled) {
>                 int amt = 0;
> +               gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
>
>                 /* atomically get the memory usage, set and charge the
>                  * newsk->sk_memcg.
> @@ -731,8 +732,8 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
>                 }
>
>                 if (amt)
> -                       mem_cgroup_charge_skmem(newsk->sk_memcg, amt,
> -                                               GFP_KERNEL | __GFP_NOFAIL);
> +                       mem_cgroup_charge_skmem(newsk->sk_memcg, amt, gfp);
> +               kmem_cache_charge(newsk, gfp);
>
>                 release_sock(newsk);
>         }
> --
> 2.43.5
>
>
Shakeel Butt Aug. 28, 2024, 7:14 p.m. UTC | #2
On Tue, Aug 27, 2024 at 05:34:24PM GMT, Yosry Ahmed wrote:
> On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
[...]
> > +
> > +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> > +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> > +
> > +static __fastpath_inline
> > +bool memcg_slab_post_charge(void *p, gfp_t flags)
> > +{
> > +       struct slabobj_ext *slab_exts;
> > +       struct kmem_cache *s;
> > +       struct folio *folio;
> > +       struct slab *slab;
> > +       unsigned long off;
> > +
> > +       folio = virt_to_folio(p);
> > +       if (!folio_test_slab(folio)) {
> > +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> > +                                               folio_order(folio)) == 0;
> 
> Will this charge the folio again if it was already charged? It seems
> like we avoid this for already charged slab objects below but not
> here.
> 

Thanks for catchig this. It's an easy fix and will do in v3.

> > +       }
> > +
> > +       slab = folio_slab(folio);
> > +       s = slab->slab_cache;
> > +
> > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > +               return true;
> 
> Would it be clearer to check if the slab cache is one of
> kmalloc_caches[KMALLOC_NORMAL]? This should be doable by comparing the
> address of the slab cache with the addresses of
> kmalloc_cache[KMALLOC_NORMAL] (perhaps in a helper). I need to refer
> to your reply to Roman to understand why this works.
> 

Do you mean looping over kmalloc_caches[KMALLOC_NORMAL] and comparing
the given slab cache address? Nah man why do long loop of pointer
comparisons when we can simply check the flag of the given kmem cache.
Also this array will increase with the recent proposed random kmalloc
caches.

Thanks,
Shakeel
Yosry Ahmed Aug. 28, 2024, 7:42 p.m. UTC | #3
On Wed, Aug 28, 2024 at 12:14 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Aug 27, 2024 at 05:34:24PM GMT, Yosry Ahmed wrote:
> > On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> [...]
> > > +
> > > +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> > > +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> > > +
> > > +static __fastpath_inline
> > > +bool memcg_slab_post_charge(void *p, gfp_t flags)
> > > +{
> > > +       struct slabobj_ext *slab_exts;
> > > +       struct kmem_cache *s;
> > > +       struct folio *folio;
> > > +       struct slab *slab;
> > > +       unsigned long off;
> > > +
> > > +       folio = virt_to_folio(p);
> > > +       if (!folio_test_slab(folio)) {
> > > +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> > > +                                               folio_order(folio)) == 0;
> >
> > Will this charge the folio again if it was already charged? It seems
> > like we avoid this for already charged slab objects below but not
> > here.
> >
>
> Thanks for catchig this. It's an easy fix and will do in v3.
>
> > > +       }
> > > +
> > > +       slab = folio_slab(folio);
> > > +       s = slab->slab_cache;
> > > +
> > > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > > +               return true;
> >
> > Would it be clearer to check if the slab cache is one of
> > kmalloc_caches[KMALLOC_NORMAL]? This should be doable by comparing the
> > address of the slab cache with the addresses of
> > kmalloc_cache[KMALLOC_NORMAL] (perhaps in a helper). I need to refer
> > to your reply to Roman to understand why this works.
> >
>
> Do you mean looping over kmalloc_caches[KMALLOC_NORMAL] and comparing
> the given slab cache address? Nah man why do long loop of pointer
> comparisons when we can simply check the flag of the given kmem cache.
> Also this array will increase with the recent proposed random kmalloc
> caches.

Oh I thought kmalloc_caches[KMALLOC_NORMAL] is an array of the actual
struct kmem_cache objects, so I thought we can just check if:
s >= kmalloc_caches[KMALLOC_NORMAL][0] &&
s >= kmalloc_caches[KMALLOC_NORMAL][LAST_INDEX]

I just realized it's an array of pointers, so we would need to loop
and compare them.

I still find the flags comparisons unclear and not very future-proof
tbh. I think we can just store the type in struct kmem_cache? I think
there are multiple holes there.
Shakeel Butt Aug. 28, 2024, 8:16 p.m. UTC | #4
On Wed, Aug 28, 2024 at 12:42:02PM GMT, Yosry Ahmed wrote:
> On Wed, Aug 28, 2024 at 12:14 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Aug 27, 2024 at 05:34:24PM GMT, Yosry Ahmed wrote:
> > > On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > [...]
> > > > +
> > > > +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> > > > +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> > > > +
> > > > +static __fastpath_inline
> > > > +bool memcg_slab_post_charge(void *p, gfp_t flags)
> > > > +{
> > > > +       struct slabobj_ext *slab_exts;
> > > > +       struct kmem_cache *s;
> > > > +       struct folio *folio;
> > > > +       struct slab *slab;
> > > > +       unsigned long off;
> > > > +
> > > > +       folio = virt_to_folio(p);
> > > > +       if (!folio_test_slab(folio)) {
> > > > +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> > > > +                                               folio_order(folio)) == 0;
> > >
> > > Will this charge the folio again if it was already charged? It seems
> > > like we avoid this for already charged slab objects below but not
> > > here.
> > >
> >
> > Thanks for catchig this. It's an easy fix and will do in v3.
> >
> > > > +       }
> > > > +
> > > > +       slab = folio_slab(folio);
> > > > +       s = slab->slab_cache;
> > > > +
> > > > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > > > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > > > +               return true;
> > >
> > > Would it be clearer to check if the slab cache is one of
> > > kmalloc_caches[KMALLOC_NORMAL]? This should be doable by comparing the
> > > address of the slab cache with the addresses of
> > > kmalloc_cache[KMALLOC_NORMAL] (perhaps in a helper). I need to refer
> > > to your reply to Roman to understand why this works.
> > >
> >
> > Do you mean looping over kmalloc_caches[KMALLOC_NORMAL] and comparing
> > the given slab cache address? Nah man why do long loop of pointer
> > comparisons when we can simply check the flag of the given kmem cache.
> > Also this array will increase with the recent proposed random kmalloc
> > caches.
> 
> Oh I thought kmalloc_caches[KMALLOC_NORMAL] is an array of the actual
> struct kmem_cache objects, so I thought we can just check if:
> s >= kmalloc_caches[KMALLOC_NORMAL][0] &&
> s >= kmalloc_caches[KMALLOC_NORMAL][LAST_INDEX]
> 
> I just realized it's an array of pointers, so we would need to loop
> and compare them.
> 
> I still find the flags comparisons unclear and not very future-proof
> tbh. I think we can just store the type in struct kmem_cache? I think
> there are multiple holes there.

Do you mean adding a new SLAB_KMALLOC_NORMAL? I will wait for SLAB
maintainers for their opinion on that. BTW this kind of checks are in
the kernel particularly for gfp flags.
Yosry Ahmed Aug. 28, 2024, 10:10 p.m. UTC | #5
On Wed, Aug 28, 2024 at 1:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 28, 2024 at 12:42:02PM GMT, Yosry Ahmed wrote:
> > On Wed, Aug 28, 2024 at 12:14 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 05:34:24PM GMT, Yosry Ahmed wrote:
> > > > On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > [...]
> > > > > +
> > > > > +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> > > > > +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> > > > > +
> > > > > +static __fastpath_inline
> > > > > +bool memcg_slab_post_charge(void *p, gfp_t flags)
> > > > > +{
> > > > > +       struct slabobj_ext *slab_exts;
> > > > > +       struct kmem_cache *s;
> > > > > +       struct folio *folio;
> > > > > +       struct slab *slab;
> > > > > +       unsigned long off;
> > > > > +
> > > > > +       folio = virt_to_folio(p);
> > > > > +       if (!folio_test_slab(folio)) {
> > > > > +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> > > > > +                                               folio_order(folio)) == 0;
> > > >
> > > > Will this charge the folio again if it was already charged? It seems
> > > > like we avoid this for already charged slab objects below but not
> > > > here.
> > > >
> > >
> > > Thanks for catchig this. It's an easy fix and will do in v3.
> > >
> > > > > +       }
> > > > > +
> > > > > +       slab = folio_slab(folio);
> > > > > +       s = slab->slab_cache;
> > > > > +
> > > > > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > > > > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > > > > +               return true;
> > > >
> > > > Would it be clearer to check if the slab cache is one of
> > > > kmalloc_caches[KMALLOC_NORMAL]? This should be doable by comparing the
> > > > address of the slab cache with the addresses of
> > > > kmalloc_cache[KMALLOC_NORMAL] (perhaps in a helper). I need to refer
> > > > to your reply to Roman to understand why this works.
> > > >
> > >
> > > Do you mean looping over kmalloc_caches[KMALLOC_NORMAL] and comparing
> > > the given slab cache address? Nah man why do long loop of pointer
> > > comparisons when we can simply check the flag of the given kmem cache.
> > > Also this array will increase with the recent proposed random kmalloc
> > > caches.
> >
> > Oh I thought kmalloc_caches[KMALLOC_NORMAL] is an array of the actual
> > struct kmem_cache objects, so I thought we can just check if:
> > s >= kmalloc_caches[KMALLOC_NORMAL][0] &&
> > s >= kmalloc_caches[KMALLOC_NORMAL][LAST_INDEX]
> >
> > I just realized it's an array of pointers, so we would need to loop
> > and compare them.
> >
> > I still find the flags comparisons unclear and not very future-proof
> > tbh. I think we can just store the type in struct kmem_cache? I think
> > there are multiple holes there.
>
> Do you mean adding a new SLAB_KMALLOC_NORMAL? I will wait for SLAB
> maintainers for their opinion on that. BTW this kind of checks are in
> the kernel particularly for gfp flags.

I meant maybe in new_kmalloc_cache() pass in 'type' to
create_kmalloc_cache() and store it in struct kmem_cache (we'd want a
KMALLOC_NONE a similar for non-kmalloc caches). Having a new flag like
SLAB_KMALLOC_NORMAL would also work.

Or maybe using the flags to deduce the kmalloc cache type is fine, but
in this case I think a well-documented helper that takes in a
kmem_cache and restores a type based on the combination of flags would
be better.

I just think in this case it's easy for the flags to change from under
us here, and the code is not very clear.

Hopefully the slab maintainers will tell us what they think here, my
concerns could very possibly be unfounded.
Yosry Ahmed Aug. 28, 2024, 11:25 p.m. UTC | #6
On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> At the moment, the slab objects are charged to the memcg at the
> allocation time. However there are cases where slab objects are
> allocated at the time where the right target memcg to charge it to is
> not known. One such case is the network sockets for the incoming
> connection which are allocated in the softirq context.
>
> Couple hundred thousand connections are very normal on large loaded
> server and almost all of those sockets underlying those connections get
> allocated in the softirq context and thus not charged to any memcg.
> However later at the accept() time we know the right target memcg to
> charge. Let's add new API to charge already allocated objects, so we can
> have better accounting of the memory usage.
>
> To measure the performance impact of this change, tcp_crr is used from
> the neper [1] performance suite. Basically it is a network ping pong
> test with new connection for each ping pong.
>
> The server and the client are run inside 3 level of cgroup hierarchy
> using the following commands:
>
> Server:
>  $ tcp_crr -6
>
> Client:
>  $ tcp_crr -6 -c -H ${server_ip}
>
> If the client and server run on different machines with 50 GBPS NIC,
> there is no visible impact of the change.
>
> For the same machine experiment with v6.11-rc5 as base.
>
>           base (throughput)     with-patch
> tcp_crr   14545 (+- 80)         14463 (+- 56)
>
> It seems like the performance impact is within the noise.
>
> Link: https://github.com/google/neper [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
> v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
> Changes since v1:
> - Correctly handle large allocations which bypass slab
> - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
>
> RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
> Changes since the RFC:
> - Added check for already charged slab objects.
> - Added performance results from neper's tcp_crr
>
>  include/linux/slab.h            |  1 +
>  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
>  net/ipv4/inet_connection_sock.c |  5 ++--
>  3 files changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index eb2bf4629157..05cfab107c72 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -547,6 +547,7 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>                             gfp_t gfpflags) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_lru(...)      alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
>
> +bool kmem_cache_charge(void *objp, gfp_t gfpflags);
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>
>  kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
> diff --git a/mm/slub.c b/mm/slub.c
> index c9d8a2497fd6..8265ea5f25be 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2185,6 +2185,43 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
>
>         __memcg_slab_free_hook(s, slab, p, objects, obj_exts);
>  }
> +
> +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> +                     SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> +
> +static __fastpath_inline
> +bool memcg_slab_post_charge(void *p, gfp_t flags)
> +{
> +       struct slabobj_ext *slab_exts;
> +       struct kmem_cache *s;
> +       struct folio *folio;
> +       struct slab *slab;
> +       unsigned long off;
> +
> +       folio = virt_to_folio(p);
> +       if (!folio_test_slab(folio)) {
> +               return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> +                                               folio_order(folio)) == 0;
> +       }
> +
> +       slab = folio_slab(folio);
> +       s = slab->slab_cache;
> +
> +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> +               return true;

Taking a step back here, why do we need this? Which circular
dependency are we avoiding here?
Shakeel Butt Aug. 29, 2024, 12:20 a.m. UTC | #7
On Wed, Aug 28, 2024 at 04:25:30PM GMT, Yosry Ahmed wrote:
> On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
[...]
> > +
> > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > +               return true;
> 
> Taking a step back here, why do we need this? Which circular
> dependency are we avoiding here?

commit 494c1dfe855ec1f70f89552fce5eadf4a1717552
Author: Waiman Long <longman@redhat.com>
Date:   Mon Jun 28 19:37:38 2021 -0700

    mm: memcg/slab: create a new set of kmalloc-cg-<n> caches

    There are currently two problems in the way the objcg pointer array
    (memcg_data) in the page structure is being allocated and freed.

    On its allocation, it is possible that the allocated objcg pointer
    array comes from the same slab that requires memory accounting. If this
    happens, the slab will never become empty again as there is at least
    one object left (the obj_cgroup array) in the slab.

    When it is freed, the objcg pointer array object may be the last one
    in its slab and hence causes kfree() to be called again. With the
    right workload, the slab cache may be set up in a way that allows the
    recursive kfree() calling loop to nest deep enough to cause a kernel
    stack overflow and panic the system.
    ...
Yosry Ahmed Aug. 29, 2024, 12:49 a.m. UTC | #8
On Wed, Aug 28, 2024 at 5:20 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 28, 2024 at 04:25:30PM GMT, Yosry Ahmed wrote:
> > On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> [...]
> > > +
> > > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> > > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> > > +               return true;
> >
> > Taking a step back here, why do we need this? Which circular
> > dependency are we avoiding here?
>
> commit 494c1dfe855ec1f70f89552fce5eadf4a1717552
> Author: Waiman Long <longman@redhat.com>
> Date:   Mon Jun 28 19:37:38 2021 -0700
>
>     mm: memcg/slab: create a new set of kmalloc-cg-<n> caches
>
>     There are currently two problems in the way the objcg pointer array
>     (memcg_data) in the page structure is being allocated and freed.
>
>     On its allocation, it is possible that the allocated objcg pointer
>     array comes from the same slab that requires memory accounting. If this
>     happens, the slab will never become empty again as there is at least
>     one object left (the obj_cgroup array) in the slab.
>
>     When it is freed, the objcg pointer array object may be the last one
>     in its slab and hence causes kfree() to be called again. With the
>     right workload, the slab cache may be set up in a way that allows the
>     recursive kfree() calling loop to nest deep enough to cause a kernel
>     stack overflow and panic the system.
>     ...

Thanks for the reference, this makes sense.

Wouldn't it be easier to special case the specific slab cache used for
the objcg vector or use a dedicated cache for it instead of using
kmalloc caches to begin with?

Anyway, I am fine with any approach you and/or the slab maintainers
prefer, as long as we make things clear. If you keep the following
approach as-is, please expand the comment or refer to the commit you
just referenced.

Personally, I prefer either explicitly special casing the slab cache
used for the objcgs vector, explicitly tagging KMALLOC_NORMAL
allocations, or having a dedicated documented helper that finds the
slab cache kmalloc type (if any) or checks if it is a KMALLOC_NORMAL
cache.
Vlastimil Babka Aug. 29, 2024, 8:42 a.m. UTC | #9
On 8/29/24 02:49, Yosry Ahmed wrote:
> On Wed, Aug 28, 2024 at 5:20 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Wed, Aug 28, 2024 at 04:25:30PM GMT, Yosry Ahmed wrote:
>> > On Tue, Aug 27, 2024 at 4:52 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> > >
>> [...]
>> > > +
>> > > +       /* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
>> > > +       if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
>> > > +               return true;
>> >
>> > Taking a step back here, why do we need this? Which circular
>> > dependency are we avoiding here?
>>
>> commit 494c1dfe855ec1f70f89552fce5eadf4a1717552
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Mon Jun 28 19:37:38 2021 -0700
>>
>>     mm: memcg/slab: create a new set of kmalloc-cg-<n> caches
>>
>>     There are currently two problems in the way the objcg pointer array
>>     (memcg_data) in the page structure is being allocated and freed.
>>
>>     On its allocation, it is possible that the allocated objcg pointer
>>     array comes from the same slab that requires memory accounting. If this
>>     happens, the slab will never become empty again as there is at least
>>     one object left (the obj_cgroup array) in the slab.
>>
>>     When it is freed, the objcg pointer array object may be the last one
>>     in its slab and hence causes kfree() to be called again. With the
>>     right workload, the slab cache may be set up in a way that allows the
>>     recursive kfree() calling loop to nest deep enough to cause a kernel
>>     stack overflow and panic the system.
>>     ...
> 
> Thanks for the reference, this makes sense.

Another reason is memory savings, if we have a small subset of objects in
KMALLOC_NORMAL caches accounted, there might be e.g. one vector per a slab
just to account on object while the rest is unaccounted. Separating between
kmalloc and kmalloc-cg caches keeps the former with no vectors and the
latter with fully used vectors.

> Wouldn't it be easier to special case the specific slab cache used for
> the objcg vector or use a dedicated cache for it instead of using
> kmalloc caches to begin with?

The problem is the vector isn't a fixed size, it depends on how many objects
a particular slab (not even a particular cache) has.

> Anyway, I am fine with any approach you and/or the slab maintainers
> prefer, as long as we make things clear. If you keep the following
> approach as-is, please expand the comment or refer to the commit you
> just referenced.
> 
> Personally, I prefer either explicitly special casing the slab cache
> used for the objcgs vector, explicitly tagging KMALLOC_NORMAL
> allocations, or having a dedicated documented helper that finds the
> slab cache kmalloc type (if any) or checks if it is a KMALLOC_NORMAL
> cache.

A helper to check is_kmalloc_normal() would be better than defining
KMALLOC_TYPE and using it directly, yes. We don't need to handle any other
types now until anyone needs those.
Vlastimil Babka Aug. 29, 2024, 9:42 a.m. UTC | #10
On 8/28/24 01:52, Shakeel Butt wrote:
> At the moment, the slab objects are charged to the memcg at the
> allocation time. However there are cases where slab objects are
> allocated at the time where the right target memcg to charge it to is
> not known. One such case is the network sockets for the incoming
> connection which are allocated in the softirq context.
> 
> Couple hundred thousand connections are very normal on large loaded
> server and almost all of those sockets underlying those connections get
> allocated in the softirq context and thus not charged to any memcg.
> However later at the accept() time we know the right target memcg to
> charge. Let's add new API to charge already allocated objects, so we can
> have better accounting of the memory usage.
> 
> To measure the performance impact of this change, tcp_crr is used from
> the neper [1] performance suite. Basically it is a network ping pong
> test with new connection for each ping pong.
> 
> The server and the client are run inside 3 level of cgroup hierarchy
> using the following commands:
> 
> Server:
>  $ tcp_crr -6
> 
> Client:
>  $ tcp_crr -6 -c -H ${server_ip}
> 
> If the client and server run on different machines with 50 GBPS NIC,
> there is no visible impact of the change.
> 
> For the same machine experiment with v6.11-rc5 as base.
> 
>           base (throughput)     with-patch
> tcp_crr   14545 (+- 80)         14463 (+- 56)
> 
> It seems like the performance impact is within the noise.
> 
> Link: https://github.com/google/neper [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
> v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
> Changes since v1:
> - Correctly handle large allocations which bypass slab
> - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
> 
> RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
> Changes since the RFC:
> - Added check for already charged slab objects.
> - Added performance results from neper's tcp_crr
> 
>  include/linux/slab.h            |  1 +
>  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
>  net/ipv4/inet_connection_sock.c |  5 ++--
>  3 files changed, 55 insertions(+), 2 deletions(-)

I can take the v3 in slab tree, if net people ack?

BTW, will this be also useful for Linus's idea of charging struct files only
after they exist? But IIRC there was supposed to be also a part where we
have a way to quickly determine if we're not over limit (while allowing some
overcharge to make it quicker).

Because right now this just overcharges unconditionally, but that's
understandable when the irq context creating the socket can't know the memcg
upfront. In the open() situation this is different.

> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index eb2bf4629157..05cfab107c72 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -547,6 +547,7 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>  			    gfp_t gfpflags) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_lru(...)	alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
>  
> +bool kmem_cache_charge(void *objp, gfp_t gfpflags);
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>  
>  kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
> diff --git a/mm/slub.c b/mm/slub.c
> index c9d8a2497fd6..8265ea5f25be 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2185,6 +2185,43 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
>  
>  	__memcg_slab_free_hook(s, slab, p, objects, obj_exts);
>  }
> +
> +#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
> +		      SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
> +
> +static __fastpath_inline
> +bool memcg_slab_post_charge(void *p, gfp_t flags)
> +{
> +	struct slabobj_ext *slab_exts;
> +	struct kmem_cache *s;
> +	struct folio *folio;
> +	struct slab *slab;
> +	unsigned long off;
> +
> +	folio = virt_to_folio(p);
> +	if (!folio_test_slab(folio)) {
> +		return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
> +						folio_order(folio)) == 0;
> +	}
> +
> +	slab = folio_slab(folio);
> +	s = slab->slab_cache;
> +
> +	/* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
> +	if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
> +		return true;
> +
> +	/* Ignore already charged objects. */
> +	slab_exts = slab_obj_exts(slab);
> +	if (slab_exts) {
> +		off = obj_to_index(s, slab, p);
> +		if (unlikely(slab_exts[off].objcg))
> +			return true;
> +	}
> +
> +	return __memcg_slab_post_alloc_hook(s, NULL, flags, 1, &p);
> +}
> +
>  #else /* CONFIG_MEMCG */
>  static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  					      struct list_lru *lru,
> @@ -2198,6 +2235,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>  					void **p, int objects)
>  {
>  }
> +
> +static inline bool memcg_slab_post_charge(void *p, gfp_t flags)
> +{
> +	return true;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /*
> @@ -4062,6 +4104,15 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_lru_noprof);
>  
> +bool kmem_cache_charge(void *objp, gfp_t gfpflags)
> +{
> +	if (!memcg_kmem_online())
> +		return true;
> +
> +	return memcg_slab_post_charge(objp, gfpflags);
> +}
> +EXPORT_SYMBOL(kmem_cache_charge);
> +
>  /**
>   * kmem_cache_alloc_node - Allocate an object on the specified node
>   * @s: The cache to allocate from.
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 64d07b842e73..3c13ca8c11fb 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -715,6 +715,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
>  	release_sock(sk);
>  	if (newsk && mem_cgroup_sockets_enabled) {
>  		int amt = 0;
> +		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
>  
>  		/* atomically get the memory usage, set and charge the
>  		 * newsk->sk_memcg.
> @@ -731,8 +732,8 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
>  		}
>  
>  		if (amt)
> -			mem_cgroup_charge_skmem(newsk->sk_memcg, amt,
> -						GFP_KERNEL | __GFP_NOFAIL);
> +			mem_cgroup_charge_skmem(newsk->sk_memcg, amt, gfp);
> +		kmem_cache_charge(newsk, gfp);
>  
>  		release_sock(newsk);
>  	}
Shakeel Butt Aug. 29, 2024, 3:50 p.m. UTC | #11
On Thu, Aug 29, 2024 at 10:42:59AM GMT, Vlastimil Babka wrote:
> On 8/29/24 02:49, Yosry Ahmed wrote:
[...]
> > 
> > Personally, I prefer either explicitly special casing the slab cache
> > used for the objcgs vector, explicitly tagging KMALLOC_NORMAL
> > allocations, or having a dedicated documented helper that finds the
> > slab cache kmalloc type (if any) or checks if it is a KMALLOC_NORMAL
> > cache.
> 
> A helper to check is_kmalloc_normal() would be better than defining
> KMALLOC_TYPE and using it directly, yes. We don't need to handle any other
> types now until anyone needs those.
> 

Sounds good, will update in v3.
Shakeel Butt Aug. 29, 2024, 4:10 p.m. UTC | #12
On Thu, Aug 29, 2024 at 11:42:10AM GMT, Vlastimil Babka wrote:
> On 8/28/24 01:52, Shakeel Butt wrote:
> > At the moment, the slab objects are charged to the memcg at the
> > allocation time. However there are cases where slab objects are
> > allocated at the time where the right target memcg to charge it to is
> > not known. One such case is the network sockets for the incoming
> > connection which are allocated in the softirq context.
> > 
> > Couple hundred thousand connections are very normal on large loaded
> > server and almost all of those sockets underlying those connections get
> > allocated in the softirq context and thus not charged to any memcg.
> > However later at the accept() time we know the right target memcg to
> > charge. Let's add new API to charge already allocated objects, so we can
> > have better accounting of the memory usage.
> > 
> > To measure the performance impact of this change, tcp_crr is used from
> > the neper [1] performance suite. Basically it is a network ping pong
> > test with new connection for each ping pong.
> > 
> > The server and the client are run inside 3 level of cgroup hierarchy
> > using the following commands:
> > 
> > Server:
> >  $ tcp_crr -6
> > 
> > Client:
> >  $ tcp_crr -6 -c -H ${server_ip}
> > 
> > If the client and server run on different machines with 50 GBPS NIC,
> > there is no visible impact of the change.
> > 
> > For the same machine experiment with v6.11-rc5 as base.
> > 
> >           base (throughput)     with-patch
> > tcp_crr   14545 (+- 80)         14463 (+- 56)
> > 
> > It seems like the performance impact is within the noise.
> > 
> > Link: https://github.com/google/neper [1]
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> > v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
> > Changes since v1:
> > - Correctly handle large allocations which bypass slab
> > - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
> > 
> > RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
> > Changes since the RFC:
> > - Added check for already charged slab objects.
> > - Added performance results from neper's tcp_crr
> > 
> >  include/linux/slab.h            |  1 +
> >  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
> >  net/ipv4/inet_connection_sock.c |  5 ++--
> >  3 files changed, 55 insertions(+), 2 deletions(-)
> 
> I can take the v3 in slab tree, if net people ack?

Thanks.

> 
> BTW, will this be also useful for Linus's idea of charging struct files only
> after they exist? But IIRC there was supposed to be also a part where we
> have a way to quickly determine if we're not over limit (while allowing some
> overcharge to make it quicker).
>

Do you have link to those discussions or pointers to the code? From what
you have described, I think this should work. We have the relevant gfp
flags to control the charging behavior (with some caveats).

> Because right now this just overcharges unconditionally, but that's
> understandable when the irq context creating the socket can't know the memcg
> upfront. In the open() situation this is different.
> 

For networking we deliberately overcharges in the irq context (if
needed) and the course correct in the task context. However networking
stack is very robust due to mechanisms like backoff, retransmit to handle
situations like packet drops, allocation failures, congestion etc. Other
subsystem are not that robust against ENOMEM. Once I have more detail I
can follow up on the struct files case.

thanks,
Shakeel
Roman Gushchin Aug. 29, 2024, 4:20 p.m. UTC | #13
On Thu, Aug 29, 2024 at 09:10:53AM -0700, Shakeel Butt wrote:
> On Thu, Aug 29, 2024 at 11:42:10AM GMT, Vlastimil Babka wrote:
> > On 8/28/24 01:52, Shakeel Butt wrote:
> > > At the moment, the slab objects are charged to the memcg at the
> > > allocation time. However there are cases where slab objects are
> > > allocated at the time where the right target memcg to charge it to is
> > > not known. One such case is the network sockets for the incoming
> > > connection which are allocated in the softirq context.
> > > 
> > > Couple hundred thousand connections are very normal on large loaded
> > > server and almost all of those sockets underlying those connections get
> > > allocated in the softirq context and thus not charged to any memcg.
> > > However later at the accept() time we know the right target memcg to
> > > charge. Let's add new API to charge already allocated objects, so we can
> > > have better accounting of the memory usage.
> > > 
> > > To measure the performance impact of this change, tcp_crr is used from
> > > the neper [1] performance suite. Basically it is a network ping pong
> > > test with new connection for each ping pong.
> > > 
> > > The server and the client are run inside 3 level of cgroup hierarchy
> > > using the following commands:
> > > 
> > > Server:
> > >  $ tcp_crr -6
> > > 
> > > Client:
> > >  $ tcp_crr -6 -c -H ${server_ip}
> > > 
> > > If the client and server run on different machines with 50 GBPS NIC,
> > > there is no visible impact of the change.
> > > 
> > > For the same machine experiment with v6.11-rc5 as base.
> > > 
> > >           base (throughput)     with-patch
> > > tcp_crr   14545 (+- 80)         14463 (+- 56)
> > > 
> > > It seems like the performance impact is within the noise.
> > > 
> > > Link: https://github.com/google/neper [1]
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > ---
> > > v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
> > > Changes since v1:
> > > - Correctly handle large allocations which bypass slab
> > > - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
> > > 
> > > RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
> > > Changes since the RFC:
> > > - Added check for already charged slab objects.
> > > - Added performance results from neper's tcp_crr
> > > 
> > >  include/linux/slab.h            |  1 +
> > >  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
> > >  net/ipv4/inet_connection_sock.c |  5 ++--
> > >  3 files changed, 55 insertions(+), 2 deletions(-)
> > 
> > I can take the v3 in slab tree, if net people ack?
> 
> Thanks.
> 
> > 
> > BTW, will this be also useful for Linus's idea of charging struct files only
> > after they exist? But IIRC there was supposed to be also a part where we
> > have a way to quickly determine if we're not over limit (while allowing some
> > overcharge to make it quicker).

It should work and speed up the case when we can drop the object before charging.
I'd suggest to implement it in a separate change though.

Thanks!
Vlastimil Babka Aug. 29, 2024, 5:39 p.m. UTC | #14
On 8/29/24 18:10, Shakeel Butt wrote:
> On Thu, Aug 29, 2024 at 11:42:10AM GMT, Vlastimil Babka wrote:
>> On 8/28/24 01:52, Shakeel Butt wrote:
>> > At the moment, the slab objects are charged to the memcg at the
>> > allocation time. However there are cases where slab objects are
>> > allocated at the time where the right target memcg to charge it to is
>> > not known. One such case is the network sockets for the incoming
>> > connection which are allocated in the softirq context.
>> > 
>> > Couple hundred thousand connections are very normal on large loaded
>> > server and almost all of those sockets underlying those connections get
>> > allocated in the softirq context and thus not charged to any memcg.
>> > However later at the accept() time we know the right target memcg to
>> > charge. Let's add new API to charge already allocated objects, so we can
>> > have better accounting of the memory usage.
>> > 
>> > To measure the performance impact of this change, tcp_crr is used from
>> > the neper [1] performance suite. Basically it is a network ping pong
>> > test with new connection for each ping pong.
>> > 
>> > The server and the client are run inside 3 level of cgroup hierarchy
>> > using the following commands:
>> > 
>> > Server:
>> >  $ tcp_crr -6
>> > 
>> > Client:
>> >  $ tcp_crr -6 -c -H ${server_ip}
>> > 
>> > If the client and server run on different machines with 50 GBPS NIC,
>> > there is no visible impact of the change.
>> > 
>> > For the same machine experiment with v6.11-rc5 as base.
>> > 
>> >           base (throughput)     with-patch
>> > tcp_crr   14545 (+- 80)         14463 (+- 56)
>> > 
>> > It seems like the performance impact is within the noise.
>> > 
>> > Link: https://github.com/google/neper [1]
>> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>> > ---
>> > v1: https://lore.kernel.org/all/20240826232908.4076417-1-shakeel.butt@linux.dev/
>> > Changes since v1:
>> > - Correctly handle large allocations which bypass slab
>> > - Rearrange code to avoid compilation errors for !CONFIG_MEMCG builds
>> > 
>> > RFC: https://lore.kernel.org/all/20240824010139.1293051-1-shakeel.butt@linux.dev/
>> > Changes since the RFC:
>> > - Added check for already charged slab objects.
>> > - Added performance results from neper's tcp_crr
>> > 
>> >  include/linux/slab.h            |  1 +
>> >  mm/slub.c                       | 51 +++++++++++++++++++++++++++++++++
>> >  net/ipv4/inet_connection_sock.c |  5 ++--
>> >  3 files changed, 55 insertions(+), 2 deletions(-)
>> 
>> I can take the v3 in slab tree, if net people ack?
> 
> Thanks.
> 
>> 
>> BTW, will this be also useful for Linus's idea of charging struct files only
>> after they exist? But IIRC there was supposed to be also a part where we
>> have a way to quickly determine if we're not over limit (while allowing some
>> overcharge to make it quicker).
>>
> 
> Do you have link to those discussions or pointers to the code? From what
> you have described, I think this should work. We have the relevant gfp
> flags to control the charging behavior (with some caveats).

I think this was the last part of the discussion, and in the cover letter of
that there are links to the older threads for more context

https://lore.kernel.org/all/CAHk-%3DwhgFtbTxCAg2CWQtDj7n6CEyzvdV1wcCj2qpMfpw0%3Dm1A@mail.gmail.com/

>> Because right now this just overcharges unconditionally, but that's
>> understandable when the irq context creating the socket can't know the memcg
>> upfront. In the open() situation this is different.
>> 
> 
> For networking we deliberately overcharges in the irq context (if
> needed) and the course correct in the task context. However networking
> stack is very robust due to mechanisms like backoff, retransmit to handle
> situations like packet drops, allocation failures, congestion etc. Other
> subsystem are not that robust against ENOMEM. Once I have more detail I
> can follow up on the struct files case.

Ack. Agreed with Roman that it would be a separate followup change.

> thanks,
> Shakeel
> 
>
Yosry Ahmed Aug. 29, 2024, 6:28 p.m. UTC | #15
[..]
>
> Another reason is memory savings, if we have a small subset of objects in
> KMALLOC_NORMAL caches accounted, there might be e.g. one vector per a slab
> just to account on object while the rest is unaccounted. Separating between
> kmalloc and kmalloc-cg caches keeps the former with no vectors and the
> latter with fully used vectors.

Makes sense.

>
> > Wouldn't it be easier to special case the specific slab cache used for
> > the objcg vector or use a dedicated cache for it instead of using
> > kmalloc caches to begin with?
>
> The problem is the vector isn't a fixed size, it depends on how many objects
> a particular slab (not even a particular cache) has.

Oh right, I missed that part. Thanks for pointing it out.

>
> > Anyway, I am fine with any approach you and/or the slab maintainers
> > prefer, as long as we make things clear. If you keep the following
> > approach as-is, please expand the comment or refer to the commit you
> > just referenced.
> >
> > Personally, I prefer either explicitly special casing the slab cache
> > used for the objcgs vector, explicitly tagging KMALLOC_NORMAL
> > allocations, or having a dedicated documented helper that finds the
> > slab cache kmalloc type (if any) or checks if it is a KMALLOC_NORMAL
> > cache.
>
> A helper to check is_kmalloc_normal() would be better than defining
> KMALLOC_TYPE and using it directly, yes. We don't need to handle any other
> types now until anyone needs those.

is_kmalloc_normal() sounds good to me.

Thanks, Vlastimil.
Roman Gushchin Aug. 30, 2024, 8:34 p.m. UTC | #16
On Tue, Aug 27, 2024 at 04:52:28PM -0700, Shakeel Butt wrote:
41;2500;0c> At the moment, the slab objects are charged to the memcg at the
> allocation time. However there are cases where slab objects are
> allocated at the time where the right target memcg to charge it to is
> not known. One such case is the network sockets for the incoming
> connection which are allocated in the softirq context.
> 
> Couple hundred thousand connections are very normal on large loaded
> server and almost all of those sockets underlying those connections get
> allocated in the softirq context and thus not charged to any memcg.
> However later at the accept() time we know the right target memcg to
> charge. Let's add new API to charge already allocated objects, so we can
> have better accounting of the memory usage.
> 
> To measure the performance impact of this change, tcp_crr is used from
> the neper [1] performance suite. Basically it is a network ping pong
> test with new connection for each ping pong.
> 
> The server and the client are run inside 3 level of cgroup hierarchy
> using the following commands:
> 
> Server:
>  $ tcp_crr -6
> 
> Client:
>  $ tcp_crr -6 -c -H ${server_ip}
> 
> If the client and server run on different machines with 50 GBPS NIC,
> there is no visible impact of the change.
> 
> For the same machine experiment with v6.11-rc5 as base.
> 
>           base (throughput)     with-patch
> tcp_crr   14545 (+- 80)         14463 (+- 56)
> 
> It seems like the performance impact is within the noise.
> 
> Link: https://github.com/google/neper [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index eb2bf4629157..05cfab107c72 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -547,6 +547,7 @@  void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
 			    gfp_t gfpflags) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_lru(...)	alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
 
+bool kmem_cache_charge(void *objp, gfp_t gfpflags);
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
 kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
diff --git a/mm/slub.c b/mm/slub.c
index c9d8a2497fd6..8265ea5f25be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2185,6 +2185,43 @@  void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
 
 	__memcg_slab_free_hook(s, slab, p, objects, obj_exts);
 }
+
+#define KMALLOC_TYPE (SLAB_KMALLOC | SLAB_CACHE_DMA | \
+		      SLAB_ACCOUNT | SLAB_RECLAIM_ACCOUNT)
+
+static __fastpath_inline
+bool memcg_slab_post_charge(void *p, gfp_t flags)
+{
+	struct slabobj_ext *slab_exts;
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+	unsigned long off;
+
+	folio = virt_to_folio(p);
+	if (!folio_test_slab(folio)) {
+		return __memcg_kmem_charge_page(folio_page(folio, 0), flags,
+						folio_order(folio)) == 0;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	/* Ignore KMALLOC_NORMAL cache to avoid circular dependency. */
+	if ((s->flags & KMALLOC_TYPE) == SLAB_KMALLOC)
+		return true;
+
+	/* Ignore already charged objects. */
+	slab_exts = slab_obj_exts(slab);
+	if (slab_exts) {
+		off = obj_to_index(s, slab, p);
+		if (unlikely(slab_exts[off].objcg))
+			return true;
+	}
+
+	return __memcg_slab_post_alloc_hook(s, NULL, flags, 1, &p);
+}
+
 #else /* CONFIG_MEMCG */
 static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct list_lru *lru,
@@ -2198,6 +2235,11 @@  static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 					void **p, int objects)
 {
 }
+
+static inline bool memcg_slab_post_charge(void *p, gfp_t flags)
+{
+	return true;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
@@ -4062,6 +4104,15 @@  void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
 }
 EXPORT_SYMBOL(kmem_cache_alloc_lru_noprof);
 
+bool kmem_cache_charge(void *objp, gfp_t gfpflags)
+{
+	if (!memcg_kmem_online())
+		return true;
+
+	return memcg_slab_post_charge(objp, gfpflags);
+}
+EXPORT_SYMBOL(kmem_cache_charge);
+
 /**
  * kmem_cache_alloc_node - Allocate an object on the specified node
  * @s: The cache to allocate from.
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 64d07b842e73..3c13ca8c11fb 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -715,6 +715,7 @@  struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 	release_sock(sk);
 	if (newsk && mem_cgroup_sockets_enabled) {
 		int amt = 0;
+		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
 
 		/* atomically get the memory usage, set and charge the
 		 * newsk->sk_memcg.
@@ -731,8 +732,8 @@  struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 		}
 
 		if (amt)
-			mem_cgroup_charge_skmem(newsk->sk_memcg, amt,
-						GFP_KERNEL | __GFP_NOFAIL);
+			mem_cgroup_charge_skmem(newsk->sk_memcg, amt, gfp);
+		kmem_cache_charge(newsk, gfp);
 
 		release_sock(newsk);
 	}