[1/2] audit: Vary struct audit_entry alignment

Message ID	20231212102857.803984-2-haakon.bugge@oracle.com (mailing list archive)
State	Rejected
Delegated to:	Paul Moore
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="JKjv6fYA" From: =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com> To: Paul Moore <paul@paul-moore.com>, Eric Paris <eparis@redhat.com>, audit@vger.kernel.org Cc: Ankur Arora <ankur.a.arora@oracle.com> Subject: [PATCH 1/2] audit: Vary struct audit_entry alignment Date: Tue, 12 Dec 2023 11:28:55 +0100 Message-Id: <20231212102857.803984-2-haakon.bugge@oracle.com> In-Reply-To: <20231212102857.803984-1-haakon.bugge@oracle.com> References: <20231212102857.803984-1-haakon.bugge@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	audit: Further reduce syscall latency \| expand [0/2] audit: Further reduce syscall latency [1/2] audit: Vary struct audit_entry alignment [2/2] audit: Apply special optimizations

Haakon Bugge Dec. 12, 2023, 10:28 a.m. UTC

We allocate struct audit_entry using kzalloc() which aligns the
structure at its natural boundary and so uses the kmalloc-512
SLAB.

That means that the lower order 9 bits are equal for these allocations.
Which on architectures with limited L1D cache-sets width would cause
conflict misses.

With STIG compliant audit rules, up-to 26 L1D misses per syscall have
been observed. This, of course, varies from boot to boot.

Fix this by using a kmem_cache aligned at cacheline width, which
greatly increases the entropy in the lower order VA bits. With this
fix, we observe a maximum of 0.8 misses per syscall using the same set
of audit rules.

Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
simple getpid() loop, running 100 million rounds. This test is run 10
times, with a reboot in between. After each reboot, we wait until
the last minute load is less than 1.0.

We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
amplify the changes in the audit system.

Base vs. base + this commit gives:

ns per call:
  min avg max   pstdev
- 205 210 227 6.402000
+ 203 203 209 0.954149

L1d misses per syscall:
    min    avg    max   pstdev
- 3.147 12.017 26.045 6.568284
+ 0.012  0.103  0.817 0.238352

ipc:
    min    avg    max   pstdev
- 2.090  2.259  2.300 0.060075
+ 2.320  2.329  2.330 0.003000

The above metrics are all improved, and the L1D misses per syscall has
a significant reduction.

Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 kernel/auditfilter.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Paul Moore Dec. 13, 2023, 11:54 p.m. UTC | #1

On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>
> We allocate struct audit_entry using kzalloc() which aligns the
> structure at its natural boundary and so uses the kmalloc-512
> SLAB.
>
> That means that the lower order 9 bits are equal for these allocations.
> Which on architectures with limited L1D cache-sets width would cause
> conflict misses.
>
> With STIG compliant audit rules, up-to 26 L1D misses per syscall have
> been observed. This, of course, varies from boot to boot.
>
> Fix this by using a kmem_cache aligned at cacheline width, which
> greatly increases the entropy in the lower order VA bits. With this
> fix, we observe a maximum of 0.8 misses per syscall using the same set
> of audit rules.
>
> Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
> simple getpid() loop, running 100 million rounds. This test is run 10
> times, with a reboot in between. After each reboot, we wait until
> the last minute load is less than 1.0.
>
> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> amplify the changes in the audit system.
>
> Base vs. base + this commit gives:
>
> ns per call:
>   min avg max   pstdev
> - 205 210 227 6.402000
> + 203 203 209 0.954149
>
> L1d misses per syscall:
>     min    avg    max   pstdev
> - 3.147 12.017 26.045 6.568284
> + 0.012  0.103  0.817 0.238352
>
> ipc:
>     min    avg    max   pstdev
> - 2.090  2.259  2.300 0.060075
> + 2.320  2.329  2.330 0.003000
>
> The above metrics are all improved, and the L1D misses per syscall has
> a significant reduction.
>
> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>  kernel/auditfilter.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index 8317a37dea0bb..b7597a63a3950 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -80,6 +80,7 @@ static void audit_free_lsm_field(struct audit_field *f)
>         }
>  }
>
> +static struct kmem_cache *entry_cache;
>  static inline void audit_free_rule(struct audit_entry *e)
>  {
>         int i;
> @@ -93,7 +94,7 @@ static inline void audit_free_rule(struct audit_entry *e)
>                         audit_free_lsm_field(&erule->fields[i]);
>         kfree(erule->fields);
>         kfree(erule->filterkey);
> -       kfree(e);
> +       kmem_cache_free(entry_cache, e);
>  }
>
>  void audit_free_rule_rcu(struct rcu_head *head)
> @@ -108,13 +109,20 @@ static inline struct audit_entry *audit_init_entry(u32 field_count)
>         struct audit_entry *entry;
>         struct audit_field *fields;
>
> -       entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> +       if (!entry_cache) {
> +               entry_cache = kmem_cache_create("audit_entry", sizeof(*entry), 0,
> +                                               SLAB_HWCACHE_ALIGN, NULL);
> +               if (!entry_cache)
> +                       return NULL;

Two things:

1. If we are going to create a kmem_cache pool we shouldn't create it
here, it should be in its own audit_filter_init() function which is
called from audit_init().

2. I'm not sure it makes a lot of sense to create a kmem_cache pool
for audit filter entries, especially given the modest performance
gains.  Is there not some way to request cacheline alignment with
kmalloc() or similar?

> +       }
> +
> +       entry = kmem_cache_zalloc(entry_cache, GFP_KERNEL);
>         if (unlikely(!entry))
>                 return NULL;
>
>         fields = kcalloc(field_count, sizeof(*fields), GFP_KERNEL);
>         if (unlikely(!fields)) {
> -               kfree(entry);
> +               kmem_cache_free(entry_cache, entry);
>                 return NULL;
>         }
>         entry->rule.fields = fields;
> --
> 2.39.3

Haakon Bugge Dec. 16, 2023, 4:25 p.m. UTC | #2

> On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> 
> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote:
>> 
>> We allocate struct audit_entry using kzalloc() which aligns the
>> structure at its natural boundary and so uses the kmalloc-512
>> SLAB.
>> 
>> That means that the lower order 9 bits are equal for these allocations.
>> Which on architectures with limited L1D cache-sets width would cause
>> conflict misses.
>> 
>> With STIG compliant audit rules, up-to 26 L1D misses per syscall have
>> been observed. This, of course, varies from boot to boot.
>> 
>> Fix this by using a kmem_cache aligned at cacheline width, which
>> greatly increases the entropy in the lower order VA bits. With this
>> fix, we observe a maximum of 0.8 misses per syscall using the same set
>> of audit rules.
>> 
>> Testing: run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a
>> simple getpid() loop, running 100 million rounds. This test is run 10
>> times, with a reboot in between. After each reboot, we wait until
>> the last minute load is less than 1.0.
>> 
>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
>> amplify the changes in the audit system.
>> 
>> Base vs. base + this commit gives:
>> 
>> ns per call:
>>  min avg max   pstdev
>> - 205 210 227 6.402000
>> + 203 203 209 0.954149
>> 
>> L1d misses per syscall:
>>    min    avg    max   pstdev
>> - 3.147 12.017 26.045 6.568284
>> + 0.012  0.103  0.817 0.238352
>> 
>> ipc:
>>    min    avg    max   pstdev
>> - 2.090  2.259  2.300 0.060075
>> + 2.320  2.329  2.330 0.003000
>> 
>> The above metrics are all improved, and the L1D misses per syscall has
>> a significant reduction.
>> 
>> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> kernel/auditfilter.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>> 
>> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
>> index 8317a37dea0bb..b7597a63a3950 100644
>> --- a/kernel/auditfilter.c
>> +++ b/kernel/auditfilter.c
>> @@ -80,6 +80,7 @@ static void audit_free_lsm_field(struct audit_field *f)
>>        }
>> }
>> 
>> +static struct kmem_cache *entry_cache;
>> static inline void audit_free_rule(struct audit_entry *e)
>> {
>>        int i;
>> @@ -93,7 +94,7 @@ static inline void audit_free_rule(struct audit_entry *e)
>>                        audit_free_lsm_field(&erule->fields[i]);
>>        kfree(erule->fields);
>>        kfree(erule->filterkey);
>> -       kfree(e);
>> +       kmem_cache_free(entry_cache, e);
>> }
>> 
>> void audit_free_rule_rcu(struct rcu_head *head)
>> @@ -108,13 +109,20 @@ static inline struct audit_entry *audit_init_entry(u32 field_count)
>>        struct audit_entry *entry;
>>        struct audit_field *fields;
>> 
>> -       entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>> +       if (!entry_cache) {
>> +               entry_cache = kmem_cache_create("audit_entry", sizeof(*entry), 0,
>> +                                               SLAB_HWCACHE_ALIGN, NULL);
>> +               if (!entry_cache)
>> +                       return NULL;
> 
> Two things:
> 
> 1. If we are going to create a kmem_cache pool we shouldn't create it
> here, it should be in its own audit_filter_init() function which is
> called from audit_init().

Understood. Will fix.

> 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> for audit filter entries, especially given the modest performance
> gains.  Is there not some way to request cacheline alignment with
> kmalloc() or similar?

The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.

Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.


Thxs, Håkon

Paul Moore Dec. 18, 2023, 10:07 p.m. UTC | #3

On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> >
> > Two things:
> >
> > 1. If we are going to create a kmem_cache pool we shouldn't create it
> > here, it should be in its own audit_filter_init() function which is
> > called from audit_init().
>
> Understood. Will fix.
>
> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> > for audit filter entries, especially given the modest performance
> > gains.  Is there not some way to request cacheline alignment with
> > kmalloc() or similar?
>
> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
>
> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.

My hesitation of using a kmem_cache object here remains, given the
relatively limited and static filter rule configuration I would rather
use a k*malloc() based approach.

Ankur Arora Dec. 19, 2023, 9:07 p.m. UTC | #4

Paul Moore <paul@paul-moore.com> writes:

> On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
>> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
>> >
>> > Two things:
>> >
>> > 1. If we are going to create a kmem_cache pool we shouldn't create it
>> > here, it should be in its own audit_filter_init() function which is
>> > called from audit_init().
>>
>> Understood. Will fix.
>>
>> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
>> > for audit filter entries, especially given the modest performance
>> > gains.  Is there not some way to request cacheline alignment with
>> > kmalloc() or similar?
>>
>> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
>>
>> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.
>
> My hesitation of using a kmem_cache object here remains, given the
> relatively limited and static filter rule configuration I would rather
> use a k*malloc() based approach.

AFAICT, kmalloc() etc only allows fixed alignment. From
Documentation/core-api/memory-allocation.rst:

  The address of a chunk allocated with `kmalloc` is aligned to at least
  ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
  alignment is also guaranteed to be at least the respective size.

I had sent out a patch a while ago reducing the cost of the same
alignment issue. For me the most pernicious part was the fact that
syscall latency was good or poor based on how boot time factors
affected audit allocations.

So while I do agree with your hesitation on kmem_cache not being quite
the right interface for what are static allocations, I think it might
be worth it given the cost.

Thanks
--
ankur

Paul Moore Dec. 19, 2023, 9:27 p.m. UTC | #5

On Tue, Dec 19, 2023 at 4:07 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Sat, Dec 16, 2023 at 11:25 AM Haakon Bugge <haakon.bugge@oracle.com> wrote:
> >> > On 14 Dec 2023, at 00:54, Paul Moore <paul@paul-moore.com> wrote:
> >> >
> >> > Two things:
> >> >
> >> > 1. If we are going to create a kmem_cache pool we shouldn't create it
> >> > here, it should be in its own audit_filter_init() function which is
> >> > called from audit_init().
> >>
> >> Understood. Will fix.
> >>
> >> > 2. I'm not sure it makes a lot of sense to create a kmem_cache pool
> >> > for audit filter entries, especially given the modest performance
> >> > gains.  Is there not some way to request cacheline alignment with
> >> > kmalloc() or similar?
> >>
> >> The problem with today's kzmalloc() is lack of entropy on the lower order address bits, because the audit filter entries are aligned on a 512B boundary. IOW, they are too much aligned. The increased entropy is exactly what we get from using a kmem_cache which yields more L1D cache sets to be used.
> >>
> >> Although the performance gain is modest, the reduction in L1D cache misses is substantial and that will improve performance on most archs that employ a virtually indexed L1D cache. And, this commit acts as a prerequisite to avoid high variability in performance gain from the second commit in this series.
> >
> > My hesitation of using a kmem_cache object here remains, given the
> > relatively limited and static filter rule configuration I would rather
> > use a k*malloc() based approach.
>
> AFAICT, kmalloc() etc only allows fixed alignment. From
> Documentation/core-api/memory-allocation.rst:
>
>   The address of a chunk allocated with `kmalloc` is aligned to at least
>   ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
>   alignment is also guaranteed to be at least the respective size.
>
> I had sent out a patch a while ago reducing the cost of the same
> alignment issue. For me the most pernicious part was the fact that
> syscall latency was good or poor based on how boot time factors
> affected audit allocations.
>
> So while I do agree with your hesitation on kmem_cache not being quite
> the right interface for what are static allocations, I think it might
> be worth it given the cost.

Once again, I appreciate the time and effort that you have put into
this patchset, but it is not something I want to accept at this point
in time.

[1/2] audit: Vary struct audit_entry alignment

Commit Message

Comments

Patch