[RFC,0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Message ID	f184a2d6-7892-4e43-a0cd-cab638c3d5c2@amd.com (mailing list archive)
Headers	show Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2053.outbound.protection.outlook.com [40.107.94.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEEF84779D; Wed, 10 Jan 2024 11:11:31 +0000 (UTC) Message-ID: <f184a2d6-7892-4e43-a0cd-cab638c3d5c2@amd.com> Date: Wed, 10 Jan 2024 16:41:18 +0530 User-Agent: Mozilla Thunderbird Content-Language: en-US To: john.johansen@canonical.com, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com Cc: linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, "Gautham R. Shenoy" <gautham.shenoy@amd.com>, "Shukla, Santosh" <Santosh.Shukla@amd.com>, "Narayan, Ananth" <Ananth.Narayan@amd.com>, raghavendra.kodsarathimmappa@amd.com, koverstreet@google.com, paulmck@kernel.org, boqun.feng@gmail.com, vinicius.gomes@intel.com, mjguzik@gmail.com From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Subject: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk MIME-Version: 1.0
Series	Nginx refcount scalability issue with Apparmor enabled and potential solutions \| expand [RFC,0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions [RFC,1/9] doc: Add document for apparmor refcount management [RFC,2/9] apparmor: Switch labels to percpu refcount in atomic mode [RFC,3/9] apparmor: Switch unconfined namespaces refcount to percpu mode [RFC,4/9] apparmor: Add infrastructure to reclaim percpu labels [RFC,5/9] apparmor: Switch intree labels to percpu mode [RFC,6/9] apparmor: Initial prototype for optimizing ref switch [RFC,7/9] percpu-rcuref: Add basic infrastructure [RFC,8/9] apparmor: Switch labels to percpu rcurefcount in unmanaged mode [RFC,9/9] apparmor: Switch unconfined and in tree labels to managed ref mode

Neeraj Upadhyay Jan. 10, 2024, 11:11 a.m. UTC

Problem Statement
=================

Nginx performance testing with Apparmor enabled (with nginx
running in unconfined profile), on kernel versions 6.1 and 6.5
show significant drop in throughput scalability, when Nginx
workers are scaled to higher number of CPUs across various
L3 cache domains.

Below is one sample data on the throughput scalability loss,
based on results on AMD Zen4 system with 96 CPUs with SMT
core count 2; so, overall, 192 CPUs:

Config      Cache Domains     apparmor=off        apparmor=on
                             scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   95%              94%
24C48T         3                   94%              93%
48C96T         6                   92%              88%
96C192T        12                  85%              68%

If we look at above data, there is a significant drop in
scaling efficiency, when we move to 96 CPUs/192 SMT threads.

Perf tool shows most of the contention coming from below
6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj 
6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on label.

Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
from 6.7 alleviates the issue to an extent, but not completely:

Config      Cache Domains     apparmor=on        apparmor=on (patched)
                             scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   97%              93%
24C48T         3                   94%              92%
48C96T         6                   88%              88%
96C192T        12                  65%              79%

This adverse impact gets more pronounced when we move to >192 CPUs.
The memory contention and impact increases with high frequency label
update operations and labels are marked stale more frequently.


Label Refcount Management
=========================

Apparmor uses label objects (struct aa_label) to manage refcounts for
below set of objects:

- Applicable profiles
- Namespaces (unconfined profile)
- Other non-profile references

These label references are acquired on various apparmor lsm hooks,
on operations such as file open, task kill operations, socket bind,
and other file, socket, misc operations which use current task's cred,
when the label for the current cred, has been marked stale. This is
done to check these operations against the set of allowed operations
for the task performing them.

Use Percpu refcount for ref management?
=======================================

The ref put operations (percpu_ref_put()) in percpu refcount,
in active mode, do not check whether ref count has dropped to
0. The users of the percpu_ref need to explicitly invoke
a percpu_ref_kill() operation, to drop the initial reference,
at shutdown paths. After the percpu_ref_kill() operation, ref
switches to atomic mode and any new percpu_ref_put() operation
checks for the drop to 0 case and invokes the release operation
on that label.

Labels are marked stale is situations like profile removal,
profile updates. For a namespace, the unconfined label reference
is dropped when the namespace is destroyed. These points
are potential shutdown points for labels. However, killing
the percpu ref from these points has few issues:

- The label could still be referenced by tasks, which are
  still holding the reference to the now stale label.
  Killing the label ref while these operations are in progress
  will make all subsequent ref-put operations on the stale label
  to be atomic, which would still result in memory contention.
  Also, any new reference to the stale label, which is acquired
  with the elevated refcount will have atomic op contention.

- The label is marked stale using a non-atomic write operation.
  It is possible that new operations do not observe this flag
  and still reference it for quite some time.

- Explicitly tracking the shutdown points might not be maintainable
  at a per label granularity, as there can be various paths where
  label reference could get dropped, such as, before the label has
  gone live - object initialization error paths. Also, tracking
  the shutdown points for labels which reference other labels -
  subprofiles, merged labels requires careful analysis, and adds
  heavy burden on ensuring the memory contention is not introduced
  by these ref kill points.


Proposed Solution
=================

One potential solution to the refcount scalability problem is to
convert the label refcount to a percpu refcount, and manage
the initial reference from kworker context. The kworker
keeps an extra reference to the label and periodically scans
labels and release them if their refcount drops to 0.

Below is the sequence of operations, which shows the refcount
management with this approach:

1. During label initialization, the percpu ref is initialized in
   atomic mode. This is done to ensure that, for cases where the
   label hasn't gone live (->ns isn't assigned), mostly during
   initialization error paths.

2. Labels are switched to percpu mode at various points -
   when a label is added to labelset tree, when a unconfined profile
   has been assigned a namespace.

3. As part of the initial prototype, only the in tree labels
   are managed by the kworker. These labels are added to a lockless
   list. The unconfined labels invoke a percpu_ref_kill() operation
   when the namespace is destroyed.

4. The kworker does a periodic scan of all the labels in the
   llist. It does below sequence of operations:

   a. Enqueue a dummy node to mark the start of scan. This dummy
      node is used as start point of scan and ensures that we
      there is no additional synchronization required with new
      label node additions to the llist. Any new labels will
      be processed in next run of the kworker.

                      SCAN START PTR
                          |
                          v
      +----------+     +------+    +------+    +------+
      |          |     |      |    |      |    |      |
      |   head   ------> dummy|--->|label |--->| label|--->NULL
      |          |     | node |    |      |    |      |
      +----------+     +------+    +------+    +------+


      New label addition:

                            SCAN START PTR
                                 |
                                 v
      +----------+  +------+  +------+    +------+    +------+
      |          |  |      |  |      |    |      |    |      |
      |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
      |          |  |      |  | node |    |      |    |      |
      +----------+  +------+  +------+    +------+    +------+

    b. Traverse through the llist, starting from dummy->next.
       If the node is a dummy node, mark it free.
       If the node is a label node, do,

       i) Switch the label ref to atomic mode. The ref switch wait
          for the existing percpu_ref_get() and percpu_ref_put()
          operations to complete, by waiting for a RCU grace period.

          Once the switch is complete, from this point onwards, any
          percpu_ref_get(), percpu_ref_put() operations use
          atomic operations.

      ii) Drop the initial reference, which was taken while adding
          the label node to the llist.

     iii) Use a percpu_ref_tryget() increment operation on the
          ref, to see if we dropped the last ref count. if we
          dropped the last count, we remove the node from the llist.

          All of these operations are done inside a RCU critical
          section, to avoid race with the release operations,
          which can potentially trigger, as soon as we drop
          the initial ref count.

      iv) If we didn't drop the last ref, switch back the counter
          to percpu mode.

Using this approach, to move the atomic refcount manipulation out of the
contended paths, there is a significant scalability improvement seen on
nginx test, and scalability efficiency is close to apparmor-off case.

Config      Cache Domains     apparmor=on (percpuref)
                               scaling eff (%)
8C16T          1                  100%
16C32T         2                   96%
24C48T         3                   94%
48C96T         6                   93%
96C192T        12                  90%

Limitations
===========

1. Switching to percpu refcount increases memory size overhead, as
   percpu memory is allocated for all labels.

2. Deferring labels reclaim could potentially result in memory
   pressure, when there are high frequency of label update operations.

3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
   These can impact energy efficiency, due to back to back hurry
   callbacks. Using deferrable workqueue partly mitigates this.
   However, deferring kworker can delay reclaims.

4. Back to back label switches can delay other percpu users, as
   there is a single global switch spinlock used by percpu refcount
   lib.

5. Long running kworker can delay other use cases like system suspend.
   This is mitigated using freezable workqueue and litming node
   scans to a max count.

6. There is a window where label operates is atomic mode, when its
   counter is being checked for zero. This can potentially result
   in high memory contention, during this window which spans RCU
   grace period (plus callback execution). For example, when
   scanning label corresponding to unconfined profile, all
   applications which use unconfined profile would be using
   atomic ref increment and decrement operations.

   There are a few apparoaches which were tried to mitigate this issue:

   a. At a lower time interval, check if scanned label's counter
      has changed since the start of label scan. If there is a change
      in count, terminate the switch to atomic mode. Below shows the
      apparoch using rcuwait.

      static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
      {
         WRITE_ONCE(aa_atomic_switch_complete, true);
         rcuwait_wake_up(&aa_reclaim_rcuwait);
      }

      rcuwait_init(&aa_reclaim_rcuwait);
      percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);

      atomic_count = percpu_ref_count_read(&label->count);
      do {
        rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
                           (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
                           TASK_IDLE,
                           msecs_to_jiffies(5));
        if (percpu_ref_count_read(&label->count) != atomic_count)
                break;
       } while (!READ_ONCE(switch_complete));

       However, this approach does not work, as percpu refcount lib does not
       allow termination of an ongoing switch operation. Also, the counter
       can return to the original value with set of get() and put() operations
       before we check the current value.

   b. Approaches to notify the reclaim kworker from ref get and put operations
      can potentially disturb cache line state between the various CPU
      contexts, which are referncing the label, and can potentially impact
      scalability again.

   c. Swith the label to an immortal percpu ref, while the scan operates
      on the current counter. 

      Below is the sequence of operations to do this:

      1. Ensure that both immortal ref and label ref are in percpu mode.
         Reinit the immortal ref in percpu mode.

         Swap percpu and atomic counters of label refcount and immortal ref
	                          percpu-ref
      	                  +-------------------+                
      +-------+           |  percpu-ctr-addr1 |                
      | label | --------->|-------------------|    +----------------+ 
      +-------+           |   data            |--->| Atomic counter1| 
                          +-------------------+    +----------------+ 
      +-------+           +-------------------+                
      |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
      +-------+           |-------------------|--->| Atomic counter2|
                          |    data           |    +----------------+
                          +-------------------+                

          label ->percpu-ctr-addr  = percpu-ctr-addr2
          ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
          label ->data->count      = Atomic counter2
          ImmLbl ->data->count     = Atomic counter1
  
  
      2. Check the counters collected in immortal label, by switch it
         to atomic mode.

      3. If the count is 0, do,
         a. Switch immortal counter to percpu again, giving it an
            initial count of 1.
         b. Swap the label and immortal counters again. The immortal
            ref now has the counter values from new percpu ref get
            and get operations on the label ref, from the point
            when we did the initial swap operation.
         c. Transfer the percpu counts in immortal ref to atomic
            counter of label percpu refcount.
         d. Kill immortal ref, for reinit on next iteration.
         e. Switch label percpu ref to atomic mode.
         f. If the counter is 1, drop the initial ref.

       4. If the count is not 0, re-swap the counters.
          a. Switch immortal counter to percpu again, giving it an
             initial count of 1.
          b. Swap the label and immortal counters again. The immortal
             ref now has the counter values from new percpu ref get
             and get operations on the label ref, from the point
             when we did the initial swap operation.
          c. Transfer the percpu counts in immortal ref to atomic
             counter of label percpu refcount.
          d. Kill immortal ref, for reinit on next iteration.


          Using this approach, we ensure that, label ref users do not switch
          to atomic mode, while there are active references on the label.
          However, this approach requires multiple percpu ref mode switches
          and adds high overhead and complexity to the scanning code.

Extended/Future Work
====================

1. Look for ways to fix the limitations, as described in the "Limitations"
   section.

2. Generalize the approach to percpu rcuref, which is used for contexts
   where release path uses RCU grace period for release operations. Patch
   7 creates an initial prototype for this.

3. Explore hazard pointers for scalable refcounting of labels.

Highly appreciate any feedback/suggestions on the design approach.

The patches of this patchset introduce following changes:

1.      Documentation of Apparmor Refcount management.

2.      Switch labels to percpu refcount in atomic mode.

        Use percpu refcount for apparmor labels. Initial patch to init
        the percpu ref in atomic mode, to evaluate the potential
        impact of percpuref on top of kref based implementation.

3.      Switch unconfined namespaces refcount to percpu mode.

        Switch unconfined ns labels to percpu mode, and kill the
        initial refcount from namespace destroy path.

4.      Add infrastructure to reclaim percpu labels.

        Add a label reclaim infrastructure for labels which are
        in percpu mode, for managing their inital refcount.

5.      Switch intree labels to percpu mode.

        Use label reclaim infrastruture to manage intree labels.

6.      Initial prototype for optimizing ref switch.

        Prototype for reducing the time window when a label
        scan switches the label ref to atomic mode.

7.      percpu-rcuref: Add basic infrastructure.

        Prototype for Percpu refcounts for objects, which protect
        their object reclaims using RCU grace period.

8.      Switch labels to percpu rcurefcount in unmanaged mode.

        Use percpu rcuref for labels. Start with unmanaged/atomic
        mode.

9.      Switch unconfined and in tree labels to managed ref mode.

        Use percpu mode with manager worker for unconfined and intree
        labels.


------------------------------------------------------------------------

 b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
 b/Documentation/admin-guide/LSM/index.rst            |    1
 b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
 b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
 b/include/linux/percpu-refcount.h                    |    2
 b/lib/Makefile                                       |    2
 b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
 b/lib/percpu-refcount.c                              |   93 +++++++++++++
 b/security/apparmor/include/label.h                  |   16 +-
 b/security/apparmor/include/policy.h                 |    8 -
 b/security/apparmor/include/policy_ns.h              |   24 +++
 b/security/apparmor/label.c                          |   11 +
 b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
 b/security/apparmor/policy_ns.c                      |    6
 include/linux/percpu-refcount.h                      |    2
 lib/percpu-refcount.c                                |   93 -------------
 security/apparmor/include/label.h                    |   17 +-
 security/apparmor/include/policy.h                   |   56 +++----
 security/apparmor/include/policy_ns.h                |   24 ---
 security/apparmor/label.c                            |   11 -
 security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
 security/apparmor/policy_ns.c                        |    8 -
 22 files changed, 1237 insertions(+), 417 deletions(-)

base-commit: ab27740f7665

Neeraj Upadhyay Feb. 7, 2024, 4:40 a.m. UTC | #1

Gentle ping.
 
John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?


Thanks
Neeraj

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
> Problem Statement
> =================
> 
> Nginx performance testing with Apparmor enabled (with nginx
> running in unconfined profile), on kernel versions 6.1 and 6.5
> show significant drop in throughput scalability, when Nginx
> workers are scaled to higher number of CPUs across various
> L3 cache domains.
> 
> Below is one sample data on the throughput scalability loss,
> based on results on AMD Zen4 system with 96 CPUs with SMT
> core count 2; so, overall, 192 CPUs:
> 
> Config      Cache Domains     apparmor=off        apparmor=on
>                              scaling eff (%)      scaling eff (%)
> 8C16T          1                  100%             100%
> 16C32T         2                   95%              94%
> 24C48T         3                   94%              93%
> 48C96T         6                   92%              88%
> 96C192T        12                  85%              68%
> 
> If we look at above data, there is a significant drop in
> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
> 
> Perf tool shows most of the contention coming from below
> 6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj 
> 6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open
> 
> The majority of the CPU cycles is found to be due to memory contention
> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
> kref_put() operations on label.
> 
> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
> from 6.7 alleviates the issue to an extent, but not completely:
> 
> Config      Cache Domains     apparmor=on        apparmor=on (patched)
>                              scaling eff (%)      scaling eff (%)
> 8C16T          1                  100%             100%
> 16C32T         2                   97%              93%
> 24C48T         3                   94%              92%
> 48C96T         6                   88%              88%
> 96C192T        12                  65%              79%
> 
> This adverse impact gets more pronounced when we move to >192 CPUs.
> The memory contention and impact increases with high frequency label
> update operations and labels are marked stale more frequently.
> 
> 
> Label Refcount Management
> =========================
> 
> Apparmor uses label objects (struct aa_label) to manage refcounts for
> below set of objects:
> 
> - Applicable profiles
> - Namespaces (unconfined profile)
> - Other non-profile references
> 
> These label references are acquired on various apparmor lsm hooks,
> on operations such as file open, task kill operations, socket bind,
> and other file, socket, misc operations which use current task's cred,
> when the label for the current cred, has been marked stale. This is
> done to check these operations against the set of allowed operations
> for the task performing them.
> 
> Use Percpu refcount for ref management?
> =======================================
> 
> The ref put operations (percpu_ref_put()) in percpu refcount,
> in active mode, do not check whether ref count has dropped to
> 0. The users of the percpu_ref need to explicitly invoke
> a percpu_ref_kill() operation, to drop the initial reference,
> at shutdown paths. After the percpu_ref_kill() operation, ref
> switches to atomic mode and any new percpu_ref_put() operation
> checks for the drop to 0 case and invokes the release operation
> on that label.
> 
> Labels are marked stale is situations like profile removal,
> profile updates. For a namespace, the unconfined label reference
> is dropped when the namespace is destroyed. These points
> are potential shutdown points for labels. However, killing
> the percpu ref from these points has few issues:
> 
> - The label could still be referenced by tasks, which are
>   still holding the reference to the now stale label.
>   Killing the label ref while these operations are in progress
>   will make all subsequent ref-put operations on the stale label
>   to be atomic, which would still result in memory contention.
>   Also, any new reference to the stale label, which is acquired
>   with the elevated refcount will have atomic op contention.
> 
> - The label is marked stale using a non-atomic write operation.
>   It is possible that new operations do not observe this flag
>   and still reference it for quite some time.
> 
> - Explicitly tracking the shutdown points might not be maintainable
>   at a per label granularity, as there can be various paths where
>   label reference could get dropped, such as, before the label has
>   gone live - object initialization error paths. Also, tracking
>   the shutdown points for labels which reference other labels -
>   subprofiles, merged labels requires careful analysis, and adds
>   heavy burden on ensuring the memory contention is not introduced
>   by these ref kill points.
> 
> 
> Proposed Solution
> =================
> 
> One potential solution to the refcount scalability problem is to
> convert the label refcount to a percpu refcount, and manage
> the initial reference from kworker context. The kworker
> keeps an extra reference to the label and periodically scans
> labels and release them if their refcount drops to 0.
> 
> Below is the sequence of operations, which shows the refcount
> management with this approach:
> 
> 1. During label initialization, the percpu ref is initialized in
>    atomic mode. This is done to ensure that, for cases where the
>    label hasn't gone live (->ns isn't assigned), mostly during
>    initialization error paths.
> 
> 2. Labels are switched to percpu mode at various points -
>    when a label is added to labelset tree, when a unconfined profile
>    has been assigned a namespace.
> 
> 3. As part of the initial prototype, only the in tree labels
>    are managed by the kworker. These labels are added to a lockless
>    list. The unconfined labels invoke a percpu_ref_kill() operation
>    when the namespace is destroyed.
> 
> 4. The kworker does a periodic scan of all the labels in the
>    llist. It does below sequence of operations:
> 
>    a. Enqueue a dummy node to mark the start of scan. This dummy
>       node is used as start point of scan and ensures that we
>       there is no additional synchronization required with new
>       label node additions to the llist. Any new labels will
>       be processed in next run of the kworker.
> 
>                       SCAN START PTR
>                           |
>                           v
>       +----------+     +------+    +------+    +------+
>       |          |     |      |    |      |    |      |
>       |   head   ------> dummy|--->|label |--->| label|--->NULL
>       |          |     | node |    |      |    |      |
>       +----------+     +------+    +------+    +------+
> 
> 
>       New label addition:
> 
>                             SCAN START PTR
>                                  |
>                                  v
>       +----------+  +------+  +------+    +------+    +------+
>       |          |  |      |  |      |    |      |    |      |
>       |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
>       |          |  |      |  | node |    |      |    |      |
>       +----------+  +------+  +------+    +------+    +------+
> 
>     b. Traverse through the llist, starting from dummy->next.
>        If the node is a dummy node, mark it free.
>        If the node is a label node, do,
> 
>        i) Switch the label ref to atomic mode. The ref switch wait
>           for the existing percpu_ref_get() and percpu_ref_put()
>           operations to complete, by waiting for a RCU grace period.
> 
>           Once the switch is complete, from this point onwards, any
>           percpu_ref_get(), percpu_ref_put() operations use
>           atomic operations.
> 
>       ii) Drop the initial reference, which was taken while adding
>           the label node to the llist.
> 
>      iii) Use a percpu_ref_tryget() increment operation on the
>           ref, to see if we dropped the last ref count. if we
>           dropped the last count, we remove the node from the llist.
> 
>           All of these operations are done inside a RCU critical
>           section, to avoid race with the release operations,
>           which can potentially trigger, as soon as we drop
>           the initial ref count.
> 
>       iv) If we didn't drop the last ref, switch back the counter
>           to percpu mode.
> 
> Using this approach, to move the atomic refcount manipulation out of the
> contended paths, there is a significant scalability improvement seen on
> nginx test, and scalability efficiency is close to apparmor-off case.
> 
> Config      Cache Domains     apparmor=on (percpuref)
>                                scaling eff (%)
> 8C16T          1                  100%
> 16C32T         2                   96%
> 24C48T         3                   94%
> 48C96T         6                   93%
> 96C192T        12                  90%
> 
> Limitations
> ===========
> 
> 1. Switching to percpu refcount increases memory size overhead, as
>    percpu memory is allocated for all labels.
> 
> 2. Deferring labels reclaim could potentially result in memory
>    pressure, when there are high frequency of label update operations.
> 
> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
>    These can impact energy efficiency, due to back to back hurry
>    callbacks. Using deferrable workqueue partly mitigates this.
>    However, deferring kworker can delay reclaims.
> 
> 4. Back to back label switches can delay other percpu users, as
>    there is a single global switch spinlock used by percpu refcount
>    lib.
> 
> 5. Long running kworker can delay other use cases like system suspend.
>    This is mitigated using freezable workqueue and litming node
>    scans to a max count.
> 
> 6. There is a window where label operates is atomic mode, when its
>    counter is being checked for zero. This can potentially result
>    in high memory contention, during this window which spans RCU
>    grace period (plus callback execution). For example, when
>    scanning label corresponding to unconfined profile, all
>    applications which use unconfined profile would be using
>    atomic ref increment and decrement operations.
> 
>    There are a few apparoaches which were tried to mitigate this issue:
> 
>    a. At a lower time interval, check if scanned label's counter
>       has changed since the start of label scan. If there is a change
>       in count, terminate the switch to atomic mode. Below shows the
>       apparoch using rcuwait.
> 
>       static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
>       {
>          WRITE_ONCE(aa_atomic_switch_complete, true);
>          rcuwait_wake_up(&aa_reclaim_rcuwait);
>       }
> 
>       rcuwait_init(&aa_reclaim_rcuwait);
>       percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
> 
>       atomic_count = percpu_ref_count_read(&label->count);
>       do {
>         rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
>                            (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
>                            TASK_IDLE,
>                            msecs_to_jiffies(5));
>         if (percpu_ref_count_read(&label->count) != atomic_count)
>                 break;
>        } while (!READ_ONCE(switch_complete));
> 
>        However, this approach does not work, as percpu refcount lib does not
>        allow termination of an ongoing switch operation. Also, the counter
>        can return to the original value with set of get() and put() operations
>        before we check the current value.
> 
>    b. Approaches to notify the reclaim kworker from ref get and put operations
>       can potentially disturb cache line state between the various CPU
>       contexts, which are referncing the label, and can potentially impact
>       scalability again.
> 
>    c. Swith the label to an immortal percpu ref, while the scan operates
>       on the current counter. 
> 
>       Below is the sequence of operations to do this:
> 
>       1. Ensure that both immortal ref and label ref are in percpu mode.
>          Reinit the immortal ref in percpu mode.
> 
>          Swap percpu and atomic counters of label refcount and immortal ref
> 	                          percpu-ref
>       	                  +-------------------+                
>       +-------+           |  percpu-ctr-addr1 |                
>       | label | --------->|-------------------|    +----------------+ 
>       +-------+           |   data            |--->| Atomic counter1| 
>                           +-------------------+    +----------------+ 
>       +-------+           +-------------------+                
>       |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
>       +-------+           |-------------------|--->| Atomic counter2|
>                           |    data           |    +----------------+
>                           +-------------------+                
> 
>           label ->percpu-ctr-addr  = percpu-ctr-addr2
>           ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
>           label ->data->count      = Atomic counter2
>           ImmLbl ->data->count     = Atomic counter1
>   
>   
>       2. Check the counters collected in immortal label, by switch it
>          to atomic mode.
> 
>       3. If the count is 0, do,
>          a. Switch immortal counter to percpu again, giving it an
>             initial count of 1.
>          b. Swap the label and immortal counters again. The immortal
>             ref now has the counter values from new percpu ref get
>             and get operations on the label ref, from the point
>             when we did the initial swap operation.
>          c. Transfer the percpu counts in immortal ref to atomic
>             counter of label percpu refcount.
>          d. Kill immortal ref, for reinit on next iteration.
>          e. Switch label percpu ref to atomic mode.
>          f. If the counter is 1, drop the initial ref.
> 
>        4. If the count is not 0, re-swap the counters.
>           a. Switch immortal counter to percpu again, giving it an
>              initial count of 1.
>           b. Swap the label and immortal counters again. The immortal
>              ref now has the counter values from new percpu ref get
>              and get operations on the label ref, from the point
>              when we did the initial swap operation.
>           c. Transfer the percpu counts in immortal ref to atomic
>              counter of label percpu refcount.
>           d. Kill immortal ref, for reinit on next iteration.
> 
> 
>           Using this approach, we ensure that, label ref users do not switch
>           to atomic mode, while there are active references on the label.
>           However, this approach requires multiple percpu ref mode switches
>           and adds high overhead and complexity to the scanning code.
> 
> Extended/Future Work
> ====================
> 
> 1. Look for ways to fix the limitations, as described in the "Limitations"
>    section.
> 
> 2. Generalize the approach to percpu rcuref, which is used for contexts
>    where release path uses RCU grace period for release operations. Patch
>    7 creates an initial prototype for this.
> 
> 3. Explore hazard pointers for scalable refcounting of labels.
> 
> Highly appreciate any feedback/suggestions on the design approach.
> 
> The patches of this patchset introduce following changes:
> 
> 1.      Documentation of Apparmor Refcount management.
> 
> 2.      Switch labels to percpu refcount in atomic mode.
> 
>         Use percpu refcount for apparmor labels. Initial patch to init
>         the percpu ref in atomic mode, to evaluate the potential
>         impact of percpuref on top of kref based implementation.
> 
> 3.      Switch unconfined namespaces refcount to percpu mode.
> 
>         Switch unconfined ns labels to percpu mode, and kill the
>         initial refcount from namespace destroy path.
> 
> 4.      Add infrastructure to reclaim percpu labels.
> 
>         Add a label reclaim infrastructure for labels which are
>         in percpu mode, for managing their inital refcount.
> 
> 5.      Switch intree labels to percpu mode.
> 
>         Use label reclaim infrastruture to manage intree labels.
> 
> 6.      Initial prototype for optimizing ref switch.
> 
>         Prototype for reducing the time window when a label
>         scan switches the label ref to atomic mode.
> 
> 7.      percpu-rcuref: Add basic infrastructure.
> 
>         Prototype for Percpu refcounts for objects, which protect
>         their object reclaims using RCU grace period.
> 
> 8.      Switch labels to percpu rcurefcount in unmanaged mode.
> 
>         Use percpu rcuref for labels. Start with unmanaged/atomic
>         mode.
> 
> 9.      Switch unconfined and in tree labels to managed ref mode.
> 
>         Use percpu mode with manager worker for unconfined and intree
>         labels.
> 
> 
> ------------------------------------------------------------------------
> 
>  b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  b/Documentation/admin-guide/LSM/index.rst            |    1
>  b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
>  b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
>  b/include/linux/percpu-refcount.h                    |    2
>  b/lib/Makefile                                       |    2
>  b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
>  b/lib/percpu-refcount.c                              |   93 +++++++++++++
>  b/security/apparmor/include/label.h                  |   16 +-
>  b/security/apparmor/include/policy.h                 |    8 -
>  b/security/apparmor/include/policy_ns.h              |   24 +++
>  b/security/apparmor/label.c                          |   11 +
>  b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
>  b/security/apparmor/policy_ns.c                      |    6
>  include/linux/percpu-refcount.h                      |    2
>  lib/percpu-refcount.c                                |   93 -------------
>  security/apparmor/include/label.h                    |   17 +-
>  security/apparmor/include/policy.h                   |   56 +++----
>  security/apparmor/include/policy_ns.h                |   24 ---
>  security/apparmor/label.c                            |   11 -
>  security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
>  security/apparmor/policy_ns.c                        |    8 -
>  22 files changed, 1237 insertions(+), 417 deletions(-)
> 
> base-commit: ab27740f7665

John Johansen Feb. 9, 2024, 5:33 p.m. UTC | #2

On 2/6/24 20:40, Neeraj Upadhyay wrote:
> Gentle ping.
>   
> John,
> 
> Could you please confirm that:
> 
> a. The AppArmor refcount usage described in the RFC is correct?
> b. Approach taken to fix the scalability issue is valid/correct?
> 

Hi Neeraj,

I know your patchset has been waiting on review for a long time.
Unfortunately I have been very, very busy lately. I will try to
get to it this weekend, but I can't promise that I will be able
to get the review fully done.

john


> 
> On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
>> Problem Statement
>> =================
>>
>> Nginx performance testing with Apparmor enabled (with nginx
>> running in unconfined profile), on kernel versions 6.1 and 6.5
>> show significant drop in throughput scalability, when Nginx
>> workers are scaled to higher number of CPUs across various
>> L3 cache domains.
>>
>> Below is one sample data on the throughput scalability loss,
>> based on results on AMD Zen4 system with 96 CPUs with SMT
>> core count 2; so, overall, 192 CPUs:
>>
>> Config      Cache Domains     apparmor=off        apparmor=on
>>                               scaling eff (%)      scaling eff (%)
>> 8C16T          1                  100%             100%
>> 16C32T         2                   95%              94%
>> 24C48T         3                   94%              93%
>> 48C96T         6                   92%              88%
>> 96C192T        12                  85%              68%
>>
>> If we look at above data, there is a significant drop in
>> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
>>
>> Perf tool shows most of the contention coming from below
>> 6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj
>> 6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open
>>
>> The majority of the CPU cycles is found to be due to memory contention
>> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
>> kref_put() operations on label.
>>
>> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
>> from 6.7 alleviates the issue to an extent, but not completely:
>>
>> Config      Cache Domains     apparmor=on        apparmor=on (patched)
>>                               scaling eff (%)      scaling eff (%)
>> 8C16T          1                  100%             100%
>> 16C32T         2                   97%              93%
>> 24C48T         3                   94%              92%
>> 48C96T         6                   88%              88%
>> 96C192T        12                  65%              79%
>>
>> This adverse impact gets more pronounced when we move to >192 CPUs.
>> The memory contention and impact increases with high frequency label
>> update operations and labels are marked stale more frequently.
>>
>>
>> Label Refcount Management
>> =========================
>>
>> Apparmor uses label objects (struct aa_label) to manage refcounts for
>> below set of objects:
>>
>> - Applicable profiles
>> - Namespaces (unconfined profile)
>> - Other non-profile references
>>
>> These label references are acquired on various apparmor lsm hooks,
>> on operations such as file open, task kill operations, socket bind,
>> and other file, socket, misc operations which use current task's cred,
>> when the label for the current cred, has been marked stale. This is
>> done to check these operations against the set of allowed operations
>> for the task performing them.
>>
>> Use Percpu refcount for ref management?
>> =======================================
>>
>> The ref put operations (percpu_ref_put()) in percpu refcount,
>> in active mode, do not check whether ref count has dropped to
>> 0. The users of the percpu_ref need to explicitly invoke
>> a percpu_ref_kill() operation, to drop the initial reference,
>> at shutdown paths. After the percpu_ref_kill() operation, ref
>> switches to atomic mode and any new percpu_ref_put() operation
>> checks for the drop to 0 case and invokes the release operation
>> on that label.
>>
>> Labels are marked stale is situations like profile removal,
>> profile updates. For a namespace, the unconfined label reference
>> is dropped when the namespace is destroyed. These points
>> are potential shutdown points for labels. However, killing
>> the percpu ref from these points has few issues:
>>
>> - The label could still be referenced by tasks, which are
>>    still holding the reference to the now stale label.
>>    Killing the label ref while these operations are in progress
>>    will make all subsequent ref-put operations on the stale label
>>    to be atomic, which would still result in memory contention.
>>    Also, any new reference to the stale label, which is acquired
>>    with the elevated refcount will have atomic op contention.
>>
>> - The label is marked stale using a non-atomic write operation.
>>    It is possible that new operations do not observe this flag
>>    and still reference it for quite some time.
>>
>> - Explicitly tracking the shutdown points might not be maintainable
>>    at a per label granularity, as there can be various paths where
>>    label reference could get dropped, such as, before the label has
>>    gone live - object initialization error paths. Also, tracking
>>    the shutdown points for labels which reference other labels -
>>    subprofiles, merged labels requires careful analysis, and adds
>>    heavy burden on ensuring the memory contention is not introduced
>>    by these ref kill points.
>>
>>
>> Proposed Solution
>> =================
>>
>> One potential solution to the refcount scalability problem is to
>> convert the label refcount to a percpu refcount, and manage
>> the initial reference from kworker context. The kworker
>> keeps an extra reference to the label and periodically scans
>> labels and release them if their refcount drops to 0.
>>
>> Below is the sequence of operations, which shows the refcount
>> management with this approach:
>>
>> 1. During label initialization, the percpu ref is initialized in
>>     atomic mode. This is done to ensure that, for cases where the
>>     label hasn't gone live (->ns isn't assigned), mostly during
>>     initialization error paths.
>>
>> 2. Labels are switched to percpu mode at various points -
>>     when a label is added to labelset tree, when a unconfined profile
>>     has been assigned a namespace.
>>
>> 3. As part of the initial prototype, only the in tree labels
>>     are managed by the kworker. These labels are added to a lockless
>>     list. The unconfined labels invoke a percpu_ref_kill() operation
>>     when the namespace is destroyed.
>>
>> 4. The kworker does a periodic scan of all the labels in the
>>     llist. It does below sequence of operations:
>>
>>     a. Enqueue a dummy node to mark the start of scan. This dummy
>>        node is used as start point of scan and ensures that we
>>        there is no additional synchronization required with new
>>        label node additions to the llist. Any new labels will
>>        be processed in next run of the kworker.
>>
>>                        SCAN START PTR
>>                            |
>>                            v
>>        +----------+     +------+    +------+    +------+
>>        |          |     |      |    |      |    |      |
>>        |   head   ------> dummy|--->|label |--->| label|--->NULL
>>        |          |     | node |    |      |    |      |
>>        +----------+     +------+    +------+    +------+
>>
>>
>>        New label addition:
>>
>>                              SCAN START PTR
>>                                   |
>>                                   v
>>        +----------+  +------+  +------+    +------+    +------+
>>        |          |  |      |  |      |    |      |    |      |
>>        |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
>>        |          |  |      |  | node |    |      |    |      |
>>        +----------+  +------+  +------+    +------+    +------+
>>
>>      b. Traverse through the llist, starting from dummy->next.
>>         If the node is a dummy node, mark it free.
>>         If the node is a label node, do,
>>
>>         i) Switch the label ref to atomic mode. The ref switch wait
>>            for the existing percpu_ref_get() and percpu_ref_put()
>>            operations to complete, by waiting for a RCU grace period.
>>
>>            Once the switch is complete, from this point onwards, any
>>            percpu_ref_get(), percpu_ref_put() operations use
>>            atomic operations.
>>
>>        ii) Drop the initial reference, which was taken while adding
>>            the label node to the llist.
>>
>>       iii) Use a percpu_ref_tryget() increment operation on the
>>            ref, to see if we dropped the last ref count. if we
>>            dropped the last count, we remove the node from the llist.
>>
>>            All of these operations are done inside a RCU critical
>>            section, to avoid race with the release operations,
>>            which can potentially trigger, as soon as we drop
>>            the initial ref count.
>>
>>        iv) If we didn't drop the last ref, switch back the counter
>>            to percpu mode.
>>
>> Using this approach, to move the atomic refcount manipulation out of the
>> contended paths, there is a significant scalability improvement seen on
>> nginx test, and scalability efficiency is close to apparmor-off case.
>>
>> Config      Cache Domains     apparmor=on (percpuref)
>>                                 scaling eff (%)
>> 8C16T          1                  100%
>> 16C32T         2                   96%
>> 24C48T         3                   94%
>> 48C96T         6                   93%
>> 96C192T        12                  90%
>>
>> Limitations
>> ===========
>>
>> 1. Switching to percpu refcount increases memory size overhead, as
>>     percpu memory is allocated for all labels.
>>
>> 2. Deferring labels reclaim could potentially result in memory
>>     pressure, when there are high frequency of label update operations.
>>
>> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
>>     These can impact energy efficiency, due to back to back hurry
>>     callbacks. Using deferrable workqueue partly mitigates this.
>>     However, deferring kworker can delay reclaims.
>>
>> 4. Back to back label switches can delay other percpu users, as
>>     there is a single global switch spinlock used by percpu refcount
>>     lib.
>>
>> 5. Long running kworker can delay other use cases like system suspend.
>>     This is mitigated using freezable workqueue and litming node
>>     scans to a max count.
>>
>> 6. There is a window where label operates is atomic mode, when its
>>     counter is being checked for zero. This can potentially result
>>     in high memory contention, during this window which spans RCU
>>     grace period (plus callback execution). For example, when
>>     scanning label corresponding to unconfined profile, all
>>     applications which use unconfined profile would be using
>>     atomic ref increment and decrement operations.
>>
>>     There are a few apparoaches which were tried to mitigate this issue:
>>
>>     a. At a lower time interval, check if scanned label's counter
>>        has changed since the start of label scan. If there is a change
>>        in count, terminate the switch to atomic mode. Below shows the
>>        apparoch using rcuwait.
>>
>>        static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
>>        {
>>           WRITE_ONCE(aa_atomic_switch_complete, true);
>>           rcuwait_wake_up(&aa_reclaim_rcuwait);
>>        }
>>
>>        rcuwait_init(&aa_reclaim_rcuwait);
>>        percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
>>
>>        atomic_count = percpu_ref_count_read(&label->count);
>>        do {
>>          rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
>>                             (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
>>                             TASK_IDLE,
>>                             msecs_to_jiffies(5));
>>          if (percpu_ref_count_read(&label->count) != atomic_count)
>>                  break;
>>         } while (!READ_ONCE(switch_complete));
>>
>>         However, this approach does not work, as percpu refcount lib does not
>>         allow termination of an ongoing switch operation. Also, the counter
>>         can return to the original value with set of get() and put() operations
>>         before we check the current value.
>>
>>     b. Approaches to notify the reclaim kworker from ref get and put operations
>>        can potentially disturb cache line state between the various CPU
>>        contexts, which are referncing the label, and can potentially impact
>>        scalability again.
>>
>>     c. Swith the label to an immortal percpu ref, while the scan operates
>>        on the current counter.
>>
>>        Below is the sequence of operations to do this:
>>
>>        1. Ensure that both immortal ref and label ref are in percpu mode.
>>           Reinit the immortal ref in percpu mode.
>>
>>           Swap percpu and atomic counters of label refcount and immortal ref
>> 	                          percpu-ref
>>        	                  +-------------------+
>>        +-------+           |  percpu-ctr-addr1 |
>>        | label | --------->|-------------------|    +----------------+
>>        +-------+           |   data            |--->| Atomic counter1|
>>                            +-------------------+    +----------------+
>>        +-------+           +-------------------+
>>        |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
>>        +-------+           |-------------------|--->| Atomic counter2|
>>                            |    data           |    +----------------+
>>                            +-------------------+
>>
>>            label ->percpu-ctr-addr  = percpu-ctr-addr2
>>            ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
>>            label ->data->count      = Atomic counter2
>>            ImmLbl ->data->count     = Atomic counter1
>>    
>>    
>>        2. Check the counters collected in immortal label, by switch it
>>           to atomic mode.
>>
>>        3. If the count is 0, do,
>>           a. Switch immortal counter to percpu again, giving it an
>>              initial count of 1.
>>           b. Swap the label and immortal counters again. The immortal
>>              ref now has the counter values from new percpu ref get
>>              and get operations on the label ref, from the point
>>              when we did the initial swap operation.
>>           c. Transfer the percpu counts in immortal ref to atomic
>>              counter of label percpu refcount.
>>           d. Kill immortal ref, for reinit on next iteration.
>>           e. Switch label percpu ref to atomic mode.
>>           f. If the counter is 1, drop the initial ref.
>>
>>         4. If the count is not 0, re-swap the counters.
>>            a. Switch immortal counter to percpu again, giving it an
>>               initial count of 1.
>>            b. Swap the label and immortal counters again. The immortal
>>               ref now has the counter values from new percpu ref get
>>               and get operations on the label ref, from the point
>>               when we did the initial swap operation.
>>            c. Transfer the percpu counts in immortal ref to atomic
>>               counter of label percpu refcount.
>>            d. Kill immortal ref, for reinit on next iteration.
>>
>>
>>            Using this approach, we ensure that, label ref users do not switch
>>            to atomic mode, while there are active references on the label.
>>            However, this approach requires multiple percpu ref mode switches
>>            and adds high overhead and complexity to the scanning code.
>>
>> Extended/Future Work
>> ====================
>>
>> 1. Look for ways to fix the limitations, as described in the "Limitations"
>>     section.
>>
>> 2. Generalize the approach to percpu rcuref, which is used for contexts
>>     where release path uses RCU grace period for release operations. Patch
>>     7 creates an initial prototype for this.
>>
>> 3. Explore hazard pointers for scalable refcounting of labels.
>>
>> Highly appreciate any feedback/suggestions on the design approach.
>>
>> The patches of this patchset introduce following changes:
>>
>> 1.      Documentation of Apparmor Refcount management.
>>
>> 2.      Switch labels to percpu refcount in atomic mode.
>>
>>          Use percpu refcount for apparmor labels. Initial patch to init
>>          the percpu ref in atomic mode, to evaluate the potential
>>          impact of percpuref on top of kref based implementation.
>>
>> 3.      Switch unconfined namespaces refcount to percpu mode.
>>
>>          Switch unconfined ns labels to percpu mode, and kill the
>>          initial refcount from namespace destroy path.
>>
>> 4.      Add infrastructure to reclaim percpu labels.
>>
>>          Add a label reclaim infrastructure for labels which are
>>          in percpu mode, for managing their inital refcount.
>>
>> 5.      Switch intree labels to percpu mode.
>>
>>          Use label reclaim infrastruture to manage intree labels.
>>
>> 6.      Initial prototype for optimizing ref switch.
>>
>>          Prototype for reducing the time window when a label
>>          scan switches the label ref to atomic mode.
>>
>> 7.      percpu-rcuref: Add basic infrastructure.
>>
>>          Prototype for Percpu refcounts for objects, which protect
>>          their object reclaims using RCU grace period.
>>
>> 8.      Switch labels to percpu rcurefcount in unmanaged mode.
>>
>>          Use percpu rcuref for labels. Start with unmanaged/atomic
>>          mode.
>>
>> 9.      Switch unconfined and in tree labels to managed ref mode.
>>
>>          Use percpu mode with manager worker for unconfined and intree
>>          labels.
>>
>>
>> ------------------------------------------------------------------------
>>
>>   b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   b/Documentation/admin-guide/LSM/index.rst            |    1
>>   b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
>>   b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
>>   b/include/linux/percpu-refcount.h                    |    2
>>   b/lib/Makefile                                       |    2
>>   b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
>>   b/lib/percpu-refcount.c                              |   93 +++++++++++++
>>   b/security/apparmor/include/label.h                  |   16 +-
>>   b/security/apparmor/include/policy.h                 |    8 -
>>   b/security/apparmor/include/policy_ns.h              |   24 +++
>>   b/security/apparmor/label.c                          |   11 +
>>   b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
>>   b/security/apparmor/policy_ns.c                      |    6
>>   include/linux/percpu-refcount.h                      |    2
>>   lib/percpu-refcount.c                                |   93 -------------
>>   security/apparmor/include/label.h                    |   17 +-
>>   security/apparmor/include/policy.h                   |   56 +++----
>>   security/apparmor/include/policy_ns.h                |   24 ---
>>   security/apparmor/label.c                            |   11 -
>>   security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
>>   security/apparmor/policy_ns.c                        |    8 -
>>   22 files changed, 1237 insertions(+), 417 deletions(-)
>>
>> base-commit: ab27740f7665

Mateusz Guzik March 2, 2024, 10:23 a.m. UTC | #3

On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>> Gentle ping.
>>
>> John,
>>
>> Could you please confirm that:
>>
>> a. The AppArmor refcount usage described in the RFC is correct?
>> b. Approach taken to fix the scalability issue is valid/correct?
>>
>
> Hi Neeraj,
>
> I know your patchset has been waiting on review for a long time.
> Unfortunately I have been very, very busy lately. I will try to
> get to it this weekend, but I can't promise that I will be able
> to get the review fully done.
>

Gentle prod.

Any chances of this getting reviewed in the foreseeable future? Would
be a real bummer if the patchset fell through the cracks.

John Johansen March 8, 2024, 8:09 p.m. UTC | #4

On 3/2/24 02:23, Mateusz Guzik wrote:
> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>> Gentle ping.
>>>
>>> John,
>>>
>>> Could you please confirm that:
>>>
>>> a. The AppArmor refcount usage described in the RFC is correct?
>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>
>>
>> Hi Neeraj,
>>
>> I know your patchset has been waiting on review for a long time.
>> Unfortunately I have been very, very busy lately. I will try to
>> get to it this weekend, but I can't promise that I will be able
>> to get the review fully done.
>>
> 
> Gentle prod.
> 
> Any chances of this getting reviewed in the foreseeable future? Would
> be a real bummer if the patchset fell through the cracks.
> 

yes, sorry I have been unavailable for the last couple of weeks. I am
now back, I have a rather large backlog to try catching up on but this
is has an entry on the list.

Mateusz Guzik May 24, 2024, 9:10 p.m. UTC | #5

On Fri, Mar 8, 2024 at 9:09 PM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 3/2/24 02:23, Mateusz Guzik wrote:
> > On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
> >> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>> Gentle ping.
> >>>
> >>> John,
> >>>
> >>> Could you please confirm that:
> >>>
> >>> a. The AppArmor refcount usage described in the RFC is correct?
> >>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>
> >>
> >> Hi Neeraj,
> >>
> >> I know your patchset has been waiting on review for a long time.
> >> Unfortunately I have been very, very busy lately. I will try to
> >> get to it this weekend, but I can't promise that I will be able
> >> to get the review fully done.
> >>
> >
> > Gentle prod.
> >
> > Any chances of this getting reviewed in the foreseeable future? Would
> > be a real bummer if the patchset fell through the cracks.
> >
>
> yes, sorry I have been unavailable for the last couple of weeks. I am
> now back, I have a rather large backlog to try catching up on but this
> is has an entry on the list.
>

So where do we stand here?

John Johansen May 24, 2024, 9:51 p.m. UTC | #6

On 5/24/24 14:10, Mateusz Guzik wrote:
> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> <john.johansen@canonical.com> wrote:
>>
>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>> Gentle ping.
>>>>>
>>>>> John,
>>>>>
>>>>> Could you please confirm that:
>>>>>
>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>
>>>>
>>>> Hi Neeraj,
>>>>
>>>> I know your patchset has been waiting on review for a long time.
>>>> Unfortunately I have been very, very busy lately. I will try to
>>>> get to it this weekend, but I can't promise that I will be able
>>>> to get the review fully done.
>>>>
>>>
>>> Gentle prod.
>>>
>>> Any chances of this getting reviewed in the foreseeable future? Would
>>> be a real bummer if the patchset fell through the cracks.
>>>
>>
>> yes, sorry I have been unavailable for the last couple of weeks. I am
>> now back, I have a rather large backlog to try catching up on but this
>> is has an entry on the list.
>>
> 
> So where do we stand here?
> 
sorry I am still trying to dig out of my backlog, I will look at this,
this weekend.

Mateusz Guzik May 28, 2024, 1:29 p.m. UTC | #7

On Fri, May 24, 2024 at 11:52 PM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 5/24/24 14:10, Mateusz Guzik wrote:
> > On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> > <john.johansen@canonical.com> wrote:
> >>
> >> On 3/2/24 02:23, Mateusz Guzik wrote:
> >>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
> >>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>>>> Gentle ping.
> >>>>>
> >>>>> John,
> >>>>>
> >>>>> Could you please confirm that:
> >>>>>
> >>>>> a. The AppArmor refcount usage described in the RFC is correct?
> >>>>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>>>
> >>>>
> >>>> Hi Neeraj,
> >>>>
> >>>> I know your patchset has been waiting on review for a long time.
> >>>> Unfortunately I have been very, very busy lately. I will try to
> >>>> get to it this weekend, but I can't promise that I will be able
> >>>> to get the review fully done.
> >>>>
> >>>
> >>> Gentle prod.
> >>>
> >>> Any chances of this getting reviewed in the foreseeable future? Would
> >>> be a real bummer if the patchset fell through the cracks.
> >>>
> >>
> >> yes, sorry I have been unavailable for the last couple of weeks. I am
> >> now back, I have a rather large backlog to try catching up on but this
> >> is has an entry on the list.
> >>
> >
> > So where do we stand here?
> >
> sorry I am still trying to dig out of my backlog, I will look at this,
> this weekend.
>

How was the weekend? ;)

John Johansen May 29, 2024, 12:37 a.m. UTC | #8

On 5/28/24 06:29, Mateusz Guzik wrote:
> On Fri, May 24, 2024 at 11:52 PM John Johansen
> <john.johansen@canonical.com> wrote:
>>
>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>> <john.johansen@canonical.com> wrote:
>>>>
>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>> Gentle ping.
>>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> Could you please confirm that:
>>>>>>>
>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>
>>>>>>
>>>>>> Hi Neeraj,
>>>>>>
>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>> to get the review fully done.
>>>>>>
>>>>>
>>>>> Gentle prod.
>>>>>
>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>
>>>>
>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>> now back, I have a rather large backlog to try catching up on but this
>>>> is has an entry on the list.
>>>>
>>>
>>> So where do we stand here?
>>>
>> sorry I am still trying to dig out of my backlog, I will look at this,
>> this weekend.
>>
> 
> How was the weekend? ;)
> 

lets say it was busy. Have I looked at this, yes. I am still digesting it.
I don't have objections to moving towards percpu refcounts, but the overhead
of a percpu stuct per label is a problem when we have thousands of labels
on the system. That is to say, this would have to be a config option. We
moved buffers from kmalloc to percpu to reduce memory overhead to reduce
contention. The to percpu, to a global pool because the percpu overhead was
too high for some machines, and then from a global pool to a hybrid scheme
because of global lock contention. I don't see a way of doing that with the
label, which means a config would be the next best thing.

Not part of your patch but something to be considered is that the label tree
needs a rework, its locking needs to move to read side a read side lock less
scheme, and the plan was to make it also use a linked list such that new
labels are always queued at the end, allowing dynamically created labels to
be lazily added to the tree.

I see the use of the kworker as problematic as well, especially if we are
talking using kconfig to switch reference counting modes. I am futzing with
some ideas, on how to deal with this.

Like I said I am still digesting.

Mateusz Guzik May 29, 2024, 9:38 a.m. UTC | #9

On Wed, May 29, 2024 at 2:37 AM John Johansen
<john.johansen@canonical.com> wrote:
> I don't have objections to moving towards percpu refcounts, but the overhead
> of a percpu stuct per label is a problem when we have thousands of labels
> on the system. That is to say, this would have to be a config option. We
> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> contention. The to percpu, to a global pool because the percpu overhead was
> too high for some machines, and then from a global pool to a hybrid scheme
> because of global lock contention. I don't see a way of doing that with the
> label, which means a config would be the next best thing.
>

There was a patchset somewhere which adds counters starting as atomic
and automagically converting themselves per-cpu if there as enough
load applied to them. General point being it is plausible this may
autotune itself.

Another option would be a boot-time tunable.

> Not part of your patch but something to be considered is that the label tree
> needs a rework, its locking needs to move to read side a read side lock less
> scheme, and the plan was to make it also use a linked list such that new
> labels are always queued at the end, allowing dynamically created labels to
> be lazily added to the tree.
>

It's not *my* patchset. ;)

> I see the use of the kworker as problematic as well, especially if we are
> talking using kconfig to switch reference counting modes. I am futzing with
> some ideas, on how to deal with this.
>

Thanks for the update. Hopefully this is going to get sorted out in
the foreseeable future.

Neeraj Upadhyay May 30, 2024, 4:19 a.m. UTC | #10

Hi John,

Thanks for taking a look at the series!

On 5/29/2024 6:07 AM, John Johansen wrote:
> On 5/28/24 06:29, Mateusz Guzik wrote:
>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>> <john.johansen@canonical.com> wrote:
>>>
>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>> <john.johansen@canonical.com> wrote:
>>>>>
>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>> Gentle ping.
>>>>>>>>
>>>>>>>> John,
>>>>>>>>
>>>>>>>> Could you please confirm that:
>>>>>>>>
>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>
>>>>>>>
>>>>>>> Hi Neeraj,
>>>>>>>
>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>> to get the review fully done.
>>>>>>>
>>>>>>
>>>>>> Gentle prod.
>>>>>>
>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>
>>>>>
>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>> is has an entry on the list.
>>>>>
>>>>
>>>> So where do we stand here?
>>>>
>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>> this weekend.
>>>
>>
>> How was the weekend? ;)
>>
> 
> lets say it was busy. Have I looked at this, yes. I am still digesting it.
> I don't have objections to moving towards percpu refcounts, but the overhead
> of a percpu stuct per label is a problem when we have thousands of labels
> on the system. That is to say, this would have to be a config option. We
> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> contention. The to percpu, to a global pool because the percpu overhead was
> too high for some machines, and then from a global pool to a hybrid scheme
> because of global lock contention. I don't see a way of doing that with the
> label, which means a config would be the next best thing.
> 

For the buffers, what was the percpu overhead roughly? For
thousands of labels, I think, the extra memory overhead roughly would
be in the range of few MBs (need to be profiled though). This extra 
label overhead would be considered high for the machines where percpu
buffer overhead was considered high?

Please correct me here, so you are proposing that we use a kconfig to
use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
inside the 'struct aa_label' and update the refcount operations accordingly?
If yes, I will work on a patch with this kconfig based selection of
refcounting mode to see how it pans out.

@Mateusz can you share the dynamic switching counter mode patch series please?

In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
which would not have memory usage overhead as in percpu refcount. At this point the
API design/implementation is in early prototype stage.

[1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing

> Not part of your patch but something to be considered is that the label tree
> needs a rework, its locking needs to move to read side a read side lock less
> scheme, and the plan was to make it also use a linked list such that new
> labels are always queued at the end, allowing dynamically created labels to
> be lazily added to the tree.
> 

Read side would be rcu read lock protected in this scheme?
The linked list would store the dynamically created compound labels?
What is the advantage of using this lazy addition to the tree? We optimize
on the label search, addition/deletion for dynamic labels? The lazy addition
to the tree is done when a label find operation on the list succeeds?

> I see the use of the kworker as problematic as well, especially if we are
> talking using kconfig to switch reference counting modes. I am futzing with
> some ideas, on how to deal with this.
> 

We can disable queuing of label reclaim work for non-percpu case?

> Like I said I am still digesting.
> 

Thank you!

Thanks
Neeraj

John Johansen May 30, 2024, 5:59 a.m. UTC | #11

On 5/29/24 21:19, Neeraj Upadhyay wrote:
> Hi John,
> 
> Thanks for taking a look at the series!
> 
> On 5/29/2024 6:07 AM, John Johansen wrote:
>> On 5/28/24 06:29, Mateusz Guzik wrote:
>>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>>> <john.johansen@canonical.com> wrote:
>>>>
>>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>>> <john.johansen@canonical.com> wrote:
>>>>>>
>>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>>> Gentle ping.
>>>>>>>>>
>>>>>>>>> John,
>>>>>>>>>
>>>>>>>>> Could you please confirm that:
>>>>>>>>>
>>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Neeraj,
>>>>>>>>
>>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>>> to get the review fully done.
>>>>>>>>
>>>>>>>
>>>>>>> Gentle prod.
>>>>>>>
>>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>>
>>>>>>
>>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>>> is has an entry on the list.
>>>>>>
>>>>>
>>>>> So where do we stand here?
>>>>>
>>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>>> this weekend.
>>>>
>>>
>>> How was the weekend? ;)
>>>
>>
>> lets say it was busy. Have I looked at this, yes. I am still digesting it.
>> I don't have objections to moving towards percpu refcounts, but the overhead
>> of a percpu stuct per label is a problem when we have thousands of labels
>> on the system. That is to say, this would have to be a config option. We
>> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
>> contention. The to percpu, to a global pool because the percpu overhead was
>> too high for some machines, and then from a global pool to a hybrid scheme
>> because of global lock contention. I don't see a way of doing that with the
>> label, which means a config would be the next best thing.
>>
> 
> For the buffers, what was the percpu overhead roughly? For
> thousands of labels, I think, the extra memory overhead roughly would
> be in the range of few MBs (need to be profiled though). This extra
> label overhead would be considered high for the machines where percpu
> buffer overhead was considered high?
> 

It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
on bigger machines. The problem here is say the percpu refcount while smaller
per label, will be more in situations with lots of cpus. Which is fine if that
is what it needs to be, but for other use cases tuning it to be smaller would
be nice.


> Please correct me here, so you are proposing that we use a kconfig to
> use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
> inside the 'struct aa_label' and update the refcount operations accordingly?
> If yes, I will work on a patch with this kconfig based selection of
> refcounting mode to see how it pans out.
> 
possibly, I am still mulling over how we want to approach this

> @Mateusz can you share the dynamic switching counter mode patch series please?
> 
yes I am interested in looking at this as well.

> In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
> on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
> which would not have memory usage overhead as in percpu refcount. At this point the
> API design/implementation is in early prototype stage.
> 
> 
> [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing

okay, I will take a look

> 
>> Not part of your patch but something to be considered is that the label tree
>> needs a rework, its locking needs to move to read side a read side lock less
>> scheme, and the plan was to make it also use a linked list such that new
>> labels are always queued at the end, allowing dynamically created labels to
>> be lazily added to the tree.
>>
> 
> Read side would be rcu read lock protected in this scheme?
> The linked list would store the dynamically created compound labels?
> What is the advantage of using this lazy addition to the tree? We optimize
> on the label search, addition/deletion for dynamic labels? The lazy addition
> to the tree is done when a label find operation on the list succeeds?
> 
there are contexts where we are creating labels, and do not want to wait on
some of the longer tree walk profile updates/replacements. If a replacement is
on going the idea is to just add the label to the end of a list and let the
process that is doing the tree update take the hit of inserting and rebalancing
the tree.


>> I see the use of the kworker as problematic as well, especially if we are
>> talking using kconfig to switch reference counting modes. I am futzing with
>> some ideas, on how to deal with this.
>>
> 
> We can disable queuing of label reclaim work for non-percpu case?
> 
maybe, I am pondering ways we can deal with this. I have been pondering the
if we might be able to leverage a seqlock here, but I will also take a look
at hazard pointers.

>> Like I said I am still digesting.
>>
> 
> Thank you!
> 
> 
> Thanks
> Neeraj
>

Mateusz Guzik May 30, 2024, 9:47 a.m. UTC | #12

On Thu, May 30, 2024 at 7:59 AM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 5/29/24 21:19, Neeraj Upadhyay wrote:
> > Hi John,
> >
> > Thanks for taking a look at the series!
> >
> > On 5/29/2024 6:07 AM, John Johansen wrote:
> >> On 5/28/24 06:29, Mateusz Guzik wrote:
> >>> On Fri, May 24, 2024 at 11:52 PM John Johansen
> >>> <john.johansen@canonical.com> wrote:
> >>>>
> >>>> On 5/24/24 14:10, Mateusz Guzik wrote:
> >>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> >>>>> <john.johansen@canonical.com> wrote:
> >>>>>>
> >>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
> >>>>>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
> >>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>>>>>>>> Gentle ping.
> >>>>>>>>>
> >>>>>>>>> John,
> >>>>>>>>>
> >>>>>>>>> Could you please confirm that:
> >>>>>>>>>
> >>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
> >>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Neeraj,
> >>>>>>>>
> >>>>>>>> I know your patchset has been waiting on review for a long time.
> >>>>>>>> Unfortunately I have been very, very busy lately. I will try to
> >>>>>>>> get to it this weekend, but I can't promise that I will be able
> >>>>>>>> to get the review fully done.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Gentle prod.
> >>>>>>>
> >>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
> >>>>>>> be a real bummer if the patchset fell through the cracks.
> >>>>>>>
> >>>>>>
> >>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
> >>>>>> now back, I have a rather large backlog to try catching up on but this
> >>>>>> is has an entry on the list.
> >>>>>>
> >>>>>
> >>>>> So where do we stand here?
> >>>>>
> >>>> sorry I am still trying to dig out of my backlog, I will look at this,
> >>>> this weekend.
> >>>>
> >>>
> >>> How was the weekend? ;)
> >>>
> >>
> >> lets say it was busy. Have I looked at this, yes. I am still digesting it.
> >> I don't have objections to moving towards percpu refcounts, but the overhead
> >> of a percpu stuct per label is a problem when we have thousands of labels
> >> on the system. That is to say, this would have to be a config option. We
> >> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> >> contention. The to percpu, to a global pool because the percpu overhead was
> >> too high for some machines, and then from a global pool to a hybrid scheme
> >> because of global lock contention. I don't see a way of doing that with the
> >> label, which means a config would be the next best thing.
> >>
> >
> > For the buffers, what was the percpu overhead roughly? For
> > thousands of labels, I think, the extra memory overhead roughly would
> > be in the range of few MBs (need to be profiled though). This extra
> > label overhead would be considered high for the machines where percpu
> > buffer overhead was considered high?
> >
>
> It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
> size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
> on bigger machines. The problem here is say the percpu refcount while smaller
> per label, will be more in situations with lots of cpus. Which is fine if that
> is what it needs to be, but for other use cases tuning it to be smaller would
> be nice.
>
>
> > Please correct me here, so you are proposing that we use a kconfig to
> > use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
> > inside the 'struct aa_label' and update the refcount operations accordingly?
> > If yes, I will work on a patch with this kconfig based selection of
> > refcounting mode to see how it pans out.
> >
> possibly, I am still mulling over how we want to approach this
>
> > @Mateusz can you share the dynamic switching counter mode patch series please?
> >
> yes I am interested in looking at this as well.
>

https://lore.kernel.org/lkml/1356573611-18590-26-git-send-email-koverstreet@google.com/

> > In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
> > on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
> > which would not have memory usage overhead as in percpu refcount. At this point the
> > API design/implementation is in early prototype stage.
> >
> >
> > [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing
>
> okay, I will take a look
>
> >
> >> Not part of your patch but something to be considered is that the label tree
> >> needs a rework, its locking needs to move to read side a read side lock less
> >> scheme, and the plan was to make it also use a linked list such that new
> >> labels are always queued at the end, allowing dynamically created labels to
> >> be lazily added to the tree.
> >>
> >
> > Read side would be rcu read lock protected in this scheme?
> > The linked list would store the dynamically created compound labels?
> > What is the advantage of using this lazy addition to the tree? We optimize
> > on the label search, addition/deletion for dynamic labels? The lazy addition
> > to the tree is done when a label find operation on the list succeeds?
> >
> there are contexts where we are creating labels, and do not want to wait on
> some of the longer tree walk profile updates/replacements. If a replacement is
> on going the idea is to just add the label to the end of a list and let the
> process that is doing the tree update take the hit of inserting and rebalancing
> the tree.
>
>
> >> I see the use of the kworker as problematic as well, especially if we are
> >> talking using kconfig to switch reference counting modes. I am futzing with
> >> some ideas, on how to deal with this.
> >>
> >
> > We can disable queuing of label reclaim work for non-percpu case?
> >
> maybe, I am pondering ways we can deal with this. I have been pondering the
> if we might be able to leverage a seqlock here, but I will also take a look
> at hazard pointers.
>

Since there is some elaborate talk going about this, let me throw in
my own $0,03 -- I may happen to have a simple solution which will sort
it out and it boils down to storing local ref updates in task_struct.

Here is the context: creds are always refed and unrefed when creating
and destroying a file object. Should you have one instance of
credentials for a given user across different processes they would
serialize on updating the ref. Linux mostly dodges the problem by
always creating a copy on fork, thus only serializing within threads
of a given process. Even then that induces avoidable overhead if only
from single-threaded standpoint -- that's 2 atomics slapped for every
open/close cycle.

$elsewhere I devised an idea where cred ref updates to creds matching
current->cred only modify a local counter. They get rolled up when
current->creds is changed. That is to say there is 0 atomics or
modifying memory seen by other cpus as long as the process does not
change creds, which almost never happens compared to how often refing
and unrefing is implemented.

In struct cred apart from regular refs you would have "user" counter
and "distributed" counter. switching user to > 0 grabs a normal ref on
creds, the value of the "distributed" counter is arbitrary as long as
user > 0. users going back to 0 means we can release the special ref
held for that purpose.

I was considering implementing this for Linux. In my original code all
cred handling is augmented like this, but for Linux one could add a
dedicated get_cred_localref or similar machinery.

Skimming through apparmor suggests the bit which does cause
performance problems can be sorted out in the same manner.

Maybe i'll hack it up as a demo just for apparmor.

This would come with some extra space usage in task_struct which on
the surface may sounds like a non-starter. However, should you take a
look at the struct with pahole you will find it is riddles with
padding. If I wanted to I could add all fields I need to the struct
and not grow it on LP64.

John Johansen May 31, 2024, 12:17 a.m. UTC | #13

On 5/30/24 02:47, Mateusz Guzik wrote:
> On Thu, May 30, 2024 at 7:59 AM John Johansen
> <john.johansen@canonical.com> wrote:
>>
>> On 5/29/24 21:19, Neeraj Upadhyay wrote:
>>> Hi John,
>>>
>>> Thanks for taking a look at the series!
>>>
>>> On 5/29/2024 6:07 AM, John Johansen wrote:
>>>> On 5/28/24 06:29, Mateusz Guzik wrote:
>>>>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>>>>> <john.johansen@canonical.com> wrote:
>>>>>>
>>>>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>>>>> <john.johansen@canonical.com> wrote:
>>>>>>>>
>>>>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>>>>> On 2/9/24, John Johansen <john.johansen@canonical.com> wrote:
>>>>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>>>>> Gentle ping.
>>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> Could you please confirm that:
>>>>>>>>>>>
>>>>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Neeraj,
>>>>>>>>>>
>>>>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>>>>> to get the review fully done.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gentle prod.
>>>>>>>>>
>>>>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>>>>> is has an entry on the list.
>>>>>>>>
>>>>>>>
>>>>>>> So where do we stand here?
>>>>>>>
>>>>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>>>>> this weekend.
>>>>>>
>>>>>
>>>>> How was the weekend? ;)
>>>>>
>>>>
>>>> lets say it was busy. Have I looked at this, yes. I am still digesting it.
>>>> I don't have objections to moving towards percpu refcounts, but the overhead
>>>> of a percpu stuct per label is a problem when we have thousands of labels
>>>> on the system. That is to say, this would have to be a config option. We
>>>> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
>>>> contention. The to percpu, to a global pool because the percpu overhead was
>>>> too high for some machines, and then from a global pool to a hybrid scheme
>>>> because of global lock contention. I don't see a way of doing that with the
>>>> label, which means a config would be the next best thing.
>>>>
>>>
>>> For the buffers, what was the percpu overhead roughly? For
>>> thousands of labels, I think, the extra memory overhead roughly would
>>> be in the range of few MBs (need to be profiled though). This extra
>>> label overhead would be considered high for the machines where percpu
>>> buffer overhead was considered high?
>>>
>>
>> It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
>> size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
>> on bigger machines. The problem here is say the percpu refcount while smaller
>> per label, will be more in situations with lots of cpus. Which is fine if that
>> is what it needs to be, but for other use cases tuning it to be smaller would
>> be nice.
>>
>>
>>> Please correct me here, so you are proposing that we use a kconfig to
>>> use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
>>> inside the 'struct aa_label' and update the refcount operations accordingly?
>>> If yes, I will work on a patch with this kconfig based selection of
>>> refcounting mode to see how it pans out.
>>>
>> possibly, I am still mulling over how we want to approach this
>>
>>> @Mateusz can you share the dynamic switching counter mode patch series please?
>>>
>> yes I am interested in looking at this as well.
>>
> 
> https://lore.kernel.org/lkml/1356573611-18590-26-git-send-email-koverstreet@google.com/
> 
>>> In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
>>> on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
>>> which would not have memory usage overhead as in percpu refcount. At this point the
>>> API design/implementation is in early prototype stage.
>>>
>>>
>>> [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing
>>
>> okay, I will take a look
>>
>>>
>>>> Not part of your patch but something to be considered is that the label tree
>>>> needs a rework, its locking needs to move to read side a read side lock less
>>>> scheme, and the plan was to make it also use a linked list such that new
>>>> labels are always queued at the end, allowing dynamically created labels to
>>>> be lazily added to the tree.
>>>>
>>>
>>> Read side would be rcu read lock protected in this scheme?
>>> The linked list would store the dynamically created compound labels?
>>> What is the advantage of using this lazy addition to the tree? We optimize
>>> on the label search, addition/deletion for dynamic labels? The lazy addition
>>> to the tree is done when a label find operation on the list succeeds?
>>>
>> there are contexts where we are creating labels, and do not want to wait on
>> some of the longer tree walk profile updates/replacements. If a replacement is
>> on going the idea is to just add the label to the end of a list and let the
>> process that is doing the tree update take the hit of inserting and rebalancing
>> the tree.
>>
>>
>>>> I see the use of the kworker as problematic as well, especially if we are
>>>> talking using kconfig to switch reference counting modes. I am futzing with
>>>> some ideas, on how to deal with this.
>>>>
>>>
>>> We can disable queuing of label reclaim work for non-percpu case?
>>>
>> maybe, I am pondering ways we can deal with this. I have been pondering the
>> if we might be able to leverage a seqlock here, but I will also take a look
>> at hazard pointers.
>>
> 
> Since there is some elaborate talk going about this, let me throw in
> my own $0,03 -- I may happen to have a simple solution which will sort
> it out and it boils down to storing local ref updates in task_struct.
> 
> Here is the context: creds are always refed and unrefed when creating
> and destroying a file object. Should you have one instance of
> credentials for a given user across different processes they would
> serialize on updating the ref. Linux mostly dodges the problem by
> always creating a copy on fork, thus only serializing within threads
> of a given process. Even then that induces avoidable overhead if only
> from single-threaded standpoint -- that's 2 atomics slapped for every
> open/close cycle.
> 
so the apparmor label can and will update beyond the open/close cycle.
Yes they are used in the cred as well but, for more than that. The
apparmor label on file can be updated by other tasks, for various
reasons.

> $elsewhere I devised an idea where cred ref updates to creds matching
> current->cred only modify a local counter. They get rolled up when
> current->creds is changed. That is to say there is 0 atomics or
> modifying memory seen by other cpus as long as the process does not
> change creds, which almost never happens compared to how often refing
> and unrefing is implemented.
> 
right, we do something like this for the task cred with a crit section
marked out by

label = begin_current_label_crit_section()

end_current_label_crit_section(label);

if everything works out, no reference counts are taken. The purpose
of the fns is to deal with the cases where for one reason or another
a refcount needs to be taken (generally because of live policy
replacement, and the task hasn't been able to update its cred yet).

> In struct cred apart from regular refs you would have "user" counter
> and "distributed" counter. switching user to > 0 grabs a normal ref on
> creds, the value of the "distributed" counter is arbitrary as long as
> user > 0. users going back to 0 means we can release the special ref
> held for that purpose.
> 
So I don't see how this will generally help for labels which exist
on many different objects.

> I was considering implementing this for Linux. In my original code all
> cred handling is augmented like this, but for Linux one could add a
> dedicated get_cred_localref or similar machinery.
> 

sure, but I am not sure its needed. The rules for task creds is only
task can update its cred. The task can look at its cred and do most
things without having to take a count. Most cred refs should just
be being taken for objects.

> Skimming through apparmor suggests the bit which does cause
> performance problems can be sorted out in the same manner.
> 
I don't see it. The file cred is very much updated live, async to
the task cred. And while currently it always starts as the task
cred, that won't even be true much longer.

> Maybe i'll hack it up as a demo just for apparmor.
> 
> This would come with some extra space usage in task_struct which on
> the surface may sounds like a non-starter. However, should you take a
> look at the struct with pahole you will find it is riddles with
> padding. If I wanted to I could add all fields I need to the struct
> and not grow it on LP64.
>

[RFC,0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Message

Comments