[v16,09/16] x86/sgx: Add basic EPC reclamation flow for cgroup

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal.  When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup to reduce its usage
so the new allocation can succeed.

Currently, all EPC pages are tracked in a single global LRU, and the
"global EPC reclamation" supports the following 3 cases:

1) On-demand asynchronous reclamation: For allocation requests that can
   not wait for reclamation but can be retried, an asynchronous
   reclamation is triggered, in which the global reclaimer, ksgxd, keeps
   reclaiming EPC pages until the free page count is above a minimal
   threshold.

2) On-demand synchronous reclamation: For allocations that can wait for
   reclamation, the EPC page allocator, sgx_alloc_epc_page() reclaims
   EPC page(s) immediately until at least one free page is available for
   allocation.

3) Preemptive reclamation: For some allocation requests, e.g.,
   allocation for reloading a reclaimed page to change its permissions
   or page type, the kernel invokes sgx_reclaim_direct() to preemptively
   reclaim EPC page(s) as a best effort to minimize on-demand
   reclamation for subsequent allocations.

Similarly, a "per-cgroup reclamation" is needed to support the above 3
cases as well:

1) For on-demand asynchronous reclamation, a per-cgroup reclamation
   needs to be invoked to maintain a minimal difference between the
   usage and the limit for each cgroup, analogous to the minimal free
   page threshold  maintained by the global reclaimer.

2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs
   to invoke the per-cgroup reclamation until the cgroup usage become
   at least one page lower than its limit.

3) For preemptive reclamation, sgx_reclaim_direct() needs to invoke the
   per-cgroup reclamation to minimize per-cgroup on-demand reclamation
   for subsequent allocations.

To support the per-cgroup reclamation, introduce a "per-cgroup LRU" to
track all EPC pages belong to the owner cgroup to utilize the existing
sgx_reclaim_pages().

Currently, the global reclamation treats all EPC pages equally as it
scans all EPC pages in FIFO order in the global LRU.  The "per-cgroup
reclamation" needs to somehow achieve the same fairness of all EPC pages
that are tracked in the multiple LRUs of the given cgroup and all the
descendants to reflect the nature of the cgroup.

The idea is to achieve such fairness by scanning "all EPC cgroups" of
the subtree (the given cgroup and all the descendants) equally in turns,
and in the scan to each cgroup, apply the existing sgx_reclaim_pages()
to its LRU. This basic flow is encapsulated in a new function,
sgx_cgroup_reclaim_pages().

Export sgx_reclaim_pages() for use in sgx_cgroup_reclaim_pages(). And
modify sgx_reclaim_pages() to return the number of pages scanned so
sgx_cgroup_reclaim_pages() can track scanning progress and determine
whether enough scanning is done or to continue the scanning for next
descendant.

Whenever reclaiming in a subtree of a given root is needed, start the
scanning from the next descendant where scanning was stopped at last
time.  To keep track of the next descendant cgroup to scan, add a new
field, next_cg, in the sgx_cgroup struct.  Create an iterator function,
sgx_cgroup_next_get(), atomically returns a valid reference of the
descendant for next round of scanning and advances @next_cg to next
valid descendant in a preorder walk. This iterator function is used in
sgx_cgroup_reclaim_pages() to iterate descendants for scanning.
Separately also advances @next_cg to next valid descendant when the
cgroup referenced by @next_cg is to be freed.

Add support for on-demand synchronous reclamation in
sgx_cgroup_try_charge(), applying sgx_cgroup_reclaim_pages() iteratively
until cgroup usage is lower than its limit.

Later patches will reuse sgx_cgroup_reclaim_pages() to add support for
asynchronous and preemptive reclamation.

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment: -ENOMEM is
returned by __sgx_cgroup_try_charge() when LRUs are empty. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V16:
- Revise commit log to define reclamation requirement and the design more clearly. (Kai)
- Revise sgx_cgroup_reclaim_pages() to scan cgroups in subtree more
  fairly, track next_cg in each sgx_cgroup and add helpers that is used
  to iterate descendant in sgx_cgroup_reclaim_pages().(Kai)

V14:
- Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk.
It takes in a 'start' node and returns the 'next' node for the caller to
use as the new 'start'. This is to ensure pages in lower level cgroups
can be reclaimed if all pages in upper level nodes are "too young".
(Kai)
- Move renaming sgx_should_reclaim() to sgx_should_reclaim_global() from
a later patch to this one. (Kai)

V11:
- Use commit message suggested by Kai
- Remove "usage" comments for functions. (Kai)

V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static variables. (Jarkko)

V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 208 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  18 ++-
 arch/x86/kernel/cpu/sgx/main.c       |  19 ++-
 arch/x86/kernel/cpu/sgx/sgx.h        |   1 +
 4 files changed, 238 insertions(+), 8 deletions(-)

Message ID	20240821015404.6038-10-haitao.huang@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9BD113E40F; Wed, 21 Aug 2024 01:54:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724205268; cv=none; b=D31UxpxsqNacCUzaxz8qny+vOsvWxb2aDUL3JucxdT6LwiKc8nreangs08siq1JYT1sOs1wgLYH+49WsPY/0OWCZtPcdsQ1/VLHiEqNFqvCKi9JykqWxO5bgssehZvv1hw+1/3BnPok7G1rUoSZ126dhfsSPRj1DchR5GRBaUQs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724205268; c=relaxed/simple; bh=yKjbhYX8NYy6WBTLSZE+Pt5k8zeJcYOwlFHtnoC7S1s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pfwMIByCRZndjUIRPKBYDdVLyyefIPnBUqePSPxiW/NEihIEvy8sLEnwlD9aYnzk1y+ieEX01e4WqYpPetts/QR3hkBmuAsaNUdQGyviJXneScTUQP0fjAJ3gG/DoEOnOGVa/M3F/fEr6WZYuGQIR3WyZausN44n1wifTyRPjkc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=jT6m9G6E; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="jT6m9G6E" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1724205267; x=1755741267; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yKjbhYX8NYy6WBTLSZE+Pt5k8zeJcYOwlFHtnoC7S1s=; b=jT6m9G6ESa2IR6xlBGqkw/aZ23/GxiXXaffWWrIKvQ9i/FNWg4gmWT94 TYxx2+ME3p+0zKGPAuXpXf8tX+RmSuQy+TEjUykXNQI7zyNaaY/3eGIdp 7Z93WP1leV/3xaOFuD6AhKql+SGpsDbXQfaWay8YMtabNLS47HJQ5/3Hc Zpct7v9QPw8jLoM3h4UK8vkU9+QHJoLRsdrIZ+3y5HX10ydHcj+fwVPI+ T6xWt33H5ckfsTXua5AuuuQSojR9p3c1ONeFQieN/8sDKo1jup81xQPhN gEYC6wpuBSSnxt1p1JKD53Z68KLyPkKk0kJNlH95ZNYquCX66zCgHf/sc g==; X-CSE-ConnectionGUID: ZaoSWP1oRzG6UaYk7WK1sQ== X-CSE-MsgGUID: nCvXX0l4Sv6tnzliLWrKsQ== X-IronPort-AV: E=McAfee;i="6700,10204,11170"; a="33107913" X-IronPort-AV: E=Sophos;i="6.10,163,1719903600"; d="scan'208";a="33107913" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Aug 2024 18:54:20 -0700 X-CSE-ConnectionGUID: jlOa/gazSW6ajkTehKRvJw== X-CSE-MsgGUID: fCAjtw3mSTCQU2lVZ1OjcQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,163,1719903600"; d="scan'208";a="61078606" Received: from b4969164b36c.jf.intel.com ([10.165.59.5]) by fmviesa010.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Aug 2024 18:54:19 -0700 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, kai.huang@intel.com, tj@kernel.org, mkoutny@suse.com, chenridong@huawei.com, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com, tim.c.chen@linux.intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com, chrisyan@microsoft.com Subject: [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Date: Tue, 20 Aug 2024 18:53:57 -0700 Message-ID: <20240821015404.6038-10-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240821015404.6038-1-haitao.huang@linux.intel.com> References: <20240821015404.6038-1-haitao.huang@linux.intel.com> Precedence: bulk X-Mailing-List: linux-sgx@vger.kernel.org List-Id: <linux-sgx.vger.kernel.org> List-Subscribe: <mailto:linux-sgx+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-sgx+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add Cgroup support for SGX EPC memory \| expand [v16,00/16] Add Cgroup support for SGX EPC memory [v16,01/16] x86/sgx: Replace boolean parameters with enums [v16,02/16] cgroup/misc: Add per resource callbacks for CSS events [v16,03/16] cgroup/misc: Export APIs for SGX driver [v16,04/16] cgroup/misc: Add SGX EPC resource type [v16,05/16] x86/sgx: Implement basic EPC misc cgroup functionality [v16,06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list [v16,07/16] x86/sgx: Abstract tracking reclaimable pages in LRU [v16,08/16] x86/sgx: Encapsulate uses of the global LRU [v16,09/16] x86/sgx: Add basic EPC reclamation flow for cgroup [v16,10/16] x86/sgx: Implement async reclamation for cgroup [v16,11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation [v16,12/16] x86/sgx: Revise global reclamation for EPC cgroups [v16,13/16] x86/sgx: implement direct reclamation for cgroups [v16,14/16] x86/sgx: Turn on per-cgroup EPC reclamation [v16,15/16] Docs/x86/sgx: Add description for cgroup support [v16,16/16] selftests/sgx: Add scripts for EPC cgroup testing

[v16,09/16] x86/sgx: Add basic EPC reclamation flow for cgroup

Commit Message

Comments

Patch