From patchwork Wed Jun 14 17:37:35 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <sean.j.christopherson@intel.com>
X-Patchwork-Id: 9787119
Return-Path: <intel-sgx-kernel-dev-bounces@lists.01.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	ADAF860325 for <patchwork-intel-sgx@patchwork.kernel.org>;
	Wed, 14 Jun 2017 17:38:04 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 946C21FF13
	for <patchwork-intel-sgx@patchwork.kernel.org>;
	Wed, 14 Jun 2017 17:38:04 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 89691268AE; Wed, 14 Jun 2017 17:38:04 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_NONE
	autolearn=ham version=3.3.1
Received: from ml01.01.org (ml01.01.org [198.145.21.10])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 365C126E69
	for <patchwork-intel-sgx@patchwork.kernel.org>;
	Wed, 14 Jun 2017 17:38:03 +0000 (UTC)
Received: from [127.0.0.1] (localhost [IPv6:::1])
	by ml01.01.org (Postfix) with ESMTP id 4C2D721A143E8;
	Wed, 14 Jun 2017 10:36:47 -0700 (PDT)
X-Original-To: intel-sgx-kernel-dev@lists.01.org
Delivered-To: intel-sgx-kernel-dev@lists.01.org
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ml01.01.org (Postfix) with ESMTPS id E926E21A16EC6
	for <intel-sgx-kernel-dev@lists.01.org>;
	Wed, 14 Jun 2017 10:36:44 -0700 (PDT)
Received: from orsmga005.jf.intel.com ([10.7.209.41])
	by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
	14 Jun 2017 10:37:46 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.39,341,1493708400"; d="scan'208";a="113210037"
Received: from sjchrist-ts.jf.intel.com ([10.54.74.20])
	by orsmga005.jf.intel.com with ESMTP; 14 Jun 2017 10:37:45 -0700
From: Sean Christopherson <sean.j.christopherson@intel.com>
To: intel-sgx-kernel-dev@lists.01.org
Date: Wed, 14 Jun 2017 10:37:35 -0700
Message-Id: 
 <1497461858-20309-10-git-send-email-sean.j.christopherson@intel.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: 
 <1497461858-20309-1-git-send-email-sean.j.christopherson@intel.com>
References: 
 <1497461858-20309-1-git-send-email-sean.j.christopherson@intel.com>
Subject: [intel-sgx-kernel-dev] [RFC][PATCH 09/12] cgroup,
	intel_sgx: add SGX EPC cgroup controller
X-BeenThere: intel-sgx-kernel-dev@lists.01.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Project: Intel&reg; Software Guard Extensions for Linux*:
	https://01.org/intel-software-guard-extensions"
	<intel-sgx-kernel-dev.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/intel-sgx-kernel-dev>,
	<mailto:intel-sgx-kernel-dev-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/intel-sgx-kernel-dev/>
List-Post: <mailto:intel-sgx-kernel-dev@lists.01.org>
List-Help: <mailto:intel-sgx-kernel-dev-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/intel-sgx-kernel-dev>,
	<mailto:intel-sgx-kernel-dev-request@lists.01.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: intel-sgx-kernel-dev-bounces@lists.01.org
Sender: "intel-sgx-kernel-dev" <intel-sgx-kernel-dev-bounces@lists.01.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Implement a cgroup controller, sgx_epc, which regulates distribution of
SGX EPC memory.  EPC memory is independent from normal system memory;
it is managed by the SGX subsystem and is not accounted by the memory
controller.  This approach is necessary as EPC memory must be reserved
at boot, i.e. memory cannot be converted between EPC and normal memory
while the system is running.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC
to their backing store (normal system memory allocated via shmem).
The SGX EPC subsystem is analogous to the memory subsytem, and the
SGX EPC controller is in turn analogous to the memory controller;
it implements limit and protection models for EPC memory.

"sgx_epc.high" and "sgx_epc.low" are the main mechanisms to control
EPC usage, while "sgx_epc.max" is a last line of defense mechanism.
"sgx_epc.high" is a best-effort limit of EPC usage.  "sgx_epc.low"
is a best-effort protection of EPC usage.  "sgx_epc.max" is a hard
limit of EPC usage.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 drivers/platform/x86/intel_sgx/Makefile         |   2 +
 drivers/platform/x86/intel_sgx/sgx_epc_cgroup.c | 759 ++++++++++++++++++++++++
 drivers/platform/x86/intel_sgx/sgx_epc_cgroup.h |  81 +++
 include/linux/cgroup_subsys.h                   |   4 +
 init/Kconfig                                    |  12 +
 5 files changed, 858 insertions(+)
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_epc_cgroup.c
 create mode 100644 drivers/platform/x86/intel_sgx/sgx_epc_cgroup.h

diff --git a/drivers/platform/x86/intel_sgx/Makefile b/drivers/platform/x86/intel_sgx/Makefile
index b87f4e1..f7bcb9d 100644
--- a/drivers/platform/x86/intel_sgx/Makefile
+++ b/drivers/platform/x86/intel_sgx/Makefile
@@ -10,3 +10,5 @@ intel_sgx-$(CONFIG_INTEL_SGX) += \
 	sgx_page_cache.o \
 	sgx_util.o \
 	sgx_vma.o \
+
+obj-$(CONFIG_CGROUP_SGX_EPC) +=	sgx_epc_cgroup.o
diff --git a/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.c b/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.c
new file mode 100644
index 0000000..273555c
--- /dev/null
+++ b/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.c
@@ -0,0 +1,759 @@
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * Sean Christopherson <sean.j.christopherson@intel.com>
+ */
+
+#include "sgx_epc_cgroup.h"
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#define SGX_EPC_SWAP_MIN_PAGES		16
+#define SGX_EPC_SWAP_MAX_PAGES		64
+#define SGX_EPC_SWAP_FAIL_TRIGGER	5
+
+static struct sgx_epc_cgroup *root_epc_cgroup __read_mostly;
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(sgx_epc_cgrp_subsys);
+}
+
+static struct sgx_epc_cgroup *sgx_epc_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct sgx_epc_cgroup, css);
+}
+
+static struct sgx_epc_cgroup *sgx_epc_cgroup_from_task(struct task_struct *task)
+{
+	if (unlikely(!task))
+		return NULL;
+	return sgx_epc_cgroup_from_css(task_css(task, sgx_epc_cgrp_id));
+}
+
+static struct sgx_epc_cgroup *sgx_epc_cgroup_from_mm(struct mm_struct *mm)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	rcu_read_lock();
+	do {
+		epc_cg = sgx_epc_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!epc_cg))
+			epc_cg = root_epc_cgroup;
+	} while (!css_tryget_online(&epc_cg->css));
+	rcu_read_unlock();
+
+	return epc_cg;
+}
+
+static struct sgx_epc_cgroup *parent_epc_cgroup(struct sgx_epc_cgroup *epc_cg)
+{
+	return sgx_epc_cgroup_from_css(epc_cg->css.parent);
+}
+
+/**
+ * sgx_epc_cgroup_iter - iterate over the EPC cgroup hierarchy
+ * @root:	hierarchy root
+ * @prev:	previously returned epc_cg, NULL on first invocation
+ * @reclaim:	cookie for shared reclaim walks, NULL for full walks
+ *
+ * Returns references to children of the hierarchy below @root, or
+ * @root itself, or %NULL after a full round-trip.
+ *
+ * Caller must pass the return value in @prev on subsequent invocations
+ * for reference counting, or use sgx_epc_cgroup_iter_break() to cancel
+ * a hierarchy walk before the round-trip is complete.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_iter(struct sgx_epc_cgroup *prev,
+					   struct sgx_epc_cgroup *root,
+					   struct sgx_epc_reclaim *reclaim)
+{
+	struct cgroup_subsys_state *css = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	struct sgx_epc_cgroup *pos = NULL;
+	bool inc_epoch = false;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	if (!root)
+		root = root_epc_cgroup;
+
+	if (prev && !reclaim)
+		pos = prev;
+
+	rcu_read_lock();
+
+start:
+	if (reclaim) {
+		/*
+		 * Abort the walk if a reclaimer working from the same root has
+		 * started a new walk after this reclaimer has already scanned
+		 * at least one cgroup.
+		 */
+		if (prev && reclaim->epoch != root->epoch)
+			goto out;
+
+		while (1) {
+			pos = READ_ONCE(root->reclaim_iter);
+			if (!pos || css_tryget(&pos->css))
+				break;
+
+			/*
+			 * The css is dying, clear the reclaim_iter immediately
+			 * instead of waiting for ->css_released to be called.
+			 * Busy waiting serves no purpose and attempting to wait
+			 * for ->css_released may actually block it from being
+			 * called.
+			 */
+			(void)cmpxchg(&root->reclaim_iter, pos, NULL);
+		}
+	}
+
+	if (pos)
+		css = &pos->css;
+
+	while (!epc_cg) {
+		css = css_next_descendant_pre(css, &root->css);
+		if (!css) {
+			/*
+			 * Increment the epoch as we've reached the end of the
+			 * tree and the next call to css_next_descendant_pre
+			 * will restart at root.  Do not update root->epoch
+			 * directly as we should only do so if we update the
+			 * reclaim_iter, i.e. a different thread may win the
+			 * race and update the epoch for us.
+			 */
+			inc_epoch = true;
+
+			/*
+			 * Reclaimers share the hierarchy walk, and a new one
+			 * might jump in at the end of the hierarchy.  Restart
+			 * at root so that  we don't return NULL on a thread's
+			 * initial call.
+			 */
+			if (!prev)
+				continue;
+			break;
+		}
+
+		/*
+		 * Verify the css and acquire a reference.  Don't take an
+		 * extra reference to root as it's either the global root
+		 * or is provided by the caller and so is guaranteed to be
+		 * alive.  Keep walking if this css is dying.
+		 */
+		if (css != &root->css && !css_tryget(css))
+			continue;
+
+		epc_cg = sgx_epc_cgroup_from_css(css);
+	}
+
+	if (reclaim) {
+		/*
+		 * reclaim_iter could have already been updated by a competing
+		 * thread; check that the value hasn't changed since we read
+		 * it to avoid reclaiming from the same cgroup twice.  If the
+		 * value did change, put all of our references and restart the
+		 * entire process, for all intents and purposes we're making a
+		 * new call.
+		 */
+		if (cmpxchg(&root->reclaim_iter, pos, epc_cg) != pos) {
+			if (epc_cg && epc_cg != root)
+				css_put(&epc_cg->css);
+			if (pos)
+				css_put(&pos->css);
+			css = NULL;
+			epc_cg = NULL;
+			inc_epoch = false;
+			goto start;
+		}
+
+		if (inc_epoch)
+			root->epoch++;
+		if (!prev)
+			reclaim->epoch = root->epoch;
+
+		if (pos)
+			css_put(&pos->css);
+	}
+
+out:
+	rcu_read_unlock();
+	if (prev && prev != root)
+		css_put(&prev->css);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_iter_break - abort a hierarchy walk prematurely
+ * @prev:	last visited cgroup as returned by sgx_epc_cgroup_iter()
+ * @root:	hierarchy root
+ */
+void sgx_epc_cgroup_iter_break(struct sgx_epc_cgroup *prev,
+			       struct sgx_epc_cgroup *root)
+{
+	if (!root)
+		root = root_epc_cgroup;
+	if (prev && prev != root)
+		css_put(&prev->css);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Returns true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released.
+ */
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+	     epc_cg;
+	     epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+		if (!list_empty(&epc_cg->lru.active_lru)) {
+			sgx_epc_cgroup_iter_break(epc_cg, root);
+			return false;
+		}
+	}
+	return true;
+}
+
+static unsigned long sgx_epc_cgroup_swap_pages(unsigned long nr_pages,
+					       unsigned int flags,
+					       struct sgx_epc_cgroup *epc_cg)
+{
+	/*
+	 * Ensure sgx_swap_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to swap only a few pages will
+	 * often fail and is inefficient, while swapping a hugh number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.
+	 */
+	nr_pages = max(nr_pages, (unsigned long)SGX_EPC_SWAP_MIN_PAGES);
+	nr_pages = min(nr_pages, (unsigned long)SGX_EPC_SWAP_MAX_PAGES);
+	return sgx_swap_pages(nr_pages, flags, epc_cg);
+}
+
+static inline unsigned long sgx_epc_cgroup_swap_max(unsigned long nr_pages,
+						    unsigned int flags,
+						    struct sgx_epc_cgroup *epc_cg)
+{
+	return sgx_epc_cgroup_swap_pages(nr_pages, flags, epc_cg);
+}
+
+static inline unsigned long sgx_epc_cgroup_swap_high(unsigned long nr_pages,
+						     unsigned int flags,
+						     struct sgx_epc_cgroup *epc_cg)
+{
+	return sgx_epc_cgroup_swap_pages(nr_pages, flags, epc_cg);
+}
+
+static void sgx_epc_cgroup_reclaim_high(struct sgx_epc_cgroup *epc_cg)
+{
+	unsigned long cur, high;
+	unsigned int nr_fails;
+
+	for (; epc_cg; epc_cg = parent_epc_cgroup(epc_cg)) {
+		nr_fails = 0;
+
+		for (;;) {
+			high = READ_ONCE(epc_cg->high);
+
+			cur = page_counter_read(&epc_cg->pc);
+			if (cur <= high)
+				break;
+
+			if (!sgx_epc_cgroup_swap_high(cur - high, 0, epc_cg)) {
+				if (++nr_fails >= SGX_EPC_SWAP_FAIL_TRIGGER)
+					break;
+			}
+		}
+	}
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup, either when the cgroup is at/near its maximum capacity or
+ * when the cgroup is above its high threshold.
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	unsigned long cur, max;
+	unsigned int flags = 0, nr_fails = 0;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	for (;;) {
+		max = READ_ONCE(epc_cg->pc.limit);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Force the cgroup to swap at least once if it's operating near
+		 * its max limit by adjusting max down by half the min swap size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge when
+		 * it cannot directly swap due to an atomic allocation, e.g. the
+		 * allocation is being done in a fault handler.  Waiting to swap
+		 * until the cgroup is actually at its limit is less performant
+		 * as it means the faulting program is effectively blocked until
+		 * a worker makes its way through the global work queue.  Don't
+		 * force the swap if the limit is extremely low as doing so will
+		 * probably do more harm than good.
+		 */
+		if (max > SGX_EPC_SWAP_MAX_PAGES)
+			max -= (SGX_EPC_SWAP_MIN_PAGES/2);
+
+		cur = page_counter_read(&epc_cg->pc);
+		if (cur <= max)
+			break;
+
+		if (!sgx_epc_cgroup_swap_max(cur - max, flags, epc_cg)) {
+			if (++nr_fails < SGX_EPC_SWAP_FAIL_TRIGGER)
+				continue;
+
+			flags |= SGX_SWAP_IGNORE_LRU;
+
+			if (sgx_epc_cgroup_lru_empty(epc_cg))
+				break;
+		}
+	}
+
+	sgx_epc_cgroup_reclaim_high(epc_cg);
+}
+
+static inline int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+					      unsigned int alloc_flags,
+					      unsigned long nr_pages)
+{
+	unsigned long cur, max, over;
+	unsigned int flags = 0, nr_fails = 0;
+	struct page_counter *fail;
+	struct sgx_epc_cgroup *fail_cg;
+
+	if (epc_cg == root_epc_cgroup) {
+		page_counter_charge(&epc_cg->pc, nr_pages);
+		return 0;
+	}
+
+	for (;;) {
+		if (page_counter_try_charge(&epc_cg->pc, nr_pages, &fail))
+			break;
+
+		fail_cg = container_of(fail, struct sgx_epc_cgroup, pc);
+		max = READ_ONCE(fail_cg->pc.limit);
+		if (nr_pages > max)
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (alloc_flags & SGX_ALLOC_ATOMIC) {
+			schedule_work(&fail_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		cur = page_counter_read(&fail_cg->pc);
+		over = ((cur + nr_pages) > max) ?
+			    (cur + nr_pages) - max : SGX_EPC_SWAP_MIN_PAGES;
+
+		if (!sgx_epc_cgroup_swap_max(over, flags, fail_cg)) {
+			if (++nr_fails < SGX_EPC_SWAP_FAIL_TRIGGER)
+				continue;
+
+			flags |= SGX_SWAP_IGNORE_LRU;
+
+			if (sgx_epc_cgroup_lru_empty(fail_cg))
+				schedule();
+		}
+	}
+
+	css_get_many(&epc_cg->css, nr_pages);
+
+	for (; epc_cg; epc_cg = parent_epc_cgroup(epc_cg)) {
+		if (page_counter_read(&epc_cg->pc) >= epc_cg->high) {
+			if (alloc_flags & SGX_ALLOC_ATOMIC)
+				schedule_work(&epc_cg->reclaim_work);
+			else
+				sgx_epc_cgroup_reclaim_high(epc_cg);
+			break;
+		}
+	}
+	return 0;
+}
+
+
+/**
+ * sgx_epc_cgroup_try_charge - hierarchically try to charge the EPC count
+ * @mm:			the mm_struct of the process to charge
+ * @alloc_flags:	allocation flags specified by the original caller
+ * @nr_pages:		the number of epc pages to charge
+ * @epc_cg_ptr:		out parameter for the charged EPC cgroup
+ *
+ * Returns 0 on success.
+ */
+int sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+			      unsigned int alloc_flags,
+			      unsigned long nr_pages,
+			      struct sgx_epc_cgroup **epc_cg_ptr)
+{
+	int ret;
+	struct sgx_epc_cgroup *epc_cg;
+
+	*epc_cg_ptr = NULL;
+
+	if (sgx_epc_cgroup_disabled())
+		return 0;
+
+	epc_cg = sgx_epc_cgroup_from_mm(mm);
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, alloc_flags, nr_pages);
+	css_put(&epc_cg->css);
+
+	if (!ret)
+		*epc_cg_ptr = epc_cg;
+	return ret;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ * @nr_pages: 	the number of pages to uncharge
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg,
+			     unsigned long nr_pages)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	if (!epc_cg)
+		return;
+
+	page_counter_uncharge(&epc_cg->pc, nr_pages);
+
+	if (epc_cg != root_epc_cgroup)
+		css_put_many(&epc_cg->css, nr_pages);
+}
+
+static inline bool __sgx_epc_cgroup_is_low(struct sgx_epc_cgroup *epc_cg)
+{
+	unsigned long cur = page_counter_read(&epc_cg->pc);
+	return (cur < epc_cg->low &&
+		cur < epc_cg->high &&
+		cur < epc_cg->pc.limit);
+}
+
+/**
+* sgx_epc_cgroup_is_low - check if EPC consumption is below the normal range
+* @epc_cg:	the EPC cgroup to check
+* @root:	the top ancestor of the sub-tree being checked
+*
+* Returns %true if EPC consumption of @epc_cg, and that of all
+* ancestors up to (but not including) @root, is below the normal range.
+*
+* @root is exclusive; it is never low when looked at directly and isn't
+* checked when traversing the hierarchy.
+*
+* Excluding @root enables using sgx_epc.low to prioritize EPC usage
+* between cgroups within a subtree of the hierarchy that is limited
+* by sgx_epc.high or sgx_epc.max.
+*
+* For example, given cgroup A with children B and C:
+*
+*    A
+*   / \
+*  B   C
+*
+* and
+*
+*  1. A/sgx_epc.current > A/sgx_epc.high
+*  2. A/B/sgx_epc.current < A/B/sgx_epc.low
+*  3. A/C/sgx_epc.current >= A/C/sgx_epc.low
+*
+* As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
+* should reclaim from 'C' until 'A' is no longer high or until we can
+* no longer reclaim from 'C'.  If 'A', i.e. @root, isn't excluded by
+* when reclaming from 'A', then 'B' will not be considered low and we
+* will reclaim indiscriminately from both 'B' and 'C'.
+ */
+bool sgx_epc_cgroup_is_low(struct sgx_epc_cgroup *epc_cg,
+			   struct sgx_epc_cgroup *root)
+{
+	if (sgx_epc_cgroup_disabled())
+		return false;
+
+	if (!root)
+		root = root_epc_cgroup;
+	if (epc_cg == root)
+		return false;
+
+	for (; epc_cg != root; epc_cg = parent_epc_cgroup(epc_cg)) {
+		if (!__sgx_epc_cgroup_is_low(epc_cg))
+			return false;
+	}
+
+	return true;
+}
+
+/**
+ * sgx_epc_cgroup_all_in_use_are_low - check if all cgroups in a tree are low
+ * @root:	the root EPC cgroup of the hierarchy to check
+ *
+ * Returns true if all cgroups in a hierarchy are either low or
+ * or do not have any pages on their LRU.
+ */
+bool sgx_epc_cgroup_all_in_use_are_low(struct sgx_epc_cgroup *root)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	if (sgx_epc_cgroup_disabled())
+		return false;
+
+	for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+	     epc_cg;
+	     epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+		if (!list_empty(&epc_cg->lru.active_lru) &&
+		    !__sgx_epc_cgroup_is_low(epc_cg)) {
+			sgx_epc_cgroup_iter_break(epc_cg, root);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static struct cgroup_subsys_state *
+sgx_epc_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct sgx_epc_cgroup *parent = sgx_epc_cgroup_from_css(parent_css);
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(struct sgx_epc_cgroup), GFP_KERNEL);
+	if (!epc_cg)
+		return ERR_PTR(-ENOMEM);
+
+	if (!parent)
+		root_epc_cgroup = epc_cg;
+
+	epc_cg->high = PAGE_COUNTER_MAX;
+	sgx_lru_init(&epc_cg->lru);
+	page_counter_init(&epc_cg->pc, parent ? &parent->pc : NULL);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+
+	return &epc_cg->css;
+}
+
+static void sgx_epc_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(css);
+	struct sgx_epc_cgroup *dead_cg = epc_cg;
+
+	while ((epc_cg = parent_epc_cgroup(epc_cg)))
+		cmpxchg(&epc_cg->reclaim_iter, dead_cg, NULL);
+}
+
+static void sgx_epc_cgroup_css_free(struct cgroup_subsys_state *css)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(css);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static u64 sgx_epc_current_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(css);
+	return (u64)page_counter_read(&epc_cg->pc) * PAGE_SIZE;
+}
+
+static int sgx_epc_low_show(struct seq_file *m, void *v)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(seq_css(m));
+	unsigned long low = READ_ONCE(epc_cg->low);
+
+	if (low == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t sgx_epc_low_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(of_css(of));
+	unsigned long low;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &low);
+	if (err)
+		return err;
+
+	epc_cg->low = low;
+
+	return nbytes;
+}
+
+static int sgx_epc_high_show(struct seq_file *m, void *v)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(seq_css(m));
+	unsigned long high = READ_ONCE(epc_cg->high);
+
+	if (high == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t sgx_epc_high_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(of_css(of));
+	unsigned long cur, high;
+	unsigned int flags = 0, nr_fails = 0;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	epc_cg->high = high;
+
+	for (;;) {
+		cur = page_counter_read(&epc_cg->pc);
+		if (cur <= high)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		if (!sgx_epc_cgroup_swap_high(cur - high, 0, epc_cg)) {
+			if (++nr_fails >= SGX_EPC_SWAP_FAIL_TRIGGER)
+				continue;
+
+			flags |= SGX_SWAP_IGNORE_LRU;
+
+			if (sgx_epc_cgroup_lru_empty(epc_cg))
+				break;
+		}
+	}
+
+	return nbytes;
+}
+
+static int sgx_epc_max_show(struct seq_file *m, void *v)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(seq_css(m));
+	unsigned long max = READ_ONCE(epc_cg->pc.limit);
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+
+static ssize_t sgx_epc_max_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	struct sgx_epc_cgroup *epc_cg = sgx_epc_cgroup_from_css(of_css(of));
+	unsigned long cur, max;
+	unsigned int flags = 0, nr_fails = 0;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &max);
+	if (err)
+		return err;
+
+	xchg(&epc_cg->pc.limit, max);
+
+	for (;;) {
+		cur = page_counter_read(&epc_cg->pc);
+		if (cur <= max)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		if (!sgx_epc_cgroup_swap_max(cur - max, flags, epc_cg)) {
+			if (++nr_fails < SGX_EPC_SWAP_FAIL_TRIGGER)
+				continue;
+
+			flags |= SGX_SWAP_IGNORE_LRU;
+
+			if (sgx_epc_cgroup_lru_empty(epc_cg))
+				schedule();
+		}
+	}
+
+	return nbytes;
+}
+
+static struct cftype sgx_epc_cgroup_files[] = {
+	{
+		.name = "current",
+		.read_u64 = sgx_epc_current_read,
+	},
+	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = sgx_epc_low_show,
+		.write = sgx_epc_low_write,
+	},
+	{
+		.name = "high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = sgx_epc_high_show,
+		.write = sgx_epc_high_write,
+	},
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = sgx_epc_max_show,
+		.write = sgx_epc_max_write,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys sgx_epc_cgrp_subsys = {
+	.css_alloc	= sgx_epc_cgroup_css_alloc,
+	.css_free	= sgx_epc_cgroup_css_free,
+	.css_released	= sgx_epc_cgroup_css_released,
+
+	.legacy_cftypes	= sgx_epc_cgroup_files,
+	.dfl_cftypes	= sgx_epc_cgroup_files,
+};
diff --git a/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.h b/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.h
new file mode 100644
index 0000000..40ba7fc
--- /dev/null
+++ b/drivers/platform/x86/intel_sgx/sgx_epc_cgroup.h
@@ -0,0 +1,81 @@
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * Sean Christopherson <sean.j.christopherson@intel.com>
+ */
+
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include "sgx.h"
+
+#include <linux/cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#ifdef CONFIG_CGROUP_SGX_EPC
+
+struct sgx_epc_cgroup {
+	struct cgroup_subsys_state	css;
+
+	struct page_counter	pc;
+	unsigned long		low;
+	unsigned long		high;
+
+	struct sgx_epc_lru	lru;
+	struct sgx_epc_cgroup	*reclaim_iter;
+	struct work_struct	reclaim_work;
+	unsigned int		epoch;
+};
+
+struct sgx_epc_reclaim {
+	unsigned int		epoch;
+};
+
+int sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+                              unsigned int flags,
+                              unsigned long nr_pages,
+                              struct sgx_epc_cgroup **epc_cg_ptr);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg,
+			     unsigned long nr_pages);
+struct sgx_epc_cgroup *sgx_epc_cgroup_iter(struct sgx_epc_cgroup *prev,
+					   struct sgx_epc_cgroup *root,
+					   struct sgx_epc_reclaim *reclaim);
+void sgx_epc_cgroup_iter_break(struct sgx_epc_cgroup *prev,
+			       struct sgx_epc_cgroup *root);
+bool sgx_epc_cgroup_is_low(struct sgx_epc_cgroup *root,
+			   struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_all_in_use_are_low(struct sgx_epc_cgroup *root);
+
+#else
+
+struct sgx_epc_cgroup;
+
+static inline int sgx_epc_cgroup_try_charge(struct pid *tgid,
+					    unsigned int flags,
+					    unsigned long nr_pages,
+					    struct sgx_epc_cgroup **epc_cg_ptr)
+{
+	*epc_cg_ptr = NULL;
+	return 0;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg,
+					   unsigned long nr_pages)
+{
+
+}
+
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index d0e597c..c1ebc91 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,6 +60,10 @@ SUBSYS(pids)
 SUBSYS(rdma)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_SGX_EPC)
+SUBSYS(sgx_epc)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 1d3475f..b6cb0aa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1204,6 +1204,18 @@ config CGROUP_BPF
 	  BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
 	  inet sockets.
 
+config CGROUP_SGX_EPC
+	bool "Enclave Page Cache (EPC) controller for Intel SGX"
+	depends on INTEL_SGX && MEMCG
+	select PAGE_COUNTER
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup.
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+          Say N if unsure.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n