[07/13] memcontrol: schedule throttling if we are congested

Message ID	20180529211724.4531-8-josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) client-ip=209.85.220.65; From: Josef Bacik <josef@toxicpanda.com> To: axboe@kernel.dk, kernel-team@fb.com, linux-block@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, tj@kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 07/13] memcontrol: schedule throttling if we are congested Date: Tue, 29 May 2018 17:17:18 -0400 Message-Id: <20180529211724.4531-8-josef@toxicpanda.com> In-Reply-To: <20180529211724.4531-1-josef@toxicpanda.com> References: <20180529211724.4531-1-josef@toxicpanda.com> Sender: owner-linux-mm@kvack.org Precedence: bulk

Message ID

20180529211724.4531-8-josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

Received-SPF: neutral (google.com: 209.85.220.65 is neither permitted nor
	denied by best guess record for domain of
	josef@toxicpanda.com) client-ip=209.85.220.65; 
From: Josef Bacik <josef@toxicpanda.com>
To: axboe@kernel.dk, kernel-team@fb.com, linux-block@vger.kernel.org,
	akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, tj@kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: [PATCH 07/13] memcontrol: schedule throttling if we are congested
Date: Tue, 29 May 2018 17:17:18 -0400
Message-Id: <20180529211724.4531-8-josef@toxicpanda.com>
In-Reply-To: <20180529211724.4531-1-josef@toxicpanda.com>
References: <20180529211724.4531-1-josef@toxicpanda.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Commit Message

Josef Bacik May 29, 2018, 9:17 p.m. UTC

From: Tejun Heo <tj@kernel.org>

Memory allocations can induce swapping via kswapd or direct reclaim.  If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling.  So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h | 13 +++++++++++++
 mm/huge_memory.c           |  6 +++---
 mm/memcontrol.c            | 24 ++++++++++++++++++++++++
 mm/memory.c                | 11 ++++++-----
 mm/shmem.c                 | 10 +++++-----
 5 files changed, 51 insertions(+), 13 deletions(-)

Comments

Johannes Weiner May 30, 2018, 2:15 p.m. UTC | #1

On Tue, May 29, 2018 at 05:17:18PM -0400, Josef Bacik wrote:
> @@ -5458,6 +5458,30 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  	return ret;
>  }
>  
> +int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
> +			  bool compound)
> +{
> +	struct mem_cgroup *memcg;
> +	struct block_device *bdev;
> +	int ret;
> +
> +	ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, compound);
> +	memcg = *memcgp;
> +
> +	if (!(gfp_mask & __GFP_IO) || !memcg)
> +		return ret;
> +#if defined(CONFIG_BLOCK) && defined(CONFIG_SWAP)
> +	if (atomic_read(&memcg->css.cgroup->congestion_count) &&
> +	    has_usable_swap()) {
> +		map_swap_page(page, &bdev);

This doesn't work, unfortunately - or only works on accident.

It goes through page_private(), which is only valid for pages in the
swapcache. The newly allocated pages you call it against aren't in the
swapcache, but their page_private() is 0, which is incorrectly
interpreted as "first swap slot on the first swap device" - which
happens to make sense if you have only one swap device.

> +		blkcg_schedule_throttle(bdev_get_queue(bdev), true);

By the time we allocate, we simply cannot know which swap device the
page will end up on. However, we know what's likely: swap_avail_heads
is sorted by order in which we try to allocate swap slots; the first
device on there is where swap io will go. If we walk this list and
throttle on the first device that has built-up delay debt, we'll
throttle against the device that probably gets the current bulk of the
swap writes.

Also, if we have two swap devices with the same priority, swap
allocation will re-order the list for us automatically in order to do
round-robin loading of the devices. See get_swap_pages(). That should
work out nicely for throttling as well.

You can use page_to_nid() on the newly allocated page to index into
swap_avail_heads[].

On an unrelated note, mem_cgroup_try_charge_delay() isn't the most
descriptive name. Since it's not too page specific, we might want to
move the throttling part out of the charge function and do something
simliar to a stand-alone balance_dirty_pages() function.

mem_cgroup_balance_anon_pages()?

mem_cgroup_throttle_swaprate()?

mem_cgroup_anon_throttle()?

mem_cgroup_anon_allocwait()?

Something like that. I personally like balance_anon_pages the best;
not because it is the best name by itself, but because in the MM it
has the notion of throttling the creation of IO liabilities to the
write rate, which is what we're doing here as well.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..4d2e7f35f2dc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -290,6 +290,9 @@  bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
 			  bool compound);
+int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare, bool compound);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
@@ -745,6 +748,16 @@  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mem_cgroup_try_charge_delay(struct page *page,
+					      struct mm_struct *mm,
+					      gfp_t gfp_mask,
+					      struct mem_cgroup **memcgp,
+					      bool compound)
+{
+	*memcgp = NULL;
+	return 0;
+}
+
 static inline void mem_cgroup_commit_charge(struct page *page,
 					    struct mem_cgroup *memcg,
 					    bool lrucare, bool compound)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..9812ddad9961 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,7 +555,7 @@  static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
+	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg, true)) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1145,7 +1145,7 @@  static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma,
 					       vmf->address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_try_charge(pages[i], vma->vm_mm,
+			     mem_cgroup_try_charge_delay(pages[i], vma->vm_mm,
 				     GFP_KERNEL, &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
@@ -1315,7 +1315,7 @@  int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+	if (unlikely(mem_cgroup_try_charge_delay(new_page, vma->vm_mm,
 					huge_gfp, &memcg, true))) {
 		put_page(new_page);
 		split_huge_pmd(vma, vmf->pmd, vmf->address);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..81f61b66d27a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5458,6 +5458,30 @@  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 	return ret;
 }
 
+int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound)
+{
+	struct mem_cgroup *memcg;
+	struct block_device *bdev;
+	int ret;
+
+	ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, compound);
+	memcg = *memcgp;
+
+	if (!(gfp_mask & __GFP_IO) || !memcg)
+		return ret;
+#if defined(CONFIG_BLOCK) && defined(CONFIG_SWAP)
+	if (atomic_read(&memcg->css.cgroup->congestion_count) &&
+	    has_usable_swap()) {
+		map_swap_page(page, &bdev);
+
+		blkcg_schedule_throttle(bdev_get_queue(bdev), true);
+	}
+#endif
+	return ret;
+}
+
 /**
  * mem_cgroup_commit_charge - commit a page charge
  * @page: page to charge
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464e0fd2..d0eea6d33b18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2494,7 +2494,7 @@  static int wp_page_copy(struct vm_fault *vmf)
 		cow_user_page(new_page, old_page, vmf->address, vma);
 	}
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
+	if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_new;
 
 	__SetPageUptodate(new_page);
@@ -2994,8 +2994,8 @@  int do_swap_page(struct vm_fault *vmf)
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
-				&memcg, false)) {
+	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL,
+					&memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -3156,7 +3156,8 @@  static int do_anonymous_page(struct vm_fault *vmf)
 	if (!page)
 		goto oom;
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg,
+					false))
 		goto oom_free_page;
 
 	/*
@@ -3652,7 +3653,7 @@  static int do_cow_fault(struct vm_fault *vmf)
 	if (!vmf->cow_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
+	if (mem_cgroup_try_charge_delay(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
 				&vmf->memcg, false)) {
 		put_page(vmf->cow_page);
 		return VM_FAULT_OOM;
diff --git a/mm/shmem.c b/mm/shmem.c
index 9d6c7e595415..a96af5690864 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1219,8 +1219,8 @@  int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
-			false);
+	error = mem_cgroup_try_charge_delay(page, current->mm, GFP_KERNEL,
+					    &memcg, false);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1697,7 +1697,7 @@  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 				goto failed;
 		}
 
-		error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
+		error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
 				false);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
@@ -1803,7 +1803,7 @@  alloc_nohuge:		page = shmem_alloc_and_acct_page(gfp, inode,
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
-		error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
+		error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
 				PageTransHuge(page));
 		if (error)
 			goto unacct;
@@ -2276,7 +2276,7 @@  static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	__SetPageSwapBacked(page);
 	__SetPageUptodate(page);
 
-	ret = mem_cgroup_try_charge(page, dst_mm, gfp, &memcg, false);
+	ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg, false);
 	if (ret)
 		goto out_release;

[07/13] memcontrol: schedule throttling if we are congested

Commit Message

Comments

Patch