Patchwork Detecting page cache trashing state

login
register
mail settings
Submitter Johannes Weiner
Date Oct. 25, 2017, 5:54 p.m.
Message ID <20171025175424.GA14039@cmpxchg.org>
Download mbox | patch
Permalink /patch/10027103/
State New
Headers show

Comments

Johannes Weiner - Oct. 25, 2017, 5:54 p.m.
Hi Ruslan,

sorry about the delayed response, I missed the new activity in this
older thread.

On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?

That's the main thing it should measure! :)

The lock_page() and wait_on_page_locked() calls are where iowaits
happen on a cache miss. If those are refaults, they'll be counted.

> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
> 
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
> 
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
> 
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
> 
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26

How does it correlate with /proc/vmstat::workingset_activate during
that time? It only counts thrashing time of refaults it can actively
detect.

Btw, how many CPUs does this system have? There is a bug in this
version on how idle time is aggregated across multiple CPUs. The error
compounds with the number of CPUs in the system.

I'm attaching 3 bugfixes that go on top of what you have. There might
be some conflicts, but they should be minor variable naming issues.
Ruslan Ruslichenko - Oct. 27, 2017, 8:19 p.m.
Hi Johannes,

On 10/25/2017 08:54 PM, Johannes Weiner wrote:
> Hi Ruslan,
>
> sorry about the delayed response, I missed the new activity in this
> older thread.
>
> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
>> Hi Johannes,
>>
>> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
>> version by us right now)
>> and test a bit.
>> The overall idea definitely looks promising, although I have one question on
>> usage.
>> Will it be able to account the time which processes spend on handling major
>> page faults
>> (including fs and iowait time) of refaulting page?
> That's the main thing it should measure! :)
>
> The lock_page() and wait_on_page_locked() calls are where iowaits
> happen on a cache miss. If those are refaults, they'll be counted.
>
>> As we have one big application which code space occupies big amount of place
>> in page cache,
>> when the system under heavy memory usage will reclaim some of it, the
>> application will
>> start constantly thrashing. Since it code is placed on squashfs it spends
>> whole CPU time
>> decompressing the pages and seem memdelay counters are not detecting this
>> situation.
>> Here are some counters to indicate this:
>>
>> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
>> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>>
>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>> pgscand/s pgsteal/s    %vmeff
>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
>> 15802.00      0.00
>>
>> And as nobody actively allocating memory anymore looks like memdelay
>> counters are not
>> actively incremented:
>>
>> [:~]$ cat /proc/memdelay
>> 268035776
>> 6.13 5.43 3.58
>> 1.90 1.89 1.26
> How does it correlate with /proc/vmstat::workingset_activate during
> that time? It only counts thrashing time of refaults it can actively
> detect.
The workingset counters are growing quite actively too. Here are
some numbers per second:

workingset_refault   8201
workingset_activate   389
workingset_restore   187
workingset_nodereclaim   313

> Btw, how many CPUs does this system have? There is a bug in this
> version on how idle time is aggregated across multiple CPUs. The error
> compounds with the number of CPUs in the system.
The system has 2 CPU cores.
> I'm attaching 3 bugfixes that go on top of what you have. There might
> be some conflicts, but they should be minor variable naming issues.
>
I will test with your patches and get back to you.

Thanks,
Ruslan

Patch

From ea663e42b24871d370b6ddbfbf47c1775a2f9f09 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <jweiner@fb.com>
Date: Thu, 28 Sep 2017 10:36:39 -0700
Subject: [PATCH 3/3] mm: memdelay: drop IO as productive time

Counting IO as productive time distorts the sense of pressure with
workloads that are naturally IO-bound. Memory pressure can cause IO,
and thus cause "productive" IO to slow down - yet we don't attribute
this slowdown properly to a shortage of memory.

Disregard IO time altogether, and use CPU time alone as the baseline.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memdelay.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memdelay.c b/mm/memdelay.c
index ea5ede79f044..fbce1d4ba142 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -113,12 +113,11 @@  static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	 * one running the workload, the domain is considered fully
 	 * blocked 50% of the time.
 	 */
-	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+	if (mdc->tasks[MTS_DELAYED_ACTIVE])
 		state = MDS_FULL;
 	else if (mdc->tasks[MTS_DELAYED])
-		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
-			MDS_PART : MDS_FULL;
-	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
+		state = mdc->tasks[MTS_RUNNABLE] ? MDS_PART : MDS_FULL;
+	else if (mdc->tasks[MTS_RUNNABLE])
 		state = MDS_NONE;
 	else
 		state = MDS_IDLE;
-- 
2.14.2