diff mbox

kernel: fs: drop_caches: add dds drop_caches_count

Message ID 1455308080-27238-1-git-send-email-danielwa@cisco.com (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Walker (danielwa) Feb. 12, 2016, 8:14 p.m. UTC
From: Khalid Mughal <khalidm@cisco.com>

Currently there is no way to figure out the droppable pagecache size
from the meminfo output. The MemFree size can shrink during normal
system operation, when some of the memory pages get cached and is
reflected in "Cached" field. Similarly for file operations some of
the buffer memory gets cached and it is reflected in "Buffers" field.
The kernel automatically reclaims all this cached & buffered memory,
when it is needed elsewhere on the system. The only way to manually
reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
this can have performance impact. Since it discards cached objects,
it may cause high CPU & I/O utilization to recreate the dropped
objects during heavy system load.
This patch computes the droppable pagecache count, using same
algorithm as "vm/drop_caches". It is non-destructive and does not
drop any pages. Therefore it does not have any impact on system
performance. The computation does not include the size of
reclaimable slab.

Cc: xe-kernel@external.cisco.com
Cc: dave.hansen@intel.com
Cc: hannes@cmpxchg.org
Cc: riel@redhat.com
Signed-off-by: Khalid Mughal <khalidm@cisco.com>
Signed-off-by: Daniel Walker <danielwa@cisco.com>
---
 Documentation/sysctl/vm.txt | 12 +++++++
 fs/drop_caches.c            | 80 +++++++++++++++++++++++++++++++++++++++++++--
 include/linux/mm.h          |  3 ++
 kernel/sysctl.c             |  7 ++++
 4 files changed, 100 insertions(+), 2 deletions(-)

Comments

Dave Chinner Feb. 14, 2016, 9:18 p.m. UTC | #1
On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
> From: Khalid Mughal <khalidm@cisco.com>
> 
> Currently there is no way to figure out the droppable pagecache size
> from the meminfo output. The MemFree size can shrink during normal
> system operation, when some of the memory pages get cached and is
> reflected in "Cached" field. Similarly for file operations some of
> the buffer memory gets cached and it is reflected in "Buffers" field.
> The kernel automatically reclaims all this cached & buffered memory,
> when it is needed elsewhere on the system. The only way to manually
> reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
> this can have performance impact. Since it discards cached objects,
> it may cause high CPU & I/O utilization to recreate the dropped
> objects during heavy system load.
> This patch computes the droppable pagecache count, using same
> algorithm as "vm/drop_caches". It is non-destructive and does not
> drop any pages. Therefore it does not have any impact on system
> performance. The computation does not include the size of
> reclaimable slab.

Why, exactly, do you need this? You've described what the patch
does (i.e. redundant, because we can read the code), and described
that the kernel already accounts this reclaimable memory elsewhere
and you can already read that and infer the amount of reclaimable
memory from it. So why isn't that accounting sufficient?

As to the code, I think it is a horrible hack - the calculation
does not come for free. Forcing iteration all the inodes in the
inode cache is not something we should allow users to do - what's to
stop someone just doing this 100 times in parallel and DOSing the
machine?

Or what happens when someone does 'grep "" /proc/sys/vm/*" to see
what all the VM settings are on a machine with a couple of TB of
page cache spread across a couple of hundred million cached inodes?
It a) takes a long time, b) adds sustained load to an already
contended lock (sb->s_inode_list_lock), and c) isn't configuration
information.

Cheers,

Dave.
Daniel Walker (danielwa) Feb. 15, 2016, 6:19 p.m. UTC | #2
On 02/14/2016 01:18 PM, Dave Chinner wrote:
> On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
>> From: Khalid Mughal <khalidm@cisco.com>
>>
>> Currently there is no way to figure out the droppable pagecache size
>> from the meminfo output. The MemFree size can shrink during normal
>> system operation, when some of the memory pages get cached and is
>> reflected in "Cached" field. Similarly for file operations some of
>> the buffer memory gets cached and it is reflected in "Buffers" field.
>> The kernel automatically reclaims all this cached & buffered memory,
>> when it is needed elsewhere on the system. The only way to manually
>> reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
>> this can have performance impact. Since it discards cached objects,
>> it may cause high CPU & I/O utilization to recreate the dropped
>> objects during heavy system load.
>> This patch computes the droppable pagecache count, using same
>> algorithm as "vm/drop_caches". It is non-destructive and does not
>> drop any pages. Therefore it does not have any impact on system
>> performance. The computation does not include the size of
>> reclaimable slab.
> Why, exactly, do you need this? You've described what the patch
> does (i.e. redundant, because we can read the code), and described
> that the kernel already accounts this reclaimable memory elsewhere
> and you can already read that and infer the amount of reclaimable
> memory from it. So why isn't that accounting sufficient?

We need it to determine accurately what the free memory in the system 
is. If you know where we can get this information already please tell, 
we aren't aware of it. For instance /proc/meminfo isn't accurate enough.

> As to the code, I think it is a horrible hack - the calculation
> does not come for free. Forcing iteration all the inodes in the
> inode cache is not something we should allow users to do - what's to
> stop someone just doing this 100 times in parallel and DOSing the
> machine?

Yes it is costly.

>
> Or what happens when someone does 'grep "" /proc/sys/vm/*" to see
> what all the VM settings are on a machine with a couple of TB of
> page cache spread across a couple of hundred million cached inodes?
> It a) takes a long time, b) adds sustained load to an already
> contended lock (sb->s_inode_list_lock), and c) isn't configuration
> information.
>

We could make it "echo 4 > /proc/sys/vm/drop_cache" then you "cat 
/proc/sys/vm/drop_cache_count" that would make the person executing the 
command responsible for the latency. So grep wouldn't trigger it.

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 15, 2016, 11:05 p.m. UTC | #3
On Mon, Feb 15, 2016 at 10:19:54AM -0800, Daniel Walker wrote:
> On 02/14/2016 01:18 PM, Dave Chinner wrote:
> >On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
> >>From: Khalid Mughal <khalidm@cisco.com>
> >>
> >>Currently there is no way to figure out the droppable pagecache size
> >>from the meminfo output. The MemFree size can shrink during normal
> >>system operation, when some of the memory pages get cached and is
> >>reflected in "Cached" field. Similarly for file operations some of
> >>the buffer memory gets cached and it is reflected in "Buffers" field.
> >>The kernel automatically reclaims all this cached & buffered memory,
> >>when it is needed elsewhere on the system. The only way to manually
> >>reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
> >>this can have performance impact. Since it discards cached objects,
> >>it may cause high CPU & I/O utilization to recreate the dropped
> >>objects during heavy system load.
> >>This patch computes the droppable pagecache count, using same
> >>algorithm as "vm/drop_caches". It is non-destructive and does not
> >>drop any pages. Therefore it does not have any impact on system
> >>performance. The computation does not include the size of
> >>reclaimable slab.
> >Why, exactly, do you need this? You've described what the patch
> >does (i.e. redundant, because we can read the code), and described
> >that the kernel already accounts this reclaimable memory elsewhere
> >and you can already read that and infer the amount of reclaimable
> >memory from it. So why isn't that accounting sufficient?
> 
> We need it to determine accurately what the free memory in the
> system is. If you know where we can get this information already
> please tell, we aren't aware of it. For instance /proc/meminfo isn't
> accurate enough.

What you are proposing isn't accurate, either, because it will be
stale by the time the inode cache traversal is completed and the
count returned to userspace. e.g. pages that have already been
accounted as droppable can be reclaimed or marked dirty and hence
"unreclaimable".

IOWs, the best you are going to get is an approximate point-in-time
indication of how much memory is available for immediate reclaim.
We're never going to get an accurate measure in userspace unless we
accurately account for it in the kernel itself. Which, I think it
has already been pointed out, is prohibitively expensive so isn't
done.

As for a replacement, looking at what pages you consider "droppable"
is really only file pages that are not under dirty or under
writeback. i.e. from /proc/meminfo:

Active(file):     220128 kB
Inactive(file):    60232 kB
Dirty:                 0 kB
Writeback:             0 kB

i.e. reclaimable file cache = Active + inactive - dirty - writeback.

And while you are there, when you drop slab caches:

SReclaimable:      66632 kB

some amount of that may be freed. No guarantees can be made about
the amount, though.

Cheers,

Dave.
Daniel Walker (danielwa) Feb. 15, 2016, 11:52 p.m. UTC | #4
On 02/15/2016 03:05 PM, Dave Chinner wrote:
> On Mon, Feb 15, 2016 at 10:19:54AM -0800, Daniel Walker wrote:
>> On 02/14/2016 01:18 PM, Dave Chinner wrote:
>>> On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
>>>> From: Khalid Mughal <khalidm@cisco.com>
>>>>
>>>> Currently there is no way to figure out the droppable pagecache size
>>> >from the meminfo output. The MemFree size can shrink during normal
>>>> system operation, when some of the memory pages get cached and is
>>>> reflected in "Cached" field. Similarly for file operations some of
>>>> the buffer memory gets cached and it is reflected in "Buffers" field.
>>>> The kernel automatically reclaims all this cached & buffered memory,
>>>> when it is needed elsewhere on the system. The only way to manually
>>>> reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
>>>> this can have performance impact. Since it discards cached objects,
>>>> it may cause high CPU & I/O utilization to recreate the dropped
>>>> objects during heavy system load.
>>>> This patch computes the droppable pagecache count, using same
>>>> algorithm as "vm/drop_caches". It is non-destructive and does not
>>>> drop any pages. Therefore it does not have any impact on system
>>>> performance. The computation does not include the size of
>>>> reclaimable slab.
>>> Why, exactly, do you need this? You've described what the patch
>>> does (i.e. redundant, because we can read the code), and described
>>> that the kernel already accounts this reclaimable memory elsewhere
>>> and you can already read that and infer the amount of reclaimable
>>> memory from it. So why isn't that accounting sufficient?
>> We need it to determine accurately what the free memory in the
>> system is. If you know where we can get this information already
>> please tell, we aren't aware of it. For instance /proc/meminfo isn't
>> accurate enough.
> What you are proposing isn't accurate, either, because it will be
> stale by the time the inode cache traversal is completed and the
> count returned to userspace. e.g. pages that have already been
> accounted as droppable can be reclaimed or marked dirty and hence
> "unreclaimable".
>
> IOWs, the best you are going to get is an approximate point-in-time
> indication of how much memory is available for immediate reclaim.
> We're never going to get an accurate measure in userspace unless we
> accurately account for it in the kernel itself. Which, I think it
> has already been pointed out, is prohibitively expensive so isn't
> done.
>
> As for a replacement, looking at what pages you consider "droppable"
> is really only file pages that are not under dirty or under
> writeback. i.e. from /proc/meminfo:
>
> Active(file):     220128 kB
> Inactive(file):    60232 kB
> Dirty:                 0 kB
> Writeback:             0 kB
>
> i.e. reclaimable file cache = Active + inactive - dirty - writeback.
>
> And while you are there, when you drop slab caches:
>
> SReclaimable:      66632 kB
>
> some amount of that may be freed. No guarantees can be made about
> the amount, though.

I got this response from another engineer here at Cisco (Nag he's CC'd 
also),

"

Approximate point-in-time indication is an accurate characterization of what we are doing. This is good enough for us. NO matter what we do, we are never going to be able to address the "time of check to time of use” window.  But, this approximation works reasonably well for our use case.

As to his other suggestion of estimating the droppable cache, I have considered it but found it unusable. The problem is the inactive file pages count a whole lot pages more than the droppable pages.

See the value of these, before and [after] dropping reclaimable pages.

Before:

Active(file):     183488 kB
Inactive(file):   180504 kB

After (the drop caches):
Active(file):      89468 kB
Inactive(file):    32016 kB

The dirty and the write back are mostly 0KB under our workload as we are 
mostly dealing with the readonly file pages of binaries 
(programs/libraries)..
"


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 16, 2016, 12:45 a.m. UTC | #5
On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
> >>We need it to determine accurately what the free memory in the
> >>system is. If you know where we can get this information already
> >>please tell, we aren't aware of it. For instance /proc/meminfo isn't
> >>accurate enough.
> 
> Approximate point-in-time indication is an accurate characterization
> of what we are doing. This is good enough for us. NO matter what we
> do, we are never going to be able to address the "time of check to
> time of use” window.  But, this approximation works reasonably well
> for our use case.

Why do you need such accuracy, and what do you consider "good enough".
Having something which iterates over all of the inodes in the system
is something that really shouldn't be in a general production kernel
At the very least it should only be accessible by root (so now only a
careless system administrator can DOS attack the system) but the
Dave's original question still stands.  Why do you need a certain
level of accuracy regarding how much memory is available after
dropping all of the caches?  What problem are you trying to
solve/avoid?

It may be that you are going about things completely the wrong way,
which is why understanding the higher order problem you are trying to
solve might be helpful in finding something which is safer,
architecturally cleaner, and something that could go into the upstream
kernel.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nag Avadhanam (nag) Feb. 16, 2016, 2:58 a.m. UTC | #6
We have a class of platforms that are essentially swap-less embedded
systems that have limited memory resources (2GB and less).

There is a need to implement early alerts (before the OOM killer kicks in)
based on the current memory usage so admins can take appropriate steps (do
not initiate provisioning operations but support existing services,
de-provision certain services, etc. based on the extent of memory usage in
the system) . 

There is also a general need to let end users know the available memory so
they can determine if they can enable new services (helps in planning).

These two depend upon knowing approximate (accurate within few 10s of MB)
memory usage within the system. We want to alert admins before system
exhibits any thrashing behaviors.

We find the source of accounting anomalies to be the page cache
accounting. Anonymous page accounting is fine. Page cache usage on our
system can be attributed to these ­ file system cache, shared memory store
(non-reclaimable) and the in-memory file systems (non-reclaimable). We
know the sizes of the shared memory stores and the in memory file system
sizes.

If we can determine the amount of reclaimable file system cache (+/- few
10s of MB), we can improve the serviceability of these systems.
 
Total - (# of bytes of anon pages + # of bytes of shared memory/tmpfs
pages + # of bytes of non-reclaimable file system cache pages) gives us a
measure of the available memory.


Its the calculation of the # of bytes of non-reclaimable file system cache
pages that has been troubling us. We do not want to count inactive file
pages (of programs/binaries) that were once mapped by any process in the
system as reclaimable because that might lead to thrashing under memory
pressure (we want to alert admins before system starts dropping text
pages).

From our experiments, we determined running a VM scan looking for
droppable pages came close to establishing that number. If there are
cheaper ways of determining this stat, please let us know.

Thanks,
nag 


On 2/15/16, 4:45 PM, "Theodore Ts'o" <tytso@mit.edu> wrote:

>On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
>> >>We need it to determine accurately what the free memory in the
>> >>system is. If you know where we can get this information already
>> >>please tell, we aren't aware of it. For instance /proc/meminfo isn't
>> >>accurate enough.
>> 
>> Approximate point-in-time indication is an accurate characterization
>> of what we are doing. This is good enough for us. NO matter what we
>> do, we are never going to be able to address the "time of check to
>> time of use² window.  But, this approximation works reasonably well
>> for our use case.
>
>Why do you need such accuracy, and what do you consider "good enough".
>Having something which iterates over all of the inodes in the system
>is something that really shouldn't be in a general production kernel
>At the very least it should only be accessible by root (so now only a
>careless system administrator can DOS attack the system) but the
>Dave's original question still stands.  Why do you need a certain
>level of accuracy regarding how much memory is available after
>dropping all of the caches?  What problem are you trying to
>solve/avoid?
>
>It may be that you are going about things completely the wrong way,
>which is why understanding the higher order problem you are trying to
>solve might be helpful in finding something which is safer,
>architecturally cleaner, and something that could go into the upstream
>kernel.
>
>Cheers,
>
>						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 16, 2016, 5:28 a.m. UTC | #7
On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
> On 02/15/2016 03:05 PM, Dave Chinner wrote:
> >What you are proposing isn't accurate, either, because it will be
> >stale by the time the inode cache traversal is completed and the
> >count returned to userspace. e.g. pages that have already been
> >accounted as droppable can be reclaimed or marked dirty and hence
> >"unreclaimable".
> >
> >IOWs, the best you are going to get is an approximate point-in-time
> >indication of how much memory is available for immediate reclaim.
> >We're never going to get an accurate measure in userspace unless we
> >accurately account for it in the kernel itself. Which, I think it
> >has already been pointed out, is prohibitively expensive so isn't
> >done.
> >
> >As for a replacement, looking at what pages you consider "droppable"
> >is really only file pages that are not under dirty or under
> >writeback. i.e. from /proc/meminfo:
> >
> >Active(file):     220128 kB
> >Inactive(file):    60232 kB
> >Dirty:                 0 kB
> >Writeback:             0 kB
> >
> >i.e. reclaimable file cache = Active + inactive - dirty - writeback.
.....

> Approximate point-in-time indication is an accurate
> characterization of what we are doing. This is good enough for us.
> NO matter what we do, we are never going to be able to address the
> "time of check to time of use” window.  But, this
> approximation works reasonably well for our use case.
> 
> As to his other suggestion of estimating the droppable cache, I
> have considered it but found it unusable. The problem is the
> inactive file pages count a whole lot pages more than the
> droppable pages.

inactive file pages are supposed to be exactly that - inactive. i.e.
the have not been referenced recently, and are unlikely to be dirty.
They should be immediately reclaimable.

> See the value of these, before and [after] dropping reclaimable
> pages.
> 
> Before:
> 
> Active(file):     183488 kB
> Inactive(file):   180504 kB
> 
> After (the drop caches):
>
> Active(file):      89468 kB
> Inactive(file):    32016 kB
>
> The dirty and the write back are mostly 0KB under our workload as
> we are mostly dealing with the readonly file pages of binaries
> (programs/libraries)..  "

if the pages are read-only, then they are clean. If they are on the
LRUs, then they should be immediately reclaimable.

Let's go back to your counting criteria of all those file pages:

+static int is_page_droppable(struct page *page)
+{
+       struct address_space *mapping = page_mapping(page);
+
+       if (!mapping)
+               return 0;

invalidated page, should be none.

+       if (PageDirty(page))
+               return 0;

Dirty get ignored, in /proc/meminfo.

+       if (PageWriteback(page))
+               return 0;

Writeback ignored, in /proc/meminfo.

+       if (page_mapped(page))
+               return 0;

Clean page, mapped into userspace get ignored, in /proc/meminfo.

+       if (page->mapping != mapping)
+               return 0;

Invalidation race, should be none.

+       if (page_has_private(page))
+               return 0;

That's simply wrong. For XFs inodes, that will skip *every page on
every inode* because it attachs bufferheads to every page, even on
read. ext4 behaviour will depend on mount options and whether the
page has been dirtied or not. IOWs, this turns the number of
reclaimable pages in the inode cache into garbage because it counts
clean, reclaimable pages with attached bufferheads as non-reclaimable.

But let's ignore that by assuming you have read-only pages without
bufferheads (e.g. ext4, blocksize = pagesize, nobh mode on read-only
pages). That means the only thing that makes a difference to the
count returned is mapped pages, a count of which is also in
/proc/meminfo.

So, to pick a random active server here:

		before		after
Active(file):   12103200 kB	24060 kB
Inactive(file):  5976676 kB	 1380 kB
Mapped:            31308 kB	31308 kB

How much was not reclaimed? Roughly the same number of pages as the
Mapped count, and that's exactly what we'd expect to see from the
above page walk counting code. Hence a slightly better approximation
of the pages that dropping caches will reclaim is:

reclaimable pages = active + inactive - dirty - writeback - mapped

Cheers,

Dave.
Dave Chinner Feb. 16, 2016, 5:38 a.m. UTC | #8
On Tue, Feb 16, 2016 at 02:58:04AM +0000, Nag Avadhanam (nag) wrote:
> Its the calculation of the # of bytes of non-reclaimable file system cache
> pages that has been troubling us. We do not want to count inactive file
> pages (of programs/binaries) that were once mapped by any process in the
> system as reclaimable because that might lead to thrashing under memory
> pressure (we want to alert admins before system starts dropping text
> pages).

The code presented does not match your requirements. It only counts
pages that are currently mapped into ptes. hence it will tell you
that once-used and now unmapped binary pages are reclaimable, and
drop caches will reclaim them. hence they'll need to be fetched from
disk again if they are faulted in again after a drop_caches run.

Cheers,

Dave.
Nag Avadhanam (nag) Feb. 16, 2016, 5:57 a.m. UTC | #9
On Mon, 15 Feb 2016, Dave Chinner wrote:

> On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
>> On 02/15/2016 03:05 PM, Dave Chinner wrote:
>>> What you are proposing isn't accurate, either, because it will be
>>> stale by the time the inode cache traversal is completed and the
>>> count returned to userspace. e.g. pages that have already been
>>> accounted as droppable can be reclaimed or marked dirty and hence
>>> "unreclaimable".
>>>
>>> IOWs, the best you are going to get is an approximate point-in-time
>>> indication of how much memory is available for immediate reclaim.
>>> We're never going to get an accurate measure in userspace unless we
>>> accurately account for it in the kernel itself. Which, I think it
>>> has already been pointed out, is prohibitively expensive so isn't
>>> done.
>>>
>>> As for a replacement, looking at what pages you consider "droppable"
>>> is really only file pages that are not under dirty or under
>>> writeback. i.e. from /proc/meminfo:
>>>
>>> Active(file):     220128 kB
>>> Inactive(file):    60232 kB
>>> Dirty:                 0 kB
>>> Writeback:             0 kB
>>>
>>> i.e. reclaimable file cache = Active + inactive - dirty - writeback.
> .....
>
>> Approximate point-in-time indication is an accurate
>> characterization of what we are doing. This is good enough for us.
>> NO matter what we do, we are never going to be able to address the
>> "time of check to time of use” window.  But, this
>> approximation works reasonably well for our use case.
>>
>> As to his other suggestion of estimating the droppable cache, I
>> have considered it but found it unusable. The problem is the
>> inactive file pages count a whole lot pages more than the
>> droppable pages.
>
> inactive file pages are supposed to be exactly that - inactive. i.e.
> the have not been referenced recently, and are unlikely to be dirty.
> They should be immediately reclaimable.
>
>> See the value of these, before and [after] dropping reclaimable
>> pages.
>>
>> Before:
>>
>> Active(file):     183488 kB
>> Inactive(file):   180504 kB
>>
>> After (the drop caches):
>>
>> Active(file):      89468 kB
>> Inactive(file):    32016 kB
>>
>> The dirty and the write back are mostly 0KB under our workload as
>> we are mostly dealing with the readonly file pages of binaries
>> (programs/libraries)..  "
>
> if the pages are read-only, then they are clean. If they are on the
> LRUs, then they should be immediately reclaimable.
>
> Let's go back to your counting criteria of all those file pages:
>
> +static int is_page_droppable(struct page *page)
> +{
> +       struct address_space *mapping = page_mapping(page);
> +
> +       if (!mapping)
> +               return 0;
>
> invalidated page, should be none.
>
> +       if (PageDirty(page))
> +               return 0;
>
> Dirty get ignored, in /proc/meminfo.
>
> +       if (PageWriteback(page))
> +               return 0;
>
> Writeback ignored, in /proc/meminfo.
>
> +       if (page_mapped(page))
> +               return 0;
>
> Clean page, mapped into userspace get ignored, in /proc/meminfo.
>
> +       if (page->mapping != mapping)
> +               return 0;
>
> Invalidation race, should be none.
>
> +       if (page_has_private(page))
> +               return 0;
>
> That's simply wrong. For XFs inodes, that will skip *every page on
> every inode* because it attachs bufferheads to every page, even on
> read. ext4 behaviour will depend on mount options and whether the
> page has been dirtied or not. IOWs, this turns the number of
> reclaimable pages in the inode cache into garbage because it counts
> clean, reclaimable pages with attached bufferheads as non-reclaimable.
>
> But let's ignore that by assuming you have read-only pages without
> bufferheads (e.g. ext4, blocksize = pagesize, nobh mode on read-only
> pages). That means the only thing that makes a difference to the
> count returned is mapped pages, a count of which is also in
> /proc/meminfo.
>
> So, to pick a random active server here:
>
> 		before		after
> Active(file):   12103200 kB	24060 kB
> Inactive(file):  5976676 kB	 1380 kB
> Mapped:            31308 kB	31308 kB
>
> How much was not reclaimed? Roughly the same number of pages as the
> Mapped count, and that's exactly what we'd expect to see from the
> above page walk counting code. Hence a slightly better approximation
> of the pages that dropping caches will reclaim is:
>
> reclaimable pages = active + inactive - dirty - writeback - mapped

Thanks Dave. I considered that, but see this.

Mapped page count below is much higher than the 
(active(file) + inactive (file)).

Mapped seems to include all page cache pages mapped into the process memory, 
including the shared memory pages, file pages and few other type
mappings.

I suppose the above can be rewritten as (mapped is still high):

reclaimable pages = active + inactive + shmem - dirty - writeback - mapped

What about kernel pages mapped into user address space? Does "Mapped"
include those pages as well? How do we exclude them? What about device 
mappings? Are these excluded in the "Mapped" pages calculation?

MemTotal:        1025444 kB
MemFree:          264712 kB
Buffers:            1220 kB
Cached:           212736 kB
SwapCached:            0 kB
Active:           398232 kB
Inactive:         240892 kB
Active(anon):     307588 kB
Inactive(anon):   204860 kB
Active(file):      90644 kB
Inactive(file):    36032 kB
Unevictable:       22672 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                24 kB
Writeback:             0 kB
AnonPages:        447848 kB
Mapped:           202624 kB
Shmem:             64608 kB
Slab:              29632 kB
SReclaimable:      10996 kB
SUnreclaim:        18636 kB
KernelStack:        2528 kB
PageTables:         7936 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      512720 kB
Committed_AS:     973504 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      140060 kB
VmallocChunk:   34359595388 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       10240 kB
DirectMap2M:     1042432 kB

thanks,
nag
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Nag Avadhanam (nag) Feb. 16, 2016, 7:14 a.m. UTC | #10
On Mon, 15 Feb 2016, Dave Chinner wrote:

> On Tue, Feb 16, 2016 at 02:58:04AM +0000, Nag Avadhanam (nag) wrote:
>> Its the calculation of the # of bytes of non-reclaimable file system cache
>> pages that has been troubling us. We do not want to count inactive file
>> pages (of programs/binaries) that were once mapped by any process in the
>> system as reclaimable because that might lead to thrashing under memory
>> pressure (we want to alert admins before system starts dropping text
>> pages).
>
> The code presented does not match your requirements. It only counts
> pages that are currently mapped into ptes. hence it will tell you
> that once-used and now unmapped binary pages are reclaimable, and
> drop caches will reclaim them. hence they'll need to be fetched from
> disk again if they are faulted in again after a drop_caches run.

Will the inactive binary pages be automatically unmapped even if the process
into whose address space they are mapped is still around? I thought they
are left mapped until such time there is memory pressure.

We only care for binary pages (active and inactive) mapped into the address 
spaces of live processes. Its okay to aggressively reclaim inactive
pages once mapped into processes that are no longer around.

thanks,
nag

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 16, 2016, 8:22 a.m. UTC | #11
On Mon, Feb 15, 2016 at 09:57:42PM -0800, Nag Avadhanam wrote:
> On Mon, 15 Feb 2016, Dave Chinner wrote:
> >So, to pick a random active server here:
> >
> >		before		after
> >Active(file):   12103200 kB	24060 kB
> >Inactive(file):  5976676 kB	 1380 kB
> >Mapped:            31308 kB	31308 kB
> >
> >How much was not reclaimed? Roughly the same number of pages as the
> >Mapped count, and that's exactly what we'd expect to see from the
> >above page walk counting code. Hence a slightly better approximation
> >of the pages that dropping caches will reclaim is:
> >
> >reclaimable pages = active + inactive - dirty - writeback - mapped
> 
> Thanks Dave. I considered that, but see this.
> 
> Mapped page count below is much higher than the (active(file) +
> inactive (file)).

Yes. it's all unreclaimable from drop caches, though.

> Mapped seems to include all page cache pages mapped into the process
> memory, including the shared memory pages, file pages and few other
> type
> mappings.
> 
> I suppose the above can be rewritten as (mapped is still high):
> 
> reclaimable pages = active + inactive + shmem - dirty - writeback - mapped
> 
> What about kernel pages mapped into user address space? Does "Mapped"
> include those pages as well? How do we exclude them? What about
> device mappings? Are these excluded in the "Mapped" pages
> calculation?

/me shrugs.

I have no idea - I really don't care about what pages are accounted
as mapped. I assumed that the patch proposed addressed your
requirements and so I suggested an alternative that provided almost
exactly the same information but erred on the side of
underestimation and hence solves your problem of drop_caches not
freeing as much memory as you expected....

Cheers,

Dave.
Dave Chinner Feb. 16, 2016, 8:35 a.m. UTC | #12
On Mon, Feb 15, 2016 at 11:14:13PM -0800, Nag Avadhanam wrote:
> On Mon, 15 Feb 2016, Dave Chinner wrote:
> 
> >On Tue, Feb 16, 2016 at 02:58:04AM +0000, Nag Avadhanam (nag) wrote:
> >>Its the calculation of the # of bytes of non-reclaimable file system cache
> >>pages that has been troubling us. We do not want to count inactive file
> >>pages (of programs/binaries) that were once mapped by any process in the
> >>system as reclaimable because that might lead to thrashing under memory
> >>pressure (we want to alert admins before system starts dropping text
> >>pages).
> >
> >The code presented does not match your requirements. It only counts
> >pages that are currently mapped into ptes. hence it will tell you
> >that once-used and now unmapped binary pages are reclaimable, and
> >drop caches will reclaim them. hence they'll need to be fetched from
> >disk again if they are faulted in again after a drop_caches run.
> 
> Will the inactive binary pages be automatically unmapped even if the process
> into whose address space they are mapped is still around? I thought they
> are left mapped until such time there is memory pressure.

Right, page reclaim via memory pressure can unmap mapped pages in
order to reclaim them. Drop caches will skip them.

> We only care for binary pages (active and inactive) mapped into the
> address spaces of live processes. Its okay to aggressively reclaim
> inactive
> pages once mapped into processes that are no longer around.

Ok, if you're only concerned about live processes then drop caches
should behave as you want.

Cheers,

Dave.
Vladimir Davydov Feb. 16, 2016, 8:43 a.m. UTC | #13
On Tue, Feb 16, 2016 at 02:58:04AM +0000, Nag Avadhanam (nag) wrote:
> We have a class of platforms that are essentially swap-less embedded
> systems that have limited memory resources (2GB and less).
> 
> There is a need to implement early alerts (before the OOM killer kicks in)
> based on the current memory usage so admins can take appropriate steps (do
> not initiate provisioning operations but support existing services,
> de-provision certain services, etc. based on the extent of memory usage in
> the system) . 
> 
> There is also a general need to let end users know the available memory so
> they can determine if they can enable new services (helps in planning).
> 
> These two depend upon knowing approximate (accurate within few 10s of MB)
> memory usage within the system. We want to alert admins before system
> exhibits any thrashing behaviors.

Have you considered using /proc/kpageflags for counting such pages? It
should already export all information about memory pages you might need,
e.g. which pages are mapped, which are anonymous, which are inactive,
basically all page flags and even more. Moreover, you can even determine
the set of pages that are really read/written by processes - see
/sys/kernel/mm/page_idle/bitmap. On such a small machine scanning the
whole pfn range should be pretty cheap, so you might find this API
acceptable.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel Feb. 16, 2016, 4:12 p.m. UTC | #14
On Tue, 2016-02-16 at 16:28 +1100, Dave Chinner wrote:
> On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
> > On 02/15/2016 03:05 PM, Dave Chinner wrote:
> > > 
> > > As for a replacement, looking at what pages you consider
> > > "droppable"
> > > is really only file pages that are not under dirty or under
> > > writeback. i.e. from /proc/meminfo:
> > > 
> > > Active(file):     220128 kB
> > > Inactive(file):    60232 kB
> > > Dirty:                 0 kB
> > > Writeback:             0 kB
> > > 
> > > i.e. reclaimable file cache = Active + inactive - dirty -
> > > writeback.
> .....

> > As to his other suggestion of estimating the droppable cache, I
> > have considered it but found it unusable. The problem is the
> > inactive file pages count a whole lot pages more than the
> > droppable pages.
> 
> inactive file pages are supposed to be exactly that - inactive. i.e.
> the have not been referenced recently, and are unlikely to be dirty.
> They should be immediately reclaimable.

Inactive file pages can still be mapped by
processes.

The reason we do not unmap file pages when
moving them to the inactive list is that
some workloads fill essentially all of memory
with mmapped file pages.

Given that the inactive list is generally a
considerable fraction of file memory, unmapping
pages that get deactivated could create a lot
of churn and unnecessary page faults for that
kind of workload.

-- 
All rights reversed
Nag Avadhanam (nag) Feb. 16, 2016, 6:37 p.m. UTC | #15
On Tue, 16 Feb 2016, Vladimir Davydov wrote:

> On Tue, Feb 16, 2016 at 02:58:04AM +0000, Nag Avadhanam (nag) wrote:
>> We have a class of platforms that are essentially swap-less embedded
>> systems that have limited memory resources (2GB and less).
>>
>> There is a need to implement early alerts (before the OOM killer kicks in)
>> based on the current memory usage so admins can take appropriate steps (do
>> not initiate provisioning operations but support existing services,
>> de-provision certain services, etc. based on the extent of memory usage in
>> the system) .
>>
>> There is also a general need to let end users know the available memory so
>> they can determine if they can enable new services (helps in planning).
>>
>> These two depend upon knowing approximate (accurate within few 10s of MB)
>> memory usage within the system. We want to alert admins before system
>> exhibits any thrashing behaviors.
>
> Have you considered using /proc/kpageflags for counting such pages? It
> should already export all information about memory pages you might need,
> e.g. which pages are mapped, which are anonymous, which are inactive,
> basically all page flags and even more. Moreover, you can even determine
> the set of pages that are really read/written by processes - see
> /sys/kernel/mm/page_idle/bitmap. On such a small machine scanning the
> whole pfn range should be pretty cheap, so you might find this API
> acceptable.

Thanks Vladimir. I came across the pagmemap interface sometime ago. I
was not sure if its mainstream. I think this should allow userspace 
VM scan (scans might take a bit longer). Will try it.

We could avoid the scans altogether.

The need plainly put is, inform the admins of these swapless embedded systems 
of the available memory.

If we can reliably and efficiently maintain counts of file pages 
(inactive and active) mapped into the address spaces of active user space 
processes, this need can be met. "Mapped" of /proc/meminfo does not seem 
to be a direct fit for this purpose (I need to understand this better). 
If I know for sure "Mapped" does not count device and the kernel pages 
mapped into the user space, then I can employ it gainfully for this need.

(Cached - Shmem - <mapped file/binary pages of active processes>) gives me
reclaimable file pages. If I can determine that then I can add that to MemFree 
and determine the available memory.

Thanks,
nag

>
> Thanks,
> Vladimir
>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 89a887c..13a501c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -29,6 +29,7 @@  Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- drop_caches_count
 - extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
@@ -224,6 +225,17 @@  with your system.  To disable them, echo 4 (bit 3) into drop_caches.
 
 ==============================================================
 
+drop_caches_count
+
+The amount of droppable pagecache (in kilobytes). Reading this file
+performs same calculation as writing 1 to /proc/sys/vm/drop_caches.
+The actual pages are not dropped during computation of this value.
+
+To read the value:
+	cat /proc/sys/vm/drop_caches_count
+
+==============================================================
+
 extfrag_threshold
 
 This parameter affects whether the kernel will compact memory or direct
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..0cb2186 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -8,12 +8,73 @@ 
 #include <linux/writeback.h>
 #include <linux/sysctl.h>
 #include <linux/gfp.h>
+#include <linux/init.h>
+#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/vmstat.h>
+#include <linux/blkdev.h>
+
 #include "internal.h"
 
 /* A global variable is a bit ugly, but it keeps the code simple */
+
 int sysctl_drop_caches;
+unsigned int sysctl_drop_caches_count;
+
+static int is_page_droppable(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return 0;
+	if (PageDirty(page))
+		return 0;
+	if (PageWriteback(page))
+		return 0;
+	if (page_mapped(page))
+		return 0;
+	if (page->mapping != mapping)
+		return 0;
+	if (page_has_private(page))
+		return 0;
+	return 1;
+}
+
+static unsigned long count_unlocked_pages(struct address_space *mapping)
+{
+	struct pagevec pvec;
+	pgoff_t start = 0;
+	pgoff_t end = -1;
+	unsigned long count = 0;
+	int i;
+	int rc;
+
+	pagevec_init(&pvec, 0);
+	while (start <= end && pagevec_lookup(&pvec, mapping, start,
+		min(end - start, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			start = page->index;
+			if (start > end)
+				break;
+			if (!trylock_page(page))
+				continue;
+			WARN_ON(page->index != start);
+			rc = is_page_droppable(page);
+			unlock_page(page);
+			count += rc;
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+		start++;
+	}
+	return count;
+}
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static void drop_pagecache_sb(struct super_block *sb, void *count)
 {
 	struct inode *inode, *toput_inode = NULL;
 
@@ -29,7 +90,11 @@  static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inode_list_lock);
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
+		if (count)
+			sysctl_drop_caches_count += count_unlocked_pages(inode->i_mapping);
+		else
+			invalidate_mapping_pages(inode->i_mapping, 0, -1);
+
 		iput(toput_inode);
 		toput_inode = inode;
 
@@ -67,3 +132,14 @@  int drop_caches_sysctl_handler(struct ctl_table *table, int write,
 	}
 	return 0;
 }
+
+int drop_caches_count_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int ret = 0;
+	sysctl_drop_caches_count = nr_blockdev_pages();
+	iterate_supers(drop_pagecache_sb, &sysctl_drop_caches_count);
+	sysctl_drop_caches_count <<= (PAGE_SHIFT - 10); /* count in KBytes */
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	return ret;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1cd22f..02ebd41 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2220,8 +2220,11 @@  static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
+extern unsigned int sysctl_drop_caches_count;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int drop_caches_count_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 #endif
 
 void drop_slab(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd..c043175 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1356,6 +1356,13 @@  static struct ctl_table vm_table[] = {
 		.extra1		= &one,
 		.extra2		= &four,
 	},
+	{
+		.procname	= "drop_caches_count",
+		.data		= &sysctl_drop_caches_count,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0444,
+		.proc_handler	= drop_caches_count_sysctl_handler,
+	},
 #ifdef CONFIG_COMPACTION
 	{
 		.procname	= "compact_memory",