diff mbox

[RFC,v2] Add /proc/pid/smaps_rollup

Message ID 20170810001557.147285-1-dancol@google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Colascione Aug. 10, 2017, 12:15 a.m. UTC
/proc/pid/smaps_rollup is a new proc file that improves the
performance of user programs that determine aggregate memory
statistics (e.g., total PSS) of a process.

Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes. This sampling process involves
opening /proc/pid/smaps and summing certain fields. For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.

smaps_rollup improves the situation. It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process. In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all
of the real-smaps VMAs. Using a common format for smaps_rollup and
smaps allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.

By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a
large process's potentially very numerous memory mappings. For
sampling system_server's PSS in Android, we measured a 12x speedup,
representing a savings of several hundred milliseconds.

One alternative to a new per-process proc file would have been
including PSS information in /proc/pid/status. We considered this
option but thought that PSS would be too expensive (by a few orders of
magnitude) to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.

The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for
each VMA.  In this way, summation happens automatically.  We let
seq_file walk over the VMAs just as it does for regular smaps and just
emit nothing to the seq_file until we hit the last VMA.

Patch changelog:

v2: Fix typo in commit message
    Add ABI documentation as requested by gregkh

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 Documentation/ABI/testing/procfs-smaps_rollup |  34 +++++
 fs/proc/base.c                                |   2 +
 fs/proc/internal.h                            |   3 +
 fs/proc/task_mmu.c                            | 196 ++++++++++++++++++--------
 4 files changed, 173 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/ABI/testing/procfs-smaps_rollup

Comments

Randy Dunlap Aug. 10, 2017, 1:24 a.m. UTC | #1
On 08/09/2017 05:15 PM, Daniel Colascione wrote:
> 
> diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
> new file mode 100644
> index 000000000000..fd5a3699edf1
> --- /dev/null
> +++ b/Documentation/ABI/testing/procfs-smaps_rollup
> @@ -0,0 +1,34 @@
> +What:		/proc/pid/smaps_Rollup

        		          smaps_rollup

\although I would prefer smaps_summary. whatever.

> +Date:		August 2017
> +Contact:	Daniel Colascione <dancol@google.com>
> +Description:
> +		This file provides pre-summed memory information for a
> +		process.  The format is identical to /proc/pid/smaps,
> +		except instead of an entry for each VMA in a process,
> +		smaps_rollup has a single entry (tagged "[rollup]")
> +		for which each field is the sum of the corresponding
> +		fields from all the maps in /proc/pid/smaps.
> +		For more details, see the procfs man page.
Minchan Kim Aug. 10, 2017, 4:38 a.m. UTC | #2
On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
> /proc/pid/smaps_rollup is a new proc file that improves the
> performance of user programs that determine aggregate memory
> statistics (e.g., total PSS) of a process.
> 
> Android regularly "samples" the memory usage of various processes in
> order to balance its memory pool sizes. This sampling process involves
> opening /proc/pid/smaps and summing certain fields. For very large
> processes, sampling memory use this way can take several hundred
> milliseconds, due mostly to the overhead of the seq_printf calls in
> task_mmu.c.
> 
> smaps_rollup improves the situation. It contains most of the fields of
> /proc/pid/smaps, but instead of a set of fields for each VMA,
> smaps_rollup instead contains one synthetic smaps-format entry
> representing the whole process. In the single smaps_rollup synthetic
> entry, each field is the summation of the corresponding field in all
> of the real-smaps VMAs. Using a common format for smaps_rollup and
> smaps allows userspace parsers to repurpose parsers meant for use with
> non-rollup smaps for smaps_rollup, and it allows userspace to switch
> between smaps_rollup and smaps at runtime (say, based on the
> availability of smaps_rollup in a given kernel) with minimal fuss.
> 
> By using smaps_rollup instead of smaps, a caller can avoid the
> significant overhead of formatting, reading, and parsing each of a
> large process's potentially very numerous memory mappings. For
> sampling system_server's PSS in Android, we measured a 12x speedup,
> representing a savings of several hundred milliseconds.
> 
> One alternative to a new per-process proc file would have been
> including PSS information in /proc/pid/status. We considered this
> option but thought that PSS would be too expensive (by a few orders of
> magnitude) to collect relative to what's already emitted as part of
> /proc/pid/status, and slowing every user of /proc/pid/status for the
> sake of readers that happen to want PSS feels wrong.
> 
> The code itself works by reusing the existing VMA-walking framework we
> use for regular smaps generation and keeping the mem_size_stats
> structure around between VMA walks instead of using a fresh one for
> each VMA.  In this way, summation happens automatically.  We let
> seq_file walk over the VMAs just as it does for regular smaps and just
> emit nothing to the seq_file until we hit the last VMA.
> 
> Patch changelog:
> 
> v2: Fix typo in commit message
>     Add ABI documentation as requested by gregkh
> 
> Signed-off-by: Daniel Colascione <dancol@google.com>

I love this.

FYI, there was trial but got failed at that time so in this time,
https://marc.info/?l=linux-kernel&m=147310650003277&w=2
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html

I really hope we merge this patch.
Michal Hocko Aug. 10, 2017, 8:46 a.m. UTC | #3
[CC linux-api - the patch was posted here
http://lkml.kernel.org/r/20170810001557.147285-1-dancol@google.com]

On Thu 10-08-17 13:38:31, Minchan Kim wrote:
> On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
> > /proc/pid/smaps_rollup is a new proc file that improves the
> > performance of user programs that determine aggregate memory
> > statistics (e.g., total PSS) of a process.
> > 
> > Android regularly "samples" the memory usage of various processes in
> > order to balance its memory pool sizes. This sampling process involves
> > opening /proc/pid/smaps and summing certain fields. For very large
> > processes, sampling memory use this way can take several hundred
> > milliseconds, due mostly to the overhead of the seq_printf calls in
> > task_mmu.c.

Have you tried to reduce that overhead? E.g. by replacing seq_printf by
something more simple
http://lkml.kernel.org/r/20160817130320.GC20703@dhcp22.suse.cz?
How often you you need to read this information?

> > smaps_rollup improves the situation. It contains most of the fields of
> > /proc/pid/smaps, but instead of a set of fields for each VMA,
> > smaps_rollup instead contains one synthetic smaps-format entry
> > representing the whole process. In the single smaps_rollup synthetic
> > entry, each field is the summation of the corresponding field in all
> > of the real-smaps VMAs. Using a common format for smaps_rollup and
> > smaps allows userspace parsers to repurpose parsers meant for use with
> > non-rollup smaps for smaps_rollup, and it allows userspace to switch
> > between smaps_rollup and smaps at runtime (say, based on the
> > availability of smaps_rollup in a given kernel) with minimal fuss.
> > 
> > By using smaps_rollup instead of smaps, a caller can avoid the
> > significant overhead of formatting, reading, and parsing each of a
> > large process's potentially very numerous memory mappings. For
> > sampling system_server's PSS in Android, we measured a 12x speedup,
> > representing a savings of several hundred milliseconds.

By a large process you mean a process with many VMAs right? How many
vmas are we talking about?

> > One alternative to a new per-process proc file would have been
> > including PSS information in /proc/pid/status. We considered this
> > option but thought that PSS would be too expensive (by a few orders of
> > magnitude) to collect relative to what's already emitted as part of
> > /proc/pid/status, and slowing every user of /proc/pid/status for the
> > sake of readers that happen to want PSS feels wrong.
> > 
> > The code itself works by reusing the existing VMA-walking framework we
> > use for regular smaps generation and keeping the mem_size_stats
> > structure around between VMA walks instead of using a fresh one for
> > each VMA.  In this way, summation happens automatically.  We let
> > seq_file walk over the VMAs just as it does for regular smaps and just
> > emit nothing to the seq_file until we hit the last VMA.
> > 
> > Patch changelog:
> > 
> > v2: Fix typo in commit message
> >     Add ABI documentation as requested by gregkh
> > 
> > Signed-off-by: Daniel Colascione <dancol@google.com>
> 
> I love this.
> 
> FYI, there was trial but got failed at that time so in this time,
> https://marc.info/?l=linux-kernel&m=147310650003277&w=2
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html

Yes I really disliked the previous attempt and this one is not all that
better. The primary unanswered question back then was a relevant
usecase. Back then it was argued [1] that PSS was useful for userspace
OOM handling but arguments were rather dubious. Follow up questions [2]
shown that the useage of PSS was very workload specific. Minchan has
noted some usecase as well but not very specific either.

So let's start with a clear use case description. Then let's make it
clear that even optimizing the current implementation is not sufficient
to meat goals and only then try to add one more user visible API which
we will have to maintain for ever.

[1] http://lkml.kernel.org/r/CAPz6YkW3Ph4mi++qY4cJiQ1PwhnxLr5=E4oCHjf5nYJHMhRcew@mail.gmail.com
[2] http://lkml.kernel.org/r/20160819075910.GB32619@dhcp22.suse.cz
Daniel Colascione Aug. 10, 2017, 10:23 a.m. UTC | #4
Thanks for taking a look at the patch!

On Thu, Aug 10 2017, Michal Hocko wrote:
> [CC linux-api - the patch was posted here
> http://lkml.kernel.org/r/20170810001557.147285-1-dancol@google.com]
>
> On Thu 10-08-17 13:38:31, Minchan Kim wrote:
>> On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
>> > /proc/pid/smaps_rollup is a new proc file that improves the
>> > performance of user programs that determine aggregate memory
>> > statistics (e.g., total PSS) of a process.
>> > 
>> > Android regularly "samples" the memory usage of various processes in
>> > order to balance its memory pool sizes. This sampling process involves
>> > opening /proc/pid/smaps and summing certain fields. For very large
>> > processes, sampling memory use this way can take several hundred
>> > milliseconds, due mostly to the overhead of the seq_printf calls in
>> > task_mmu.c.
>
> Have you tried to reduce that overhead? E.g. by replacing seq_printf by
> something more simple
> http://lkml.kernel.org/r/20160817130320.GC20703@dhcp22.suse.cz?

I haven't tried that yet, but if I'm reading that thread correctly, it
looks like using more efficient printing primitives gives us a 7%
speedup. The smaps_rollup patch gives us a much bigger speedup while
reusing almost all the smaps code, so it seems easier and simpler than a
bunch of incremental improvements to smaps. And even an efficient smaps
would have to push 2MB through seq_file for the 3000-VMA process case.

> How often you you need to read this information?

It varies depending on how often processes change state.  We sample a
short time (tens of seconds) after processes change state (e.g., enters
foreground) and every few minutes thereafter. We're particularly
concerned from an energy perspective about needlessly burning CPU on
background samples.

>> > smaps_rollup improves the situation. It contains most of the fields of
>> > /proc/pid/smaps, but instead of a set of fields for each VMA,
>> > smaps_rollup instead contains one synthetic smaps-format entry
>> > representing the whole process. In the single smaps_rollup synthetic
>> > entry, each field is the summation of the corresponding field in all
>> > of the real-smaps VMAs. Using a common format for smaps_rollup and
>> > smaps allows userspace parsers to repurpose parsers meant for use with
>> > non-rollup smaps for smaps_rollup, and it allows userspace to switch
>> > between smaps_rollup and smaps at runtime (say, based on the
>> > availability of smaps_rollup in a given kernel) with minimal fuss.
>> > 
>> > By using smaps_rollup instead of smaps, a caller can avoid the
>> > significant overhead of formatting, reading, and parsing each of a
>> > large process's potentially very numerous memory mappings. For
>> > sampling system_server's PSS in Android, we measured a 12x speedup,
>> > representing a savings of several hundred milliseconds.
>
> By a large process you mean a process with many VMAs right? How many
> vmas are we talking about?

Yes: ~3000 VMAs is one I'd consider large in this context.

>> > One alternative to a new per-process proc file would have been
>> > including PSS information in /proc/pid/status. We considered this
>> > option but thought that PSS would be too expensive (by a few orders of
>> > magnitude) to collect relative to what's already emitted as part of
>> > /proc/pid/status, and slowing every user of /proc/pid/status for the
>> > sake of readers that happen to want PSS feels wrong.
>> > 
>> > The code itself works by reusing the existing VMA-walking framework we
>> > use for regular smaps generation and keeping the mem_size_stats
>> > structure around between VMA walks instead of using a fresh one for
>> > each VMA.  In this way, summation happens automatically.  We let
>> > seq_file walk over the VMAs just as it does for regular smaps and just
>> > emit nothing to the seq_file until we hit the last VMA.
>> > 
>> > Patch changelog:
>> > 
>> > v2: Fix typo in commit message
>> >     Add ABI documentation as requested by gregkh
>> > 
>> > Signed-off-by: Daniel Colascione <dancol@google.com>
>> 
>> I love this.
>> 
>> FYI, there was trial but got failed at that time so in this time,
>> https://marc.info/?l=linux-kernel&m=147310650003277&w=2
>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html
>
> Yes I really disliked the previous attempt and this one is not all that
> better. The primary unanswered question back then was a relevant
> usecase. Back then it was argued [1] that PSS was useful for userspace
> OOM handling but arguments were rather dubious. Follow up questions [2]
> shown that the useage of PSS was very workload specific. Minchan has
> noted some usecase as well but not very specific either.

Anyway, I see what you mean about PSS being iffy for user-space OOM
processing (because PSS doesn't tell you how much memory you get back in
exchange for killing a given process at a particular moment). We're not
using it like that.

Instead, we're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows. Android has been using PSS for these purposes for a
long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more. IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.

The existence of an independent attempt to add this functionality
suggests that it might be generally useful too.

> So let's start with a clear use case description. Then let's make it
> clear that even optimizing the current implementation is not sufficient
> to meat goals and only then try to add one more user visible API which
> we will have to maintain for ever.

Adding a new API shouldn't be anyone's first choice, but the efficiency
wins are hard to ignore. We can cut ~2MB of smaps data to a few KB this
way. I'm definitely open to other ideas for getting similar wins, but
right now, I'd like to explore approaches that let us keep using the
existing PSS metric.

Thanks again!
Michal Hocko Aug. 10, 2017, 10:58 a.m. UTC | #5
On Thu 10-08-17 03:23:23, Daniel Colascione wrote:
> Thanks for taking a look at the patch!
> 
> On Thu, Aug 10 2017, Michal Hocko wrote:
> > [CC linux-api - the patch was posted here
> > http://lkml.kernel.org/r/20170810001557.147285-1-dancol@google.com]
> >
> > On Thu 10-08-17 13:38:31, Minchan Kim wrote:
> >> On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
> >> > /proc/pid/smaps_rollup is a new proc file that improves the
> >> > performance of user programs that determine aggregate memory
> >> > statistics (e.g., total PSS) of a process.
> >> > 
> >> > Android regularly "samples" the memory usage of various processes in
> >> > order to balance its memory pool sizes. This sampling process involves
> >> > opening /proc/pid/smaps and summing certain fields. For very large
> >> > processes, sampling memory use this way can take several hundred
> >> > milliseconds, due mostly to the overhead of the seq_printf calls in
> >> > task_mmu.c.
> >
> > Have you tried to reduce that overhead? E.g. by replacing seq_printf by
> > something more simple
> > http://lkml.kernel.org/r/20160817130320.GC20703@dhcp22.suse.cz?
> 
> I haven't tried that yet, but if I'm reading that thread correctly, it
> looks like using more efficient printing primitives gives us a 7%
> speedup. The smaps_rollup patch gives us a much bigger speedup while
> reusing almost all the smaps code, so it seems easier and simpler than a
> bunch of incremental improvements to smaps. And even an efficient smaps
> would have to push 2MB through seq_file for the 3000-VMA process case.

The thing is that more users would benefit from a more efficient
/proc/pid/smaps call. Maybe we can use some caching tricks etc...  We
should make sure that existing options should be attempted before a new
user visible interface is added. It is kind of sad that the real work
(pte walk) is less expensive than formating the output and copying it to
the userspace...

> > How often you you need to read this information?
> 
> It varies depending on how often processes change state.  We sample a
> short time (tens of seconds) after processes change state (e.g., enters
> foreground) and every few minutes thereafter. We're particularly
> concerned from an energy perspective about needlessly burning CPU on
> background samples.

Please make sure this is documented in the patch along with some numbers
ideally.

[...]

> >> FYI, there was trial but got failed at that time so in this time,
> >> https://marc.info/?l=linux-kernel&m=147310650003277&w=2
> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html
> >
> > Yes I really disliked the previous attempt and this one is not all that
> > better. The primary unanswered question back then was a relevant
> > usecase. Back then it was argued [1] that PSS was useful for userspace
> > OOM handling but arguments were rather dubious. Follow up questions [2]
> > shown that the useage of PSS was very workload specific. Minchan has
> > noted some usecase as well but not very specific either.
> 
> Anyway, I see what you mean about PSS being iffy for user-space OOM
> processing (because PSS doesn't tell you how much memory you get back in
> exchange for killing a given process at a particular moment). We're not
> using it like that.
> 
> Instead, we're using the PSS samples we collect asynchronously for
> system-management tasks like fine-tuning oom_adj_score, memory use
> tracking for debugging, application-level memory-use attribution, and
> deciding whether we want to kill large processes during system idle
> maintenance windows. Android has been using PSS for these purposes for a
> long time; as the average process VMA count has increased and and
> devices become more efficiency-conscious, PSS-collection inefficiency
> has started to matter more. IMHO, it'd be a lot safer to optimize the
> existing PSS-collection model, which has been fine-tuned over the years,
> instead of changing the memory tracking approach entirely to work around
> smaps-generation inefficiency.

This is really vague. Please be more specific.
Sonny Rao Aug. 10, 2017, 6:56 p.m. UTC | #6
On Thu, Aug 10, 2017 at 3:58 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 10-08-17 03:23:23, Daniel Colascione wrote:
>> Thanks for taking a look at the patch!
>>
>> On Thu, Aug 10 2017, Michal Hocko wrote:
>> > [CC linux-api - the patch was posted here
>> > http://lkml.kernel.org/r/20170810001557.147285-1-dancol@google.com]
>> >
>> > On Thu 10-08-17 13:38:31, Minchan Kim wrote:
>> >> On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
>> >> > /proc/pid/smaps_rollup is a new proc file that improves the
>> >> > performance of user programs that determine aggregate memory
>> >> > statistics (e.g., total PSS) of a process.
>> >> >
>> >> > Android regularly "samples" the memory usage of various processes in
>> >> > order to balance its memory pool sizes. This sampling process involves
>> >> > opening /proc/pid/smaps and summing certain fields. For very large
>> >> > processes, sampling memory use this way can take several hundred
>> >> > milliseconds, due mostly to the overhead of the seq_printf calls in
>> >> > task_mmu.c.
>> >
>> > Have you tried to reduce that overhead? E.g. by replacing seq_printf by
>> > something more simple
>> > http://lkml.kernel.org/r/20160817130320.GC20703@dhcp22.suse.cz?
>>
>> I haven't tried that yet, but if I'm reading that thread correctly, it
>> looks like using more efficient printing primitives gives us a 7%
>> speedup. The smaps_rollup patch gives us a much bigger speedup while
>> reusing almost all the smaps code, so it seems easier and simpler than a
>> bunch of incremental improvements to smaps. And even an efficient smaps
>> would have to push 2MB through seq_file for the 3000-VMA process case.
>
> The thing is that more users would benefit from a more efficient
> /proc/pid/smaps call. Maybe we can use some caching tricks etc...  We
> should make sure that existing options should be attempted before a new
> user visible interface is added. It is kind of sad that the real work
> (pte walk) is less expensive than formating the output and copying it to
> the userspace...
>
>> > How often you you need to read this information?
>>
>> It varies depending on how often processes change state.  We sample a
>> short time (tens of seconds) after processes change state (e.g., enters
>> foreground) and every few minutes thereafter. We're particularly
>> concerned from an energy perspective about needlessly burning CPU on
>> background samples.
>
> Please make sure this is documented in the patch along with some numbers
> ideally.
>
> [...]
>
>> >> FYI, there was trial but got failed at that time so in this time,
>> >> https://marc.info/?l=linux-kernel&m=147310650003277&w=2
>> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html
>> >
>> > Yes I really disliked the previous attempt and this one is not all that
>> > better. The primary unanswered question back then was a relevant
>> > usecase. Back then it was argued [1] that PSS was useful for userspace
>> > OOM handling but arguments were rather dubious. Follow up questions [2]
>> > shown that the useage of PSS was very workload specific. Minchan has
>> > noted some usecase as well but not very specific either.
>>
>> Anyway, I see what you mean about PSS being iffy for user-space OOM
>> processing (because PSS doesn't tell you how much memory you get back in
>> exchange for killing a given process at a particular moment). We're not
>> using it like that.
>>
>> Instead, we're using the PSS samples we collect asynchronously for
>> system-management tasks like fine-tuning oom_adj_score, memory use
>> tracking for debugging, application-level memory-use attribution, and
>> deciding whether we want to kill large processes during system idle
>> maintenance windows. Android has been using PSS for these purposes for a
>> long time; as the average process VMA count has increased and and
>> devices become more efficiency-conscious, PSS-collection inefficiency
>> has started to matter more. IMHO, it'd be a lot safer to optimize the
>> existing PSS-collection model, which has been fine-tuned over the years,
>> instead of changing the memory tracking approach entirely to work around
>> smaps-generation inefficiency.
>
> This is really vague. Please be more specific.

I actually think this is really similar to the Chrome OS use case --
we need to do proper accounting of memory from user space, and we need
something more accurate than what we have now (usually RSS) to figure
it out.  I'm not sure what is vague about that statement?

PSS is not perfect but in closed systems where we have some knowledge
about what is being shared amongst process, PSS is much better than
RSS and readily available.  So, I disagree that this is a dubious
usage -- if there's a better metric for making this kind of decision,
please share it.

Also I realized there's another argument for presenting this
information outside of smaps which is that we expose far less
information about a process and it's address space via something like
this, so it's much better for isolation to have a separate file with
different permissions.  Right now the process in charge of accounting
for memory usage also gains knowledge about each process's address
space which is unnecessary.

IMHO, the fact that multiple folks have independently asked for this
seems like an argument that something like this is needed.


> --
> Michal Hocko
> SUSE Labs
Tim Murray Aug. 10, 2017, 7:17 p.m. UTC | #7
I've looked into this a fair bit on the Android side, so I can provide
some context. There are two main reasons why Android gathers PSS
information:

1. Android devices can show the user the amount of memory used per
application via the settings app. This is a less important use case.
2. We log PSS to help identify leaks in applications. We have found an
enormous number of bugs (in the Android platform, in Google's own
apps, and in third-party applications) using this data.

To do this, system_server (the main process in Android userspace) will
sample the PSS of a process three seconds after it changes state (for
example, app is launched and becomes the foreground application) and
about every ten minutes after that. The net result is that PSS
collection is regularly running on at least one process in the system
(usually a few times a minute while the screen is on, less when screen
is off due to suspend). PSS of a process is an incredibly useful stat
to track, and we aren't going to get rid of it. We've looked at some
very hacky approaches using RSS ("take the RSS of the target process,
subtract the RSS of the zygote process that is the parent of all
Android apps") to reduce the accounting time, but it regularly
overestimated the memory used by 20+ percent. Accordingly, I don't
think that there's a good alternative to using PSS.

We started looking into PSS collection performance after we noticed
random frequency spikes while a phone's screen was off; occasionally,
one of the CPU clusters would ramp to a high frequency because there
was 200-300ms of constant CPU work from a single thread in the main
Android userspace process. The work causing the spike (which is
reasonable governor behavior given the amount of CPU time needed) was
always PSS collection. As a result, Android is burning more power than
we should be on PSS collection.

The other issue (and why I'm less sure about improving smaps as a
long-term solution) is that the number of VMAs per process has
increased significantly from release to release. After trying to
figure out why we were seeing these 200-300ms PSS collection times on
Android O but had not noticed it in previous versions, we found that
the number of VMAs in the main system process increased by 50% from
Android N to Android O (from ~1800 to ~2700) and varying increases in
every userspace process. Android M to N also had an increase in the
number of VMAs, although not as much. I'm not sure why this is
increasing so much over time, but thinking about ASLR and ways to make
ASLR better, I expect that this will continue to increase going
forward. I would not be surprised if we hit 5000 VMAs on the main
Android process (system_server) by 2020.

If we assume that the number of VMAs is going to increase over time,
then doing anything we can do to reduce the overhead of each VMA
during PSS collection seems like the right way to go, and that means
outputting an aggregate statistic (to avoid whatever overhead there is
per line in writing smaps and in reading each line from userspace).

Also, Dan sent me some numbers from his benchmark measuring PSS on
system_server (the big Android process) using smaps vs smaps_rollup:

using smaps:
iterations:1000 pid:1163 pss:220023808
 0m29.46s real 0m08.28s user 0m20.98s system

using smaps_rollup:
iterations:1000 pid:1163 pss:220702720
 0m04.39s real 0m00.03s user 0m04.31s system

On Thu, Aug 10, 2017 at 11:56 AM, Sonny Rao <sonnyrao@chromium.org> wrote:
> On Thu, Aug 10, 2017 at 3:58 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> On Thu 10-08-17 03:23:23, Daniel Colascione wrote:
>>> Thanks for taking a look at the patch!
>>>
>>> On Thu, Aug 10 2017, Michal Hocko wrote:
>>> > [CC linux-api - the patch was posted here
>>> > http://lkml.kernel.org/r/20170810001557.147285-1-dancol@google.com]
>>> >
>>> > On Thu 10-08-17 13:38:31, Minchan Kim wrote:
>>> >> On Wed, Aug 09, 2017 at 05:15:57PM -0700, Daniel Colascione wrote:
>>> >> > /proc/pid/smaps_rollup is a new proc file that improves the
>>> >> > performance of user programs that determine aggregate memory
>>> >> > statistics (e.g., total PSS) of a process.
>>> >> >
>>> >> > Android regularly "samples" the memory usage of various processes in
>>> >> > order to balance its memory pool sizes. This sampling process involves
>>> >> > opening /proc/pid/smaps and summing certain fields. For very large
>>> >> > processes, sampling memory use this way can take several hundred
>>> >> > milliseconds, due mostly to the overhead of the seq_printf calls in
>>> >> > task_mmu.c.
>>> >
>>> > Have you tried to reduce that overhead? E.g. by replacing seq_printf by
>>> > something more simple
>>> > http://lkml.kernel.org/r/20160817130320.GC20703@dhcp22.suse.cz?
>>>
>>> I haven't tried that yet, but if I'm reading that thread correctly, it
>>> looks like using more efficient printing primitives gives us a 7%
>>> speedup. The smaps_rollup patch gives us a much bigger speedup while
>>> reusing almost all the smaps code, so it seems easier and simpler than a
>>> bunch of incremental improvements to smaps. And even an efficient smaps
>>> would have to push 2MB through seq_file for the 3000-VMA process case.
>>
>> The thing is that more users would benefit from a more efficient
>> /proc/pid/smaps call. Maybe we can use some caching tricks etc...  We
>> should make sure that existing options should be attempted before a new
>> user visible interface is added. It is kind of sad that the real work
>> (pte walk) is less expensive than formating the output and copying it to
>> the userspace...
>>
>>> > How often you you need to read this information?
>>>
>>> It varies depending on how often processes change state.  We sample a
>>> short time (tens of seconds) after processes change state (e.g., enters
>>> foreground) and every few minutes thereafter. We're particularly
>>> concerned from an energy perspective about needlessly burning CPU on
>>> background samples.
>>
>> Please make sure this is documented in the patch along with some numbers
>> ideally.
>>
>> [...]
>>
>>> >> FYI, there was trial but got failed at that time so in this time,
>>> >> https://marc.info/?l=linux-kernel&m=147310650003277&w=2
>>> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1229163.html
>>> >
>>> > Yes I really disliked the previous attempt and this one is not all that
>>> > better. The primary unanswered question back then was a relevant
>>> > usecase. Back then it was argued [1] that PSS was useful for userspace
>>> > OOM handling but arguments were rather dubious. Follow up questions [2]
>>> > shown that the useage of PSS was very workload specific. Minchan has
>>> > noted some usecase as well but not very specific either.
>>>
>>> Anyway, I see what you mean about PSS being iffy for user-space OOM
>>> processing (because PSS doesn't tell you how much memory you get back in
>>> exchange for killing a given process at a particular moment). We're not
>>> using it like that.
>>>
>>> Instead, we're using the PSS samples we collect asynchronously for
>>> system-management tasks like fine-tuning oom_adj_score, memory use
>>> tracking for debugging, application-level memory-use attribution, and
>>> deciding whether we want to kill large processes during system idle
>>> maintenance windows. Android has been using PSS for these purposes for a
>>> long time; as the average process VMA count has increased and and
>>> devices become more efficiency-conscious, PSS-collection inefficiency
>>> has started to matter more. IMHO, it'd be a lot safer to optimize the
>>> existing PSS-collection model, which has been fine-tuned over the years,
>>> instead of changing the memory tracking approach entirely to work around
>>> smaps-generation inefficiency.
>>
>> This is really vague. Please be more specific.
>
> I actually think this is really similar to the Chrome OS use case --
> we need to do proper accounting of memory from user space, and we need
> something more accurate than what we have now (usually RSS) to figure
> it out.  I'm not sure what is vague about that statement?
>
> PSS is not perfect but in closed systems where we have some knowledge
> about what is being shared amongst process, PSS is much better than
> RSS and readily available.  So, I disagree that this is a dubious
> usage -- if there's a better metric for making this kind of decision,
> please share it.
>
> Also I realized there's another argument for presenting this
> information outside of smaps which is that we expose far less
> information about a process and it's address space via something like
> this, so it's much better for isolation to have a separate file with
> different permissions.  Right now the process in charge of accounting
> for memory usage also gains knowledge about each process's address
> space which is unnecessary.
>
> IMHO, the fact that multiple folks have independently asked for this
> seems like an argument that something like this is needed.
>
>
>> --
>> Michal Hocko
>> SUSE Labs
Michal Hocko Aug. 24, 2017, 8:55 a.m. UTC | #8
Sorry for a late reply

On Thu 10-08-17 12:17:07, Tim Murray wrote:
> I've looked into this a fair bit on the Android side, so I can provide
> some context. There are two main reasons why Android gathers PSS
> information:
> 
> 1. Android devices can show the user the amount of memory used per
> application via the settings app. This is a less important use case.

yes

> 2. We log PSS to help identify leaks in applications. We have found an
> enormous number of bugs (in the Android platform, in Google's own
> apps, and in third-party applications) using this data.
> 
> To do this, system_server (the main process in Android userspace) will
> sample the PSS of a process three seconds after it changes state (for
> example, app is launched and becomes the foreground application) and
> about every ten minutes after that. The net result is that PSS
> collection is regularly running on at least one process in the system
> (usually a few times a minute while the screen is on, less when screen
> is off due to suspend). PSS of a process is an incredibly useful stat
> to track, and we aren't going to get rid of it. We've looked at some
> very hacky approaches using RSS ("take the RSS of the target process,
> subtract the RSS of the zygote process that is the parent of all
> Android apps") to reduce the accounting time, but it regularly
> overestimated the memory used by 20+ percent. Accordingly, I don't
> think that there's a good alternative to using PSS.

Even if the RSS overestimates this shouldn't hide a memory leak, no?

> We started looking into PSS collection performance after we noticed
> random frequency spikes while a phone's screen was off; occasionally,
> one of the CPU clusters would ramp to a high frequency because there
> was 200-300ms of constant CPU work from a single thread in the main
> Android userspace process. The work causing the spike (which is
> reasonable governor behavior given the amount of CPU time needed) was
> always PSS collection. As a result, Android is burning more power than
> we should be on PSS collection.

Yes, this really sucks but we are revolving around the same point. It
really sucks that we burn so much time just copying the output to the
userspace when the real stuff (vma walk and pte walk) has to be done
anyway. AFAIR I could reduce the overhead by using more appropriate
seq_* functions but maybe we can do even better.

> The other issue (and why I'm less sure about improving smaps as a
> long-term solution) is that the number of VMAs per process has
> increased significantly from release to release. After trying to
> figure out why we were seeing these 200-300ms PSS collection times on
> Android O but had not noticed it in previous versions, we found that
> the number of VMAs in the main system process increased by 50% from
> Android N to Android O (from ~1800 to ~2700) and varying increases in
> every userspace process. Android M to N also had an increase in the
> number of VMAs, although not as much. I'm not sure why this is
> increasing so much over time, but thinking about ASLR and ways to make
> ASLR better, I expect that this will continue to increase going
> forward. I would not be surprised if we hit 5000 VMAs on the main
> Android process (system_server) by 2020.

The thing is, however, that the larger amount of VMAs will also mean
more work on the kernel side. The data collection has to be done anyway.
 
> If we assume that the number of VMAs is going to increase over time,
> then doing anything we can do to reduce the overhead of each VMA
> during PSS collection seems like the right way to go, and that means
> outputting an aggregate statistic (to avoid whatever overhead there is
> per line in writing smaps and in reading each line from userspace).
> 
> Also, Dan sent me some numbers from his benchmark measuring PSS on
> system_server (the big Android process) using smaps vs smaps_rollup:
> 
> using smaps:
> iterations:1000 pid:1163 pss:220023808
>  0m29.46s real 0m08.28s user 0m20.98s system
> 
> using smaps_rollup:
> iterations:1000 pid:1163 pss:220702720
>  0m04.39s real 0m00.03s user 0m04.31s system

I would assume we would do all we can to reduce this kernel->user
overhead first before considering a new user visible file. I haven't
seen any attempts except from the low hanging fruid I have tried.
Andrew Morton Aug. 25, 2017, 9:16 p.m. UTC | #9
On Thu, 24 Aug 2017 10:55:53 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > If we assume that the number of VMAs is going to increase over time,
> > then doing anything we can do to reduce the overhead of each VMA
> > during PSS collection seems like the right way to go, and that means
> > outputting an aggregate statistic (to avoid whatever overhead there is
> > per line in writing smaps and in reading each line from userspace).
> > 
> > Also, Dan sent me some numbers from his benchmark measuring PSS on
> > system_server (the big Android process) using smaps vs smaps_rollup:
> > 
> > using smaps:
> > iterations:1000 pid:1163 pss:220023808
> >  0m29.46s real 0m08.28s user 0m20.98s system
> > 
> > using smaps_rollup:
> > iterations:1000 pid:1163 pss:220702720
> >  0m04.39s real 0m00.03s user 0m04.31s system
> 
> I would assume we would do all we can to reduce this kernel->user
> overhead first before considering a new user visible file. I haven't
> seen any attempts except from the low hanging fruid I have tried.

It's hard to believe that we'll get anything like a 5x speedup via
optimization of the existing code?
Michal Hocko Aug. 28, 2017, 11:30 a.m. UTC | #10
On Fri 25-08-17 14:16:37, Andrew Morton wrote:
> On Thu, 24 Aug 2017 10:55:53 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > If we assume that the number of VMAs is going to increase over time,
> > > then doing anything we can do to reduce the overhead of each VMA
> > > during PSS collection seems like the right way to go, and that means
> > > outputting an aggregate statistic (to avoid whatever overhead there is
> > > per line in writing smaps and in reading each line from userspace).
> > > 
> > > Also, Dan sent me some numbers from his benchmark measuring PSS on
> > > system_server (the big Android process) using smaps vs smaps_rollup:
> > > 
> > > using smaps:
> > > iterations:1000 pid:1163 pss:220023808
> > >  0m29.46s real 0m08.28s user 0m20.98s system
> > > 
> > > using smaps_rollup:
> > > iterations:1000 pid:1163 pss:220702720
> > >  0m04.39s real 0m00.03s user 0m04.31s system
> > 
> > I would assume we would do all we can to reduce this kernel->user
> > overhead first before considering a new user visible file. I haven't
> > seen any attempts except from the low hanging fruid I have tried.
> 
> It's hard to believe that we'll get anything like a 5x speedup via
> optimization of the existing code?

Maybe we will not get that much of a boost but having misleading numbers
really quick is not something we should aim for. Just try to think what
the cumulative numbers actually mean. How can you even consider
cumulative PSS when you have no idea about mappings that were
considered?
diff mbox

Patch

diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
new file mode 100644
index 000000000000..fd5a3699edf1
--- /dev/null
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
@@ -0,0 +1,34 @@ 
+What:		/proc/pid/smaps_Rollup
+Date:		August 2017
+Contact:	Daniel Colascione <dancol@google.com>
+Description:
+		This file provides pre-summed memory information for a
+		process.  The format is identical to /proc/pid/smaps,
+		except instead of an entry for each VMA in a process,
+		smaps_rollup has a single entry (tagged "[rollup]")
+		for which each field is the sum of the corresponding
+		fields from all the maps in /proc/pid/smaps.
+		For more details, see the procfs man page.
+
+		Typical output looks like this:
+
+		00100000-ff709000 ---p 00000000 00:00 0		 [rollup]
+		Rss:		     884 kB
+		Pss:		     385 kB
+		Shared_Clean:	     696 kB
+		Shared_Dirty:	       0 kB
+		Private_Clean:	     120 kB
+		Private_Dirty:	      68 kB
+		Referenced:	     884 kB
+		Anonymous:	      68 kB
+		LazyFree:	       0 kB
+		AnonHugePages:	       0 kB
+		ShmemPmdMapped:	       0 kB
+		Shared_Hugetlb:	       0 kB
+		Private_Hugetlb:       0 kB
+		Swap:		       0 kB
+		SwapPss:	       0 kB
+		Locked:		     385 kB
+
+
+		
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 719c2e943ea1..a9587b9cace5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2930,6 +2930,7 @@  static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PAGE_MONITOR
 	REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
+	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
 #endif
 #ifdef CONFIG_SECURITY
@@ -3323,6 +3324,7 @@  static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_PROC_PAGE_MONITOR
 	REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
 	REG("smaps",     S_IRUGO, proc_tid_smaps_operations),
+	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
 #endif
 #ifdef CONFIG_SECURITY
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa2b89071630..2cbfcd32e884 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -269,10 +269,12 @@  extern int proc_remount(struct super_block *, int *, char *);
 /*
  * task_[no]mmu.c
  */
+struct mem_size_stats;
 struct proc_maps_private {
 	struct inode *inode;
 	struct task_struct *task;
 	struct mm_struct *mm;
+	struct mem_size_stats *rollup;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
 #endif
@@ -288,6 +290,7 @@  extern const struct file_operations proc_tid_maps_operations;
 extern const struct file_operations proc_pid_numa_maps_operations;
 extern const struct file_operations proc_tid_numa_maps_operations;
 extern const struct file_operations proc_pid_smaps_operations;
+extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_tid_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..02b55df7291c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -252,6 +252,7 @@  static int proc_map_release(struct inode *inode, struct file *file)
 	if (priv->mm)
 		mmdrop(priv->mm);
 
+	kfree(priv->rollup);
 	return seq_release_private(inode, file);
 }
 
@@ -278,6 +279,23 @@  static int is_stack(struct proc_maps_private *priv,
 		vma->vm_end >= vma->vm_mm->start_stack;
 }
 
+static void show_vma_header_prefix(struct seq_file *m,
+				   unsigned long start, unsigned long end,
+				   vm_flags_t flags, unsigned long long pgoff,
+				   dev_t dev, unsigned long ino)
+{
+	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
+	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+		   start,
+		   end,
+		   flags & VM_READ ? 'r' : '-',
+		   flags & VM_WRITE ? 'w' : '-',
+		   flags & VM_EXEC ? 'x' : '-',
+		   flags & VM_MAYSHARE ? 's' : 'p',
+		   pgoff,
+		   MAJOR(dev), MINOR(dev), ino);
+}
+
 static void
 show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 {
@@ -300,17 +318,7 @@  show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 
 	start = vma->vm_start;
 	end = vma->vm_end;
-
-	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
-	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
-			start,
-			end,
-			flags & VM_READ ? 'r' : '-',
-			flags & VM_WRITE ? 'w' : '-',
-			flags & VM_EXEC ? 'x' : '-',
-			flags & VM_MAYSHARE ? 's' : 'p',
-			pgoff,
-			MAJOR(dev), MINOR(dev), ino);
+	show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
 
 	/*
 	 * Print the dentry name for named mappings, and a
@@ -429,6 +437,7 @@  const struct file_operations proc_tid_maps_operations = {
 
 #ifdef CONFIG_PROC_PAGE_MONITOR
 struct mem_size_stats {
+	bool first;
 	unsigned long resident;
 	unsigned long shared_clean;
 	unsigned long shared_dirty;
@@ -442,7 +451,9 @@  struct mem_size_stats {
 	unsigned long swap;
 	unsigned long shared_hugetlb;
 	unsigned long private_hugetlb;
+	unsigned long first_vma_start;
 	u64 pss;
+	u64 pss_locked;
 	u64 swap_pss;
 	bool check_shmem_swap;
 };
@@ -718,18 +729,36 @@  void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
 
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
+	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
-	struct mem_size_stats mss;
+	struct mem_size_stats mss_stack;
+	struct mem_size_stats *mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
 #ifdef CONFIG_HUGETLB_PAGE
 		.hugetlb_entry = smaps_hugetlb_range,
 #endif
 		.mm = vma->vm_mm,
-		.private = &mss,
 	};
+	int ret = 0;
+	bool rollup_mode;
+	bool last_vma;
+
+	if (priv->rollup) {
+		rollup_mode = true;
+		mss = priv->rollup;
+		if (mss->first) {
+			mss->first_vma_start = vma->vm_start;
+			mss->first = false;
+		}
+		last_vma = !m_next_vma(priv, vma);
+	} else {
+		rollup_mode = false;
+		memset(&mss_stack, 0, sizeof(mss_stack));
+		mss = &mss_stack;
+	}
 
-	memset(&mss, 0, sizeof mss);
+	smaps_walk.private = mss;
 
 #ifdef CONFIG_SHMEM
 	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
@@ -747,9 +776,9 @@  static int show_smap(struct seq_file *m, void *v, int is_pid)
 
 		if (!shmem_swapped || (vma->vm_flags & VM_SHARED) ||
 					!(vma->vm_flags & VM_WRITE)) {
-			mss.swap = shmem_swapped;
+			mss->swap = shmem_swapped;
 		} else {
-			mss.check_shmem_swap = true;
+			mss->check_shmem_swap = true;
 			smaps_walk.pte_hole = smaps_pte_hole;
 		}
 	}
@@ -757,54 +786,71 @@  static int show_smap(struct seq_file *m, void *v, int is_pid)
 
 	/* mmap_sem is held in m_start */
 	walk_page_vma(vma, &smaps_walk);
+	if (vma->vm_flags & VM_LOCKED)
+		mss->pss_locked += mss->pss;
+
+	if (!rollup_mode) {
+		show_map_vma(m, vma, is_pid);
+	} else if (last_vma) {
+		show_vma_header_prefix(
+			m, mss->first_vma_start, vma->vm_end, 0, 0, 0, 0);
+		seq_pad(m, ' ');
+		seq_puts(m, "[rollup]\n");
+	} else {
+		ret = SEQ_SKIP;
+	}
 
-	show_map_vma(m, vma, is_pid);
-
-	seq_printf(m,
-		   "Size:           %8lu kB\n"
-		   "Rss:            %8lu kB\n"
-		   "Pss:            %8lu kB\n"
-		   "Shared_Clean:   %8lu kB\n"
-		   "Shared_Dirty:   %8lu kB\n"
-		   "Private_Clean:  %8lu kB\n"
-		   "Private_Dirty:  %8lu kB\n"
-		   "Referenced:     %8lu kB\n"
-		   "Anonymous:      %8lu kB\n"
-		   "LazyFree:       %8lu kB\n"
-		   "AnonHugePages:  %8lu kB\n"
-		   "ShmemPmdMapped: %8lu kB\n"
-		   "Shared_Hugetlb: %8lu kB\n"
-		   "Private_Hugetlb: %7lu kB\n"
-		   "Swap:           %8lu kB\n"
-		   "SwapPss:        %8lu kB\n"
-		   "KernelPageSize: %8lu kB\n"
-		   "MMUPageSize:    %8lu kB\n"
-		   "Locked:         %8lu kB\n",
-		   (vma->vm_end - vma->vm_start) >> 10,
-		   mss.resident >> 10,
-		   (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
-		   mss.shared_clean  >> 10,
-		   mss.shared_dirty  >> 10,
-		   mss.private_clean >> 10,
-		   mss.private_dirty >> 10,
-		   mss.referenced >> 10,
-		   mss.anonymous >> 10,
-		   mss.lazyfree >> 10,
-		   mss.anonymous_thp >> 10,
-		   mss.shmem_thp >> 10,
-		   mss.shared_hugetlb >> 10,
-		   mss.private_hugetlb >> 10,
-		   mss.swap >> 10,
-		   (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
-		   vma_kernel_pagesize(vma) >> 10,
-		   vma_mmu_pagesize(vma) >> 10,
-		   (vma->vm_flags & VM_LOCKED) ?
-			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
-
-	arch_show_smap(m, vma);
-	show_smap_vma_flags(m, vma);
+	if (!rollup_mode)
+		seq_printf(m,
+			   "Size:           %8lu kB\n"
+			   "KernelPageSize: %8lu kB\n"
+			   "MMUPageSize:    %8lu kB\n",
+			   (vma->vm_end - vma->vm_start) >> 10,
+			   vma_kernel_pagesize(vma) >> 10,
+			   vma_mmu_pagesize(vma) >> 10);
+
+
+	if (!rollup_mode || last_vma)
+		seq_printf(m,
+			   "Rss:            %8lu kB\n"
+			   "Pss:            %8lu kB\n"
+			   "Shared_Clean:   %8lu kB\n"
+			   "Shared_Dirty:   %8lu kB\n"
+			   "Private_Clean:  %8lu kB\n"
+			   "Private_Dirty:  %8lu kB\n"
+			   "Referenced:     %8lu kB\n"
+			   "Anonymous:      %8lu kB\n"
+			   "LazyFree:       %8lu kB\n"
+			   "AnonHugePages:  %8lu kB\n"
+			   "ShmemPmdMapped: %8lu kB\n"
+			   "Shared_Hugetlb: %8lu kB\n"
+			   "Private_Hugetlb: %7lu kB\n"
+			   "Swap:           %8lu kB\n"
+			   "SwapPss:        %8lu kB\n"
+			   "Locked:         %8lu kB\n",
+			   mss->resident >> 10,
+			   (unsigned long)(mss->pss >> (10 + PSS_SHIFT)),
+			   mss->shared_clean  >> 10,
+			   mss->shared_dirty  >> 10,
+			   mss->private_clean >> 10,
+			   mss->private_dirty >> 10,
+			   mss->referenced >> 10,
+			   mss->anonymous >> 10,
+			   mss->lazyfree >> 10,
+			   mss->anonymous_thp >> 10,
+			   mss->shmem_thp >> 10,
+			   mss->shared_hugetlb >> 10,
+			   mss->private_hugetlb >> 10,
+			   mss->swap >> 10,
+			   (unsigned long)(mss->swap_pss >> (10 + PSS_SHIFT)),
+			   (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));
+
+	if (!rollup_mode) {
+		arch_show_smap(m, vma);
+		show_smap_vma_flags(m, vma);
+	}
 	m_cache_vma(m, vma);
-	return 0;
+	return ret;
 }
 
 static int show_pid_smap(struct seq_file *m, void *v)
@@ -836,6 +882,25 @@  static int pid_smaps_open(struct inode *inode, struct file *file)
 	return do_maps_open(inode, file, &proc_pid_smaps_op);
 }
 
+static int pid_smaps_rollup_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq;
+	struct proc_maps_private *priv;
+	int ret = do_maps_open(inode, file, &proc_pid_smaps_op);
+
+	if (ret < 0)
+		return ret;
+	seq = file->private_data;
+	priv = seq->private;
+	priv->rollup = kzalloc(sizeof(*priv->rollup), GFP_KERNEL);
+	if (!priv->rollup) {
+		proc_map_release(inode, file);
+		return -ENOMEM;
+	}
+	priv->rollup->first = true;
+	return 0;
+}
+
 static int tid_smaps_open(struct inode *inode, struct file *file)
 {
 	return do_maps_open(inode, file, &proc_tid_smaps_op);
@@ -848,6 +913,13 @@  const struct file_operations proc_pid_smaps_operations = {
 	.release	= proc_map_release,
 };
 
+const struct file_operations proc_pid_smaps_rollup_operations = {
+	.open		= pid_smaps_rollup_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_map_release,
+};
+
 const struct file_operations proc_tid_smaps_operations = {
 	.open		= tid_smaps_open,
 	.read		= seq_read,