diff mbox series

proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING

Message ID 20201012114940.1317510-1-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING | expand

Commit Message

Jann Horn Oct. 12, 2020, 11:49 a.m. UTC
Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
the per-mm counters. With a 4K page size, that means that you can end up
with the counters off by up to 252KiB per thread.

Example:

$ cat rsstest.c
#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
  char cmd[1000];
  sprintf(cmd,
    "grep '^VmRSS' /proc/%d/status;"
    "grep '^Rss:' /proc/%d/smaps_rollup;"
    "echo",
    pid, pid
  );
  system(cmd);
}
int main(void) {
  eventfd_t dummy;
  int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
  pid_t child = fork();
  if (child == -1) err(1, "fork");
  if (child == 0) {
    if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
    if (getppid() == 1) exit(0);
    char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=0; i<40; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=40; i<80; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    exit(0);
  }

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  exit(0);
}
$ gcc -o rsstest rsstest.c && ./rsstest
VmRSS:	      68 kB
Rss:                 616 kB

VmRSS:	      68 kB
Rss:                 776 kB

VmRSS:	     812 kB
Rss:                 936 kB

$


Let's document that those counters aren't entirely accurate.

Reported-by: Mark Mossberg <mark.mossberg@gmail.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
 man5/proc.5 | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)


base-commit: 92e4056a29156598d057045ad25f59d44fcd1bb5

Comments

Jann Horn Oct. 12, 2020, 2:52 p.m. UTC | #1
On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> the per-mm counters. With a 4K page size, that means that you can end up
> with the counters off by up to 252KiB per thread.

Actually, as Mark Mossberg pointed out to me off-thread, the counters
can actually be off by many times more... can be reproduced with e.g.
the following:

#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
  char cmd[1000];
  sprintf(cmd,
    "grep '^VmRSS' /proc/%d/status;"
    "grep '^Rss:' /proc/%d/smaps_rollup;"
    "echo",
    pid, pid
  );
  system(cmd);
}
int main(void) {
  eventfd_t dummy;
  int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
  pid_t child = fork();
  if (child == -1) err(1, "fork");
  if (child == 0) {
    if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
    if (getppid() == 1) exit(0);
    char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    for (int i=0; 1; i++) {
      eventfd_write(child_wait, 1);
      eventfd_read(child_resume, &dummy);
      if (i == 80) break;
      mapping[0x1000 * i] = 1;
    }
    exit(0);
  }

  for (int i=0; i<81; i++) {
    eventfd_read(child_wait, &dummy);
    dump(child);
    eventfd_write(child_resume, 1);
  }

  exit(0);
}


I'm not entirely sure why though.
Michal Hocko Oct. 12, 2020, 3:07 p.m. UTC | #2
On Mon 12-10-20 13:49:40, Jann Horn wrote:
> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> the per-mm counters. With a 4K page size, that means that you can end up
> with the counters off by up to 252KiB per thread.

Do we actually have any strong case to keep this exception to the
accounting?
Jann Horn Oct. 12, 2020, 3:20 p.m. UTC | #3
On Mon, Oct 12, 2020 at 5:07 PM Michal Hocko <mhocko@suse.com> wrote:
> On Mon 12-10-20 13:49:40, Jann Horn wrote:
> > Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > the per-mm counters. With a 4K page size, that means that you can end up
> > with the counters off by up to 252KiB per thread.
>
> Do we actually have any strong case to keep this exception to the
> accounting?

I have no clue. The concept of "concurrently modified cache lines are
bad" seemed vaguely reasonable to me... but I have no idea how much
impact this actually has on massively multithreaded processes.
Michal Hocko Oct. 12, 2020, 3:33 p.m. UTC | #4
On Mon 12-10-20 17:20:08, Jann Horn wrote:
> On Mon, Oct 12, 2020 at 5:07 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 12-10-20 13:49:40, Jann Horn wrote:
> > > Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > > v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > > the per-mm counters. With a 4K page size, that means that you can end up
> > > with the counters off by up to 252KiB per thread.
> >
> > Do we actually have any strong case to keep this exception to the
> > accounting?
> 
> I have no clue. The concept of "concurrently modified cache lines are
> bad" seemed vaguely reasonable to me... but I have no idea how much
> impact this actually has on massively multithreaded processes.

I do remember some discussion when imprecision turned out to be a real
problem (Android?).

Anyway, I have to say that 34e55232e59f ("mm: avoid false sharing of
mm_counter") sounds quite dubious to me and it begs for re-evaluation.

Btw. thanks for trying to document this weird behavior. This is
certainly useful but I am suspecting that dropping it might be even
better.
Michael Kerrisk (man-pages) Oct. 27, 2020, 7:05 a.m. UTC | #5
Hi Jann,

On 10/12/20 4:52 PM, Jann Horn wrote:
> On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
>> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
>> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
>> the per-mm counters. With a 4K page size, that means that you can end up
>> with the counters off by up to 252KiB per thread.
> 
> Actually, as Mark Mossberg pointed out to me off-thread, the counters
> can actually be off by many times more... 

So, does your patch to proc.5 need tweaking, or can I just apply as is?

Thanks,

Michael
Jann Horn Oct. 27, 2020, 10:35 a.m. UTC | #6
On Tue, Oct 27, 2020 at 8:05 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 10/12/20 4:52 PM, Jann Horn wrote:
> > On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
> >> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> >> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> >> the per-mm counters. With a 4K page size, that means that you can end up
> >> with the counters off by up to 252KiB per thread.
> >
> > Actually, as Mark Mossberg pointed out to me off-thread, the counters
> > can actually be off by many times more...
>
> So, does your patch to proc.5 need tweaking, or can I just apply as is?

The "(up to 63 pages per thread)" would have to go, the rest should be correct.

But as Michal said, if someone volunteers to get rid of this
optimization, maybe we don't need the documentation after all? But
that would probably require actually doing some careful
heavily-multithreaded benchmarking on a big machine with a few dozen
cores, or something like that, so that we know whether this
optimization actually is unimportant enough that we can just get rid
of it...
Michal Hocko Oct. 27, 2020, 12:18 p.m. UTC | #7
On Tue 27-10-20 11:35:35, Jann Horn wrote:
> On Tue, Oct 27, 2020 at 8:05 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > On 10/12/20 4:52 PM, Jann Horn wrote:
> > > On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
> > >> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > >> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > >> the per-mm counters. With a 4K page size, that means that you can end up
> > >> with the counters off by up to 252KiB per thread.
> > >
> > > Actually, as Mark Mossberg pointed out to me off-thread, the counters
> > > can actually be off by many times more...
> >
> > So, does your patch to proc.5 need tweaking, or can I just apply as is?
> 
> The "(up to 63 pages per thread)" would have to go, the rest should be correct.
> 
> But as Michal said, if someone volunteers to get rid of this
> optimization, maybe we don't need the documentation after all? But
> that would probably require actually doing some careful
> heavily-multithreaded benchmarking on a big machine with a few dozen
> cores, or something like that, so that we know whether this
> optimization actually is unimportant enough that we can just get rid
> of it...

Well, the original micro optimization didn't really come with some solid
numbers based on real workloads. Artificial workloads are likely not
very representative for this case because any potential counters overhead
normally gets dispersed.

I think this is the case where the benefit is so unclear that I would
simply revert the whole thing and try to tune up for a real life
workloads that actually sees a regression.
Michael Kerrisk (man-pages) Oct. 27, 2020, 1:49 p.m. UTC | #8
On 10/27/20 11:35 AM, Jann Horn wrote:
> On Tue, Oct 27, 2020 at 8:05 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 10/12/20 4:52 PM, Jann Horn wrote:
>>> On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
>>>> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
>>>> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
>>>> the per-mm counters. With a 4K page size, that means that you can end up
>>>> with the counters off by up to 252KiB per thread.
>>>
>>> Actually, as Mark Mossberg pointed out to me off-thread, the counters
>>> can actually be off by many times more...
>>
>> So, does your patch to proc.5 need tweaking, or can I just apply as is?
> 
> The "(up to 63 pages per thread)" would have to go, the rest should be correct.
> 
> But as Michal said, if someone volunteers to get rid of this
> optimization, maybe we don't need the documentation after all? But
> that would probably require actually doing some careful
> heavily-multithreaded benchmarking on a big machine with a few dozen
> cores, or something like that, so that we know whether this
> optimization actually is unimportant enough that we can just get rid
> of it...

Okay -- in the meantime, I've applied your patch, with your 
suggested edit, to remove "(up to 63 pages per thread)".

Thanks!

Cheers,

Michael
Michal Hocko Oct. 27, 2020, 1:49 p.m. UTC | #9
On Tue 27-10-20 13:18:17, Michal Hocko wrote:
> On Tue 27-10-20 11:35:35, Jann Horn wrote:
> > On Tue, Oct 27, 2020 at 8:05 AM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > On 10/12/20 4:52 PM, Jann Horn wrote:
> > > > On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
> > > >> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > > >> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > > >> the per-mm counters. With a 4K page size, that means that you can end up
> > > >> with the counters off by up to 252KiB per thread.
> > > >
> > > > Actually, as Mark Mossberg pointed out to me off-thread, the counters
> > > > can actually be off by many times more...
> > >
> > > So, does your patch to proc.5 need tweaking, or can I just apply as is?
> > 
> > The "(up to 63 pages per thread)" would have to go, the rest should be correct.
> > 
> > But as Michal said, if someone volunteers to get rid of this
> > optimization, maybe we don't need the documentation after all? But
> > that would probably require actually doing some careful
> > heavily-multithreaded benchmarking on a big machine with a few dozen
> > cores, or something like that, so that we know whether this
> > optimization actually is unimportant enough that we can just get rid
> > of it...
> 
> Well, the original micro optimization didn't really come with some solid
> numbers based on real workloads. Artificial workloads are likely not
> very representative for this case because any potential counters overhead
> normally gets dispersed.
> 
> I think this is the case where the benefit is so unclear that I would
> simply revert the whole thing and try to tune up for a real life
> workloads that actually sees a regression.

And here we go with an RFC. Just from the look at diffstat this looks
interesting. Please not I haven't tested this at all so it is mostly to
show how much code we really need for a historical optimization which is
not really backed by any real data.

I think this is worth it just from the code maintenance point of view.
sync_mm_rss hooks are quite arbitrary to say the least. I also do
remember that some Android folks cared about this because they couldn't
get the data they needed with a sufficient precision.

--- 
 fs/exec.c                     |  2 --
 include/linux/mm.h            | 16 --------------
 include/linux/mm_types_task.h |  9 --------
 include/linux/sched.h         |  3 ---
 kernel/exit.c                 |  4 ----
 kernel/fork.c                 |  4 ----
 kernel/kthread.c              |  1 -
 mm/madvise.c                  |  6 +-----
 mm/memory.c                   | 49 -------------------------------------------
 9 files changed, 1 insertion(+), 93 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..a8d8d5578ba3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1107,8 +1107,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk = current;
 	old_mm = current->mm;
 	exec_mm_release(tsk, old_mm);
-	if (old_mm)
-		sync_mm_rss(old_mm);
 
 	ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
 	if (ret)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 16b799a0522c..2a945f044f1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1864,14 +1864,6 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
 	long val = atomic_long_read(&mm->rss_stat.count[member]);
 
-#ifdef SPLIT_RSS_COUNTING
-	/*
-	 * counter is updated in asynchronous manner and may go to minus.
-	 * But it's never be expected number for users.
-	 */
-	if (val < 0)
-		val = 0;
-#endif
 	return (unsigned long)val;
 }
 
@@ -1958,14 +1950,6 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
 		*maxrss = hiwater_rss;
 }
 
-#if defined(SPLIT_RSS_COUNTING)
-void sync_mm_rss(struct mm_struct *mm);
-#else
-static inline void sync_mm_rss(struct mm_struct *mm)
-{
-}
-#endif
-
 #ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
 static inline int pte_special(pte_t pte)
 {
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index c1bc6731125c..a00327c663db 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -48,15 +48,6 @@ enum {
 	NR_MM_COUNTERS
 };
 
-#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
-#define SPLIT_RSS_COUNTING
-/* per-thread cached information, */
-struct task_rss_stat {
-	int events;	/* for synchronization threshold */
-	int count[NR_MM_COUNTERS];
-};
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
 struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935..46fbe466767f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -751,9 +751,6 @@ struct task_struct {
 	/* Per-thread vma caching: */
 	struct vmacache			vmacache;
 
-#ifdef SPLIT_RSS_COUNTING
-	struct task_rss_stat		rss_stat;
-#endif
 	int				exit_state;
 	int				exit_code;
 	int				exit_signal;
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..ae861b977368 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -438,7 +438,6 @@ static void exit_mm(void)
 	exit_mm_release(current, mm);
 	if (!mm)
 		return;
-	sync_mm_rss(mm);
 	/*
 	 * Serialize with any possible pending coredump.
 	 * We must hold mmap_lock around checking core_state
@@ -761,9 +760,6 @@ void __noreturn do_exit(long code)
 
 	exit_signals(tsk);  /* sets PF_EXITING */
 
-	/* sync mm's RSS info before statistics gathering */
-	if (tsk->mm)
-		sync_mm_rss(tsk->mm);
 	acct_update_integrals(tsk);
 	group_dead = atomic_dec_and_test(&tsk->signal->live);
 	if (group_dead) {
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..aa4c22e2b51c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1983,10 +1983,6 @@ static __latent_entropy struct task_struct *copy_process(
 	p->vtime.state = VTIME_INACTIVE;
 #endif
 
-#if defined(SPLIT_RSS_COUNTING)
-	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
-#endif
-
 	p->default_timer_slack_ns = current->timer_slack_ns;
 
 #ifdef CONFIG_PSI
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3edaa380dc7b..c421e54dabe7 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1276,7 +1276,6 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
-	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
 	/* active_mm is still 'mm' */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0e0d61003fc6..cf779cf9c15c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -680,12 +680,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		mark_page_lazyfree(page);
 	}
 out:
-	if (nr_swap) {
-		if (current->mm == mm)
-			sync_mm_rss(mm);
-
+	if (nr_swap)
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
-	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
diff --git a/mm/memory.c b/mm/memory.c
index fcfc4ca36eba..5cef79be41a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -162,53 +162,9 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
 	trace_rss_stat(mm, member, count);
 }
 
-#if defined(SPLIT_RSS_COUNTING)
-
-void sync_mm_rss(struct mm_struct *mm)
-{
-	int i;
-
-	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		if (current->rss_stat.count[i]) {
-			add_mm_counter(mm, i, current->rss_stat.count[i]);
-			current->rss_stat.count[i] = 0;
-		}
-	}
-	current->rss_stat.events = 0;
-}
-
-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
-{
-	struct task_struct *task = current;
-
-	if (likely(task->mm == mm))
-		task->rss_stat.count[member] += val;
-	else
-		add_mm_counter(mm, member, val);
-}
-#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
-#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
-
-/* sync counter once per 64 page faults */
-#define TASK_RSS_EVENTS_THRESH	(64)
-static void check_sync_rss_stat(struct task_struct *task)
-{
-	if (unlikely(task != current))
-		return;
-	if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
-		sync_mm_rss(task->mm);
-}
-#else /* SPLIT_RSS_COUNTING */
-
 #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
 #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
 
-static void check_sync_rss_stat(struct task_struct *task)
-{
-}
-
-#endif /* SPLIT_RSS_COUNTING */
-
 /*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
@@ -485,8 +441,6 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 {
 	int i;
 
-	if (current->mm == mm)
-		sync_mm_rss(mm);
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (rss[i])
 			add_mm_counter(mm, i, rss[i]);
@@ -4612,9 +4566,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	count_vm_event(PGFAULT);
 	count_memcg_event_mm(vma->vm_mm, PGFAULT);
 
-	/* do counter updates before entering really critical section. */
-	check_sync_rss_stat(current);
-
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
Vlastimil Babka Oct. 27, 2020, 6:56 p.m. UTC | #10
On 10/12/20 5:33 PM, Michal Hocko wrote:
> On Mon 12-10-20 17:20:08, Jann Horn wrote:
>> On Mon, Oct 12, 2020 at 5:07 PM Michal Hocko <mhocko@suse.com> wrote:
>> > On Mon 12-10-20 13:49:40, Jann Horn wrote:
>> > > Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
>> > > v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
>> > > the per-mm counters. With a 4K page size, that means that you can end up
>> > > with the counters off by up to 252KiB per thread.
>> >
>> > Do we actually have any strong case to keep this exception to the
>> > accounting?
>> 
>> I have no clue. The concept of "concurrently modified cache lines are
>> bad" seemed vaguely reasonable to me... but I have no idea how much
>> impact this actually has on massively multithreaded processes.
> 
> I do remember some discussion when imprecision turned out to be a real
> problem (Android?).
> 
> Anyway, I have to say that 34e55232e59f ("mm: avoid false sharing of
> mm_counter") sounds quite dubious to me and it begs for re-evaluation.

Agreed.
- false sharing? no, false sharing is when unrelated things share a cache line, 
this is a real sharing of a counter, AFAICS. If the problem is really 
exacerbated by false sharing of the counter with something else, then the fix is 
to move the counter or something else to a different cache line.
- the evaluation showing of 4.5 cache misses per fault reduced to 4, I think 0.5 
cache miss is negligible compared to a page fault
- "Anyway, the most contended object is mmap_sem if the number of threads 
grows." - and surprise surprise, 10 years later this is still true :)


> Btw. thanks for trying to document this weird behavior. This is
> certainly useful but I am suspecting that dropping it might be even
> better.
>
diff mbox series

Patch

diff --git a/man5/proc.5 b/man5/proc.5
index ed309380b53b..13208811efb0 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -2265,6 +2265,9 @@  This is just the pages which
 count toward text, data, or stack space.
 This does not include pages
 which have not been demand-loaded in, or which are swapped out.
+This value is inaccurate; see
+.I /proc/[pid]/statm
+below.
 .TP
 (25) \fIrsslim\fP \ %lu
 Current soft limit in bytes on the rss of the process;
@@ -2409,9 +2412,9 @@  The columns are:
 size       (1) total program size
            (same as VmSize in \fI/proc/[pid]/status\fP)
 resident   (2) resident set size
-           (same as VmRSS in \fI/proc/[pid]/status\fP)
+           (inaccurate; same as VmRSS in \fI/proc/[pid]/status\fP)
 shared     (3) number of resident shared pages (i.e., backed by a file)
-           (same as RssFile+RssShmem in \fI/proc/[pid]/status\fP)
+           (inaccurate; same as RssFile+RssShmem in \fI/proc/[pid]/status\fP)
 text       (4) text (code)
 .\" (not including libs; broken, includes data segment)
 lib        (5) library (unused since Linux 2.6; always 0)
@@ -2420,6 +2423,16 @@  data       (6) data + stack
 dt         (7) dirty pages (unused since Linux 2.6; always 0)
 .EE
 .in
+.IP
+.\" See SPLIT_RSS_COUNTING in the kernel.
+.\" Inaccuracy is bounded by TASK_RSS_EVENTS_THRESH.
+Some of these values are somewhat inaccurate (up to 63 pages per thread) because
+of a kernel-internal scalability optimization.
+If accurate values are required, use
+.I /proc/[pid]/smaps
+or
+.I /proc/[pid]/smaps_rollup
+instead, which are much slower but provide accurate, detailed information.
 .TP
 .I /proc/[pid]/status
 Provides much of the information in
@@ -2596,6 +2609,9 @@  directly access physical memory.
 .IP *
 .IR VmHWM :
 Peak resident set size ("high water mark").
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR VmRSS :
 Resident set size.
@@ -2604,16 +2620,25 @@  Note that the value here is the sum of
 .IR RssFile ,
 and
 .IR RssShmem .
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssAnon :
 Size of resident anonymous memory.
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssFile :
 Size of resident file mappings.
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssShmem :
 Size of resident shared memory (includes System V shared memory,
@@ -2622,6 +2647,9 @@  mappings from
 and shared anonymous mappings).
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR VmData ", " VmStk ", " VmExe :
 Size of data, stack, and text segments.
@@ -2640,6 +2668,9 @@  Size of second-level page tables (added in Linux 4.0; removed in Linux 4.15).
 .\" commit b084d4353ff99d824d3bc5a5c2c22c70b1fba722
 Swapped-out virtual memory size by anonymous private pages;
 shmem swap usage is not included (since Linux 2.6.34).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR HugetlbPages :
 Size of hugetlb memory portions