diff mbox series

[v4,3/4] cachestat: implement cachestat syscall

Message ID 20221216192149.3902877-4-nphamcs@gmail.com (mailing list archive)
State New
Headers show
Series cachestat: a new syscall for page cache state of files | expand

Commit Message

Nhat Pham Dec. 16, 2022, 7:21 p.m. UTC
Implement a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in
a given range.

NAME
    cachestat - query the page cache status of a file.

SYNOPSIS
    #include <sys/mman.h>

    struct cachestat {
        __u64 nr_cache;
        __u64 nr_dirty;
        __u64 nr_writeback;
        __u64 nr_evicted;
        __u64 nr_recently_evicted;
    };

    int cachestat(unsigned int fd, off_t off, size_t len,
          size_t cstat_size, struct cachestat *cstat,
          unsigned int flags);

DESCRIPTION
    cachestat() queries the number of cached pages, number of dirty
    pages, number of pages marked for writeback, number of (recently)
    evicted pages, in the bytes range given by `off` and `len`.

    These values are returned in a cachestat struct, whose address is
    given by the `cstat` argument.

    The `off` and `len` arguments must be non-negative integers. If
    `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
    0, we will query in the range from `off` to the end of the file.

    `cstat_size` allows users to obtain partial results. The syscall
    will copy the first `csstat_size` bytes to the specified userspace
    memory. `cstat_size` must be a non-negative value that is no larger
    than the current size of the cachestat struct.

    The `flags` argument is unused for now, but is included for future
    extensibility. User should pass 0 (i.e no flag specified).

RETURN VALUE
    On success, cachestat returns 0. On error, -1 is returned, and errno
    is set to indicate the error.

ERRORS
    EFAULT cstat points to an invalid address.

    EINVAL invalid `cstat_size` or `flags`

    EBADF  invalid file descriptor.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 include/linux/fs.h                          |   3 +
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/mman.h                   |   9 ++
 init/Kconfig                                |  10 ++
 kernel/sys_ni.c                             |   1 +
 mm/filemap.c                                | 137 ++++++++++++++++++++
 20 files changed, 180 insertions(+), 1 deletion(-)

Comments

Andrew Morton Dec. 16, 2022, 9:48 p.m. UTC | #1
On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> Implement a new syscall that queries cache state of a file and
> summarizes the number of cached pages, number of dirty pages, number of
> pages marked for writeback, number of (recently) evicted pages, etc. in
> a given range.
> 
> NAME
>     cachestat - query the page cache status of a file.
> 
> SYNOPSIS
>     #include <sys/mman.h>
> 
>     struct cachestat {
>         __u64 nr_cache;
>         __u64 nr_dirty;
>         __u64 nr_writeback;
>         __u64 nr_evicted;
>         __u64 nr_recently_evicted;
>     };
> 
>     int cachestat(unsigned int fd, off_t off, size_t len,
>           size_t cstat_size, struct cachestat *cstat,
>           unsigned int flags);
> 
> DESCRIPTION
>     cachestat() queries the number of cached pages, number of dirty
>     pages, number of pages marked for writeback, number of (recently)
>     evicted pages, in the bytes range given by `off` and `len`.

I suggest this be spelled out better: "number of evicted and number or
recently evicted pages".

I suggest this clearly tell readers what an "evicted" page is - they
aren't kernel programmers!

What is the benefit of the "recently evicted" pages?  "recently" seems
very vague - what use is this to anyone?

>     These values are returned in a cachestat struct, whose address is
>     given by the `cstat` argument.
> 
>     The `off` and `len` arguments must be non-negative integers. If
>     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
>     0, we will query in the range from `off` to the end of the file.
> 
>     `cstat_size` allows users to obtain partial results. The syscall
>     will copy the first `csstat_size` bytes to the specified userspace
>     memory. `cstat_size` must be a non-negative value that is no larger
>     than the current size of the cachestat struct.
> 
>     The `flags` argument is unused for now, but is included for future
>     extensibility. User should pass 0 (i.e no flag specified).

Why is `flags' here?  We could add an unused flags arg to any syscall,
but we don't.  What's the plan?


Are there security implications?  If I know that some process has a
file open, I can use cachestat() to infer which parts of that file
they're looking at (like mincore(), I guess).  And I can infer which
parts they're writing to, unlike mincore().


I suggest the [patch 1/4] fixup be separated from this series.
kernel test robot Dec. 16, 2022, 10:26 p.m. UTC | #2
Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: s390-allmodconfig
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
        git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from include/uapi/asm-generic/mman.h:5,
                    from ./arch/s390/include/generated/uapi/asm/mman.h:1,
                    from include/uapi/linux/mman.h:5,
                    from include/linux/fs.h:49,
                    from include/linux/huge_mm.h:8,
                    from include/linux/mm.h:737,
                    from include/linux/kvm_host.h:16,
                    from arch/s390/kvm/kvm-s390.h:17,
                    from arch/s390/kvm/gaccess.c:16:
>> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant
      16 | #define PROT_NONE       0x0             /* page can not be accessed */
         |                         ^~~
   arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE'
     493 |         PROT_NONE,
         |         ^~~~~~~~~
   arch/s390/kvm/gaccess.c: In function 'trans_exc_ending':
>> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value
     516 |                 case PROT_TYPE_LA:
         |                 ^~~~
   arch/s390/kvm/gaccess.c:509:17: note: previously used here
     509 |                 case PROT_NONE:
         |                 ^~~~
   In file included from include/linux/compiler_types.h:79,
                    from <command-line>:
>> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label
     223 | # define fallthrough                    __attribute__((__fallthrough__))
         |                                         ^~~~~~~~~~~~~
   arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough'
     515 |                         fallthrough;
         |                         ^~~~~~~~~~~


vim +16 include/uapi/asm-generic/mman-common.h

5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15   9  
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  10  #define PROT_READ	0x1		/* page can be read */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  11  #define PROT_WRITE	0x2		/* page can be written */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  12  #define PROT_EXEC	0x4		/* page can be executed */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  13  #define PROT_SEM	0x8		/* page may be used for atomic ops */
d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin        2019-12-11  14  /*			0x10		   reserved for arch-specific use */
d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin        2019-12-11  15  /*			0x20		   reserved for arch-specific use */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15 @16  #define PROT_NONE	0x0		/* page can not be accessed */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  17  #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  18  #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  19
kernel test robot Dec. 16, 2022, 10:46 p.m. UTC | #3
Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: arm64-randconfig-r035-20221216
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 98b13979fb05f3ed288a900deb843e7b27589e58)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
        git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 prepare

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/arm64/kernel/asm-offsets.c:10:
   In file included from include/linux/arm_sdei.h:8:
   In file included from include/acpi/ghes.h:5:
   In file included from include/acpi/apei.h:9:
   In file included from include/linux/acpi.h:15:
   In file included from include/linux/device.h:32:
   In file included from include/linux/device/driver.h:21:
   In file included from include/linux/module.h:19:
   In file included from include/linux/elf.h:6:
   In file included from arch/arm64/include/asm/elf.h:141:
   In file included from include/linux/fs.h:49:
   In file included from include/uapi/linux/mman.h:5:
>> arch/arm64/include/asm/mman.h:15:10: error: use of undeclared identifier 'VM_ARM64_BTI'
                   ret |= VM_ARM64_BTI;
                          ^
>> arch/arm64/include/asm/mman.h:18:10: error: use of undeclared identifier 'VM_MTE'
                   ret |= VM_MTE;
                          ^
>> arch/arm64/include/asm/mman.h:32:10: error: use of undeclared identifier 'VM_MTE_ALLOWED'
                   return VM_MTE_ALLOWED;
                          ^
   arch/arm64/include/asm/mman.h:59:22: error: use of undeclared identifier 'VM_MTE'
           return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
                               ^
   arch/arm64/include/asm/mman.h:59:45: error: use of undeclared identifier 'VM_MTE_ALLOWED'
           return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
                                                      ^
   5 errors generated.
   make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:1270: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:231: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/VM_ARM64_BTI +15 arch/arm64/include/asm/mman.h

8ef8f360cf30be Dave Martin     2020-03-16   8  
8ef8f360cf30be Dave Martin     2020-03-16   9  static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
8ef8f360cf30be Dave Martin     2020-03-16  10  	unsigned long pkey __always_unused)
8ef8f360cf30be Dave Martin     2020-03-16  11  {
9f3419315f3cdc Catalin Marinas 2019-11-27  12  	unsigned long ret = 0;
9f3419315f3cdc Catalin Marinas 2019-11-27  13  
8ef8f360cf30be Dave Martin     2020-03-16  14  	if (system_supports_bti() && (prot & PROT_BTI))
9f3419315f3cdc Catalin Marinas 2019-11-27 @15  		ret |= VM_ARM64_BTI;
8ef8f360cf30be Dave Martin     2020-03-16  16  
9f3419315f3cdc Catalin Marinas 2019-11-27  17  	if (system_supports_mte() && (prot & PROT_MTE))
9f3419315f3cdc Catalin Marinas 2019-11-27 @18  		ret |= VM_MTE;
9f3419315f3cdc Catalin Marinas 2019-11-27  19  
9f3419315f3cdc Catalin Marinas 2019-11-27  20  	return ret;
8ef8f360cf30be Dave Martin     2020-03-16  21  }
8ef8f360cf30be Dave Martin     2020-03-16  22  #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
8ef8f360cf30be Dave Martin     2020-03-16  23  
9f3419315f3cdc Catalin Marinas 2019-11-27  24  static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
9f3419315f3cdc Catalin Marinas 2019-11-27  25  {
9f3419315f3cdc Catalin Marinas 2019-11-27  26  	/*
9f3419315f3cdc Catalin Marinas 2019-11-27  27  	 * Only allow MTE on anonymous mappings as these are guaranteed to be
9f3419315f3cdc Catalin Marinas 2019-11-27  28  	 * backed by tags-capable memory. The vm_flags may be overridden by a
9f3419315f3cdc Catalin Marinas 2019-11-27  29  	 * filesystem supporting MTE (RAM-based).
9f3419315f3cdc Catalin Marinas 2019-11-27  30  	 */
9f3419315f3cdc Catalin Marinas 2019-11-27  31  	if (system_supports_mte() && (flags & MAP_ANONYMOUS))
9f3419315f3cdc Catalin Marinas 2019-11-27 @32  		return VM_MTE_ALLOWED;
9f3419315f3cdc Catalin Marinas 2019-11-27  33  
9f3419315f3cdc Catalin Marinas 2019-11-27  34  	return 0;
9f3419315f3cdc Catalin Marinas 2019-11-27  35  }
9f3419315f3cdc Catalin Marinas 2019-11-27  36  #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
9f3419315f3cdc Catalin Marinas 2019-11-27  37
kernel test robot Dec. 16, 2022, 10:46 p.m. UTC | #4
Hi Nhat,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: alpha-randconfig-r036-20221216
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
        git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash mm/damon/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/mman.h:5,
                    from include/linux/fs.h:49,
                    from include/linux/highmem.h:5,
                    from mm/damon/vaddr.c:11:
>> arch/alpha/include/uapi/asm/mman.h:15: warning: "MAP_FIXED" redefined
      15 | #define MAP_FIXED       0x100           /* Interpret addr exactly */
         | 
   In file included from mm/damon/vaddr.c:10:
   include/uapi/asm-generic/mman-common.h:22: note: this is the location of the previous definition
      22 | #define MAP_FIXED       0x10            /* Interpret addr exactly */
         | 
>> arch/alpha/include/uapi/asm/mman.h:16: warning: "MAP_ANONYMOUS" redefined
      16 | #define MAP_ANONYMOUS   0x10            /* don't use a file */
         | 
   include/uapi/asm-generic/mman-common.h:23: note: this is the location of the previous definition
      23 | #define MAP_ANONYMOUS   0x20            /* don't use a file */
         | 
>> arch/alpha/include/uapi/asm/mman.h:29: warning: "MAP_POPULATE" redefined
      29 | #define MAP_POPULATE    0x20000         /* populate (prefault) pagetables */
         | 
   include/uapi/asm-generic/mman-common.h:26: note: this is the location of the previous definition
      26 | #define MAP_POPULATE            0x008000        /* populate (prefault) pagetables */
         | 
>> arch/alpha/include/uapi/asm/mman.h:30: warning: "MAP_NONBLOCK" redefined
      30 | #define MAP_NONBLOCK    0x40000         /* do not block on IO */
         | 
   include/uapi/asm-generic/mman-common.h:27: note: this is the location of the previous definition
      27 | #define MAP_NONBLOCK            0x010000        /* do not block on IO */
         | 
>> arch/alpha/include/uapi/asm/mman.h:31: warning: "MAP_STACK" redefined
      31 | #define MAP_STACK       0x80000         /* give out an address that is best suited for process/thread stacks */
         | 
   include/uapi/asm-generic/mman-common.h:28: note: this is the location of the previous definition
      28 | #define MAP_STACK               0x020000        /* give out an address that is best suited for process/thread stacks */
         | 
>> arch/alpha/include/uapi/asm/mman.h:32: warning: "MAP_HUGETLB" redefined
      32 | #define MAP_HUGETLB     0x100000        /* create a huge page mapping */
         | 
   include/uapi/asm-generic/mman-common.h:29: note: this is the location of the previous definition
      29 | #define MAP_HUGETLB             0x040000        /* create a huge page mapping */
         | 
>> arch/alpha/include/uapi/asm/mman.h:33: warning: "MAP_FIXED_NOREPLACE" redefined
      33 | #define MAP_FIXED_NOREPLACE     0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
         | 
   include/uapi/asm-generic/mman-common.h:31: note: this is the location of the previous definition
      31 | #define MAP_FIXED_NOREPLACE     0x100000        /* MAP_FIXED which doesn't unmap underlying mapping */
         | 
>> arch/alpha/include/uapi/asm/mman.h:36: warning: "MS_SYNC" redefined
      36 | #define MS_SYNC         2               /* synchronous memory sync */
         | 
   include/uapi/asm-generic/mman-common.h:43: note: this is the location of the previous definition
      43 | #define MS_SYNC         4               /* synchronous memory sync */
         | 
>> arch/alpha/include/uapi/asm/mman.h:37: warning: "MS_INVALIDATE" redefined
      37 | #define MS_INVALIDATE   4               /* invalidate the caches */
         | 
   include/uapi/asm-generic/mman-common.h:42: note: this is the location of the previous definition
      42 | #define MS_INVALIDATE   2               /* invalidate the caches */
         | 
>> arch/alpha/include/uapi/asm/mman.h:50: warning: "MADV_DONTNEED" redefined
      50 | #define MADV_DONTNEED   6               /* don't need these pages */
         | 
   include/uapi/asm-generic/mman-common.h:49: note: this is the location of the previous definition
      49 | #define MADV_DONTNEED   4               /* don't need these pages */
         | 


vim +/MAP_FIXED +15 arch/alpha/include/uapi/asm/mman.h

^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  12  
746c9398f5ac2b arch/alpha/include/uapi/asm/mman.h Michael S. Tsirkin 2019-02-08  13  /* 0x01 - 0x03 are defined in linux/mman.h */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  14  #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @15  #define MAP_FIXED	0x100		/* Interpret addr exactly */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @16  #define MAP_ANONYMOUS	0x10		/* don't use a file */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  17  
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  18  /* not used by linux, but here to make sure we don't clash with OSF/1 defines */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  19  #define _MAP_HASSEMAPHORE 0x0200
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  20  #define _MAP_INHERIT	0x0400
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  21  #define _MAP_UNALIGNED	0x0800
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  22  
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  23  /* These are linux-specific */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  24  #define MAP_GROWSDOWN	0x01000		/* stack-like segment */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  25  #define MAP_DENYWRITE	0x02000		/* ETXTBSY */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  26  #define MAP_EXECUTABLE	0x04000		/* mark it as an executable */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  27  #define MAP_LOCKED	0x08000		/* lock the mapping */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  28  #define MAP_NORESERVE	0x10000		/* don't check for reservations */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @29  #define MAP_POPULATE	0x20000		/* populate (prefault) pagetables */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @30  #define MAP_NONBLOCK	0x40000		/* do not block on IO */
90f72aa58bbf07 arch/alpha/include/asm/mman.h      Arnd Bergmann      2009-09-21 @31  #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
90f72aa58bbf07 arch/alpha/include/asm/mman.h      Arnd Bergmann      2009-09-21 @32  #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
a4ff8e8620d3f4 arch/alpha/include/uapi/asm/mman.h Michal Hocko       2018-04-10 @33  #define MAP_FIXED_NOREPLACE	0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  34  
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  35  #define MS_ASYNC	1		/* sync memory asynchronously */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @36  #define MS_SYNC		2		/* synchronous memory sync */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @37  #define MS_INVALIDATE	4		/* invalidate the caches */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  38  
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  39  #define MCL_CURRENT	 8192		/* lock all currently mapped pages */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  40  #define MCL_FUTURE	16384		/* lock all additions to address space */
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson      2015-11-05  41  #define MCL_ONFAULT	32768		/* lock all pages that are faulted in */
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson      2015-11-05  42  
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson      2015-11-05  43  #define MLOCK_ONFAULT	0x01		/* Lock pages in range after they are faulted in, do not prefault */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  44  
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  45  #define MADV_NORMAL	0		/* no further special treatment */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  46  #define MADV_RANDOM	1		/* expect random page references */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  47  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  48  #define MADV_WILLNEED	3		/* will need these pages */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16  49  #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
^1da177e4c3f41 include/asm-alpha/mman.h           Linus Torvalds     2005-04-16 @50  #define MADV_DONTNEED	6		/* don't need these pages */
5f6164f3092832 include/asm-alpha/mman.h           Michael S. Tsirkin 2006-02-15  51
kernel test robot Dec. 17, 2022, 1:58 a.m. UTC | #5
Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: arm64-allyesconfig
compiler: aarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
        git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 prepare

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/uapi/linux/mman.h:5,
                    from include/linux/fs.h:49,
                    from arch/arm64/include/asm/elf.h:141,
                    from include/linux/elf.h:6,
                    from include/linux/module.h:19,
                    from include/linux/device/driver.h:21,
                    from include/linux/device.h:32,
                    from include/linux/acpi.h:15,
                    from include/acpi/apei.h:9,
                    from include/acpi/ghes.h:5,
                    from include/linux/arm_sdei.h:8,
                    from arch/arm64/kernel/asm-offsets.c:10:
   arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_prot_bits':
>> arch/arm64/include/asm/mman.h:15:24: error: 'VM_ARM64_BTI' undeclared (first use in this function); did you mean 'ARM64_BTI'?
      15 |                 ret |= VM_ARM64_BTI;
         |                        ^~~~~~~~~~~~
         |                        ARM64_BTI
   arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
>> arch/arm64/include/asm/mman.h:18:24: error: 'VM_MTE' undeclared (first use in this function)
      18 |                 ret |= VM_MTE;
         |                        ^~~~~~
   arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_flag_bits':
>> arch/arm64/include/asm/mman.h:32:24: error: 'VM_MTE_ALLOWED' undeclared (first use in this function)
      32 |                 return VM_MTE_ALLOWED;
         |                        ^~~~~~~~~~~~~~
   arch/arm64/include/asm/mman.h: In function 'arch_validate_flags':
   arch/arm64/include/asm/mman.h:59:29: error: 'VM_MTE' undeclared (first use in this function)
      59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
         |                             ^~~~~~
   arch/arm64/include/asm/mman.h:59:52: error: 'VM_MTE_ALLOWED' undeclared (first use in this function)
      59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
         |                                                    ^~~~~~~~~~~~~~
   make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:1270: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:231: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +15 arch/arm64/include/asm/mman.h

8ef8f360cf30be Dave Martin     2020-03-16   8  
8ef8f360cf30be Dave Martin     2020-03-16   9  static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
8ef8f360cf30be Dave Martin     2020-03-16  10  	unsigned long pkey __always_unused)
8ef8f360cf30be Dave Martin     2020-03-16  11  {
9f3419315f3cdc Catalin Marinas 2019-11-27  12  	unsigned long ret = 0;
9f3419315f3cdc Catalin Marinas 2019-11-27  13  
8ef8f360cf30be Dave Martin     2020-03-16  14  	if (system_supports_bti() && (prot & PROT_BTI))
9f3419315f3cdc Catalin Marinas 2019-11-27 @15  		ret |= VM_ARM64_BTI;
8ef8f360cf30be Dave Martin     2020-03-16  16  
9f3419315f3cdc Catalin Marinas 2019-11-27  17  	if (system_supports_mte() && (prot & PROT_MTE))
9f3419315f3cdc Catalin Marinas 2019-11-27 @18  		ret |= VM_MTE;
9f3419315f3cdc Catalin Marinas 2019-11-27  19  
9f3419315f3cdc Catalin Marinas 2019-11-27  20  	return ret;
8ef8f360cf30be Dave Martin     2020-03-16  21  }
8ef8f360cf30be Dave Martin     2020-03-16  22  #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
8ef8f360cf30be Dave Martin     2020-03-16  23  
9f3419315f3cdc Catalin Marinas 2019-11-27  24  static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
9f3419315f3cdc Catalin Marinas 2019-11-27  25  {
9f3419315f3cdc Catalin Marinas 2019-11-27  26  	/*
9f3419315f3cdc Catalin Marinas 2019-11-27  27  	 * Only allow MTE on anonymous mappings as these are guaranteed to be
9f3419315f3cdc Catalin Marinas 2019-11-27  28  	 * backed by tags-capable memory. The vm_flags may be overridden by a
9f3419315f3cdc Catalin Marinas 2019-11-27  29  	 * filesystem supporting MTE (RAM-based).
9f3419315f3cdc Catalin Marinas 2019-11-27  30  	 */
9f3419315f3cdc Catalin Marinas 2019-11-27  31  	if (system_supports_mte() && (flags & MAP_ANONYMOUS))
9f3419315f3cdc Catalin Marinas 2019-11-27 @32  		return VM_MTE_ALLOWED;
9f3419315f3cdc Catalin Marinas 2019-11-27  33  
9f3419315f3cdc Catalin Marinas 2019-11-27  34  	return 0;
9f3419315f3cdc Catalin Marinas 2019-11-27  35  }
9f3419315f3cdc Catalin Marinas 2019-11-27  36  #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
9f3419315f3cdc Catalin Marinas 2019-11-27  37
Brian Foster Dec. 20, 2022, 2:37 p.m. UTC | #6
On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> Implement a new syscall that queries cache state of a file and
> summarizes the number of cached pages, number of dirty pages, number of
> pages marked for writeback, number of (recently) evicted pages, etc. in
> a given range.
> 
> NAME
>     cachestat - query the page cache status of a file.
> 
> SYNOPSIS
>     #include <sys/mman.h>
> 
>     struct cachestat {
>         __u64 nr_cache;
>         __u64 nr_dirty;
>         __u64 nr_writeback;
>         __u64 nr_evicted;
>         __u64 nr_recently_evicted;
>     };
> 
>     int cachestat(unsigned int fd, off_t off, size_t len,
>           size_t cstat_size, struct cachestat *cstat,
>           unsigned int flags);
> 
> DESCRIPTION
>     cachestat() queries the number of cached pages, number of dirty
>     pages, number of pages marked for writeback, number of (recently)
>     evicted pages, in the bytes range given by `off` and `len`.
> 
>     These values are returned in a cachestat struct, whose address is
>     given by the `cstat` argument.
> 
>     The `off` and `len` arguments must be non-negative integers. If
>     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
>     0, we will query in the range from `off` to the end of the file.
> 
>     `cstat_size` allows users to obtain partial results. The syscall
>     will copy the first `csstat_size` bytes to the specified userspace
>     memory. `cstat_size` must be a non-negative value that is no larger
>     than the current size of the cachestat struct.
> 
>     The `flags` argument is unused for now, but is included for future
>     extensibility. User should pass 0 (i.e no flag specified).
> 
> RETURN VALUE
>     On success, cachestat returns 0. On error, -1 is returned, and errno
>     is set to indicate the error.
> 
> ERRORS
>     EFAULT cstat points to an invalid address.
> 
>     EINVAL invalid `cstat_size` or `flags`
> 
>     EBADF  invalid file descriptor.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---

Hi Nhat,

Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
Just a couple more random nitty things, if you happen to care about
them..

>  arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
>  arch/arm/tools/syscall.tbl                  |   1 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
>  include/linux/fs.h                          |   3 +
>  include/linux/syscalls.h                    |   3 +
>  include/uapi/asm-generic/unistd.h           |   5 +-
>  include/uapi/linux/mman.h                   |   9 ++
>  init/Kconfig                                |  10 ++
>  kernel/sys_ni.c                             |   1 +
>  mm/filemap.c                                | 137 ++++++++++++++++++++
>  20 files changed, 180 insertions(+), 1 deletion(-)
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 08341616ae7a..29ffb906caa4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
...
> @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
>  	return try_to_free_buffers(folio);
>  }
>  EXPORT_SYMBOL(filemap_release_folio);
> +
> +#ifdef CONFIG_CACHESTAT
> +/**
> + * filemap_cachestat() - compute the page cache statistics of a mapping
> + * @mapping:	The mapping to compute the statistics for.
> + * @first_index:	The starting page cache index.
> + * @last_index:	The final page index (inclusive).
> + * @cs:	the cachestat struct to write the result to.
> + *
> + * This will query the page cache statistics of a mapping in the
> + * page range of [first_index, last_index] (inclusive). THe statistics
> + * queried include: number of dirty pages, number of pages marked for
> + * writeback, and the number of (recently) evicted pages.
> + */
> +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> +		pgoff_t last_index, struct cachestat *cs)
> +{

Not sure why the internal helper needs to be wrapped in a config option?
Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
wrap the syscall bits with that..?

I would also think it might make things simpler to split out all the
syscall bits into a separate patch from filemap_cachestat().

> +	XA_STATE(xas, &mapping->i_pages, first_index);
> +	struct folio *folio;
> +
> +	rcu_read_lock();
> +	xas_for_each(&xas, folio, last_index) {
> +		unsigned long nr_pages;
> +		pgoff_t folio_first_index, folio_last_index;
> +
> +		if (xas_retry(&xas, folio))
> +			continue;
> +
> +		nr_pages = folio_nr_pages(folio);
> +		folio_first_index = folio_pgoff(folio);
> +		folio_last_index = folio_first_index + nr_pages - 1;
> +
> +		/* Folios might straddle the range boundaries, only count covered subpages */
> +		if (folio_first_index < first_index)
> +			nr_pages -= first_index - folio_first_index;
> +
> +		if (folio_last_index > last_index)
> +			nr_pages -= folio_last_index - last_index;
> +
> +		if (xa_is_value(folio)) {
> +			/* page is evicted */
> +			void *shadow = (void *)folio;
> +			bool workingset; /* not used */
> +
> +			cs->nr_evicted += nr_pages;
> +
> +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> +			if (shmem_mapping(mapping)) {
> +				/* shmem file - in swap cache */
> +				swp_entry_t swp = radix_to_swp_entry(folio);
> +
> +				shadow = get_shadow_from_swap_cache(swp);
> +			}
> +#endif
> +			if (workingset_test_recent(shadow, true, &workingset))
> +				cs->nr_recently_evicted += nr_pages;
> +
> +			continue;
> +		}
> +
> +		/* page is in cache */
> +		cs->nr_cache += nr_pages;
> +
> +		if (folio_test_dirty(folio))
> +			cs->nr_dirty += nr_pages;
> +
> +		if (folio_test_writeback(folio))
> +			cs->nr_writeback += nr_pages;

I'm not sure this is an issue right now (or if it will ever be for your
use cases), but from a filesystem perspective it is possible to have
large or variable sized folios in cache. At the moment I think the fs
(or XFS+iomap at least) manages dirty/writeback state at folio
granularity, but that may change in the near future if/when iomap
sub-page dirty tracking comes along. I suspect that means it may become
possible to have a large folio of some N number of pages where only a
subset of those pages are actually in dirty/writeback state, and thus
introduces some inaccuracy here because this assumes that folio state
applies to folio_nr_pages() worth of pages. Just something to be aware
of..

Brian
 
> +	}
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_cachestat);
> +
> +/*
> + * The cachestat(5) system call.
> + *
> + * cachestat() returns the page cache statistics of a file in the
> + * bytes specified by `off` and `len`: number of cached pages,
> + * number of dirty pages, number of pages marked for writeback,
> + * number of (recently) evicted pages.
> + *
> + * `off` and `len` must be non-negative integers. If `len` > 0,
> + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> + * we will query in the range from `off` to the end of the file.
> + *
> + * `cstat_size` allows users to obtain partial results. The syscall
> + * will copy the first `csstat_size` bytes to the specified userspace
> + * memory. It also makes the cachestat struct extensible - new fields
> + * can be added in the future without breaking existing usage.
> + * `cstat_size` must be a non-negative value that is no larger than
> + * the current size of the cachestat struct.
> + *
> + * The `flags` argument is unused for now, but is included for future
> + * extensibility. User should pass 0 (i.e no flag specified).
> + *
> + * Because the status of a page can change after cachestat() checks it
> + * but before it returns to the application, the returned values may
> + * contain stale information.
> + *
> + * return values:
> + *  zero    - success
> + *  -EFAULT - cstat points to an illegal address
> + *  -EINVAL - invalid arguments
> + *  -EBADF	- invalid file descriptor
> + */
> +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> +		size_t, cstat_size, struct cachestat __user *, cstat,
> +		unsigned int, flags)
> +{
> +	struct fd f = fdget(fd);
> +	struct address_space *mapping;
> +	struct cachestat cs;
> +	pgoff_t first_index = off >> PAGE_SHIFT;
> +	pgoff_t last_index =
> +		len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> +
> +	if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> +		return -EINVAL;
> +
> +	if (!f.file)
> +		return -EBADF;
> +
> +	memset(&cs, 0, sizeof(struct cachestat));
> +	mapping = f.file->f_mapping;
> +	filemap_cachestat(mapping, first_index, last_index, &cs);
> +	fdput(f);
> +
> +	if (copy_to_user(cstat, &cs, cstat_size))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_CACHESTAT */
> -- 
> 2.30.2
>
Nhat Pham Dec. 22, 2022, 12:37 a.m. UTC | #7
On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > Implement a new syscall that queries cache state of a file and
> > summarizes the number of cached pages, number of dirty pages, number of
> > pages marked for writeback, number of (recently) evicted pages, etc. in
> > a given range.
> >
> > NAME
> >     cachestat - query the page cache status of a file.
> >
> > SYNOPSIS
> >     #include <sys/mman.h>
> >
> >     struct cachestat {
> >         __u64 nr_cache;
> >         __u64 nr_dirty;
> >         __u64 nr_writeback;
> >         __u64 nr_evicted;
> >         __u64 nr_recently_evicted;
> >     };
> >
> >     int cachestat(unsigned int fd, off_t off, size_t len,
> >           size_t cstat_size, struct cachestat *cstat,
> >           unsigned int flags);
> >
> > DESCRIPTION
> >     cachestat() queries the number of cached pages, number of dirty
> >     pages, number of pages marked for writeback, number of (recently)
> >     evicted pages, in the bytes range given by `off` and `len`.
>
> I suggest this be spelled out better: "number of evicted and number or
> recently evicted pages".
>
> I suggest this clearly tell readers what an "evicted" page is - they
> aren't kernel programmers!

Valid points - I'll try to explain this more clearly in the future
versions of this patch series, especially in the draft man page.

>
> What is the benefit of the "recently evicted" pages?  "recently" seems
> very vague - what use is this to anyone?

This eviction recency semantics comes from the LRU's refault
computation. Users of cachestat might be interested in two very
different questions:

1. How many pages are not resident in the page cache.
2. How many pages are recently evicted (recently enough that
    their refault will be construed as memory pressure).

The first question is answered with nr_evicted, whereas the second
is answered with nr_recently_evicted.

I will figure out a way to explain this better in the next version. Users
definitely do not need to know the nitty gritty details of LRU logic,
but they should know the general idea of each field at least.

>
> >     These values are returned in a cachestat struct, whose address is
> >     given by the `cstat` argument.
> >
> >     The `off` and `len` arguments must be non-negative integers. If
> >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> >     0, we will query in the range from `off` to the end of the file.
> >
> >     `cstat_size` allows users to obtain partial results. The syscall
> >     will copy the first `csstat_size` bytes to the specified userspace
> >     memory. `cstat_size` must be a non-negative value that is no larger
> >     than the current size of the cachestat struct.
> >
> >     The `flags` argument is unused for now, but is included for future
> >     extensibility. User should pass 0 (i.e no flag specified).
>
> Why is `flags' here?  We could add an unused flags arg to any syscall,
> but we don't.  What's the plan?

I included this field to ensure that cachestat can be extended safely,
especially when different users might want different things out of it.

For instance, in the future there might be new fields/computations
that are too heavy for certain use cases - a flag could be used to
disable/skip such fields/computations.

Another thing it might be used for is the huge page counting -
we have not implemented this in this version yet, but it might
introduce murky semantics to new/existing fields in struct
cachestat. Or maybe not - but worst case scenario we can
leave this decision to the users to decide through flags.

I'm sure there are more potential pitfalls that the flags could
save us from, but these are the two on top of my head.

>
> Are there security implications?  If I know that some process has a
> file open, I can use cachestat() to infer which parts of that file
> they're looking at (like mincore(), I guess).  And I can infer which
> parts they're writing to, unlike mincore().

This one, I'm not 100% sure, but it is a valid concern. Let me think
about it and discuss with more security-oriented minds before
responding to this.

>
> I suggest the [patch 1/4] fixup be separated from this series.

Sounds good! I'll loop Johannes in about this breakup as well.
Nhat Pham Dec. 22, 2022, 9:50 p.m. UTC | #8
On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > Implement a new syscall that queries cache state of a file and
> > summarizes the number of cached pages, number of dirty pages, number of
> > pages marked for writeback, number of (recently) evicted pages, etc. in
> > a given range.
> >
> > NAME
> >     cachestat - query the page cache status of a file.
> >
> > SYNOPSIS
> >     #include <sys/mman.h>
> >
> >     struct cachestat {
> >         __u64 nr_cache;
> >         __u64 nr_dirty;
> >         __u64 nr_writeback;
> >         __u64 nr_evicted;
> >         __u64 nr_recently_evicted;
> >     };
> >
> >     int cachestat(unsigned int fd, off_t off, size_t len,
> >           size_t cstat_size, struct cachestat *cstat,
> >           unsigned int flags);
> >
> > DESCRIPTION
> >     cachestat() queries the number of cached pages, number of dirty
> >     pages, number of pages marked for writeback, number of (recently)
> >     evicted pages, in the bytes range given by `off` and `len`.
> >
> >     These values are returned in a cachestat struct, whose address is
> >     given by the `cstat` argument.
> >
> >     The `off` and `len` arguments must be non-negative integers. If
> >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> >     0, we will query in the range from `off` to the end of the file.
> >
> >     `cstat_size` allows users to obtain partial results. The syscall
> >     will copy the first `csstat_size` bytes to the specified userspace
> >     memory. `cstat_size` must be a non-negative value that is no larger
> >     than the current size of the cachestat struct.
> >
> >     The `flags` argument is unused for now, but is included for future
> >     extensibility. User should pass 0 (i.e no flag specified).
> >
> > RETURN VALUE
> >     On success, cachestat returns 0. On error, -1 is returned, and errno
> >     is set to indicate the error.
> >
> > ERRORS
> >     EFAULT cstat points to an invalid address.
> >
> >     EINVAL invalid `cstat_size` or `flags`
> >
> >     EBADF  invalid file descriptor.
> >
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
>
> Hi Nhat,
>
> Thanks for the tweaks. FWIW, this by and large looks reasonable to me.

Thanks for suggesting the tweaks! I tried refactoring it out and it
did look cleaner to me (and one fewer file that I have to keep
track of...)

> Just a couple more random nitty things, if you happen to care about
> them..
>
> >  arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
> >  arch/arm/tools/syscall.tbl                  |   1 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |   1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |   1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
> >  include/linux/fs.h                          |   3 +
> >  include/linux/syscalls.h                    |   3 +
> >  include/uapi/asm-generic/unistd.h           |   5 +-
> >  include/uapi/linux/mman.h                   |   9 ++
> >  init/Kconfig                                |  10 ++
> >  kernel/sys_ni.c                             |   1 +
> >  mm/filemap.c                                | 137 ++++++++++++++++++++
> >  20 files changed, 180 insertions(+), 1 deletion(-)
> >
> ...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 08341616ae7a..29ffb906caa4 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> ...
> > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> >       return try_to_free_buffers(folio);
> >  }
> >  EXPORT_SYMBOL(filemap_release_folio);
> > +
> > +#ifdef CONFIG_CACHESTAT
> > +/**
> > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > + * @mapping: The mapping to compute the statistics for.
> > + * @first_index:     The starting page cache index.
> > + * @last_index:      The final page index (inclusive).
> > + * @cs:      the cachestat struct to write the result to.
> > + *
> > + * This will query the page cache statistics of a mapping in the
> > + * page range of [first_index, last_index] (inclusive). THe statistics
> > + * queried include: number of dirty pages, number of pages marked for
> > + * writeback, and the number of (recently) evicted pages.
> > + */
> > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > +             pgoff_t last_index, struct cachestat *cs)
> > +{
>
> Not sure why the internal helper needs to be wrapped in a config option?
> Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> wrap the syscall bits with that..?

No particularly strong reasons - I was just bundling the two together
because I was not entirely sure if it's interesting or has any use
case outside of the syscall itself.

Do you have something in mind?

>
> I would also think it might make things simpler to split out all the
> syscall bits into a separate patch from filemap_cachestat().

Same as above.

>
> > +     XA_STATE(xas, &mapping->i_pages, first_index);
> > +     struct folio *folio;
> > +
> > +     rcu_read_lock();
> > +     xas_for_each(&xas, folio, last_index) {
> > +             unsigned long nr_pages;
> > +             pgoff_t folio_first_index, folio_last_index;
> > +
> > +             if (xas_retry(&xas, folio))
> > +                     continue;
> > +
> > +             nr_pages = folio_nr_pages(folio);
> > +             folio_first_index = folio_pgoff(folio);
> > +             folio_last_index = folio_first_index + nr_pages - 1;
> > +
> > +             /* Folios might straddle the range boundaries, only count covered subpages */
> > +             if (folio_first_index < first_index)
> > +                     nr_pages -= first_index - folio_first_index;
> > +
> > +             if (folio_last_index > last_index)
> > +                     nr_pages -= folio_last_index - last_index;
> > +
> > +             if (xa_is_value(folio)) {
> > +                     /* page is evicted */
> > +                     void *shadow = (void *)folio;
> > +                     bool workingset; /* not used */
> > +
> > +                     cs->nr_evicted += nr_pages;
> > +
> > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > +                     if (shmem_mapping(mapping)) {
> > +                             /* shmem file - in swap cache */
> > +                             swp_entry_t swp = radix_to_swp_entry(folio);
> > +
> > +                             shadow = get_shadow_from_swap_cache(swp);
> > +                     }
> > +#endif
> > +                     if (workingset_test_recent(shadow, true, &workingset))
> > +                             cs->nr_recently_evicted += nr_pages;
> > +
> > +                     continue;
> > +             }
> > +
> > +             /* page is in cache */
> > +             cs->nr_cache += nr_pages;
> > +
> > +             if (folio_test_dirty(folio))
> > +                     cs->nr_dirty += nr_pages;
> > +
> > +             if (folio_test_writeback(folio))
> > +                     cs->nr_writeback += nr_pages;
>
> I'm not sure this is an issue right now (or if it will ever be for your
> use cases), but from a filesystem perspective it is possible to have
> large or variable sized folios in cache. At the moment I think the fs
> (or XFS+iomap at least) manages dirty/writeback state at folio
> granularity, but that may change in the near future if/when iomap
> sub-page dirty tracking comes along. I suspect that means it may become
> possible to have a large folio of some N number of pages where only a
> subset of those pages are actually in dirty/writeback state, and thus
> introduces some inaccuracy here because this assumes that folio state
> applies to folio_nr_pages() worth of pages. Just something to be aware
> of..
>
> Brian

Oof, I coded this with the mental framework of folio-as-a-unit, and
assumed that the dirty/writeback state is managed at the granularity of folio.
Thanks for bringing this up Brian! I'll have to watch out for this as iomap
evolves (and the subpage tracking becomes a thing).

>
> > +     }
> > +     rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL(filemap_cachestat);
> > +
> > +/*
> > + * The cachestat(5) system call.
> > + *
> > + * cachestat() returns the page cache statistics of a file in the
> > + * bytes specified by `off` and `len`: number of cached pages,
> > + * number of dirty pages, number of pages marked for writeback,
> > + * number of (recently) evicted pages.
> > + *
> > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > + * we will query in the range from `off` to the end of the file.
> > + *
> > + * `cstat_size` allows users to obtain partial results. The syscall
> > + * will copy the first `csstat_size` bytes to the specified userspace
> > + * memory. It also makes the cachestat struct extensible - new fields
> > + * can be added in the future without breaking existing usage.
> > + * `cstat_size` must be a non-negative value that is no larger than
> > + * the current size of the cachestat struct.
> > + *
> > + * The `flags` argument is unused for now, but is included for future
> > + * extensibility. User should pass 0 (i.e no flag specified).
> > + *
> > + * Because the status of a page can change after cachestat() checks it
> > + * but before it returns to the application, the returned values may
> > + * contain stale information.
> > + *
> > + * return values:
> > + *  zero    - success
> > + *  -EFAULT - cstat points to an illegal address
> > + *  -EINVAL - invalid arguments
> > + *  -EBADF   - invalid file descriptor
> > + */
> > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > +             size_t, cstat_size, struct cachestat __user *, cstat,
> > +             unsigned int, flags)
> > +{
> > +     struct fd f = fdget(fd);
> > +     struct address_space *mapping;
> > +     struct cachestat cs;
> > +     pgoff_t first_index = off >> PAGE_SHIFT;
> > +     pgoff_t last_index =
> > +             len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > +
> > +     if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > +             return -EINVAL;
> > +
> > +     if (!f.file)
> > +             return -EBADF;
> > +
> > +     memset(&cs, 0, sizeof(struct cachestat));
> > +     mapping = f.file->f_mapping;
> > +     filemap_cachestat(mapping, first_index, last_index, &cs);
> > +     fdput(f);
> > +
> > +     if (copy_to_user(cstat, &cs, cstat_size))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +#endif /* CONFIG_CACHESTAT */
> > --
> > 2.30.2
> >
>
Nhat Pham Dec. 31, 2022, 8:35 a.m. UTC | #9
On Fri, Dec 16, 2022 at 2:27 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Nhat,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on shuah-kselftest/next]
> [also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
> patch link:    https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
> patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
> config: s390-allmodconfig
> compiler: s390-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
>         git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
>
> All error/warnings (new ones prefixed by >>):
>
>    In file included from include/uapi/asm-generic/mman.h:5,
>                     from ./arch/s390/include/generated/uapi/asm/mman.h:1,
>                     from include/uapi/linux/mman.h:5,
>                     from include/linux/fs.h:49,
>                     from include/linux/huge_mm.h:8,
>                     from include/linux/mm.h:737,
>                     from include/linux/kvm_host.h:16,
>                     from arch/s390/kvm/kvm-s390.h:17,
>                     from arch/s390/kvm/gaccess.c:16:
> >> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant
>       16 | #define PROT_NONE       0x0             /* page can not be accessed */
>          |                         ^~~
>    arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE'
>      493 |         PROT_NONE,
>          |         ^~~~~~~~~
>    arch/s390/kvm/gaccess.c: In function 'trans_exc_ending':
> >> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value
>      516 |                 case PROT_TYPE_LA:
>          |                 ^~~~
>    arch/s390/kvm/gaccess.c:509:17: note: previously used here
>      509 |                 case PROT_NONE:
>          |                 ^~~~
>    In file included from include/linux/compiler_types.h:79,
>                     from <command-line>:
> >> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label
>      223 | # define fallthrough                    __attribute__((__fallthrough__))
>          |                                         ^~~~~~~~~~~~~
>    arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough'
>      515 |                         fallthrough;
>          |                         ^~~~~~~~~~~
>
>
> vim +16 include/uapi/asm-generic/mman-common.h
>
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15   9
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  10  #define PROT_READ      0x1             /* page can be read */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  11  #define PROT_WRITE     0x2             /* page can be written */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  12  #define PROT_EXEC      0x4             /* page can be executed */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  13  #define PROT_SEM       0x8             /* page may be used for atomic ops */
> d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin        2019-12-11  14  /*                     0x10               reserved for arch-specific use */
> d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin        2019-12-11  15  /*                     0x20               reserved for arch-specific use */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15 @16  #define PROT_NONE      0x0             /* page can not be accessed */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  17  #define PROT_GROWSDOWN 0x01000000      /* mprotect flag: extend change to start of growsdown vma */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  18  #define PROT_GROWSUP   0x02000000      /* mprotect flag: extend change to end of growsup vma */
> 5f6164f3092832 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  19
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

These build errors (and the ones reported by the later kernel test robot emails)
seem to come from my refactoring, which adds:

#include <uapi/linux/mman.h>

to include/linux/fs.h. Several other files (which chain-include fs.h)
also define
several same constants elsewhere. This include seems unnecessary, as
I can just replace it with:

struct cachestat;

This should be fixed in the next version of the patch series.
Brian Foster Jan. 3, 2023, 6:43 p.m. UTC | #10
On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote:
> On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > > Implement a new syscall that queries cache state of a file and
> > > summarizes the number of cached pages, number of dirty pages, number of
> > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > a given range.
> > >
> > > NAME
> > >     cachestat - query the page cache status of a file.
> > >
> > > SYNOPSIS
> > >     #include <sys/mman.h>
> > >
> > >     struct cachestat {
> > >         __u64 nr_cache;
> > >         __u64 nr_dirty;
> > >         __u64 nr_writeback;
> > >         __u64 nr_evicted;
> > >         __u64 nr_recently_evicted;
> > >     };
> > >
> > >     int cachestat(unsigned int fd, off_t off, size_t len,
> > >           size_t cstat_size, struct cachestat *cstat,
> > >           unsigned int flags);
> > >
> > > DESCRIPTION
> > >     cachestat() queries the number of cached pages, number of dirty
> > >     pages, number of pages marked for writeback, number of (recently)
> > >     evicted pages, in the bytes range given by `off` and `len`.
> > >
> > >     These values are returned in a cachestat struct, whose address is
> > >     given by the `cstat` argument.
> > >
> > >     The `off` and `len` arguments must be non-negative integers. If
> > >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > >     0, we will query in the range from `off` to the end of the file.
> > >
> > >     `cstat_size` allows users to obtain partial results. The syscall
> > >     will copy the first `csstat_size` bytes to the specified userspace
> > >     memory. `cstat_size` must be a non-negative value that is no larger
> > >     than the current size of the cachestat struct.
> > >
> > >     The `flags` argument is unused for now, but is included for future
> > >     extensibility. User should pass 0 (i.e no flag specified).
> > >
> > > RETURN VALUE
> > >     On success, cachestat returns 0. On error, -1 is returned, and errno
> > >     is set to indicate the error.
> > >
> > > ERRORS
> > >     EFAULT cstat points to an invalid address.
> > >
> > >     EINVAL invalid `cstat_size` or `flags`
> > >
> > >     EBADF  invalid file descriptor.
> > >
> > > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > > ---
> >
> > Hi Nhat,
> >
> > Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
> 
> Thanks for suggesting the tweaks! I tried refactoring it out and it
> did look cleaner to me (and one fewer file that I have to keep
> track of...)
> 
> > Just a couple more random nitty things, if you happen to care about
> > them..
> >
> > >  arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
> > >  arch/arm/tools/syscall.tbl                  |   1 +
> > >  arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
> > >  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
> > >  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
> > >  arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
> > >  arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
> > >  arch/s390/kernel/syscalls/syscall.tbl       |   1 +
> > >  arch/sh/kernel/syscalls/syscall.tbl         |   1 +
> > >  arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
> > >  arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
> > >  arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
> > >  arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
> > >  include/linux/fs.h                          |   3 +
> > >  include/linux/syscalls.h                    |   3 +
> > >  include/uapi/asm-generic/unistd.h           |   5 +-
> > >  include/uapi/linux/mman.h                   |   9 ++
> > >  init/Kconfig                                |  10 ++
> > >  kernel/sys_ni.c                             |   1 +
> > >  mm/filemap.c                                | 137 ++++++++++++++++++++
> > >  20 files changed, 180 insertions(+), 1 deletion(-)
> > >
> > ...
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 08341616ae7a..29ffb906caa4 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > ...
> > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> > >       return try_to_free_buffers(folio);
> > >  }
> > >  EXPORT_SYMBOL(filemap_release_folio);
> > > +
> > > +#ifdef CONFIG_CACHESTAT
> > > +/**
> > > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > > + * @mapping: The mapping to compute the statistics for.
> > > + * @first_index:     The starting page cache index.
> > > + * @last_index:      The final page index (inclusive).
> > > + * @cs:      the cachestat struct to write the result to.
> > > + *
> > > + * This will query the page cache statistics of a mapping in the
> > > + * page range of [first_index, last_index] (inclusive). THe statistics
> > > + * queried include: number of dirty pages, number of pages marked for
> > > + * writeback, and the number of (recently) evicted pages.
> > > + */
> > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > > +             pgoff_t last_index, struct cachestat *cs)
> > > +{
> >
> > Not sure why the internal helper needs to be wrapped in a config option?
> > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> > wrap the syscall bits with that..?
> 
> No particularly strong reasons - I was just bundling the two together
> because I was not entirely sure if it's interesting or has any use
> case outside of the syscall itself.
> 
> Do you have something in mind?
> 

Not immediately, though it looks like a straightforward and potentially
useful helper. It wouldn't surprise me if it grew more in-kernel users
eventually. :)

That said, I only suggested this for cleanup/consistency reasons. I.e.,
it seems there is plenty of precedent for CONFIG_*_SYSCALL config
options, and the one or two I happened to peek at looked like they only
wrapped the syscall bits as opposed to the entire implementation (which
also seems like a clean and elegant approach to me). Somebody could
easily come along and make that change later if/when they might want to
use the helper (though it might be annoying to change the name of the
config option), so I have no strong opinion on either suggestion.

Brian

> >
> > I would also think it might make things simpler to split out all the
> > syscall bits into a separate patch from filemap_cachestat().
> 
> Same as above.
> 
> >
> > > +     XA_STATE(xas, &mapping->i_pages, first_index);
> > > +     struct folio *folio;
> > > +
> > > +     rcu_read_lock();
> > > +     xas_for_each(&xas, folio, last_index) {
> > > +             unsigned long nr_pages;
> > > +             pgoff_t folio_first_index, folio_last_index;
> > > +
> > > +             if (xas_retry(&xas, folio))
> > > +                     continue;
> > > +
> > > +             nr_pages = folio_nr_pages(folio);
> > > +             folio_first_index = folio_pgoff(folio);
> > > +             folio_last_index = folio_first_index + nr_pages - 1;
> > > +
> > > +             /* Folios might straddle the range boundaries, only count covered subpages */
> > > +             if (folio_first_index < first_index)
> > > +                     nr_pages -= first_index - folio_first_index;
> > > +
> > > +             if (folio_last_index > last_index)
> > > +                     nr_pages -= folio_last_index - last_index;
> > > +
> > > +             if (xa_is_value(folio)) {
> > > +                     /* page is evicted */
> > > +                     void *shadow = (void *)folio;
> > > +                     bool workingset; /* not used */
> > > +
> > > +                     cs->nr_evicted += nr_pages;
> > > +
> > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > > +                     if (shmem_mapping(mapping)) {
> > > +                             /* shmem file - in swap cache */
> > > +                             swp_entry_t swp = radix_to_swp_entry(folio);
> > > +
> > > +                             shadow = get_shadow_from_swap_cache(swp);
> > > +                     }
> > > +#endif
> > > +                     if (workingset_test_recent(shadow, true, &workingset))
> > > +                             cs->nr_recently_evicted += nr_pages;
> > > +
> > > +                     continue;
> > > +             }
> > > +
> > > +             /* page is in cache */
> > > +             cs->nr_cache += nr_pages;
> > > +
> > > +             if (folio_test_dirty(folio))
> > > +                     cs->nr_dirty += nr_pages;
> > > +
> > > +             if (folio_test_writeback(folio))
> > > +                     cs->nr_writeback += nr_pages;
> >
> > I'm not sure this is an issue right now (or if it will ever be for your
> > use cases), but from a filesystem perspective it is possible to have
> > large or variable sized folios in cache. At the moment I think the fs
> > (or XFS+iomap at least) manages dirty/writeback state at folio
> > granularity, but that may change in the near future if/when iomap
> > sub-page dirty tracking comes along. I suspect that means it may become
> > possible to have a large folio of some N number of pages where only a
> > subset of those pages are actually in dirty/writeback state, and thus
> > introduces some inaccuracy here because this assumes that folio state
> > applies to folio_nr_pages() worth of pages. Just something to be aware
> > of..
> >
> > Brian
> 
> Oof, I coded this with the mental framework of folio-as-a-unit, and
> assumed that the dirty/writeback state is managed at the granularity of folio.
> Thanks for bringing this up Brian! I'll have to watch out for this as iomap
> evolves (and the subpage tracking becomes a thing).
> 
> >
> > > +     }
> > > +     rcu_read_unlock();
> > > +}
> > > +EXPORT_SYMBOL(filemap_cachestat);
> > > +
> > > +/*
> > > + * The cachestat(5) system call.
> > > + *
> > > + * cachestat() returns the page cache statistics of a file in the
> > > + * bytes specified by `off` and `len`: number of cached pages,
> > > + * number of dirty pages, number of pages marked for writeback,
> > > + * number of (recently) evicted pages.
> > > + *
> > > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > > + * we will query in the range from `off` to the end of the file.
> > > + *
> > > + * `cstat_size` allows users to obtain partial results. The syscall
> > > + * will copy the first `csstat_size` bytes to the specified userspace
> > > + * memory. It also makes the cachestat struct extensible - new fields
> > > + * can be added in the future without breaking existing usage.
> > > + * `cstat_size` must be a non-negative value that is no larger than
> > > + * the current size of the cachestat struct.
> > > + *
> > > + * The `flags` argument is unused for now, but is included for future
> > > + * extensibility. User should pass 0 (i.e no flag specified).
> > > + *
> > > + * Because the status of a page can change after cachestat() checks it
> > > + * but before it returns to the application, the returned values may
> > > + * contain stale information.
> > > + *
> > > + * return values:
> > > + *  zero    - success
> > > + *  -EFAULT - cstat points to an illegal address
> > > + *  -EINVAL - invalid arguments
> > > + *  -EBADF   - invalid file descriptor
> > > + */
> > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > > +             size_t, cstat_size, struct cachestat __user *, cstat,
> > > +             unsigned int, flags)
> > > +{
> > > +     struct fd f = fdget(fd);
> > > +     struct address_space *mapping;
> > > +     struct cachestat cs;
> > > +     pgoff_t first_index = off >> PAGE_SHIFT;
> > > +     pgoff_t last_index =
> > > +             len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > > +
> > > +     if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > > +             return -EINVAL;
> > > +
> > > +     if (!f.file)
> > > +             return -EBADF;
> > > +
> > > +     memset(&cs, 0, sizeof(struct cachestat));
> > > +     mapping = f.file->f_mapping;
> > > +     filemap_cachestat(mapping, first_index, last_index, &cs);
> > > +     fdput(f);
> > > +
> > > +     if (copy_to_user(cstat, &cs, cstat_size))
> > > +             return -EFAULT;
> > > +
> > > +     return 0;
> > > +}
> > > +#endif /* CONFIG_CACHESTAT */
> > > --
> > > 2.30.2
> > >
> >
>
Nhat Pham Jan. 4, 2023, 2:18 p.m. UTC | #11
On Tue, Jan 3, 2023 at 10:42 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote:
> > On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > > > Implement a new syscall that queries cache state of a file and
> > > > summarizes the number of cached pages, number of dirty pages, number of
> > > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > > a given range.
> > > >
> > > > NAME
> > > >     cachestat - query the page cache status of a file.
> > > >
> > > > SYNOPSIS
> > > >     #include <sys/mman.h>
> > > >
> > > >     struct cachestat {
> > > >         __u64 nr_cache;
> > > >         __u64 nr_dirty;
> > > >         __u64 nr_writeback;
> > > >         __u64 nr_evicted;
> > > >         __u64 nr_recently_evicted;
> > > >     };
> > > >
> > > >     int cachestat(unsigned int fd, off_t off, size_t len,
> > > >           size_t cstat_size, struct cachestat *cstat,
> > > >           unsigned int flags);
> > > >
> > > > DESCRIPTION
> > > >     cachestat() queries the number of cached pages, number of dirty
> > > >     pages, number of pages marked for writeback, number of (recently)
> > > >     evicted pages, in the bytes range given by `off` and `len`.
> > > >
> > > >     These values are returned in a cachestat struct, whose address is
> > > >     given by the `cstat` argument.
> > > >
> > > >     The `off` and `len` arguments must be non-negative integers. If
> > > >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > > >     0, we will query in the range from `off` to the end of the file.
> > > >
> > > >     `cstat_size` allows users to obtain partial results. The syscall
> > > >     will copy the first `csstat_size` bytes to the specified userspace
> > > >     memory. `cstat_size` must be a non-negative value that is no larger
> > > >     than the current size of the cachestat struct.
> > > >
> > > >     The `flags` argument is unused for now, but is included for future
> > > >     extensibility. User should pass 0 (i.e no flag specified).
> > > >
> > > > RETURN VALUE
> > > >     On success, cachestat returns 0. On error, -1 is returned, and errno
> > > >     is set to indicate the error.
> > > >
> > > > ERRORS
> > > >     EFAULT cstat points to an invalid address.
> > > >
> > > >     EINVAL invalid `cstat_size` or `flags`
> > > >
> > > >     EBADF  invalid file descriptor.
> > > >
> > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > > > ---
> > >
> > > Hi Nhat,
> > >
> > > Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
> >
> > Thanks for suggesting the tweaks! I tried refactoring it out and it
> > did look cleaner to me (and one fewer file that I have to keep
> > track of...)
> >
> > > Just a couple more random nitty things, if you happen to care about
> > > them..
> > >
> > > >  arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
> > > >  arch/arm/tools/syscall.tbl                  |   1 +
> > > >  arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
> > > >  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
> > > >  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
> > > >  arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
> > > >  arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
> > > >  arch/s390/kernel/syscalls/syscall.tbl       |   1 +
> > > >  arch/sh/kernel/syscalls/syscall.tbl         |   1 +
> > > >  arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
> > > >  arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
> > > >  arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
> > > >  arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
> > > >  include/linux/fs.h                          |   3 +
> > > >  include/linux/syscalls.h                    |   3 +
> > > >  include/uapi/asm-generic/unistd.h           |   5 +-
> > > >  include/uapi/linux/mman.h                   |   9 ++
> > > >  init/Kconfig                                |  10 ++
> > > >  kernel/sys_ni.c                             |   1 +
> > > >  mm/filemap.c                                | 137 ++++++++++++++++++++
> > > >  20 files changed, 180 insertions(+), 1 deletion(-)
> > > >
> > > ...
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 08341616ae7a..29ffb906caa4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > ...
> > > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> > > >       return try_to_free_buffers(folio);
> > > >  }
> > > >  EXPORT_SYMBOL(filemap_release_folio);
> > > > +
> > > > +#ifdef CONFIG_CACHESTAT
> > > > +/**
> > > > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > > > + * @mapping: The mapping to compute the statistics for.
> > > > + * @first_index:     The starting page cache index.
> > > > + * @last_index:      The final page index (inclusive).
> > > > + * @cs:      the cachestat struct to write the result to.
> > > > + *
> > > > + * This will query the page cache statistics of a mapping in the
> > > > + * page range of [first_index, last_index] (inclusive). THe statistics
> > > > + * queried include: number of dirty pages, number of pages marked for
> > > > + * writeback, and the number of (recently) evicted pages.
> > > > + */
> > > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > > > +             pgoff_t last_index, struct cachestat *cs)
> > > > +{
> > >
> > > Not sure why the internal helper needs to be wrapped in a config option?
> > > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> > > wrap the syscall bits with that..?
> >
> > No particularly strong reasons - I was just bundling the two together
> > because I was not entirely sure if it's interesting or has any use
> > case outside of the syscall itself.
> >
> > Do you have something in mind?
> >
>
> Not immediately, though it looks like a straightforward and potentially
> useful helper. It wouldn't surprise me if it grew more in-kernel users
> eventually. :)

Sounds good! I'll move it out in the next version, unless there is some
unexpected issue with that.

> That said, I only suggested this for cleanup/consistency reasons. I.e.,
> it seems there is plenty of precedent for CONFIG_*_SYSCALL config
> options, and the one or two I happened to peek at looked like they only
> wrapped the syscall bits as opposed to the entire implementation (which
> also seems like a clean and elegant approach to me). Somebody could
> easily come along and make that change later if/when they might want to
> use the helper (though it might be annoying to change the name of the
> config option), so I have no strong opinion on either suggestion.
>
> Brian
>
> > >
> > > I would also think it might make things simpler to split out all the
> > > syscall bits into a separate patch from filemap_cachestat().
> >
> > Same as above.
> >
> > >
> > > > +     XA_STATE(xas, &mapping->i_pages, first_index);
> > > > +     struct folio *folio;
> > > > +
> > > > +     rcu_read_lock();
> > > > +     xas_for_each(&xas, folio, last_index) {
> > > > +             unsigned long nr_pages;
> > > > +             pgoff_t folio_first_index, folio_last_index;
> > > > +
> > > > +             if (xas_retry(&xas, folio))
> > > > +                     continue;
> > > > +
> > > > +             nr_pages = folio_nr_pages(folio);
> > > > +             folio_first_index = folio_pgoff(folio);
> > > > +             folio_last_index = folio_first_index + nr_pages - 1;
> > > > +
> > > > +             /* Folios might straddle the range boundaries, only count covered subpages */
> > > > +             if (folio_first_index < first_index)
> > > > +                     nr_pages -= first_index - folio_first_index;
> > > > +
> > > > +             if (folio_last_index > last_index)
> > > > +                     nr_pages -= folio_last_index - last_index;
> > > > +
> > > > +             if (xa_is_value(folio)) {
> > > > +                     /* page is evicted */
> > > > +                     void *shadow = (void *)folio;
> > > > +                     bool workingset; /* not used */
> > > > +
> > > > +                     cs->nr_evicted += nr_pages;
> > > > +
> > > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > > > +                     if (shmem_mapping(mapping)) {
> > > > +                             /* shmem file - in swap cache */
> > > > +                             swp_entry_t swp = radix_to_swp_entry(folio);
> > > > +
> > > > +                             shadow = get_shadow_from_swap_cache(swp);
> > > > +                     }
> > > > +#endif
> > > > +                     if (workingset_test_recent(shadow, true, &workingset))
> > > > +                             cs->nr_recently_evicted += nr_pages;
> > > > +
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > +             /* page is in cache */
> > > > +             cs->nr_cache += nr_pages;
> > > > +
> > > > +             if (folio_test_dirty(folio))
> > > > +                     cs->nr_dirty += nr_pages;
> > > > +
> > > > +             if (folio_test_writeback(folio))
> > > > +                     cs->nr_writeback += nr_pages;
> > >
> > > I'm not sure this is an issue right now (or if it will ever be for your
> > > use cases), but from a filesystem perspective it is possible to have
> > > large or variable sized folios in cache. At the moment I think the fs
> > > (or XFS+iomap at least) manages dirty/writeback state at folio
> > > granularity, but that may change in the near future if/when iomap
> > > sub-page dirty tracking comes along. I suspect that means it may become
> > > possible to have a large folio of some N number of pages where only a
> > > subset of those pages are actually in dirty/writeback state, and thus
> > > introduces some inaccuracy here because this assumes that folio state
> > > applies to folio_nr_pages() worth of pages. Just something to be aware
> > > of..
> > >
> > > Brian
> >
> > Oof, I coded this with the mental framework of folio-as-a-unit, and
> > assumed that the dirty/writeback state is managed at the granularity of folio.
> > Thanks for bringing this up Brian! I'll have to watch out for this as iomap
> > evolves (and the subpage tracking becomes a thing).
> >
> > >
> > > > +     }
> > > > +     rcu_read_unlock();
> > > > +}
> > > > +EXPORT_SYMBOL(filemap_cachestat);
> > > > +
> > > > +/*
> > > > + * The cachestat(5) system call.
> > > > + *
> > > > + * cachestat() returns the page cache statistics of a file in the
> > > > + * bytes specified by `off` and `len`: number of cached pages,
> > > > + * number of dirty pages, number of pages marked for writeback,
> > > > + * number of (recently) evicted pages.
> > > > + *
> > > > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > > > + * we will query in the range from `off` to the end of the file.
> > > > + *
> > > > + * `cstat_size` allows users to obtain partial results. The syscall
> > > > + * will copy the first `csstat_size` bytes to the specified userspace
> > > > + * memory. It also makes the cachestat struct extensible - new fields
> > > > + * can be added in the future without breaking existing usage.
> > > > + * `cstat_size` must be a non-negative value that is no larger than
> > > > + * the current size of the cachestat struct.
> > > > + *
> > > > + * The `flags` argument is unused for now, but is included for future
> > > > + * extensibility. User should pass 0 (i.e no flag specified).
> > > > + *
> > > > + * Because the status of a page can change after cachestat() checks it
> > > > + * but before it returns to the application, the returned values may
> > > > + * contain stale information.
> > > > + *
> > > > + * return values:
> > > > + *  zero    - success
> > > > + *  -EFAULT - cstat points to an illegal address
> > > > + *  -EINVAL - invalid arguments
> > > > + *  -EBADF   - invalid file descriptor
> > > > + */
> > > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > > > +             size_t, cstat_size, struct cachestat __user *, cstat,
> > > > +             unsigned int, flags)
> > > > +{
> > > > +     struct fd f = fdget(fd);
> > > > +     struct address_space *mapping;
> > > > +     struct cachestat cs;
> > > > +     pgoff_t first_index = off >> PAGE_SHIFT;
> > > > +     pgoff_t last_index =
> > > > +             len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > > > +
> > > > +     if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > > > +             return -EINVAL;
> > > > +
> > > > +     if (!f.file)
> > > > +             return -EBADF;
> > > > +
> > > > +     memset(&cs, 0, sizeof(struct cachestat));
> > > > +     mapping = f.file->f_mapping;
> > > > +     filemap_cachestat(mapping, first_index, last_index, &cs);
> > > > +     fdput(f);
> > > > +
> > > > +     if (copy_to_user(cstat, &cs, cstat_size))
> > > > +             return -EFAULT;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +#endif /* CONFIG_CACHESTAT */
> > > > --
> > > > 2.30.2
> > > >
> > >
> >
>
Nhat Pham Jan. 11, 2023, 6 p.m. UTC | #12
On Wed, Dec 21, 2022 at 4:37 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > > Implement a new syscall that queries cache state of a file and
> > > summarizes the number of cached pages, number of dirty pages, number of
> > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > a given range.
> > >
> > > NAME
> > >     cachestat - query the page cache status of a file.
> > >
> > > SYNOPSIS
> > >     #include <sys/mman.h>
> > >
> > >     struct cachestat {
> > >         __u64 nr_cache;
> > >         __u64 nr_dirty;
> > >         __u64 nr_writeback;
> > >         __u64 nr_evicted;
> > >         __u64 nr_recently_evicted;
> > >     };
> > >
> > >     int cachestat(unsigned int fd, off_t off, size_t len,
> > >           size_t cstat_size, struct cachestat *cstat,
> > >           unsigned int flags);
> > >
> > > DESCRIPTION
> > >     cachestat() queries the number of cached pages, number of dirty
> > >     pages, number of pages marked for writeback, number of (recently)
> > >     evicted pages, in the bytes range given by `off` and `len`.
> >
> > I suggest this be spelled out better: "number of evicted and number or
> > recently evicted pages".
> >
> > I suggest this clearly tell readers what an "evicted" page is - they
> > aren't kernel programmers!
>
> Valid points - I'll try to explain this more clearly in the future
> versions of this patch series, especially in the draft man page.
>
> >
> > What is the benefit of the "recently evicted" pages?  "recently" seems
> > very vague - what use is this to anyone?
>
> This eviction recency semantics comes from the LRU's refault
> computation. Users of cachestat might be interested in two very
> different questions:
>
> 1. How many pages are not resident in the page cache.
> 2. How many pages are recently evicted (recently enough that
>     their refault will be construed as memory pressure).
>
> The first question is answered with nr_evicted, whereas the second
> is answered with nr_recently_evicted.
>
> I will figure out a way to explain this better in the next version. Users
> definitely do not need to know the nitty gritty details of LRU logic,
> but they should know the general idea of each field at least.
>
> >
> > >     These values are returned in a cachestat struct, whose address is
> > >     given by the `cstat` argument.
> > >
> > >     The `off` and `len` arguments must be non-negative integers. If
> > >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > >     0, we will query in the range from `off` to the end of the file.
> > >
> > >     `cstat_size` allows users to obtain partial results. The syscall
> > >     will copy the first `csstat_size` bytes to the specified userspace
> > >     memory. `cstat_size` must be a non-negative value that is no larger
> > >     than the current size of the cachestat struct.
> > >
> > >     The `flags` argument is unused for now, but is included for future
> > >     extensibility. User should pass 0 (i.e no flag specified).
> >
> > Why is `flags' here?  We could add an unused flags arg to any syscall,
> > but we don't.  What's the plan?
>
> I included this field to ensure that cachestat can be extended safely,
> especially when different users might want different things out of it.
>
> For instance, in the future there might be new fields/computations
> that are too heavy for certain use cases - a flag could be used to
> disable/skip such fields/computations.
>
> Another thing it might be used for is the huge page counting -
> we have not implemented this in this version yet, but it might
> introduce murky semantics to new/existing fields in struct
> cachestat. Or maybe not - but worst case scenario we can
> leave this decision to the users to decide through flags.
>
> I'm sure there are more potential pitfalls that the flags could
> save us from, but these are the two on top of my head.
>
> >
> > Are there security implications?  If I know that some process has a
> > file open, I can use cachestat() to infer which parts of that file
> > they're looking at (like mincore(), I guess).  And I can infer which
> > parts they're writing to, unlike mincore().
>
> This one, I'm not 100% sure, but it is a valid concern. Let me think
> about it and discuss with more security-oriented minds before
> responding to this.

Hmm I've given it some more thought. The syscall does not seem to
expose any extra security issue, given that the user already has
read permission to the file itself (since they have an fd to that file).
This means that the user can already know the underlying data in its
entirety, which seems like much more information (and as a result,
security risk) than the cache status itself.

Do you have something concrete in mind that I might have missed?

>
> >
> > I suggest the [patch 1/4] fixup be separated from this series.
>
> Sounds good! I'll loop Johannes in about this breakup as well.
diff mbox series

Patch

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ebacf37a8cf..1f13995d00d7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -490,3 +490,4 @@ 
 558	common	process_mrelease		sys_process_mrelease
 559	common  futex_waitv                     sys_futex_waitv
 560	common	set_mempolicy_home_node		sys_ni_syscall
+561	common	cachestat			sys_cachestat
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index ac964612d8b0..8ebed8a13874 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -464,3 +464,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 72c929d9902b..f8c74ffeeefb 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -371,3 +371,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index b1f3940bc298..4f504783371f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 820145e47350..858d22bf275c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 8a99c998da9b..7c84a72306d1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -448,3 +448,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 2bca64f96164..937460f0a8ec 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -530,3 +530,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 799147658dee..7df0329d46cb 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@ 
 448  common	process_mrelease	sys_process_mrelease		sys_process_mrelease
 449  common	futex_waitv		sys_futex_waitv			sys_futex_waitv
 450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
+451  common	cachestat		sys_cachestat			sys_cachestat
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 2de85c977f54..97377e8c5025 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4398cc6fb68d..faa835f3c54a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..bc0a3c941b35 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@ 
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	i386	cachestat		sys_cachestat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..227538b0ce80 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	cachestat		sys_cachestat
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 52c94ab5c205..2b69c3c035b6 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -421,3 +421,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	cachestat			sys_cachestat
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..8902799121c9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -46,6 +46,7 @@ 
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
+#include <uapi/linux/mman.h>
 
 struct backing_dev_info;
 struct bdi_writeback;
@@ -830,6 +831,8 @@  void filemap_invalidate_lock_two(struct address_space *mapping1,
 				 struct address_space *mapping2);
 void filemap_invalidate_unlock_two(struct address_space *mapping1,
 				   struct address_space *mapping2);
+void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
+		pgoff_t last_index, struct cachestat *cs);
 
 
 /*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..50f8c6999d99 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@  struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct cachestat;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1056,6 +1057,8 @@  asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_cachestat(unsigned int fd, off_t off, size_t len,
+		size_t cstat_size, struct cachestat __user *cstat, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..cd639fae9086 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_cachestat 451
+__SYSCALL(__NR_cachestat, sys_cachestat)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..fe03ed0b7587 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -4,6 +4,7 @@ 
 
 #include <asm/mman.h>
 #include <asm-generic/hugetlb_encode.h>
+#include <linux/types.h>
 
 #define MREMAP_MAYMOVE		1
 #define MREMAP_FIXED		2
@@ -41,4 +42,12 @@ 
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+struct cachestat {
+	__u64 nr_cache;
+	__u64 nr_dirty;
+	__u64 nr_writeback;
+	__u64 nr_evicted;
+	__u64 nr_recently_evicted;
+};
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/init/Kconfig b/init/Kconfig
index 694f7c160c9c..ecc4f781dd6c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1798,6 +1798,16 @@  config RSEQ
 
 	  If unsure, say Y.
 
+config CACHESTAT
+	bool "Enable cachestat() system call" if EXPERT
+	default y
+	help
+	  Enable the cachestat system call, which queries the page cache
+	  statistics of a file (number of cached pages, dirty pages,
+	  pages marked for writeback, (recently) evicted pages).
+
+	  If unsure say Y here.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enabled debugging of rseq() system call" if EXPERT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..04bfb1e4d377 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -299,6 +299,7 @@  COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
+COND_SYSCALL(cachestat);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..29ffb906caa4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -22,6 +22,7 @@ 
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/syscalls.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
@@ -55,6 +56,9 @@ 
 #include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
+#include <uapi/linux/mman.h>
+
+#include "swap.h"
 
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
@@ -3949,3 +3953,136 @@  bool filemap_release_folio(struct folio *folio, gfp_t gfp)
 	return try_to_free_buffers(folio);
 }
 EXPORT_SYMBOL(filemap_release_folio);
+
+#ifdef CONFIG_CACHESTAT
+/**
+ * filemap_cachestat() - compute the page cache statistics of a mapping
+ * @mapping:	The mapping to compute the statistics for.
+ * @first_index:	The starting page cache index.
+ * @last_index:	The final page index (inclusive).
+ * @cs:	the cachestat struct to write the result to.
+ *
+ * This will query the page cache statistics of a mapping in the
+ * page range of [first_index, last_index] (inclusive). THe statistics
+ * queried include: number of dirty pages, number of pages marked for
+ * writeback, and the number of (recently) evicted pages.
+ */
+void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
+		pgoff_t last_index, struct cachestat *cs)
+{
+	XA_STATE(xas, &mapping->i_pages, first_index);
+	struct folio *folio;
+
+	rcu_read_lock();
+	xas_for_each(&xas, folio, last_index) {
+		unsigned long nr_pages;
+		pgoff_t folio_first_index, folio_last_index;
+
+		if (xas_retry(&xas, folio))
+			continue;
+
+		nr_pages = folio_nr_pages(folio);
+		folio_first_index = folio_pgoff(folio);
+		folio_last_index = folio_first_index + nr_pages - 1;
+
+		/* Folios might straddle the range boundaries, only count covered subpages */
+		if (folio_first_index < first_index)
+			nr_pages -= first_index - folio_first_index;
+
+		if (folio_last_index > last_index)
+			nr_pages -= folio_last_index - last_index;
+
+		if (xa_is_value(folio)) {
+			/* page is evicted */
+			void *shadow = (void *)folio;
+			bool workingset; /* not used */
+
+			cs->nr_evicted += nr_pages;
+
+#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
+			if (shmem_mapping(mapping)) {
+				/* shmem file - in swap cache */
+				swp_entry_t swp = radix_to_swp_entry(folio);
+
+				shadow = get_shadow_from_swap_cache(swp);
+			}
+#endif
+			if (workingset_test_recent(shadow, true, &workingset))
+				cs->nr_recently_evicted += nr_pages;
+
+			continue;
+		}
+
+		/* page is in cache */
+		cs->nr_cache += nr_pages;
+
+		if (folio_test_dirty(folio))
+			cs->nr_dirty += nr_pages;
+
+		if (folio_test_writeback(folio))
+			cs->nr_writeback += nr_pages;
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(filemap_cachestat);
+
+/*
+ * The cachestat(5) system call.
+ *
+ * cachestat() returns the page cache statistics of a file in the
+ * bytes specified by `off` and `len`: number of cached pages,
+ * number of dirty pages, number of pages marked for writeback,
+ * number of (recently) evicted pages.
+ *
+ * `off` and `len` must be non-negative integers. If `len` > 0,
+ * the queried range is [`off`, `off` + `len`]. If `len` == 0,
+ * we will query in the range from `off` to the end of the file.
+ *
+ * `cstat_size` allows users to obtain partial results. The syscall
+ * will copy the first `csstat_size` bytes to the specified userspace
+ * memory. It also makes the cachestat struct extensible - new fields
+ * can be added in the future without breaking existing usage.
+ * `cstat_size` must be a non-negative value that is no larger than
+ * the current size of the cachestat struct.
+ *
+ * The `flags` argument is unused for now, but is included for future
+ * extensibility. User should pass 0 (i.e no flag specified).
+ *
+ * Because the status of a page can change after cachestat() checks it
+ * but before it returns to the application, the returned values may
+ * contain stale information.
+ *
+ * return values:
+ *  zero    - success
+ *  -EFAULT - cstat points to an illegal address
+ *  -EINVAL - invalid arguments
+ *  -EBADF	- invalid file descriptor
+ */
+SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
+		size_t, cstat_size, struct cachestat __user *, cstat,
+		unsigned int, flags)
+{
+	struct fd f = fdget(fd);
+	struct address_space *mapping;
+	struct cachestat cs;
+	pgoff_t first_index = off >> PAGE_SHIFT;
+	pgoff_t last_index =
+		len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
+
+	if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
+		return -EINVAL;
+
+	if (!f.file)
+		return -EBADF;
+
+	memset(&cs, 0, sizeof(struct cachestat));
+	mapping = f.file->f_mapping;
+	filemap_cachestat(mapping, first_index, last_index, &cs);
+	fdput(f);
+
+	if (copy_to_user(cstat, &cs, cstat_size))
+		return -EFAULT;
+
+	return 0;
+}
+#endif /* CONFIG_CACHESTAT */