mbox series

[RFC,v2,00/43] PKRAM: Preserved-over-Kexec RAM

Message ID 1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com (mailing list archive)
Headers show
Series PKRAM: Preserved-over-Kexec RAM | expand

Message

Anthony Yznaga March 30, 2021, 9:35 p.m. UTC
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API and implement its use within tmpfs, allowing
tmpfs files to be preserved across kexec.

One use case for PKRAM is preserving guest memory and/or auxillary supporting
data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
State" across reboot[3].  PKRAM provides a flexible way for doing this
without requiring that the amount of memory used by a fixed size created
a priori.  Another use case is for databases to preserve their block caches
in shared memory across reboot.

Changes since RFC v1
  - Rebased onto 5.12-rc4
  - Refined the API to reduce the number of calls
    and better support multithreading.
  - Allow preserving byte data of arbitrary length
    (was previously limited to one page).
  - Build a new memblock reserved list with the
    preserved ranges and then substitute it for
    the existing one. (Mike Rapoport)
  - Use mem_avoid_overlap() to avoid kaslr stepping
    on preserved ranges. (Kees Cook)

-- Usage --

 1) Mount tmpfs with 'pkram=NAME' option.

    NAME is an arbitrary string specifying a preserved memory node.
    Different tmpfs trees may be saved to PKRAM if different names are
    passed.

    # mkdir -p /mnt
    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 2) Populate a file under /mnt

    # head -c 2G /dev/urandom > /mnt/testfile
    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile

 3) Remount tmpfs to preserve files.

    # mount -o remount,preserve,ro /mnt

 4) Load the new kernel image.

    Pass the PKRAM super block pfn via 'pkram' boot option. The pfn is
    exported via the sysfs file /sys/kernel/pkram.

    # kexec -s -l /boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img \
            --append="$(cat /proc/cmdline|sed -e 's/pkram=[^ ]*//g') pkram=$(cat /sys/kernel/pkram)"

 5) Boot to the new kernel.

    # systemctl kexec

 6) Mount tmpfs with 'pkram=NAME' option.

    It should find the PKRAM node with the tmpfs tree saved on previous
    unmount and restore it.

    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 7) Use the restored file under /mnt

    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile


 -- Implementation details --

 * When a tmpfs filesystem is mounted the first time with the 'pkram=NAME'
   option, a shmem_pkram_info is allocated to record NAME. The shmem_pkram_info
   and whether the filesystem is in the preserved state are tracked by
   shmem_sb_info.

 * A PKRAM-enabled tmpfs filesystem is saved to PKRAM on remount when the
  'preserve' mount option is specified and the filesystem is read-only.

 * Saving a file to PKRAM is done by walking the pages of the file and
   building a list of the pages and attributes needed to restore them later.
   The pages containing this metadata as well as the target file pages have
   their refcount incremented to prevent them from being freed even after
   the last user puts the pages (i.e. the filesystem is unmounted).

 * To aid in quickly finding contiguous ranges of memory containing
   preserved pages a pseudo physical mapping pagetable is populated
   with pages as they are preserved.

 * If a page to be preserved is found to be in range of memory that was
   previously reserved during early boot or in range of memory where the
   kernel will be loaded to on kexec, the page will be copied to a page
   outside of those ranges and the new page will be preserved. A compound
   page will be copied to and preserved as individual base pages.

 * A single page is allocated for the PKRAM super block. For the next kernel
   kexec boot to find preserved memory metadata, the pfn of the PKRAM super
   block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
   boot option.

 * In the newly booted kernel, PKRAM adds all preserved pages to the memblock
   reserve list during early boot so that they will not be recycled.

 * Since kexec may load the new kernel code to any memory region, it could
   destroy preserved memory. When the kernel selects the memory region
   (kexec_file_load syscall), kexec will avoid preserved pages.  When the
   user selects the kexec memory region to use (kexec_load syscall) , kexec
   load will fail if there is conflict with preserved pages. Pages preserved
   after a kexec kernel is loaded will be relocated if they conflict with
   the selected memory region.

The current implementation has some restrictions:

 * Only regular tmpfs files without multiple hard links can be preserved.
   Save to PKRAM will abort and log an error if a directory or other file
   type is encountered.

 * Pages for PKRAM-enabled files are prevented from swapping out to avoid
   the performance penalty of swapping in and the possibility of insufficient
   memory.


-- Patches --

The patches are broken down into the following groups:

Patches 1-22 implement the API and supporting functionality.

Patches 23-27 implement the use of PKRAM within tmpfs

The remaining patches implement optimizations to the initialization of
preserved pages and to the preservation and restoration of shmem pages.

To give an idea of the improvement in performance here is an example
comparison with and without these patches when saving and loading a 100G
file:

  Save a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     2265ms       |       232ms         |
  ------------------------------------------------------
  huge=always |       58ms       |        22ms         |


  Load a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     8833ms       |       516ms         |
  ------------------------------------------------------
  huge=always |      752ms       |       105ms         |


Patches 28-31 Defer initialization of page structs for preserved pages

Patches 32-34 Implement multi-threading of shmem page preservation and
restoration.

Patches 35-37 Implement and use an  API for inserting shmem pages in bulk

Patches 38-39: Reduce contention on the LRU lock by staging and adding pages
in bulk to the LRU

Patches 40-43: Reduce contention on the pagecache xarray lock by inserting
pages in bulk in certain cases

[1] https://lkml.org/lkml/2013/7/1/211

[2] https://www.youtube.com/watch?v=pBsHnf93tcQ
    https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

[3] https://www.youtube.com/watch?v=pBsHnf93tcQ
https://static.sched.com/hosted_files/kvmforum2020/10/Device-Keepalive-State-KVMForum2020.pdf

Anthony Yznaga (43):
  mm: add PKRAM API stubs and Kconfig
  mm: PKRAM: implement node load and save functions
  mm: PKRAM: implement object load and save functions
  mm: PKRAM: implement page stream operations
  mm: PKRAM: support preserving transparent hugepages
  mm: PKRAM: implement byte stream operations
  mm: PKRAM: link nodes by pfn before reboot
  mm: PKRAM: introduce super block
  PKRAM: track preserved pages in a physical mapping pagetable
  PKRAM: pass a list of preserved ranges to the next kernel
  PKRAM: prepare for adding preserved ranges to memblock reserved
  mm: PKRAM: reserve preserved memory at boot
  PKRAM: free the preserved ranges list
  PKRAM: prevent inadvertent use of a stale superblock
  PKRAM: provide a way to ban pages from use by PKRAM
  kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  PKRAM: provide a way to check if a memory range has preserved pages
  kexec: PKRAM: avoid clobbering already preserved pages
  mm: PKRAM: allow preserved memory to be freed from userspace
  PKRAM: disable feature when running the kdump kernel
  x86/KASLR: PKRAM: support physical kaslr
  x86/boot/compressed/64: use 1GB pages for mappings
  mm: shmem: introduce shmem_insert_page
  mm: shmem: enable saving to PKRAM
  mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
  mm: shmem: specify the mm to use when inserting pages
  mm: shmem: when inserting, handle pages already charged to a memcg
  x86/mm/numa: add numa_isolate_memblocks()
  PKRAM: ensure memblocks with preserved pages init'd for numa
  memblock: PKRAM: mark memblocks that contain preserved pages
  memblock, mm: defer initialization of preserved pages
  shmem: preserve shmem files a chunk at a time
  PKRAM: atomically add and remove link pages
  shmem: PKRAM: multithread preserving and restoring shmem pages
  shmem: introduce shmem_insert_pages()
  PKRAM: add support for loading pages in bulk
  shmem: PKRAM: enable bulk loading of preserved pages into shmem
  mm: implement splicing a list of pages to the LRU
  shmem: optimize adding pages to the LRU in shmem_insert_pages()
  shmem: initial support for adding multiple pages to pagecache
  XArray: add xas_export_node() and xas_import_node()
  shmem: reduce time holding xa_lock when inserting pages
  PKRAM: improve index alignment of pkram_link entries

 documentation/core-api/xarray.rst       |    8 +
 arch/x86/boot/compressed/Makefile       |    3 +
 arch/x86/boot/compressed/ident_map_64.c |    9 +-
 arch/x86/boot/compressed/kaslr.c        |   10 +-
 arch/x86/boot/compressed/misc.h         |   10 +
 arch/x86/boot/compressed/pkram.c        |  109 ++
 arch/x86/include/asm/numa.h             |    4 +
 arch/x86/kernel/setup.c                 |    3 +
 arch/x86/mm/init_64.c                   |    2 +
 arch/x86/mm/numa.c                      |   32 +-
 include/linux/memblock.h                |    6 +
 include/linux/mm.h                      |    2 +-
 include/linux/pkram.h                   |  120 ++
 include/linux/shmem_fs.h                |   28 +
 include/linux/swap.h                    |   13 +
 include/linux/xarray.h                  |    2 +
 kernel/kexec.c                          |    9 +
 kernel/kexec_core.c                     |    3 +
 kernel/kexec_file.c                     |   15 +
 lib/test_xarray.c                       |   45 +
 lib/xarray.c                            |  100 ++
 mm/Kconfig                              |    9 +
 mm/Makefile                             |    1 +
 mm/memblock.c                           |   11 +-
 mm/page_alloc.c                         |   55 +-
 mm/pkram.c                              | 1808 +++++++++++++++++++++++++++++++
 mm/pkram_pagetable.c                    |  376 +++++++
 mm/shmem.c                              |  494 ++++++++-
 mm/shmem_pkram.c                        |  530 +++++++++
 mm/swap.c                               |   86 ++
 30 files changed, 3869 insertions(+), 34 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c
 create mode 100644 mm/pkram_pagetable.c
 create mode 100644 mm/shmem_pkram.c

Comments

Pasha Tatashin June 5, 2021, 1:39 p.m. UTC | #1
On 3/30/21 5:35 PM, Anthony Yznaga wrote:
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API and implement its use within tmpfs, allowing
> tmpfs files to be preserved across kexec.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary supporting
> data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
> VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
> State" across reboot[3].  PKRAM provides a flexible way for doing this
> without requiring that the amount of memory used by a fixed size created
> a priori.  Another use case is for databases to preserve their block caches
> in shared memory across reboot.

Hi Anthony,

I have several concerns about preserving arbitrary not prereserved segments across reboot.

1. PKRAM does not work across firmware reboots
With emulated persistent memory it is possible to do reboot through firmware and not loose the preserved-memory. The firmware can be modified to mark the required ranges pages as PRAM, and Linux will treat them as such. The benefit of this is that it works for both cases kexec and reboot through firmware. The disadvantage is that you have to know in advance how much memory needs to be preserved. However, with the ability to hot-plug/hot-remove the PMEM, the second point becomes moot as it is possible to mark a large chunk of memory as PMEM if needed. I have designed something like this for one of our projects, and it is already been used in the fleet. Reboot through firmware, allows us to service firmware in addition to kernel.

2. Boot failures due to memory fragmentation
We also considered using PRAM instead of PMEM. PRAM was one of the previous attempts to do the persistent memory thing via tmpfs flag: mount -t tmpfs -o pram=mytmpfs none /mnt/crdump"; that project was never upstreamed. However, we gave up with that idea because in addition to loosing possibility to reboot through the firmware, it also adds memory fragmentation. For example, if the new kernel require larger contiguous memory chunks to be allocated during boot than the previous kernel (i.e. the next kernel has new drivers, or some debug feature enabled), the boot might simply fail because of the extra memory ranges being reserved.

3. New intra-kernel dependencies
Kexec reboot is when one Linux kernel works as a bootloader for the next one. Currently, there is very little information that is passed from the old kernel to the next kernel. Adding more information that two independent kernels must know about each other is not a very good thing from architectural point of view. It limits the flexibility of kexec.

However, we do need PKRAM and ability to preserve kernel memory across reboot for fast hypervisor updates or such. User pages can already be preserved across reboot on emulated or real persistent memory. The easiest way is via DAXFS placed on that memory.
Kernel cannot preserve its memory on  PMEM across the reboot. However, functionality can be extended so kernel memory can be preserved on both emulated persistent memory or on real persistent memory. PKRAM could provide an interface to save kernel data to a file, and that file could be placed on any filesystem including DAXFS. When placed on DAXFS, that file can be used as iommu data, as it is actually located in physical memory and not moving anywhere. It is preserved across firmware/kexec reboot with having the devices survive the reboot state intact. During boot, have the device drivers that use PKRAM preserve functionality map saved files from DAXFS in order to have IOMMU functionality working again.

Thank you,
Pasha