[064/146] docs/vm: add vmalloced-kernel-stacks document

Message ID	20220114220626.RQe8Ln1aD%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 14 Jan 2022 14:06:26 -0800 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, corbet@lwn.net, linux-mm@kvack.org, luto@kernel.org, mm-commits@vger.kernel.org, skhan@linuxfoundation.org, torvalds@linux-foundation.org Subject: [patch 064/146] docs/vm: add vmalloced-kernel-stacks document Message-ID: <20220114220626.RQe8Ln1aD%akpm@linux-foundation.org> In-Reply-To: <20220114140222.6b14f0061194d3200000c52d@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/146] kthread: add the helper function kthread_run_on_cpu() \| expand [001/146] kthread: add the helper function kthread_run_on_cpu() [002/146] RDMA/siw: make use of the helper function kthread_run_on_cpu() [003/146] ring-buffer: make use of the helper function kthread_run_on_cpu() [004/146] rcutorture: make use of the helper function kthread_run_on_cpu() [005/146] trace/osnoise: make use of the helper function kthread_run_on_cpu() [006/146] trace/hwlat: make use of the helper function kthread_run_on_cpu() [007/146] ia64: module: use swap() to make code cleaner [008/146] arch/ia64/kernel/setup.c: use swap() to make code cleaner [009/146] ia64: fix typo in a comment [010/146] ia64: topology: use default_groups in kobj_type [011/146] scripts/spelling.txt: add "oveflow" [012/146] fs/ntfs/attrib.c: fix one kernel-doc comment [013/146] squashfs: provide backing_dev_info in order to disable read-ahead [014/146] ocfs2: use BUG_ON instead of if condition followed by BUG. [015/146] ocfs2: clearly handle ocfs2_grab_pages_for_write() return value [016/146] ocfs2: use default_groups in kobj_type [017/146] ocfs2: remove redundant assignment to pointer root_bh [018/146] ocfs2: cluster: use default_groups in kobj_type [019/146] ocfs2: remove redundant assignment to variable free_space [020/146] fs/ioctl: remove unnecessary __user annotation [021/146] mm/slab_common: use WARN() if cache still has objects on destroy [022/146] mm: slab: make slab iterator functions static [023/146] kmemleak: fix kmemleak false positive report with HW tag-based kasan enable [024/146] mm: kmemleak: alloc gray object for reserved region with direct map [025/146] mm: defer kmemleak object creation of module_alloc() [026/146] mm/page_alloc: split prep_compound_page into head and tail subparts [027/146] mm/page_alloc: refactor memmap_init_zone_device() page init [028/146] mm/memremap: add ZONE_DEVICE support for compound pages [029/146] device-dax: use ALIGN() for determining pgoff [030/146] device-dax: use struct_size() [031/146] device-dax: ensure dev_dax->pgmap is valid for dynamic devices [032/146] device-dax: factor out page mapping initialization [033/146] device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}() [034/146] device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault() [035/146] device-dax: compound devmap support [036/146] kasan: test: add globals left-out-of-bounds test [037/146] kasan: add ability to detect double-kmem_cache_destroy() [038/146] kasan: test: add test case for double-kmem_cache_destroy() [039/146] kasan: fix quarantine conflicting with init_on_free [040/146] mm,fs: split dump_mapping() out from dump_page() [041/146] mm/debug_vm_pgtable: update comments regarding migration swap entries [042/146] mm/truncate.c: remove unneeded variable [043/146] gup: avoid multiple user access locking/unlocking in fault_in_{read/write}able [044/146] mm/gup.c: stricter check on THP migration entry during follow_pmd_mask [045/146] mm: shmem: don't truncate page if memory failure happens [046/146] shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode [047/146] mm/frontswap.c: use non-atomic '__set_bit()' when possible [048/146] mm: memcontrol: make cgroup_memory_nokmem static [049/146] mm/page_counter: remove an incorrect call to propagate_protected_usage() [050/146] mm/memcg: add oom_group_kill memory event [051/146] memcg: better bounds on the memcg stats updates [052/146] mm/memcg: use struct_size() helper in kzalloc() [053/146] memcg: add per-memcg vmalloc stat [054/146] tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner [055/146] mm: remove redundant check about FAULT_FLAG_ALLOW_RETRY bit [056/146] mm: rearrange madvise code to allow for reuse [057/146] mm: add a field to store names for private anonymous memory [058/146] mm: add anonymous vma name refcounting [059/146] mm: move anon_vma declarations to linux/mm_inline.h [060/146] mm: move tlb_flush_pending inline helpers to mm_inline.h [061/146] mm: protect free_pgtables with mmap_lock write lock in exit_mmap [062/146] mm: document locking restrictions for vm_operations_struct::close [063/146] mm/oom_kill: allow process_mrelease to run under mmap_lock protection [064/146] docs/vm: add vmalloced-kernel-stacks document [065/146] mm: change page type prior to adding page table entry [066/146] mm: ptep_clear() page table helper [067/146] mm: page table check [068/146] x86: mm: add x86_64 support for page table check [069/146] mm: remove last argument of reuse_swap_page() [070/146] mm: remove the total_mapcount argument from page_trans_huge_map_swapcount() [071/146] mm: remove the total_mapcount argument from page_trans_huge_mapcount() [072/146] mm/dmapool.c: revert "make dma pool to use kmalloc_node" [073/146] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc [074/146] mm/vmalloc: add support for __GFP_NOFAIL [075/146] mm/vmalloc: be more explicit about supported gfp flags. [076/146] mm: allow !GFP_KERNEL allocations for kvmalloc [077/146] mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware [078/146] mm: introduce memalloc_retry_wait() [079/146] mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% [080/146] mm: fix boolreturn.cocci warning [081/146] mm: page_alloc: fix building error on -Werror=array-compare [082/146] mm: drop node from alloc_pages_vma [083/146] include/linux/gfp.h: further document GFP_DMA32 [084/146] mm/page_alloc.c: modify the comment section for alloc_contig_pages() [085/146] mm_zone: add function to check if managed dma zone exists [086/146] dma/pool: create dma atomic pool only if dma zone has managed pages [087/146] mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages [088/146] hugetlb: add hugetlb..numa_stat file [089/146] mm, hugepages: make memory size variable in hugepage-mremap selftest [090/146] mm/vmstat: add events for THP max_ptes_ exceeds [091/146] selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting [092/146] selftests/uffd: allow EINTR/EAGAIN [093/146] userfaultfd/selftests: clean up hugetlb allocation code [094/146] vmscan: make drop_slab_node static [095/146] mm/page_isolation: unset migratetype directly for non Buddy page [096/146] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY [097/146] mm/mempolicy: add set_mempolicy_home_node syscall [098/146] mm/mempolicy: wire up syscall set_mempolicy_home_node [099/146] mm/mempolicy: fix all kernel-doc warnings [100/146] mm, oom: OOM sysrq should always kill a process [101/146] hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() [102/146] mm: migrate: fix the return value of migrate_pages() [103/146] mm: migrate: correct the hugetlb migration stats [104/146] mm: compaction: fix the migration stats in trace_mm_compaction_migratepages() [105/146] mm: migrate: support multiple target nodes demotion [106/146] mm: migrate: add more comments for selecting target node randomly [107/146] mm/migrate: move node demotion code to near its user [108/146] mm/migrate: remove redundant variables used in a for-loop [109/146] mm/thp: drop unused trace events hugepage_[invalidate\|splitting] [110/146] mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy [111/146] mm/hwpoison: mf_mutex for soft offline and unpoison [112/146] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE [113/146] mm/hwpoison: fix unpoison_memory() [114/146] mm: memcg/percpu: account extra objcg space to memory cgroups [115/146] mm/rmap: fix potential batched TLB flush race [116/146] zpool: remove the list of pools_head [117/146] zram: use ATTRIBUTE_GROUPS [118/146] mm: fix some comment errors [119/146] mm: make some vars and functions static or __init [120/146] mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault [121/146] mm/damon: unified access_check function naming rules [122/146] mm/damon: add 'age' of region tracepoint support [123/146] mm/damon/core: use abs() instead of diff_of() [124/146] mm/damon: remove some unneeded function definitions in damon.h [125/146] mm/damon/vaddr: remove swap_ranges() and replace it with swap() [126/146] mm/damon/schemes: add the validity judgment of thresholds [127/146] mm/damon: move damon_rand() definition into damon.h [128/146] mm/damon: modify damon_rand() macro to static inline function [129/146] mm/damon: convert macro functions to static inline functions [131/146] Docs/admin-guide/mm/damon/usage: remove redundant information [132/146] Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning [133/146] Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk\|rm)_contexts [134/146] mm/damon: remove a mistakenly added comment for a future feature [135/146] mm/damon/schemes: account scheme actions that successfully applied [136/146] mm/damon/schemes: account how many times quota limit has exceeded [137/146] mm/damon/reclaim: provide reclamation statistics [138/146] Docs/admin-guide/mm/damon/reclaim: document statistics parameters [139/146] mm/damon/dbgfs: support all DAMOS stats [140/146] Docs/admin-guide/mm/damon/usage: update for schemes statistics [141/146] mm/damon: add access checking for hugetlb pages [142/146] mm/damon: move the implementation of damon_insert_region to damon.h [143/146] mm/damon/dbgfs: remove an unnecessary variable [144/146] mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging [145/146] mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log [146/146] mm/damon: hide kernel pointer from tracepoint event

Message ID

20220114220626.RQe8Ln1aD%akpm@linux-foundation.org (mailing list archive)

State

New

Headers

Date: Fri, 14 Jan 2022 14:06:26 -0800
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, corbet@lwn.net, linux-mm@kvack.org,
 luto@kernel.org, mm-commits@vger.kernel.org, skhan@linuxfoundation.org,
 torvalds@linux-foundation.org
Subject: [patch 064/146] docs/vm: add vmalloced-kernel-stacks
 document
Message-ID: <20220114220626.RQe8Ln1aD%akpm@linux-foundation.org>
In-Reply-To: <20220114140222.6b14f0061194d3200000c52d@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/146] kthread: add the helper function kthread_run_on_cpu() | expand

Commit Message

Andrew Morton Jan. 14, 2022, 10:06 p.m. UTC

From: Shuah Khan <skhan@linuxfoundation.org>
Subject: docs/vm: add vmalloced-kernel-stacks document

Add a new document to explain Virtually Mapped Kernel Stack Support.  This
is a compilation of information from the code and original patch series
that introduced the Virtually Mapped Kernel Stacks feature.

This document summarizes the feature and provides details on allocation,
free, and stack overflow handling.  Provides reference to available tests.

Link: https://lkml.kernel.org/r/20211215002004.47981-1-skhan@linuxfoundation.org
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/index.rst                   |    1 
 Documentation/vm/vmalloced-kernel-stacks.rst |  153 +++++++++++++++++
 2 files changed, 154 insertions(+)

--- a/Documentation/vm/index.rst~docs-vm-add-vmalloced-kernel-stacks-document
+++ a/Documentation/vm/index.rst
@@ -36,5 +36,6 @@  algorithms.  If you are looking for advi
    split_page_table_lock
    transhuge
    unevictable-lru
+   vmalloced-kernel-stacks
    z3fold
    zsmalloc
--- /dev/null
+++ a/Documentation/vm/vmalloced-kernel-stacks.rst
@@ -0,0 +1,153 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Virtually Mapped Kernel Stack Support
+=====================================
+
+:Author: Shuah Khan <skhan@linuxfoundation.org>
+
+.. contents:: :local:
+
+Overview
+--------
+
+This is a compilation of information from the code and original patch
+series that introduced the `Virtually Mapped Kernel Stacks feature
+<https://lwn.net/Articles/694348/>`
+
+Introduction
+------------
+
+Kernel stack overflows are often hard to debug and make the kernel
+susceptible to exploits. Problems could show up at a later time making
+it difficult to isolate and root-cause.
+
+Virtually-mapped kernel stacks with guard pages causes kernel stack
+overflows to be caught immediately rather than causing difficult to
+diagnose corruptions.
+
+HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable
+support for virtually mapped stacks with guard pages. This feature
+causes reliable faults when the stack overflows. The usability of
+the stack trace after overflow and response to the overflow itself
+is architecture dependent.
+
+.. note::
+        As of this writing, arm64, powerpc, riscv, s390, um, and x86 have
+        support for VMAP_STACK.
+
+HAVE_ARCH_VMAP_STACK
+--------------------
+
+Architectures that can support Virtually Mapped Kernel Stacks should
+enable this bool configuration option. The requirements are:
+
+- vmalloc space must be large enough to hold many kernel stacks. This
+  may rule out many 32-bit architectures.
+- Stacks in vmalloc space need to work reliably.  For example, if
+  vmap page tables are created on demand, either this mechanism
+  needs to work while the stack points to a virtual address with
+  unpopulated page tables or arch code (switch_to() and switch_mm(),
+  most likely) needs to ensure that the stack's page table entries
+  are populated before running on a possibly unpopulated stack.
+- If the stack overflows into a guard page, something reasonable
+  should happen. The definition of "reasonable" is flexible, but
+  instantly rebooting without logging anything would be unfriendly.
+
+VMAP_STACK
+----------
+
+VMAP_STACK bool configuration option when enabled allocates virtually
+mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
+
+- Enable this if you want the use virtually-mapped kernel stacks
+  with guard pages. This causes kernel stack overflows to be caught
+  immediately rather than causing difficult-to-diagnose corruption.
+
+.. note::
+
+        Using this feature with KASAN requires architecture support
+        for backing virtual mappings with real shadow memory, and
+        KASAN_VMALLOC must be enabled.
+
+.. note::
+
+        VMAP_STACK is enabled, it is not possible to run DMA on stack
+        allocated data.
+
+Kernel configuration options and dependencies keep changing. Refer to
+the latest code base:
+
+`Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
+
+Allocation
+-----------
+
+When a new kernel thread is created, thread stack is allocated from
+virtually contiguous memory pages from the page level allocator. These
+pages are mapped into contiguous kernel virtual space with PAGE_KERNEL
+protections.
+
+alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack
+with PAGE_KERNEL protections.
+
+- Allocated stacks are cached and later reused by new threads, so memcg
+  accounting is performed manually on assigning/releasing stacks to tasks.
+  Hence, __vmalloc_node_range is called without __GFP_ACCOUNT.
+- vm_struct is cached to be able to find when thread free is initiated
+  in interrupt context. free_thread_stack() can be called in interrupt
+  context.
+- On arm64, all VMAP's stacks need to have the same alignment to ensure
+  that VMAP'd stack overflow detection works correctly. Arch specific
+  vmap stack allocator takes care of this detail.
+- This does not address interrupt stacks - according to the original patch
+
+Thread stack allocation is initiated from clone(), fork(), vfork(),
+kernel_thread() via kernel_clone(). Leaving a few hints for searching
+the code base to understand when and how thread stack is allocated.
+
+Bulk of the code is in:
+`kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
+
+stack_vm_area pointer in task_struct keeps track of the virtually allocated
+stack and a non-null stack_vm_area pointer serves as a indication that the
+virtually mapped kernel stacks are enabled.
+
+::
+
+        struct vm_struct *stack_vm_area;
+
+Stack overflow handling
+-----------------------
+
+Leading and trailing guard pages help detect stack overflows. When stack
+overflows into the guard pages, handlers have to be careful not overflow
+the stack again. When handlers are called, it is likely that very little
+stack space is left.
+
+On x86, this is done by handling the page fault indicating the kernel
+stack overflow on the double-fault stack.
+
+Testing VMAP allocation with guard pages
+----------------------------------------
+
+How do we ensure that VMAP_STACK is actually allocating with a leading
+and trailing guard page? The following lkdtm tests can help detect any
+regressions.
+
+::
+
+        void lkdtm_STACK_GUARD_PAGE_LEADING()
+        void lkdtm_STACK_GUARD_PAGE_TRAILING()
+
+Conclusions
+-----------
+
+- A percpu cache of vmalloced stacks appears to be a bit faster than a
+  high-order stack allocation, at least when the cache hits.
+- THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and
+  simply embed the thread_info (containing only flags) and 'int cpu' into
+  task_struct.
+- The thread stack can be free'ed as soon as the task is dead (without
+  waiting for RCU) and then, if vmapped stacks are in use, cache the
+  entire stack for reuse on the same cpu.

[064/146] docs/vm: add vmalloced-kernel-stacks document

Commit Message

Patch