Message ID | 20220407031525.2368067-15-yuzhao@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Multi-Gen LRU Framework | expand |
Hi Zhao Yu, On Wed, Apr 06, 2022 at 09:15:26PM -0600, Yu Zhao wrote: > [EXTERNAL EMAIL NOTICE: This email originated from an external sender. Please be mindful of safe email handling and proprietary information protection practices.] > > > Add a design doc. > > Signed-off-by: Yu Zhao <yuzhao@google.com> > Acked-by: Brian Geffon <bgeffon@google.com> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> > Acked-by: Steven Barrett <steven@liquorix.net> > Acked-by: Suleiman Souhlal <suleiman@google.com> > Tested-by: Daniel Byrne <djbyrne@mtu.edu> > Tested-by: Donald Carr <d@chaos-reins.com> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> > Tested-by: Sofia Trinh <sofia.trinh@edi.works> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> > --- > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 160 ++++++++++++++++++++++++++++++ > 2 files changed, 161 insertions(+) > create mode 100644 Documentation/vm/multigen_lru.rst > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > index 44365c4574a3..b48434300226 100644 > --- a/Documentation/vm/index.rst > +++ b/Documentation/vm/index.rst > @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the > ksm > memory-model > mmu_notifier > + multigen_lru > numa > overcommit-accounting > page_migration > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst > new file mode 100644 > index 000000000000..9b29b87e1435 > --- /dev/null > +++ b/Documentation/vm/multigen_lru.rst > @@ -0,0 +1,160 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +============= > +Multi-Gen LRU > +============= > +The multi-gen LRU is an alternative LRU implementation that optimizes > +page reclaim and improves performance under memory pressure. Page > +reclaim decides the kernel's caching policy and ability to overcommit > +memory. It directly impacts the kswapd CPU usage and RAM efficiency. > + > +Design overview > +=============== > +Objectives > +---------- > +The design objectives are: > + > +* Good representation of access recency > +* Try to profit from spatial locality > +* Fast paths to make obvious choices > +* Simple self-correcting heuristics > + > +The representation of access recency is at the core of all LRU > +implementations. In the multi-gen LRU, each generation represents a > +group of pages with similar access recency. Generations establish a > +common frame of reference and therefore help make better choices, > +e.g., between different memcgs on a computer or different computers in > +a data center (for job scheduling). > + > +Exploiting spatial locality improves efficiency when gathering the > +accessed bit. A rmap walk targets a single page and does not try to > +profit from discovering a young PTE. A page table walk can sweep all > +the young PTEs in an address space, but the address space can be too > +large to make a profit. The key is to optimize both methods and use > +them in combination. > + > +Fast paths reduce code complexity and runtime overhead. Unmapped pages > +do not require TLB flushes; clean pages do not require writeback. > +These facts are only helpful when other conditions, e.g., access > +recency, are similar. With generations as a common frame of reference, > +additional factors stand out. But obvious choices might not be good > +choices; thus self-correction is required. > + > +The benefits of simple self-correcting heuristics are self-evident. > +Again, with generations as a common frame of reference, this becomes > +attainable. Specifically, pages in the same generation can be > +categorized based on additional factors, and a feedback loop can > +statistically compare the refault percentages across those categories > +and infer which of them are better choices. It is better if we add a picture here to show the overview.. > + > +Assumptions > +----------- > +The protection of hot pages and the selection of cold pages are based > +on page access channels and patterns. There are two access channels: > + > +* Accesses through page tables > +* Accesses through file descriptors > + > +The protection of the former channel is by design stronger because: > + > +1. The uncertainty in determining the access patterns of the former > + channel is higher due to the approximation of the accessed bit. > +2. The cost of evicting the former channel is higher due to the TLB > + flushes required and the likelihood of encountering the dirty bit. > +3. The penalty of underprotecting the former channel is higher because > + applications usually do not prepare themselves for major page > + faults like they do for blocked I/O. E.g., GUI applications > + commonly use dedicated I/O threads to avoid blocking the rendering > + threads. > + > +There are also two access patterns: > + > +* Accesses exhibiting temporal locality > +* Accesses not exhibiting temporal locality > + > +For the reasons listed above, the former channel is assumed to follow > +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is > +present, and the latter channel is assumed to follow the latter > +pattern unless outlying refaults have been observed. > + > +Workflow overview > +================= > +Evictable pages are divided into multiple generations for each > +``lruvec``. The youngest generation number is stored in > +``lrugen->max_seq`` for both anon and file types as they are aged on > +an equal footing. The oldest generation numbers are stored in > +``lrugen->min_seq[]`` separately for anon and file types as clean file > +pages can be evicted regardless of swap constraints. These three > +variables are monotonically increasing. > + > +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` > +bits in order to fit into the gen counter in ``folio->flags``. Each > +truncated generation number is an index to ``lrugen->lists[]``. The > +sliding window technique is used to track at least ``MIN_NR_GENS`` and > +at most ``MAX_NR_GENS`` generations. The gen counter stores a value > +within ``[1, MAX_NR_GENS]`` while a page is on one of > +``lrugen->lists[]``; otherwise it stores zero. > + > +Each generation is divided into multiple tiers. Tiers represent > +different ranges of numbers of accesses through file descriptors. A > +page accessed ``N`` times through file descriptors is in tier > +``order_base_2(N)``. In contrast to moving across generations, which > +requires the LRU lock, moving across tiers only requires operations on > +``folio->flags`` and therefore has a negligible cost. A feedback loop > +modeled after the PID controller monitors refaults over all the tiers > +from anon and file types and decides which tiers from which types to > +evict or protect. > + > +There are two conceptually independent procedures: the aging and the > +eviction. They form a closed-loop system, i.e., the page reclaim. ditto. > + > +Aging > +----- > +The aging produces young generations. Given an ``lruvec``, it > +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches > +``MIN_NR_GENS``. The aging promotes hot pages to the youngest > +generation when it finds them accessed through page tables; the > +demotion of cold pages happens consequently when it increments > +``max_seq``. The aging uses page table walks and rmap walks to find > +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` > +and calls ``walk_page_range()`` with each ``mm_struct`` on this list > +to scan PTEs. On finding a young PTE, it clears the accessed bit and > +updates the gen counter of the page mapped by this PTE to > +``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it > +increments ``max_seq``. For the latter, when the eviction walks the > +rmap and finds a young PTE, the aging scans the adjacent PTEs and > +follows the same steps just described. > + > +Eviction > +-------- > +The eviction consumes old generations. Given an ``lruvec``, it > +increments ``min_seq`` when ``lrugen->lists[]`` indexed by > +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to > +evict from, it first compares ``min_seq[]`` to select the older type. > +If both types are equally old, it selects the one whose first tier has > +a lower refault percentage. The first tier contains single-use > +unmapped clean pages, which are the best bet. The eviction sorts a > +page according to the gen counter if the aging has found this page > +accessed through page tables and updated the gen counter. It also > +moves a page to the next generation, i.e., ``min_seq+1``, if this page > +was accessed multiple times through file descriptors and the feedback > +loop has detected outlying refaults from the tier this page is in. To > +do this, the feedback loop uses the first tier as the baseline, for > +the reason stated earlier. > + Thanks Huang Shijie
On 07/04/22 10.15, Yu Zhao wrote: > Add a design doc. > Why is this design documentation added?
Bagas Sanjaya <bagasdotme@gmail.com> writes: > On 07/04/22 10.15, Yu Zhao wrote: >> Add a design doc. >> > > Why is this design documentation added? ...because perhaps other developers might want to understand the design of this complex mechanism? Bagas, it is hard enough to get people to write useful documentation as it is. Could I ask you to please stop adding useless friction to the process? Thanks, jon
On 07/04/22 19.52, Jonathan Corbet wrote: > Bagas, it is hard enough to get people to write useful documentation as > it is. Could I ask you to please stop adding useless friction to the > process? > OK, thanks for reminding me.
On Wed, 6 Apr 2022 21:15:26 -0600 Yu Zhao <yuzhao@google.com> wrote: > Add a design doc. > > > ... > > +Design overview > +=============== > +Objectives > +---------- > +The design objectives are: > + > +* Good representation of access recency > +* Try to profit from spatial locality > +* Fast paths to make obvious choices > +* Simple self-correcting heuristics > + > +The representation of access recency is at the core of all LRU > +implementations. In the multi-gen LRU, each generation represents a > +group of pages with similar access recency. Generations establish a > +common frame of reference and therefore help make better choices, > +e.g., between different memcgs on a computer or different computers in > +a data center (for job scheduling). Does MGLRU have any special treatment for used-once pages? > +Exploiting spatial locality improves efficiency when gathering the > +accessed bit. A rmap walk targets a single page and does not try to > +profit from discovering a young PTE. A page table walk can sweep all > +the young PTEs in an address space, but the address space can be too > +large to make a profit. The key is to optimize both methods and use > +them in combination. > + > +Fast paths reduce code complexity and runtime overhead. Unmapped pages > +do not require TLB flushes; clean pages do not require writeback. > +These facts are only helpful when other conditions, e.g., access > +recency, are similar. With generations as a common frame of reference, > +additional factors stand out. But obvious choices might not be good > +choices; thus self-correction is required. > + > +The benefits of simple self-correcting heuristics are self-evident. > +Again, with generations as a common frame of reference, this becomes > +attainable. Specifically, pages in the same generation can be > +categorized based on additional factors, and a feedback loop can > +statistically compare the refault percentages across those categories > +and infer which of them are better choices. > + > +Assumptions > +----------- > +The protection of hot pages and the selection of cold pages are based > +on page access channels and patterns. There are two access channels: > + > +* Accesses through page tables > +* Accesses through file descriptors > + > +The protection of the former channel is by design stronger because: > + > +1. The uncertainty in determining the access patterns of the former > + channel is higher due to the approximation of the accessed bit. > +2. The cost of evicting the former channel is higher due to the TLB > + flushes required and the likelihood of encountering the dirty bit. > +3. The penalty of underprotecting the former channel is higher because > + applications usually do not prepare themselves for major page > + faults like they do for blocked I/O. E.g., GUI applications > + commonly use dedicated I/O threads to avoid blocking the rendering > + threads. > + > +There are also two access patterns: > + > +* Accesses exhibiting temporal locality > +* Accesses not exhibiting temporal locality > + > +For the reasons listed above, the former channel is assumed to follow > +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is > +present, and the latter channel is assumed to follow the latter > +pattern unless outlying refaults have been observed. What about MADV_SEQUENTIAL? Or did we propogate that into the fd? > +Workflow overview > +================= > +Evictable pages are divided into multiple generations for each > +``lruvec``. The youngest generation number is stored in > +``lrugen->max_seq`` for both anon and file types as they are aged on > +an equal footing. The oldest generation numbers are stored in > +``lrugen->min_seq[]`` separately for anon and file types as clean file > +pages can be evicted regardless of swap constraints. These three > +variables are monotonically increasing. > + > > ... > > +Summary > +------- > +The multi-gen LRU can be disassembled into the following parts: > + > +* Generations > +* Page table walks > +* Rmap walks > +* Bloom filters > +* The PID controller > + > +The aging and the eviction is a producer-consumer model; specifically, > +the latter drives the former by the sliding window over generations. > +Within the aging, rmap walks drive page table walks by inserting hot > +densely populated page tables to the Bloom filters. Within the > +eviction, the PID controller uses refaults as the feedback to select > +types to evict and tiers to protect. It's cool to see a PID controller in there. How do we know that it converges well, that it doesn't exhibit instability, etc?
On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 6 Apr 2022 21:15:26 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > Add a design doc. > > > > > > ... > > > > +Design overview > > +=============== > > +Objectives > > +---------- > > +The design objectives are: > > + > > +* Good representation of access recency > > +* Try to profit from spatial locality > > +* Fast paths to make obvious choices > > +* Simple self-correcting heuristics > > + > > +The representation of access recency is at the core of all LRU > > +implementations. In the multi-gen LRU, each generation represents a > > +group of pages with similar access recency. Generations establish a > > +common frame of reference and therefore help make better choices, > > +e.g., between different memcgs on a computer or different computers in > > +a data center (for job scheduling). > > Does MGLRU have any special treatment for used-once pages? Yes. Single-use pages are assumed to be the cheapest to reclaim: "The first tier contains single-use unmapped clean pages, which are the best bet." > > +Exploiting spatial locality improves efficiency when gathering the > > +accessed bit. A rmap walk targets a single page and does not try to > > +profit from discovering a young PTE. A page table walk can sweep all > > +the young PTEs in an address space, but the address space can be too > > +large to make a profit. The key is to optimize both methods and use > > +them in combination. > > + > > +Fast paths reduce code complexity and runtime overhead. Unmapped pages > > +do not require TLB flushes; clean pages do not require writeback. > > +These facts are only helpful when other conditions, e.g., access > > +recency, are similar. With generations as a common frame of reference, > > +additional factors stand out. But obvious choices might not be good > > +choices; thus self-correction is required. > > + > > +The benefits of simple self-correcting heuristics are self-evident. > > +Again, with generations as a common frame of reference, this becomes > > +attainable. Specifically, pages in the same generation can be > > +categorized based on additional factors, and a feedback loop can > > +statistically compare the refault percentages across those categories > > +and infer which of them are better choices. > > + > > +Assumptions > > +----------- > > +The protection of hot pages and the selection of cold pages are based > > +on page access channels and patterns. There are two access channels: > > + > > +* Accesses through page tables > > +* Accesses through file descriptors > > + > > +The protection of the former channel is by design stronger because: > > + > > +1. The uncertainty in determining the access patterns of the former > > + channel is higher due to the approximation of the accessed bit. > > +2. The cost of evicting the former channel is higher due to the TLB > > + flushes required and the likelihood of encountering the dirty bit. > > +3. The penalty of underprotecting the former channel is higher because > > + applications usually do not prepare themselves for major page > > + faults like they do for blocked I/O. E.g., GUI applications > > + commonly use dedicated I/O threads to avoid blocking the rendering > > + threads. > > + > > +There are also two access patterns: > > + > > +* Accesses exhibiting temporal locality > > +* Accesses not exhibiting temporal locality > > + > > +For the reasons listed above, the former channel is assumed to follow > > +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is > > +present, and the latter channel is assumed to follow the latter > > +pattern unless outlying refaults have been observed. > > What about MADV_SEQUENTIAL? Or did we propogate that into the fd? MADV_SEQUENTIAL is VM_SEQ_READ. I think you mean POSIX_FADV_SEQUENTIAL. We don't need to propagate it to the fd. Pages sequentially accessed through fd, i.e., detected by mark_page_accessed(), fall into the single-use category. > > +Workflow overview > > +================= > > +Evictable pages are divided into multiple generations for each > > +``lruvec``. The youngest generation number is stored in > > +``lrugen->max_seq`` for both anon and file types as they are aged on > > +an equal footing. The oldest generation numbers are stored in > > +``lrugen->min_seq[]`` separately for anon and file types as clean file > > +pages can be evicted regardless of swap constraints. These three > > +variables are monotonically increasing. > > + > > > > ... > > > > +Summary > > +------- > > +The multi-gen LRU can be disassembled into the following parts: > > + > > +* Generations > > +* Page table walks > > +* Rmap walks > > +* Bloom filters > > +* The PID controller > > + > > +The aging and the eviction is a producer-consumer model; specifically, > > +the latter drives the former by the sliding window over generations. > > +Within the aging, rmap walks drive page table walks by inserting hot > > +densely populated page tables to the Bloom filters. Within the > > +eviction, the PID controller uses refaults as the feedback to select > > +types to evict and tiers to protect. > > It's cool to see a PID controller in there. > > How do we know that it converges well, that it doesn't exhibit instability, etc? Divergence can be easily avoided by using sane parameters. A more specific problem for our use case is how to establish a (relative) time domain. We can't just use wall clock since different CPUs run at the different speeds. Even for the same CPU, it scans pages at different rates under different memory pressure. So generations serve as a time dimension. For each new generation, we refresh the PID controller to minimize the chance of it locking in a bad state. It may make poor choices now and then, but it should never make poor choices all the time.
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..b48434300226 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..9b29b87e1435 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,160 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +common frame of reference and therefore help make better choices, +e.g., between different memcgs on a computer or different computers in +a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +large to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is required. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking the rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. Tiers represent +different ranges of numbers of accesses through file descriptors. A +page accessed ``N`` times through file descriptors is in tier +``order_base_2(N)``. In contrast to moving across generations, which +requires the LRU lock, moving across tiers only requires operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs. On finding a young PTE, it clears the accessed bit and +updates the gen counter of the page mapped by this PTE to +``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it +increments ``max_seq``. For the latter, when the eviction walks the +rmap and finds a young PTE, the aging scans the adjacent PTEs and +follows the same steps just described. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to the gen counter if the aging has found this page +accessed through page tables and updated the gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +do this, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Page table walks +* Rmap walks +* Bloom filters +* The PID controller + +The aging and the eviction is a producer-consumer model; specifically, +the latter drives the former by the sliding window over generations. +Within the aging, rmap walks drive page table walks by inserting hot +densely populated page tables to the Bloom filters. Within the +eviction, the PID controller uses refaults as the feedback to select +types to evict and tiers to protect.