Message ID | 1532110430-115278-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, 21 Jul 2018 02:13:50 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote: > By digging into the original review, it looks use_zero_page sysfs knob > was added to help ease-of-testing and give user a way to mitigate > refcounting overhead. > > It has been a few years since the knob was added at the first place, I > think we are confident that it is stable enough. And, since commit > 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), > it looks refcounting overhead has been reduced significantly. > > Other than the above, the value of the knob is always 1 (enabled by > default), I'm supposed very few people turn it off by default. > > So, it sounds not worth to still keep this knob around. Probably OK. Might not be OK, nobody knows. It's been there for seven years so another six months won't kill us. How about as an intermediate step we add a printk("use_zero_page is scheduled for removal. Please contact linux-mm@kvack.org if you need it").
On Fri, 20 Jul 2018, Andrew Morton wrote: > > By digging into the original review, it looks use_zero_page sysfs knob > > was added to help ease-of-testing and give user a way to mitigate > > refcounting overhead. > > > > It has been a few years since the knob was added at the first place, I > > think we are confident that it is stable enough. And, since commit > > 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), > > it looks refcounting overhead has been reduced significantly. > > > > Other than the above, the value of the knob is always 1 (enabled by > > default), I'm supposed very few people turn it off by default. > > > > So, it sounds not worth to still keep this knob around. > > Probably OK. Might not be OK, nobody knows. > > It's been there for seven years so another six months won't kill us. > How about as an intermediate step we add a printk("use_zero_page is > scheduled for removal. Please contact linux-mm@kvack.org if you need > it"). > We disable the huge zero page through this interface, there were issues related to the huge zero page shrinker (probably best to never free a per-node huge zero page after allocated) and CVE-2017-1000405 for huge dirty COW.
On 7/20/18 1:02 PM, David Rientjes wrote: > On Fri, 20 Jul 2018, Andrew Morton wrote: > >>> By digging into the original review, it looks use_zero_page sysfs knob >>> was added to help ease-of-testing and give user a way to mitigate >>> refcounting overhead. >>> >>> It has been a few years since the knob was added at the first place, I >>> think we are confident that it is stable enough. And, since commit >>> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), >>> it looks refcounting overhead has been reduced significantly. >>> >>> Other than the above, the value of the knob is always 1 (enabled by >>> default), I'm supposed very few people turn it off by default. >>> >>> So, it sounds not worth to still keep this knob around. >> Probably OK. Might not be OK, nobody knows. >> >> It's been there for seven years so another six months won't kill us. >> How about as an intermediate step we add a printk("use_zero_page is >> scheduled for removal. Please contact linux-mm@kvack.org if you need >> it"). >> > We disable the huge zero page through this interface, there were issues > related to the huge zero page shrinker (probably best to never free a > per-node huge zero page after allocated) and CVE-2017-1000405 for huge > dirty COW. Thanks for the information. It looks the CVE has been resolved by commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already. What was the shrinker related issue? I'm supposed it has been resolved, right?
On Fri, 20 Jul 2018, Yang Shi wrote: > > We disable the huge zero page through this interface, there were issues > > related to the huge zero page shrinker (probably best to never free a > > per-node huge zero page after allocated) and CVE-2017-1000405 for huge > > dirty COW. > > Thanks for the information. It looks the CVE has been resolved by commit > a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page table > dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already. > For users who run kernels earlier than 4.15 they may choose to mitigate the CVE by using this tunable. It's not something we permanently need to have, but it may likely be too early. > What was the shrinker related issue? I'm supposed it has been resolved, right? > The huge zero page can be reclaimed under memory pressure and, if it is, it is attempted to be allocted again with gfp flags that attempt memory compaction that can become expensive. If we are constantly under memory pressure, it gets freed and reallocated millions of times always trying to compact memory both directly and by kicking kcompactd in the background. It likely should also be per node.
On Sat, Jul 21, 2018 at 02:13:50AM +0800, Yang Shi wrote: > By digging into the original review, it looks use_zero_page sysfs knob > was added to help ease-of-testing and give user a way to mitigate > refcounting overhead. > > It has been a few years since the knob was added at the first place, I > think we are confident that it is stable enough. And, since commit > 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), > it looks refcounting overhead has been reduced significantly. > > Other than the above, the value of the knob is always 1 (enabled by > default), I'm supposed very few people turn it off by default. > > So, it sounds not worth to still keep this knob around. I don't think that having the knob around is huge maintenance burden. And since it helped to workaround a security bug relative recently I would rather keep it.
On 7/20/18 2:05 PM, David Rientjes wrote: > On Fri, 20 Jul 2018, Yang Shi wrote: > >>> We disable the huge zero page through this interface, there were issues >>> related to the huge zero page shrinker (probably best to never free a >>> per-node huge zero page after allocated) and CVE-2017-1000405 for huge >>> dirty COW. >> Thanks for the information. It looks the CVE has been resolved by commit >> a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page table >> dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already. >> > For users who run kernels earlier than 4.15 they may choose to mitigate > the CVE by using this tunable. It's not something we permanently need to > have, but it may likely be too early. Yes, it might be good to keep it around for a while. > >> What was the shrinker related issue? I'm supposed it has been resolved, right? >> > The huge zero page can be reclaimed under memory pressure and, if it is, > it is attempted to be allocted again with gfp flags that attempt memory > compaction that can become expensive. If we are constantly under memory > pressure, it gets freed and reallocated millions of times always trying to > compact memory both directly and by kicking kcompactd in the background. Even though we don't use huge zero page, we may also run into the similar issue under memory pressure. Just save the cost of calling huge zero page shrinker, but actually its cost sound not high. > > It likely should also be per node.
On 7/20/18 2:06 PM, Kirill A. Shutemov wrote: > On Sat, Jul 21, 2018 at 02:13:50AM +0800, Yang Shi wrote: >> By digging into the original review, it looks use_zero_page sysfs knob >> was added to help ease-of-testing and give user a way to mitigate >> refcounting overhead. >> >> It has been a few years since the knob was added at the first place, I >> think we are confident that it is stable enough. And, since commit >> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), >> it looks refcounting overhead has been reduced significantly. >> >> Other than the above, the value of the knob is always 1 (enabled by >> default), I'm supposed very few people turn it off by default. >> >> So, it sounds not worth to still keep this knob around. > I don't think that having the knob around is huge maintenance burden. > And since it helped to workaround a security bug relative recently I would > rather keep it. I agree to keep it for a while to let that security bug cool down, however, if there is no user anymore, it sounds pointless to still keep a dead knob. >
On Fri, Jul 20, 2018 at 02:05:52PM -0700, David Rientjes wrote: > The huge zero page can be reclaimed under memory pressure and, if it is, > it is attempted to be allocted again with gfp flags that attempt memory > compaction that can become expensive. If we are constantly under memory > pressure, it gets freed and reallocated millions of times always trying to > compact memory both directly and by kicking kcompactd in the background. > > It likely should also be per node. Have you benchmarked making the non-huge zero page per-node?
On Sat, 21 Jul 2018, Matthew Wilcox wrote: > > The huge zero page can be reclaimed under memory pressure and, if it is, > > it is attempted to be allocted again with gfp flags that attempt memory > > compaction that can become expensive. If we are constantly under memory > > pressure, it gets freed and reallocated millions of times always trying to > > compact memory both directly and by kicking kcompactd in the background. > > > > It likely should also be per node. > > Have you benchmarked making the non-huge zero page per-node? > Not since we disable it :) I will, though. The more concerning issue for us, modulo CVE-2017-1000405, is the cpu cost of constantly directly compacting memory for allocating the hzp in real time after it has been reclaimed. We've observed this happening tens or hundreds of thousands of times on some systems. It will be 2MB per node on x86 if the data suggests we should make it NUMA aware, I don't think the cost is too high to leave it persistently available even under memory pressure if use_zero_page is enabled.
On Fri, 20 Jul 2018, Yang Shi wrote: > I agree to keep it for a while to let that security bug cool down, however, if > there is no user anymore, it sounds pointless to still keep a dead knob. > It's not a dead knob. We use it, and for reasons other than CVE-2017-1000405. To mitigate the cost of constantly compacting memory to allocate it after it has been freed due to memry pressure, we can either continue to disable it, allow it to be persistently available, or use a new value for use_zero_page to specify it should be persistently available.
On Mon, 23 Jul 2018, David Rientjes wrote: > > > The huge zero page can be reclaimed under memory pressure and, if it is, > > > it is attempted to be allocted again with gfp flags that attempt memory > > > compaction that can become expensive. If we are constantly under memory > > > pressure, it gets freed and reallocated millions of times always trying to > > > compact memory both directly and by kicking kcompactd in the background. > > > > > > It likely should also be per node. > > > > Have you benchmarked making the non-huge zero page per-node? > > > > Not since we disable it :) I will, though. The more concerning issue for > us, modulo CVE-2017-1000405, is the cpu cost of constantly directly > compacting memory for allocating the hzp in real time after it has been > reclaimed. We've observed this happening tens or hundreds of thousands > of times on some systems. It will be 2MB per node on x86 if the data > suggests we should make it NUMA aware, I don't think the cost is too high > to leave it persistently available even under memory pressure if > use_zero_page is enabled. > Measuring access latency to 4GB of memory on Naples I observe ~6.7% slower access latency intrasocket and ~14% slower intersocket. use_zero_page is currently a simple thp flag, meaning it rejects writes where val != !!val, so perhaps it would be best to overload it with additional options? I can imagine 0x2 defining persistent allocation so that the hzp is not freed when the refcount goes to 0 and 0x4 defining if the hzp should be per node. Implementing persistent allocation fixes our concern with it, so I'd like to start there. Comments?
On 7/23/18 2:33 PM, David Rientjes wrote: > On Mon, 23 Jul 2018, David Rientjes wrote: > >>>> The huge zero page can be reclaimed under memory pressure and, if it is, >>>> it is attempted to be allocted again with gfp flags that attempt memory >>>> compaction that can become expensive. If we are constantly under memory >>>> pressure, it gets freed and reallocated millions of times always trying to >>>> compact memory both directly and by kicking kcompactd in the background. >>>> >>>> It likely should also be per node. >>> Have you benchmarked making the non-huge zero page per-node? >>> >> Not since we disable it :) I will, though. The more concerning issue for >> us, modulo CVE-2017-1000405, is the cpu cost of constantly directly >> compacting memory for allocating the hzp in real time after it has been >> reclaimed. We've observed this happening tens or hundreds of thousands >> of times on some systems. It will be 2MB per node on x86 if the data >> suggests we should make it NUMA aware, I don't think the cost is too high >> to leave it persistently available even under memory pressure if >> use_zero_page is enabled. >> > Measuring access latency to 4GB of memory on Naples I observe ~6.7% > slower access latency intrasocket and ~14% slower intersocket. > > use_zero_page is currently a simple thp flag, meaning it rejects writes > where val != !!val, so perhaps it would be best to overload it with > additional options? I can imagine 0x2 defining persistent allocation so > that the hzp is not freed when the refcount goes to 0 and 0x4 defining if > the hzp should be per node. Implementing persistent allocation fixes our > concern with it, so I'd like to start there. Comments? Sounds worth trying to me :-) It might be worth making it persistent by default. Keeping 2MB memory unreclaimable sounds not harmful for the use case which prefer to use THP.
On 7/23/18 1:31 PM, David Rientjes wrote: > On Fri, 20 Jul 2018, Yang Shi wrote: > >> I agree to keep it for a while to let that security bug cool down, however, if >> there is no user anymore, it sounds pointless to still keep a dead knob. >> > It's not a dead knob. We use it, and for reasons other than > CVE-2017-1000405. To mitigate the cost of constantly compacting memory to > allocate it after it has been freed due to memry pressure, we can either > continue to disable it, allow it to be persistently available, or use a > new value for use_zero_page to specify it should be persistently > available. My understanding is the cost of memory compaction is *not* unique for huge zero page, right? It is expected when memory pressure is met, even though huge zero page is disabled.
On Mon, 23 Jul 2018, Yang Shi wrote: > > > I agree to keep it for a while to let that security bug cool down, > > > however, if > > > there is no user anymore, it sounds pointless to still keep a dead knob. > > > > > It's not a dead knob. We use it, and for reasons other than > > CVE-2017-1000405. To mitigate the cost of constantly compacting memory to > > allocate it after it has been freed due to memry pressure, we can either > > continue to disable it, allow it to be persistently available, or use a > > new value for use_zero_page to specify it should be persistently > > available. > > My understanding is the cost of memory compaction is *not* unique for huge > zero page, right? It is expected when memory pressure is met, even though huge > zero page is disabled. > It's caused by fragmentation, not necessarily memory pressure. We've disabled it because compacting for tens of thousands of huge zero pages in the background has a noticeable impact on cpu. Additionally, if the hzp cannot be allocated at runtime it increases the rss of applications that map it, making it unpredictable. Making it persistent, as I've been suggesting, fixes these issues.
On Mon, Jul 23, 2018 at 02:33:08PM -0700, David Rientjes wrote: > On Mon, 23 Jul 2018, David Rientjes wrote: > > > > > The huge zero page can be reclaimed under memory pressure and, if it is, > > > > it is attempted to be allocted again with gfp flags that attempt memory > > > > compaction that can become expensive. If we are constantly under memory > > > > pressure, it gets freed and reallocated millions of times always trying to > > > > compact memory both directly and by kicking kcompactd in the background. > > > > > > > > It likely should also be per node. > > > > > > Have you benchmarked making the non-huge zero page per-node? > > > > > > > Not since we disable it :) I will, though. The more concerning issue for > > us, modulo CVE-2017-1000405, is the cpu cost of constantly directly > > compacting memory for allocating the hzp in real time after it has been > > reclaimed. We've observed this happening tens or hundreds of thousands > > of times on some systems. It will be 2MB per node on x86 if the data > > suggests we should make it NUMA aware, I don't think the cost is too high > > to leave it persistently available even under memory pressure if > > use_zero_page is enabled. > > > > Measuring access latency to 4GB of memory on Naples I observe ~6.7% > slower access latency intrasocket and ~14% slower intersocket. > > use_zero_page is currently a simple thp flag, meaning it rejects writes > where val != !!val, so perhaps it would be best to overload it with > additional options? I can imagine 0x2 defining persistent allocation so > that the hzp is not freed when the refcount goes to 0 and 0x4 defining if > the hzp should be per node. Implementing persistent allocation fixes our > concern with it, so I'd like to start there. Comments? Why not a separate files?
On Tue, 24 Jul 2018, Kirill A. Shutemov wrote: > > use_zero_page is currently a simple thp flag, meaning it rejects writes > > where val != !!val, so perhaps it would be best to overload it with > > additional options? I can imagine 0x2 defining persistent allocation so > > that the hzp is not freed when the refcount goes to 0 and 0x4 defining if > > the hzp should be per node. Implementing persistent allocation fixes our > > concern with it, so I'd like to start there. Comments? > > Why not a separate files? > That works as well. I'll write a patch for persistent allocation first to address our most immediate need.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 7ab93a8..d471ce8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -148,13 +148,6 @@ madvise never should be self-explanatory. -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: - - echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page - echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page - Some userspace (such as a test program, or an optimized memory allocation library) may want to know the size (in bytes) of a transparent hugepage:: diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a8a1262..0ea7808 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -58,7 +58,6 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, - TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, #ifdef CONFIG_DEBUG_VM TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, #endif @@ -116,9 +115,6 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } -#define transparent_hugepage_use_zero_page() \ - (transparent_hugepage_flags & \ - (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) #ifdef CONFIG_DEBUG_VM #define transparent_hugepage_debug_cow() \ (transparent_hugepage_flags & \ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1cd7c1a..0d4cf87 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -54,8 +54,7 @@ (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)| #endif (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)| - (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| - (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); + (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); static struct shrinker deferred_split_shrinker; @@ -273,21 +272,6 @@ static ssize_t defrag_store(struct kobject *kobj, static struct kobj_attribute defrag_attr = __ATTR(defrag, 0644, defrag_show, defrag_store); -static ssize_t use_zero_page_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return single_hugepage_flag_show(kobj, attr, buf, - TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); -} -static ssize_t use_zero_page_store(struct kobject *kobj, - struct kobj_attribute *attr, const char *buf, size_t count) -{ - return single_hugepage_flag_store(kobj, attr, buf, count, - TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); -} -static struct kobj_attribute use_zero_page_attr = - __ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store); - static ssize_t hpage_pmd_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -317,7 +301,6 @@ static ssize_t debug_cow_store(struct kobject *kobj, static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, - &use_zero_page_attr.attr, &hpage_pmd_size_attr.attr, #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &shmem_enabled_attr.attr, @@ -677,8 +660,7 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) if (unlikely(khugepaged_enter(vma, vma->vm_flags))) return VM_FAULT_OOM; if (!(vmf->flags & FAULT_FLAG_WRITE) && - !mm_forbids_zeropage(vma->vm_mm) && - transparent_hugepage_use_zero_page()) { + !mm_forbids_zeropage(vma->vm_mm)) { pgtable_t pgtable; struct page *zero_page; bool set;
By digging into the original review, it looks use_zero_page sysfs knob was added to help ease-of-testing and give user a way to mitigate refcounting overhead. It has been a few years since the knob was added at the first place, I think we are confident that it is stable enough. And, since commit 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"), it looks refcounting overhead has been reduced significantly. Other than the above, the value of the knob is always 1 (enabled by default), I'm supposed very few people turn it off by default. So, it sounds not worth to still keep this knob around. Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- Documentation/admin-guide/mm/transhuge.rst | 7 ------- include/linux/huge_mm.h | 4 ---- mm/huge_memory.c | 22 ++-------------------- 3 files changed, 2 insertions(+), 31 deletions(-)