Message ID | 20241216165105.56185-13-dev.jain@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | khugepaged: Asynchronous mTHP collapse | expand |
On 16/12/2024 16:51, Dev Jain wrote: > One of the testcases triggers a CoW on the 255th page (0-indexing) with > max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, > and 257 pages shared, exceeding the constraint. Suppose we run the test as > ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 > folios, since PMD-collapse will fail due to the constraint. > When the scan reaches 254-257 PTE range, because at least one PTE in this range > is writable, with other 3 being read-only, khugepaged collapses this into an > order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter > a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled > constraint, but the number of shared PTEs have now come under the constraint for > PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse > this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix > this by reducing the CoW range. Is this description essentially saying that it's now possible to creep towards collapsing to a full PMD-size block over successive scans due to rounding errors in the scaling? Or is this just trying an edge case and the problem doesn't generalize? > > Note: The only objective of this patch is to make the test work for the PMD-case; > no extension has been made for testing for mTHPs. > > Signed-off-by: Dev Jain <dev.jain@arm.com> > --- > tools/testing/selftests/mm/khugepaged.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c > index 8a4d34cce36b..143c4ad9f6a1 100644 > --- a/tools/testing/selftests/mm/khugepaged.c > +++ b/tools/testing/selftests/mm/khugepaged.c > @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o > static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops) > { > int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"); > + int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1; > int wstatus; > void *p; > > @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops > fail("Fail"); > > printf("Trigger CoW on page %d of %d...", > - hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr); > - ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size); > + hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr); > + ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size); > if (ops->check_huge(p, 0)) > success("OK"); > else
On 18/12/24 2:33 pm, Ryan Roberts wrote: > On 16/12/2024 16:51, Dev Jain wrote: >> One of the testcases triggers a CoW on the 255th page (0-indexing) with >> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >> and 257 pages shared, exceeding the constraint. Suppose we run the test as >> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >> folios, since PMD-collapse will fail due to the constraint. >> When the scan reaches 254-257 PTE range, because at least one PTE in this range >> is writable, with other 3 being read-only, khugepaged collapses this into an >> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter >> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled >> constraint, but the number of shared PTEs have now come under the constraint for >> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >> this by reducing the CoW range. > Is this description essentially saying that it's now possible to creep towards > collapsing to a full PMD-size block over successive scans due to rounding errors > in the scaling? Or is this just trying an edge case and the problem doesn't > generalize? For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without rounding errors. We cannot really get a rounding problem because we are rounding down, essentially either keep the restriction same, or making it stricter, as we go down the orders. But thinking again, this behaviour may generalize: essentially, let us say that the distribution of none ptes vs filled ptes is very skewed for the PMD case. In a local region, this distribution may not be skewed, and then an mTHP collapse will occur, making this entire region uniform. Over time this may keep happening and then the region will become globally uniform to come under the PMD-constraint on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the question whether we want to detach khugepaged orders from mTHP sysfs settings. > >> Note: The only objective of this patch is to make the test work for the PMD-case; >> no extension has been made for testing for mTHPs. >> >> Signed-off-by: Dev Jain <dev.jain@arm.com> >> --- >> tools/testing/selftests/mm/khugepaged.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c >> index 8a4d34cce36b..143c4ad9f6a1 100644 >> --- a/tools/testing/selftests/mm/khugepaged.c >> +++ b/tools/testing/selftests/mm/khugepaged.c >> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o >> static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops) >> { >> int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"); >> + int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1; >> int wstatus; >> void *p; >> >> @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops >> fail("Fail"); >> >> printf("Trigger CoW on page %d of %d...", >> - hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr); >> - ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size); >> + hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr); >> + ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size); >> if (ops->check_huge(p, 0)) >> success("OK"); >> else
On 18/12/2024 09:50, Dev Jain wrote: > > On 18/12/24 2:33 pm, Ryan Roberts wrote: >> On 16/12/2024 16:51, Dev Jain wrote: >>> One of the testcases triggers a CoW on the 255th page (0-indexing) with >>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >>> and 257 pages shared, exceeding the constraint. Suppose we run the test as >>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >>> folios, since PMD-collapse will fail due to the constraint. >>> When the scan reaches 254-257 PTE range, because at least one PTE in this range >>> is writable, with other 3 being read-only, khugepaged collapses this into an >>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we >>> encounter >>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the >>> scaled >>> constraint, but the number of shared PTEs have now come under the constraint for >>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >>> this by reducing the CoW range. >> Is this description essentially saying that it's now possible to creep towards >> collapsing to a full PMD-size block over successive scans due to rounding errors >> in the scaling? Or is this just trying an edge case and the problem doesn't >> generalize? > > For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without > rounding errors. We cannot > really get a rounding problem because we are rounding down, essentially either > keep the restriction > same, or making it stricter, as we go down the orders. > > But thinking again, this behaviour may generalize: essentially, let us say that > the distribution > of none ptes vs filled ptes is very skewed for the PMD case. In a local region, > this distribution > may not be skewed, and then an mTHP collapse will occur, making this entire > region uniform. Over time this > may keep happening and then the region will become globally uniform to come > under the PMD-constraint > on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the > question whether we want > to detach khugepaged orders from mTHP sysfs settings. We want to avoid new user controls at all costs, I think. I think an example of the problem you are describing is: Let's say we start off with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA aligned over a single PMD. The first 2 4K pages of the VMA are allocated. khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. But now on the next scan, 50% of the first 8 PTEs are none so it will collapse to 32K. Then on the next scan it will collapse to 64K, and so on all the way to 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the control trying to prevent it, and we have done it in a very inefficient way. If max_ptes_none was 75% you would only need every other order enabled (I think?). In practice perhaps it's not a problem because you are only likely to have 1 or 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from this "creep"? Perhaps only consider a large folio for collapse into a larger folio if it wasn't originally collapsed by khugepaged in the first place? That would need a folio flag... and I suspect that will cause other edge case issues if we think about it for 5 mins... Another way of thinking about it is; if all the same mTHP orders were enabled at fault time (and the allocation suceeded) we would have allocated the largest order anyway, so the end states are the same. But the large number of incremental collapses that khugepaged will perform feel like a problem. I'm not quite sure what the answer is. Thanks, Ryan > >> >>> Note: The only objective of this patch is to make the test work for the PMD- >>> case; >>> no extension has been made for testing for mTHPs. >>> >>> Signed-off-by: Dev Jain <dev.jain@arm.com> >>> --- >>> tools/testing/selftests/mm/khugepaged.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/ >>> selftests/mm/khugepaged.c >>> index 8a4d34cce36b..143c4ad9f6a1 100644 >>> --- a/tools/testing/selftests/mm/khugepaged.c >>> +++ b/tools/testing/selftests/mm/khugepaged.c >>> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct >>> collapse_context *c, struct mem_ops *o >>> static void collapse_max_ptes_shared(struct collapse_context *c, struct >>> mem_ops *ops) >>> { >>> int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"); >>> + int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1; >>> int wstatus; >>> void *p; >>> @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct >>> collapse_context *c, struct mem_ops >>> fail("Fail"); >>> printf("Trigger CoW on page %d of %d...", >>> - hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr); >>> - ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size); >>> + hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr); >>> + ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * >>> page_size); >>> if (ops->check_huge(p, 0)) >>> success("OK"); >>> else
On 20/12/24 4:35 pm, Ryan Roberts wrote: > On 18/12/2024 09:50, Dev Jain wrote: >> On 18/12/24 2:33 pm, Ryan Roberts wrote: >>> On 16/12/2024 16:51, Dev Jain wrote: >>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with >>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as >>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >>>> folios, since PMD-collapse will fail due to the constraint. >>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range >>>> is writable, with other 3 being read-only, khugepaged collapses this into an >>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we >>>> encounter >>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the >>>> scaled >>>> constraint, but the number of shared PTEs have now come under the constraint for >>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >>>> this by reducing the CoW range. >>> Is this description essentially saying that it's now possible to creep towards >>> collapsing to a full PMD-size block over successive scans due to rounding errors >>> in the scaling? Or is this just trying an edge case and the problem doesn't >>> generalize? >> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without >> rounding errors. We cannot >> really get a rounding problem because we are rounding down, essentially either >> keep the restriction >> same, or making it stricter, as we go down the orders. >> >> But thinking again, this behaviour may generalize: essentially, let us say that >> the distribution >> of none ptes vs filled ptes is very skewed for the PMD case. In a local region, >> this distribution >> may not be skewed, and then an mTHP collapse will occur, making this entire >> region uniform. Over time this >> may keep happening and then the region will become globally uniform to come >> under the PMD-constraint >> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the >> question whether we want >> to detach khugepaged orders from mTHP sysfs settings. > We want to avoid new user controls at all costs, I think. > > I think an example of the problem you are describing is: Let's say we start off > with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA > aligned over a single PMD. The first 2 4K pages of the VMA are allocated. > > khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a > single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. > But now on the next scan, 50% of the first 8 PTEs are none so it will collapse > to 32K. Then on the next scan it will collapse to 64K, and so on all the way to > 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the > control trying to prevent it, and we have done it in a very inefficient way. > > If max_ptes_none was 75% you would only need every other order enabled (I think?). > > In practice perhaps it's not a problem because you are only likely to have 1 or > 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from > this "creep"? > > Perhaps only consider a large folio for collapse into a larger folio if it > wasn't originally collapsed by khugepaged in the first place? That would need a > folio flag... and I suspect that will cause other edge case issues if we think > about it for 5 mins... > > Another way of thinking about it is; if all the same mTHP orders were enabled at > fault time (and the allocation suceeded) we would have allocated the largest > order anyway, so the end states are the same. But the large number of > incremental collapses that khugepaged will perform feel like a problem. > > I'm not quite sure what the answer is. Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs... > > Thanks, > Ryan > > >>>> Note: The only objective of this patch is to make the test work for the PMD- >>>> case; >>>> no extension has been made for testing for mTHPs. >>>> >>>> Signed-off-by: Dev Jain <dev.jain@arm.com> >>>> --- >>>> tools/testing/selftests/mm/khugepaged.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/ >>>> selftests/mm/khugepaged.c >>>> index 8a4d34cce36b..143c4ad9f6a1 100644 >>>> --- a/tools/testing/selftests/mm/khugepaged.c >>>> +++ b/tools/testing/selftests/mm/khugepaged.c >>>> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct >>>> collapse_context *c, struct mem_ops *o >>>> static void collapse_max_ptes_shared(struct collapse_context *c, struct >>>> mem_ops *ops) >>>> { >>>> int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"); >>>> + int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1; >>>> int wstatus; >>>> void *p; >>>> @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct >>>> collapse_context *c, struct mem_ops >>>> fail("Fail"); >>>> printf("Trigger CoW on page %d of %d...", >>>> - hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr); >>>> - ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size); >>>> + hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr); >>>> + ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * >>>> page_size); >>>> if (ops->check_huge(p, 0)) >>>> success("OK"); >>>> else
On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote: > > On 20/12/24 4:35 pm, Ryan Roberts wrote: > > On 18/12/2024 09:50, Dev Jain wrote: > >> On 18/12/24 2:33 pm, Ryan Roberts wrote: > >>> On 16/12/2024 16:51, Dev Jain wrote: > >>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with > >>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, > >>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as > >>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 > >>>> folios, since PMD-collapse will fail due to the constraint. > >>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range > >>>> is writable, with other 3 being read-only, khugepaged collapses this into an > >>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we > >>>> encounter > >>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the > >>>> scaled > >>>> constraint, but the number of shared PTEs have now come under the constraint for > >>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse > >>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix > >>>> this by reducing the CoW range. > >>> Is this description essentially saying that it's now possible to creep towards > >>> collapsing to a full PMD-size block over successive scans due to rounding errors > >>> in the scaling? Or is this just trying an edge case and the problem doesn't > >>> generalize? > >> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without > >> rounding errors. We cannot > >> really get a rounding problem because we are rounding down, essentially either > >> keep the restriction > >> same, or making it stricter, as we go down the orders. > >> > >> But thinking again, this behaviour may generalize: essentially, let us say that > >> the distribution > >> of none ptes vs filled ptes is very skewed for the PMD case. In a local region, > >> this distribution > >> may not be skewed, and then an mTHP collapse will occur, making this entire > >> region uniform. Over time this > >> may keep happening and then the region will become globally uniform to come > >> under the PMD-constraint > >> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the > >> question whether we want > >> to detach khugepaged orders from mTHP sysfs settings. > > We want to avoid new user controls at all costs, I think. > > > > I think an example of the problem you are describing is: Let's say we start off > > with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA > > aligned over a single PMD. The first 2 4K pages of the VMA are allocated. > > > > khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a > > single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. > > But now on the next scan, 50% of the first 8 PTEs are none so it will collapse > > to 32K. Then on the next scan it will collapse to 64K, and so on all the way to > > 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the > > control trying to prevent it, and we have done it in a very inefficient way. > > > > If max_ptes_none was 75% you would only need every other order enabled (I think?). > > > > In practice perhaps it's not a problem because you are only likely to have 1 or > > 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from > > this "creep"? > > > > Perhaps only consider a large folio for collapse into a larger folio if it > > wasn't originally collapsed by khugepaged in the first place? That would need a > > folio flag... and I suspect that will cause other edge case issues if we think > > about it for 5 mins... > > > > Another way of thinking about it is; if all the same mTHP orders were enabled at > > fault time (and the allocation suceeded) we would have allocated the largest > > order anyway, so the end states are the same. But the large number of > > incremental collapses that khugepaged will perform feel like a problem. > > > > I'm not quite sure what the answer is. > > Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs... One (not so effective) workaround is to add a VMA flag to make khugepaged to skip scanning a VMA that khugepaged has collapsed before and reset the flag in a future page fault. This would prevent khugepaged from doing this "creep" collapse behavior until the page tables covered by the VMA is changed. This is not perfect, since the page faults might not change the aforementioned region and later khugepaged can still perform the "creep".
On 30/12/2024 16:36, Zi Yan wrote: > On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote: >> >> On 20/12/24 4:35 pm, Ryan Roberts wrote: >>> On 18/12/2024 09:50, Dev Jain wrote: >>>> On 18/12/24 2:33 pm, Ryan Roberts wrote: >>>>> On 16/12/2024 16:51, Dev Jain wrote: >>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with >>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as >>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >>>>>> folios, since PMD-collapse will fail due to the constraint. >>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range >>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an >>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we >>>>>> encounter >>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the >>>>>> scaled >>>>>> constraint, but the number of shared PTEs have now come under the constraint for >>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >>>>>> this by reducing the CoW range. >>>>> Is this description essentially saying that it's now possible to creep towards >>>>> collapsing to a full PMD-size block over successive scans due to rounding errors >>>>> in the scaling? Or is this just trying an edge case and the problem doesn't >>>>> generalize? >>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without >>>> rounding errors. We cannot >>>> really get a rounding problem because we are rounding down, essentially either >>>> keep the restriction >>>> same, or making it stricter, as we go down the orders. >>>> >>>> But thinking again, this behaviour may generalize: essentially, let us say that >>>> the distribution >>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region, >>>> this distribution >>>> may not be skewed, and then an mTHP collapse will occur, making this entire >>>> region uniform. Over time this >>>> may keep happening and then the region will become globally uniform to come >>>> under the PMD-constraint >>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the >>>> question whether we want >>>> to detach khugepaged orders from mTHP sysfs settings. >>> We want to avoid new user controls at all costs, I think. >>> >>> I think an example of the problem you are describing is: Let's say we start off >>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA >>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated. >>> >>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a >>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. >>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse >>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to >>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the >>> control trying to prevent it, and we have done it in a very inefficient way. >>> >>> If max_ptes_none was 75% you would only need every other order enabled (I think?). >>> >>> In practice perhaps it's not a problem because you are only likely to have 1 or >>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from >>> this "creep"? >>> >>> Perhaps only consider a large folio for collapse into a larger folio if it >>> wasn't originally collapsed by khugepaged in the first place? That would need a >>> folio flag... and I suspect that will cause other edge case issues if we think >>> about it for 5 mins... >>> >>> Another way of thinking about it is; if all the same mTHP orders were enabled at >>> fault time (and the allocation suceeded) we would have allocated the largest >>> order anyway, so the end states are the same. But the large number of >>> incremental collapses that khugepaged will perform feel like a problem. >>> >>> I'm not quite sure what the answer is. >> >> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs... > > One (not so effective) workaround is to add a VMA flag to make > khugepaged to skip scanning a VMA that khugepaged has collapsed before > and reset the flag in a future page fault. This would prevent khugepaged > from doing this "creep" collapse behavior until the page tables covered > by the VMA is changed. This is not perfect, since the page faults might > not change the aforementioned region and later khugepaged can still > perform the "creep". > I guess what you really want is a bitmap with a bit per page in the VMA to tell you whether it exists due to fault or collapse. But... yuk... It looks like Ubuntu at least is not modifying the default max_ptes_none, which is 511. So for general purpose distros I guess we won't see this issue in practice because we will always collapse to the largest enabled size. So with that in mind, perhaps Zi's suggested single VM flag idea will be good enough? Thanks, Ryan
On 02/01/25 5:13 pm, Ryan Roberts wrote: > On 30/12/2024 16:36, Zi Yan wrote: >> On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote: >>> On 20/12/24 4:35 pm, Ryan Roberts wrote: >>>> On 18/12/2024 09:50, Dev Jain wrote: >>>>> On 18/12/24 2:33 pm, Ryan Roberts wrote: >>>>>> On 16/12/2024 16:51, Dev Jain wrote: >>>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with >>>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >>>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as >>>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >>>>>>> folios, since PMD-collapse will fail due to the constraint. >>>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range >>>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an >>>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we >>>>>>> encounter >>>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the >>>>>>> scaled >>>>>>> constraint, but the number of shared PTEs have now come under the constraint for >>>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >>>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >>>>>>> this by reducing the CoW range. >>>>>> Is this description essentially saying that it's now possible to creep towards >>>>>> collapsing to a full PMD-size block over successive scans due to rounding errors >>>>>> in the scaling? Or is this just trying an edge case and the problem doesn't >>>>>> generalize? >>>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without >>>>> rounding errors. We cannot >>>>> really get a rounding problem because we are rounding down, essentially either >>>>> keep the restriction >>>>> same, or making it stricter, as we go down the orders. >>>>> >>>>> But thinking again, this behaviour may generalize: essentially, let us say that >>>>> the distribution >>>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region, >>>>> this distribution >>>>> may not be skewed, and then an mTHP collapse will occur, making this entire >>>>> region uniform. Over time this >>>>> may keep happening and then the region will become globally uniform to come >>>>> under the PMD-constraint >>>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the >>>>> question whether we want >>>>> to detach khugepaged orders from mTHP sysfs settings. >>>> We want to avoid new user controls at all costs, I think. >>>> >>>> I think an example of the problem you are describing is: Let's say we start off >>>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA >>>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated. >>>> >>>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a >>>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. >>>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse >>>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to >>>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the >>>> control trying to prevent it, and we have done it in a very inefficient way. >>>> >>>> If max_ptes_none was 75% you would only need every other order enabled (I think?). >>>> >>>> In practice perhaps it's not a problem because you are only likely to have 1 or >>>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from >>>> this "creep"? >>>> >>>> Perhaps only consider a large folio for collapse into a larger folio if it >>>> wasn't originally collapsed by khugepaged in the first place? That would need a >>>> folio flag... and I suspect that will cause other edge case issues if we think >>>> about it for 5 mins... >>>> >>>> Another way of thinking about it is; if all the same mTHP orders were enabled at >>>> fault time (and the allocation suceeded) we would have allocated the largest >>>> order anyway, so the end states are the same. But the large number of >>>> incremental collapses that khugepaged will perform feel like a problem. >>>> >>>> I'm not quite sure what the answer is. >>> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs... >> One (not so effective) workaround is to add a VMA flag to make >> khugepaged to skip scanning a VMA that khugepaged has collapsed before >> and reset the flag in a future page fault. This would prevent khugepaged >> from doing this "creep" collapse behavior until the page tables covered >> by the VMA is changed. This is not perfect, since the page faults might >> not change the aforementioned region and later khugepaged can still >> perform the "creep". >> > I guess what you really want is a bitmap with a bit per page in the VMA to tell > you whether it exists due to fault or collapse. But... yuk... > > It looks like Ubuntu at least is not modifying the default max_ptes_none, which > is 511. So for general purpose distros I guess we won't see this issue in > practice because we will always collapse to the largest enabled size. Will still see this problem for max_ptes_shared = 256... We can set this to 255, so that the fraction progress as follows: 255/512 -> 127/256 -> 63/128 -> 31/64 -> 15/32 -> 7/16 -> 3/8 -> 1/4 -> 0/2 This is the best possible fractional decrease we can get since we always end up on an odd number and lose a 1 due to rounding. The only real solution seems to be to track whether the PTE/page we have is original or collapsed. > > So with that in mind, perhaps Zi's suggested single VM flag idea will be good > enough? > > Thanks, > Ryan > >
On 30/12/24 10:06 pm, Zi Yan wrote: > On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote: >> On 20/12/24 4:35 pm, Ryan Roberts wrote: >>> On 18/12/2024 09:50, Dev Jain wrote: >>>> On 18/12/24 2:33 pm, Ryan Roberts wrote: >>>>> On 16/12/2024 16:51, Dev Jain wrote: >>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with >>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, >>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as >>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 >>>>>> folios, since PMD-collapse will fail due to the constraint. >>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range >>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an >>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we >>>>>> encounter >>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the >>>>>> scaled >>>>>> constraint, but the number of shared PTEs have now come under the constraint for >>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse >>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix >>>>>> this by reducing the CoW range. >>>>> Is this description essentially saying that it's now possible to creep towards >>>>> collapsing to a full PMD-size block over successive scans due to rounding errors >>>>> in the scaling? Or is this just trying an edge case and the problem doesn't >>>>> generalize? >>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without >>>> rounding errors. We cannot >>>> really get a rounding problem because we are rounding down, essentially either >>>> keep the restriction >>>> same, or making it stricter, as we go down the orders. >>>> >>>> But thinking again, this behaviour may generalize: essentially, let us say that >>>> the distribution >>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region, >>>> this distribution >>>> may not be skewed, and then an mTHP collapse will occur, making this entire >>>> region uniform. Over time this >>>> may keep happening and then the region will become globally uniform to come >>>> under the PMD-constraint >>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the >>>> question whether we want >>>> to detach khugepaged orders from mTHP sysfs settings. >>> We want to avoid new user controls at all costs, I think. >>> >>> I think an example of the problem you are describing is: Let's say we start off >>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA >>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated. >>> >>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a >>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none. >>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse >>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to >>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the >>> control trying to prevent it, and we have done it in a very inefficient way. >>> >>> If max_ptes_none was 75% you would only need every other order enabled (I think?). >>> >>> In practice perhaps it's not a problem because you are only likely to have 1 or >>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from >>> this "creep"? >>> >>> Perhaps only consider a large folio for collapse into a larger folio if it >>> wasn't originally collapsed by khugepaged in the first place? That would need a >>> folio flag... and I suspect that will cause other edge case issues if we think >>> about it for 5 mins... >>> >>> Another way of thinking about it is; if all the same mTHP orders were enabled at >>> fault time (and the allocation suceeded) we would have allocated the largest >>> order anyway, so the end states are the same. But the large number of >>> incremental collapses that khugepaged will perform feel like a problem. >>> >>> I'm not quite sure what the answer is. >> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs... > One (not so effective) workaround is to add a VMA flag to make > khugepaged to skip scanning a VMA that khugepaged has collapsed before > and reset the flag in a future page fault. Currently the code scans the VMA, and if it is able to collapse the first PMD-sized chunk, it will go to the next VMA: if (!mmap_locked) -> goto breakouterloop_mmap_lock. So in your method, suppose we have a VMA of 200MB, then only the first 2MB chunk will be scanned for this VMA, and the rest 198MB will never be scanned. So it boils down to remembering, which ranges (not abstractable by the complete VMA) got collapsed and skip them, which basically leads us to maintaining a folio flag... I wonder why we are going to the next VMA immediately after getting the first collapse success. I guess it is for khugepaged not to trouble the same process for too long and give chance to other processes? If we can change this behaviour and scan the complete VMA and then mark it as collapsed, then your method works, but again as you mention below, the page fault may operate somewhere else in the VMA... > This would prevent khugepaged > from doing this "creep" collapse behavior until the page tables covered > by the VMA is changed. This is not perfect, since the page faults might > not change the aforementioned region and later khugepaged can still > perform the "creep". >
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 8a4d34cce36b..143c4ad9f6a1 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops) { int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"); + int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1; int wstatus; void *p; @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops fail("Fail"); printf("Trigger CoW on page %d of %d...", - hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr); - ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size); + hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr); + ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size); if (ops->check_huge(p, 0)) success("OK"); else
One of the testcases triggers a CoW on the 255th page (0-indexing) with max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared, and 257 pages shared, exceeding the constraint. Suppose we run the test as ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2 folios, since PMD-collapse will fail due to the constraint. When the scan reaches 254-257 PTE range, because at least one PTE in this range is writable, with other 3 being read-only, khugepaged collapses this into an order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled constraint, but the number of shared PTEs have now come under the constraint for PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix this by reducing the CoW range. Note: The only objective of this patch is to make the test work for the PMD-case; no extension has been made for testing for mTHPs. Signed-off-by: Dev Jain <dev.jain@arm.com> --- tools/testing/selftests/mm/khugepaged.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)