diff mbox series

[v7,05/11] mm: split a folio in minimum folio order chunks

Message ID 20240607145902.1137853-6-kernel@pankajraghav.com (mailing list archive)
State New
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) June 7, 2024, 2:58 p.m. UTC
From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/huge_mm.h | 14 ++++++++---
 mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 61 insertions(+), 8 deletions(-)

Comments

Zi Yan June 7, 2024, 4:58 p.m. UTC | #1
Hi Pankaj,

Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
to send patches. Thanks.

On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:

> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/huge_mm.h | 14 ++++++++---
>  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 61 insertions(+), 8 deletions(-)
>

<snip>

>
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (!folio_test_anon(folio)) {
> +		if (!folio->mapping) {
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);

You should only increase this counter when the input folio is a THP, namely
folio_test_pmd_mappable(folio) is true. For other large folios, we will
need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
See enum mthp_stat_item in include/linux/huge_mm.h.


--
Best Regards,
Yan, Zi
Matthew Wilcox June 7, 2024, 5:01 p.m. UTC | #2
On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > +{
> > +	unsigned int min_order = 0;
> > +
> > +	if (!folio_test_anon(folio)) {
> > +		if (!folio->mapping) {
> > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> 
> You should only increase this counter when the input folio is a THP, namely
> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> See enum mthp_stat_item in include/linux/huge_mm.h.

Also, why should this count as a split failure?  If we see a NULL
mapping, the folio has been truncated and so no longer needs to be
split.  I understand we currently count it as a failure, but I
don't think we should.
Pankaj Raghav (Samsung) June 7, 2024, 8:30 p.m. UTC | #3
On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> Hi Pankaj,
> 
> Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
> to send patches. Thanks.

Got it!

> 
> On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:
> 
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  include/linux/huge_mm.h | 14 ++++++++---
> >  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 61 insertions(+), 8 deletions(-)
> >
> 
> <snip>
> 
> >
> > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > +{
> > +	unsigned int min_order = 0;
> > +
> > +	if (!folio_test_anon(folio)) {
> > +		if (!folio->mapping) {
> > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> 
> You should only increase this counter when the input folio is a THP, namely
> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> See enum mthp_stat_item in include/linux/huge_mm.h.
> 
Hmm, but we don't have mTHP support for non-anonymous memory right? In
that case it won't be applicable for file backed memory? 

I am not an expert there so correct me if I am wrong.

--
Regards,
Pankaj
Pankaj Raghav (Samsung) June 7, 2024, 8:45 p.m. UTC | #4
On Fri, Jun 07, 2024 at 06:01:56PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> > > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > > +{
> > > +	unsigned int min_order = 0;
> > > +
> > > +	if (!folio_test_anon(folio)) {
> > > +		if (!folio->mapping) {
> > > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> > 
> > You should only increase this counter when the input folio is a THP, namely
> > folio_test_pmd_mappable(folio) is true. For other large folios, we will
> > need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> > See enum mthp_stat_item in include/linux/huge_mm.h.
> 
> Also, why should this count as a split failure?  If we see a NULL
> mapping, the folio has been truncated and so no longer needs to be
> split.  I understand we currently count it as a failure, but I
> don't think we should.

I also thought about this. Because if the folio was under writeback, we
don't account it as a failure but we do it if it was truncated?

I can remove the accounting that we added as a part of this series in
the next version but address the upstream changes [1] in a separate
standalone patch?
I prefer to address these kind of open discussion upstream changes
separately so that we don't delay this series.

Let me know what you think. CCing Kirill as he made those changes.

[1]
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 399a4f5125c7..21f2dd5eb4c5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3152,10 +3152,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                mapping = folio->mapping;
 
                /* Truncated ? */
-               if (!mapping) {
+               if (!mapping)
                        ret = -EBUSY;
-                       goto out;
-               }
Zi Yan June 7, 2024, 8:51 p.m. UTC | #5
On 7 Jun 2024, at 16:30, Pankaj Raghav (Samsung) wrote:

> On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
>> Hi Pankaj,
>>
>> Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
>> to send patches. Thanks.
>
> Got it!
>
>>
>> On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:
>>
>>> From: Luis Chamberlain <mcgrof@kernel.org>
>>>
>>> split_folio() and split_folio_to_list() assume order 0, to support
>>> minorder for non-anonymous folios, we must expand these to check the
>>> folio mapping order and use that.
>>>
>>> Set new_order to be at least minimum folio order if it is set in
>>> split_huge_page_to_list() so that we can maintain minimum folio order
>>> requirement in the page cache.
>>>
>>> Update the debugfs write files used for testing to ensure the order
>>> is respected as well. We simply enforce the min order when a file
>>> mapping is used.
>>>
>>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>>> ---
>>>  include/linux/huge_mm.h | 14 ++++++++---
>>>  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>>>  2 files changed, 61 insertions(+), 8 deletions(-)
>>>
>>
>> <snip>
>>
>>>
>>> +int split_folio_to_list(struct folio *folio, struct list_head *list)
>>> +{
>>> +	unsigned int min_order = 0;
>>> +
>>> +	if (!folio_test_anon(folio)) {
>>> +		if (!folio->mapping) {
>>> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
>>
>> You should only increase this counter when the input folio is a THP, namely
>> folio_test_pmd_mappable(folio) is true. For other large folios, we will
>> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
>> See enum mthp_stat_item in include/linux/huge_mm.h.
>>
> Hmm, but we don't have mTHP support for non-anonymous memory right? In
> that case it won't be applicable for file backed memory?

Large folio support in page cache precedes mTHP (large anonymous folio),
thanks to willy's work. mTHP is more like a subset of large folio.
There is no specific counters for page cache large folio. If you think
it is worth tracking folios with orders between 0 and 9 (exclusive),
you can add counters. Matthew, what is your take on this?


--
Best Regards,
Yan, Zi
Pankaj Raghav (Samsung) June 10, 2024, 7:26 a.m. UTC | #6
On Fri, Jun 07, 2024 at 04:51:04PM -0400, Zi Yan wrote:
> On 7 Jun 2024, at 16:30, Pankaj Raghav (Samsung) wrote:
> >>> +		if (!folio->mapping) {
> >>> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> >>
> >> You should only increase this counter when the input folio is a THP, namely
> >> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> >> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> >> See enum mthp_stat_item in include/linux/huge_mm.h.
> >>
> > Hmm, but we don't have mTHP support for non-anonymous memory right? In
> > that case it won't be applicable for file backed memory?
> 
> Large folio support in page cache precedes mTHP (large anonymous folio),
> thanks to willy's work. mTHP is more like a subset of large folio.
> There is no specific counters for page cache large folio. If you think
> it is worth tracking folios with orders between 0 and 9 (exclusive),
> you can add counters. Matthew, what is your take on this?

Got it. I think this is out of scope for this series but something we
could consider as a future enhancement?

In any case, we need to decide whether we need to count truncation as a 
VM event or not.
Hannes Reinecke June 12, 2024, 9:02 a.m. UTC | #7
On 6/7/24 16:58, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
> 
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
> 
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/huge_mm.h | 14 ++++++++---
>   mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>   2 files changed, 61 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 020e2344eb86..15caa4e7b00e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -88,6 +88,8 @@  extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -307,9 +309,10 @@  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -474,6 +477,12 @@  static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -578,7 +587,4 @@  static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..399a4f5125c7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3068,6 +3068,9 @@  bool can_split_folio(struct folio *folio, int *pextra_pins)
  * released, or if some unexpected race happened (e.g., anon VMA disappeared,
  * truncation).
  *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
@@ -3143,6 +3146,7 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3153,6 +3157,14 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3264,6 +3276,21 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (!folio_test_anon(folio)) {
+		if (!folio->mapping) {
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+			return -EBUSY;
+		}
+		min_order = mapping_min_folio_order(folio->mapping);
+	}
+
+	return split_huge_page_to_list_to_order(&folio->page, list, min_order);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3493,6 +3520,8 @@  static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
 		struct folio *folio;
+		struct address_space *mapping;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3513,7 +3542,13 @@  static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
-		if (new_order >= folio_order(folio))
+		if (!folio_test_anon(folio)) {
+			mapping = folio->mapping;
+			target_order = max(new_order,
+					   mapping_min_folio_order(mapping));
+		}
+
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		total++;
@@ -3529,9 +3564,13 @@  static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio) && folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
@@ -3556,6 +3595,7 @@  static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	pgoff_t index;
 	int nr_pages = 1;
 	unsigned long total = 0, split = 0;
+	unsigned int min_order;
 
 	file = getname_kernel(file_path);
 	if (IS_ERR(file))
@@ -3569,9 +3609,11 @@  static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		 file_path, off_start, off_end);
 
 	mapping = candidate->f_mapping;
+	min_order = mapping_min_folio_order(mapping);
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
+		unsigned int target_order = new_order;
 
 		nr_pages = 1;
 		if (IS_ERR(folio))
@@ -3580,18 +3622,23 @@  static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (!folio_test_large(folio))
 			goto next;
 
+		target_order = max(new_order, min_order);
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
-		if (new_order >= folio_order(folio))
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);