[v18,06/32] mm/thp: narrow lru locking

Message ID	1598273705-69124-7-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=JPLW=CC=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5C2B421775 From: Alex Shi <alex.shi@linux.alibaba.com> To: akpm@linux-foundation.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, shakeelb@google.com, iamjoonsoo.kim@lge.com, richard.weiyang@gmail.com, kirill@shutemov.name, alexander.duyck@gmail.com, rong.a.chen@intel.com, mhocko@suse.com, vdavydov.dev@gmail.com, shy828301@gmail.com Cc: Andrea Arcangeli <aarcange@redhat.com> Subject: [PATCH v18 06/32] mm/thp: narrow lru locking Date: Mon, 24 Aug 2020 20:54:39 +0800 Message-Id: <1598273705-69124-7-git-send-email-alex.shi@linux.alibaba.com> In-Reply-To: <1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com> References: <1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	per memcg lru_lock \| expand [v18,00/32] per memcg lru_lock [v18,01/32] mm/memcg: warning on !memcg after readahead page charged [v18,02/32] mm/memcg: bail out early from swap accounting when memcg is disabled [v18,03/32] mm/thp: move lru_add_page_tail func to huge_memory.c [v18,04/32] mm/thp: clean up lru_add_page_tail [v18,05/32] mm/thp: remove code path which never got into [v18,06/32] mm/thp: narrow lru locking [v18,07/32] mm/swap.c: stop deactivate_file_page if page not on lru [v18,08/32] mm/vmscan: remove unnecessary lruvec adding [v18,09/32] mm/page_idle: no unlikely double check for idle page counting [v18,10/32] mm/compaction: rename compact_deferred as compact_should_defer [v18,11/32] mm/memcg: add debug checking in lock_page_memcg [v18,12/32] mm/memcg: optimize mem_cgroup_page_lruvec [v18,13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn [v18,14/32] mm/lru: move lru_lock holding in func lru_note_cost_page [v18,15/32] mm/lru: move lock into lru_note_cost [v18,16/32] mm/lru: introduce TestClearPageLRU [v18,17/32] mm/compaction: do page isolation first in compaction [v18,18/32] mm/thp: add tail pages into lru anyway in split_huge_page() [v18,19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn [v18,20/32] mm/lru: replace pgdat lru_lock with lruvec lock [v18,21/32] mm/lru: introduce the relock_page_lruvec function [v18,22/32] mm/vmscan: use relock for move_pages_to_lru [v18,23/32] mm/lru: revise the comments of lru_lock [v18,24/32] mm/pgdat: remove pgdat lru_lock [v18,25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page [v18,26/32] mm/mlock: remove __munlock_isolate_lru_page [v18,27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock [v18,28/32] mm/compaction: Drop locked from isolate_migratepages_block [v18,29/32] mm: Identify compound pages sooner in isolate_migratepages_block [v18,30/32] mm: Drop use of test_and_set_skip in favor of just setting skip [v18,31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages [v18,32/32] mm: Split release_pages work into 3 passes

Message ID

1598273705-69124-7-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5C2B421775
From: Alex Shi <alex.shi@linux.alibaba.com>
To: akpm@linux-foundation.org,
	mgorman@techsingularity.net,
	tj@kernel.org,
	hughd@google.com,
	khlebnikov@yandex-team.ru,
	daniel.m.jordan@oracle.com,
	willy@infradead.org,
	hannes@cmpxchg.org,
	lkp@intel.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org,
	shakeelb@google.com,
	iamjoonsoo.kim@lge.com,
	richard.weiyang@gmail.com,
	kirill@shutemov.name,
	alexander.duyck@gmail.com,
	rong.a.chen@intel.com,
	mhocko@suse.com,
	vdavydov.dev@gmail.com,
	shy828301@gmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH v18 06/32] mm/thp: narrow lru locking
Date: Mon, 24 Aug 2020 20:54:39 +0800
Message-Id: <1598273705-69124-7-git-send-email-alex.shi@linux.alibaba.com>
In-Reply-To: <1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com>
References: <1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

per memcg lru_lock | expand

Commit Message

Alex Shi Aug. 24, 2020, 12:54 p.m. UTC

lru_lock and page cache xa_lock have no reason with current sequence,
put them together isn't necessary. let's narrow the lru locking, but
left the local_irq_disable to block interrupt re-entry and statistic update.

Hugh Dickins point: split_huge_page_to_list() was already silly,to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.
so we can save passing flags argument down to __split_huge_page().

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

Comments

Matthew Wilcox Sept. 10, 2020, 1:49 p.m. UTC | #1

On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> lru_lock and page cache xa_lock have no reason with current sequence,
> put them together isn't necessary. let's narrow the lru locking, but
> left the local_irq_disable to block interrupt re-entry and statistic update.

What stats are you talking about here?

> +++ b/mm/huge_memory.c
> @@ -2397,7 +2397,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>  
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +			      pgoff_t end)

Please don't change this whitespace.  It's really annoying having to
adjust the whitespace when renaming a function.  Just two tabs indentation
to give a clear separation of arguments from code is fine.


How about this patch instead?  It occurred to me we already have
perfectly good infrastructure to track whether or not interrupts are
already disabled, and so we should use that instead of ensuring that
interrupts are disabled, or tracking that ourselves.

But I may have missed something else that's relying on having
interrupts disabled.  Please check carefully.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ccff8472cd4..74cae6c032f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2376,17 +2376,16 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
+	unsigned long flags;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2395,9 +2394,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 		offset = swp_offset(entry);
 		swap_cache = swap_address_space(entry);
-		xa_lock(&swap_cache->i_pages);
+		xa_lock_irq(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(&pgdat->lru_lock, flags);
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2417,6 +2420,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2425,18 +2429,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Additional pin to swap cache */
 		if (PageSwapCache(head)) {
 			page_ref_add(head, 2);
-			xa_unlock(&swap_cache->i_pages);
+			xa_unlock_irq(&swap_cache->i_pages);
 		} else {
 			page_ref_inc(head);
 		}
 	} else {
 		/* Additional pin to page cache */
 		page_ref_add(head, 2);
-		xa_unlock(&head->mapping->i_pages);
+		xa_unlock_irq(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-
 	remap_page(head);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -2574,7 +2576,6 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
@@ -2640,9 +2641,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2650,13 +2648,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * Check if the head page is present in page cache.
 		 * We assume all tail are present too, if head is there.
 		 */
-		xa_lock(&mapping->i_pages);
+		xa_lock_irq(&mapping->i_pages);
 		if (xas_load(&xas) != head)
 			goto fail;
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
-	spin_lock(&ds_queue->split_queue_lock);
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	count = page_count(head);
 	mapcount = total_mapcount(head);
 	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
@@ -2664,7 +2662,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 		if (mapping) {
 			if (PageSwapBacked(head))
 				__dec_node_page_state(head, NR_SHMEM_THPS);
@@ -2672,7 +2670,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2688,10 +2686,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			dump_page(page, "total_mapcount(head) > 0");
 			BUG();
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 fail:		if (mapping)
-			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+			xa_unlock_irq(&mapping->i_pages);
 		remap_page(head);
 		ret = -EBUSY;
 	}

Alex Shi Sept. 11, 2020, 3:37 a.m. UTC | #2

在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
>> lru_lock and page cache xa_lock have no reason with current sequence,
>> put them together isn't necessary. let's narrow the lru locking, but
>> left the local_irq_disable to block interrupt re-entry and statistic update.
> 
> What stats are you talking about here?

Hi Matthew,

Thanks for comments!

like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...

> 
>> +++ b/mm/huge_memory.c
>> @@ -2397,7 +2397,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>  }
>>  
>>  static void __split_huge_page(struct page *page, struct list_head *list,
>> -		pgoff_t end, unsigned long flags)
>> +			      pgoff_t end)
> 
> Please don't change this whitespace.  It's really annoying having to
> adjust the whitespace when renaming a function.  Just two tabs indentation
> to give a clear separation of arguments from code is fine.
> 
> 
> How about this patch instead?  It occurred to me we already have
> perfectly good infrastructure to track whether or not interrupts are
> already disabled, and so we should use that instead of ensuring that
> interrupts are disabled, or tracking that ourselves.

So your proposal looks like;
1, xa_lock_irq(&mapping->i_pages); (optional)
2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
3, spin_lock_irqsave(&pgdat->lru_lock, flags);

Is there meaningful for the 2nd and 3rd flags?

IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
but objected by Hugh.

Thanks
Alex

> 
> But I may have missed something else that's relying on having
> interrupts disabled.  Please check carefully.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ccff8472cd4..74cae6c032f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2376,17 +2376,16 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>  
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +		pgoff_t end)
>  {
>  	struct page *head = compound_head(page);
>  	pg_data_t *pgdat = page_pgdat(head);
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
>  	unsigned long offset = 0;
> +	unsigned long flags;
>  	int i;
>  
> -	lruvec = mem_cgroup_page_lruvec(head, pgdat);
> -
>  	/* complete memcg works before add pages to LRU */
>  	mem_cgroup_split_huge_fixup(head);
>  
> @@ -2395,9 +2394,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  
>  		offset = swp_offset(entry);
>  		swap_cache = swap_address_space(entry);
> -		xa_lock(&swap_cache->i_pages);
> +		xa_lock_irq(&swap_cache->i_pages);
>  	}
>  
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irqsave(&pgdat->lru_lock, flags);
> +	lruvec = mem_cgroup_page_lruvec(head, pgdat);
> +
>  	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>  		__split_huge_page_tail(head, i, lruvec, list);
>  		/* Some pages can be beyond i_size: drop them from page cache */
> @@ -2417,6 +2420,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	}
>  
>  	ClearPageCompound(head);
> +	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>  
>  	split_page_owner(head, HPAGE_PMD_ORDER);
>  
> @@ -2425,18 +2429,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		/* Additional pin to swap cache */
>  		if (PageSwapCache(head)) {
>  			page_ref_add(head, 2);
> -			xa_unlock(&swap_cache->i_pages);
> +			xa_unlock_irq(&swap_cache->i_pages);
>  		} else {
>  			page_ref_inc(head);
>  		}
>  	} else {
>  		/* Additional pin to page cache */
>  		page_ref_add(head, 2);
> -		xa_unlock(&head->mapping->i_pages);
> +		xa_unlock_irq(&head->mapping->i_pages);
>  	}
>  
> -	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -
>  	remap_page(head);
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> @@ -2574,7 +2576,6 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>  int split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
>  	struct page *head = compound_head(page);
> -	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>  	struct deferred_split *ds_queue = get_deferred_split_queue(head);
>  	struct anon_vma *anon_vma = NULL;
>  	struct address_space *mapping = NULL;
> @@ -2640,9 +2641,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	unmap_page(head);
>  	VM_BUG_ON_PAGE(compound_mapcount(head), head);
>  
> -	/* prevent PageLRU to go away from under us, and freeze lru stats */
> -	spin_lock_irqsave(&pgdata->lru_lock, flags);
> -
>  	if (mapping) {
>  		XA_STATE(xas, &mapping->i_pages, page_index(head));
>  
> @@ -2650,13 +2648,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  		 * Check if the head page is present in page cache.
>  		 * We assume all tail are present too, if head is there.
>  		 */
> -		xa_lock(&mapping->i_pages);
> +		xa_lock_irq(&mapping->i_pages);
>  		if (xas_load(&xas) != head)
>  			goto fail;
>  	}
>  
>  	/* Prevent deferred_split_scan() touching ->_refcount */
> -	spin_lock(&ds_queue->split_queue_lock);
> +	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	count = page_count(head);
>  	mapcount = total_mapcount(head);
>  	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
> @@ -2664,7 +2662,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			ds_queue->split_queue_len--;
>  			list_del(page_deferred_list(head));
>  		}
> -		spin_unlock(&ds_queue->split_queue_lock);
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  		if (mapping) {
>  			if (PageSwapBacked(head))
>  				__dec_node_page_state(head, NR_SHMEM_THPS);
> @@ -2672,7 +2670,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  				__dec_node_page_state(head, NR_FILE_THPS);
>  		}
>  
> -		__split_huge_page(page, list, end, flags);
> +		__split_huge_page(page, list, end);
>  		if (PageSwapCache(head)) {
>  			swp_entry_t entry = { .val = page_private(head) };
>  
> @@ -2688,10 +2686,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			dump_page(page, "total_mapcount(head) > 0");
>  			BUG();
>  		}
> -		spin_unlock(&ds_queue->split_queue_lock);
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  fail:		if (mapping)
> -			xa_unlock(&mapping->i_pages);
> -		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
> +			xa_unlock_irq(&mapping->i_pages);
>  		remap_page(head);
>  		ret = -EBUSY;
>  	}
>

Matthew Wilcox Sept. 13, 2020, 3:27 p.m. UTC | #3

On Fri, Sep 11, 2020 at 11:37:50AM +0800, Alex Shi wrote:
> 
> 
> 在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> > On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> >> lru_lock and page cache xa_lock have no reason with current sequence,
> >> put them together isn't necessary. let's narrow the lru locking, but
> >> left the local_irq_disable to block interrupt re-entry and statistic update.
> > 
> > What stats are you talking about here?
> 
> Hi Matthew,
> 
> Thanks for comments!
> 
> like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...

OK, but those stats are guarded by 'if (mapping)', so this patch doesn't
produce that warning because we'll have taken the xarray lock and disabled
interrupts.

> > How about this patch instead?  It occurred to me we already have
> > perfectly good infrastructure to track whether or not interrupts are
> > already disabled, and so we should use that instead of ensuring that
> > interrupts are disabled, or tracking that ourselves.
> 
> So your proposal looks like;
> 1, xa_lock_irq(&mapping->i_pages); (optional)
> 2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> 3, spin_lock_irqsave(&pgdat->lru_lock, flags);
> 
> Is there meaningful for the 2nd and 3rd flags?

Yes.  We want to avoid doing:

	if (mapping)
		spin_lock(&ds_queue->split_queue_lock);
	else
		spin_lock_irq(&ds_queue->split_queue_lock);
...
	if (mapping)
		spin_unlock(&ds_queue->split_queue_lock);
	else
		spin_unlock_irq(&ds_queue->split_queue_lock);

Just using _irqsave has the same effect and is easier to reason about.

> IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
> but objected by Hugh.

I imagine Hugh's objection was that we know it's safe to disable/enable
interrupts here because we're in a sleepable context.  But for the
other two locks, we'd rather not track whether we've already disabled
interrupts or not.

Maybe you could dig up the email from Hugh?  I can't find it.

Hugh Dickins Sept. 19, 2020, 1 a.m. UTC | #4

On Sun, 13 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 11, 2020 at 11:37:50AM +0800, Alex Shi wrote:
> > 在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> > > On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> > >> lru_lock and page cache xa_lock have no reason with current sequence,
> > >> put them together isn't necessary. let's narrow the lru locking, but
> > >> left the local_irq_disable to block interrupt re-entry and statistic update.
> > > 
> > > What stats are you talking about here?
> > 
> > Hi Matthew,
> > 
> > Thanks for comments!
> > 
> > like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...
> 
> OK, but those stats are guarded by 'if (mapping)', so this patch doesn't
> produce that warning because we'll have taken the xarray lock and disabled
> interrupts.
> 
> > > How about this patch instead?  It occurred to me we already have
> > > perfectly good infrastructure to track whether or not interrupts are
> > > already disabled, and so we should use that instead of ensuring that
> > > interrupts are disabled, or tracking that ourselves.
> > 
> > So your proposal looks like;
> > 1, xa_lock_irq(&mapping->i_pages); (optional)
> > 2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > 3, spin_lock_irqsave(&pgdat->lru_lock, flags);
> > 
> > Is there meaningful for the 2nd and 3rd flags?
> 
> Yes.  We want to avoid doing:
> 
> 	if (mapping)
> 		spin_lock(&ds_queue->split_queue_lock);
> 	else
> 		spin_lock_irq(&ds_queue->split_queue_lock);
> ...
> 	if (mapping)
> 		spin_unlock(&ds_queue->split_queue_lock);
> 	else
> 		spin_unlock_irq(&ds_queue->split_queue_lock);
> 
> Just using _irqsave has the same effect and is easier to reason about.
> 
> > IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
> > but objected by Hugh.
> 
> I imagine Hugh's objection was that we know it's safe to disable/enable
> interrupts here because we're in a sleepable context.  But for the
> other two locks, we'd rather not track whether we've already disabled
> interrupts or not.
> 
> Maybe you could dig up the email from Hugh?  I can't find it.

I did not find exactly the objection Alex seems to be remembering, but
I have certainly expressed frustration with the lack of a reason for
the THP split lock reordering, and in private mail in June while I was
testing and sending back fixes: "I'd prefer that you never got into this:
it looks like an unrelated and debatable cleanup, and I can see more
such cleanup to make there, that we'd better not get into right now."

I've several times toyed with just leaving this patch out of the series:
but each time ended up, for better or worse, deciding we'd better keep
it in - partly because we've never tested without it, and it cannot be
dropped without making some other change (to stabilize the memcg in
the !list case) - easily doable, but already done by this patch.

Alex asked me to improve his commit message to satisfy my objections,
here's what I sent him last night:

===
lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.
===

It's still the case that I want to avoid further cleanups and
bikeshedding here for now.  I took an open-minded look at Alex's
patch versus Matthew's patch, and do prefer Alex's: largely because
it's simple and explicit about where the irq disabling and enabling
is done (exactly where it was done before), and doesn't need irqsave
clutter in between.  If this were to be the only local_irq_disable()
in mm I'd NAK it, but that's not so - and as I said before, I don't
take the RT THP case very seriously anyway.

One slight worry in Matthew's version:

	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
	count = page_count(head);
	mapcount = total_mapcount(head);
	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
		if (!list_empty(page_deferred_list(head))) {
			ds_queue->split_queue_len--;
			list_del(page_deferred_list(head));
		}
		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
		if (mapping) {
			if (PageSwapBacked(head))
				__dec_node_page_state(head, NR_SHMEM_THPS);
			else
				__dec_node_page_state(head, NR_FILE_THPS);
		}
		__split_huge_page(page, list, end);

In the Anon case, interrupts are enabled when calling __split_huge_page()
there, but head's refcount is frozen: I'm uneasy about preemption when a
refcount is frozen.  But I'd worry much more if it were the mapping case:
no, that has interrupts safely disabled at that point (as does Anon in
the current kernel, and with Alex's patch).

Hugh

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 247f53def87b..0132d363253e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2397,7 +2397,7 @@  static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2406,8 +2406,6 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2419,6 +2417,11 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2438,6 +2441,8 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2455,8 +2460,7 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head);
 
@@ -2595,12 +2599,10 @@  bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2661,9 +2663,8 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2693,7 +2694,7 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2712,7 +2713,7 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head);
 		ret = -EBUSY;
 	}

[v18,06/32] mm/thp: narrow lru locking

Commit Message

Comments

Patch