[RFC] migrate_pages: Never block waiting for the page lock

Message ID	20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Douglas Anderson <dianders@chromium.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Yu Zhao <yuzhao@google.com>, Douglas Anderson <dianders@chromium.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH] migrate_pages: Never block waiting for the page lock Date: Thu, 13 Apr 2023 18:23:15 -0700 Message-ID: <20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] migrate_pages: Never block waiting for the page lock \| expand [RFC] migrate_pages: Never block waiting for the page lock

Doug Anderson April 14, 2023, 1:23 a.m. UTC

Currently when we try to do page migration and we're in "synchronous"
mode (and not doing direct compaction) then we'll wait an infinite
amount of time for a page lock. This does not appear to be a great
idea.

One issue can be seen when I put a device under extreme memory
pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
swap). I ran the browser along with Android (which runs from a
loopback mounted 128K block-size squashfs "disk"). I then manually ran
the mmm_donut memory pressure tool [1]. The system is completely
unusable both with and without this patch since there are 8 processes
completely thrashing memory, but it was still interesting to look at
how migration was behaving. I put some timing code in and I could see
that we sometimes waited over 25 seconds (in the context of
kcompactd0) for a page lock to become available. Although the 25
seconds was the high mark, it was easy to see tens, hundreds, or
thousands of milliseconds spent waiting on the lock.

Instead of waiting, if I bailed out right away (as this patch does), I
could see kcompactd0 move forward to successfully to migrate other
pages instead. This seems like a better use of kcompactd's time.

Thus, even though this didn't make the system any more usable in my
absurd test case, it still seemed to make migration behave better and
that feels like a win. It also makes the code simpler since we have
one fewer special case.

The second issue that this patch attempts to fix is one that I haven't
managed to reproduce yet. We have crash reports from the field that
report that kcompactd0 was blocked for more than ~120 seconds on this
exact lock. These crash reports are on devices running older kernels
(5.10 mostly). In most of these crash reports the device is under
memory pressure and many tasks were blocked in squashfs code, ext4
code, or memory allocation code. While I don't know if unblocking
kcompactd would actually have helped relieve the memory pressure, at
least there was a chance that it could have helped a little bit.

[1] https://chromium.googlesource.com/chromiumos/platform/microbenchmarks/+/refs/heads/main/mmm_donut.py

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 mm/migrate.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

Huang, Ying April 14, 2023, 3:09 a.m. UTC | #1

Douglas Anderson <dianders@chromium.org> writes:

> Currently when we try to do page migration and we're in "synchronous"
> mode (and not doing direct compaction) then we'll wait an infinite
> amount of time for a page lock. This does not appear to be a great
> idea.
>
> One issue can be seen when I put a device under extreme memory
> pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
> swap). I ran the browser along with Android (which runs from a
> loopback mounted 128K block-size squashfs "disk"). I then manually ran
> the mmm_donut memory pressure tool [1]. The system is completely
> unusable both with and without this patch since there are 8 processes
> completely thrashing memory, but it was still interesting to look at
> how migration was behaving. I put some timing code in and I could see
> that we sometimes waited over 25 seconds (in the context of
> kcompactd0) for a page lock to become available. Although the 25
> seconds was the high mark, it was easy to see tens, hundreds, or
> thousands of milliseconds spent waiting on the lock.
>
> Instead of waiting, if I bailed out right away (as this patch does), I
> could see kcompactd0 move forward to successfully to migrate other
> pages instead. This seems like a better use of kcompactd's time.
>
> Thus, even though this didn't make the system any more usable in my
> absurd test case, it still seemed to make migration behave better and
> that feels like a win. It also makes the code simpler since we have
> one fewer special case.

TBH, the test case is too extreme for me.

And, we have multiple "sync" mode to deal with latency requirement, for
example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
latency.  If you have latency requirement for some users, you may
consider to add new "sync" mode.

Best Regards,
Huang, Ying

> The second issue that this patch attempts to fix is one that I haven't
> managed to reproduce yet. We have crash reports from the field that
> report that kcompactd0 was blocked for more than ~120 seconds on this
> exact lock. These crash reports are on devices running older kernels
> (5.10 mostly). In most of these crash reports the device is under
> memory pressure and many tasks were blocked in squashfs code, ext4
> code, or memory allocation code. While I don't know if unblocking
> kcompactd would actually have helped relieve the memory pressure, at
> least there was a chance that it could have helped a little bit.
>
> [1] https://chromium.googlesource.com/chromiumos/platform/microbenchmarks/+/refs/heads/main/mmm_donut.py
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
>
>  mm/migrate.c | 25 +++++++------------------
>  1 file changed, 7 insertions(+), 18 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index db3f154446af..dfb0a6944181 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1143,26 +1143,15 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
>  	dst->private = NULL;
>  
>  	if (!folio_trylock(src)) {
> -		if (mode == MIGRATE_ASYNC)
> -			goto out;
> -
>  		/*
> -		 * It's not safe for direct compaction to call lock_page.
> -		 * For example, during page readahead pages are added locked
> -		 * to the LRU. Later, when the IO completes the pages are
> -		 * marked uptodate and unlocked. However, the queueing
> -		 * could be merging multiple pages for one bio (e.g.
> -		 * mpage_readahead). If an allocation happens for the
> -		 * second or third page, the process can end up locking
> -		 * the same page twice and deadlocking. Rather than
> -		 * trying to be clever about what pages can be locked,
> -		 * avoid the use of lock_page for direct compaction
> -		 * altogether.
> +		 * While there are some modes we could be running in where we
> +		 * could block here waiting for the lock (specifically
> +		 * modes other than MIGRATE_ASYNC and when we're not in
> +		 * direct compaction), it's not worth the wait. Instead of
> +		 * waiting, we'll bail. This will let the caller try to
> +		 * migrate some other pages that aren't contended.
>  		 */
> -		if (current->flags & PF_MEMALLOC)
> -			goto out;
> -
> -		folio_lock(src);
> +		goto out;
>  	}
>  	locked = true;

Doug Anderson April 14, 2023, 3:25 p.m. UTC | #2

Hi,

On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Douglas Anderson <dianders@chromium.org> writes:
>
> > Currently when we try to do page migration and we're in "synchronous"
> > mode (and not doing direct compaction) then we'll wait an infinite
> > amount of time for a page lock. This does not appear to be a great
> > idea.
> >
> > One issue can be seen when I put a device under extreme memory
> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
> > swap). I ran the browser along with Android (which runs from a
> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
> > the mmm_donut memory pressure tool [1]. The system is completely
> > unusable both with and without this patch since there are 8 processes
> > completely thrashing memory, but it was still interesting to look at
> > how migration was behaving. I put some timing code in and I could see
> > that we sometimes waited over 25 seconds (in the context of
> > kcompactd0) for a page lock to become available. Although the 25
> > seconds was the high mark, it was easy to see tens, hundreds, or
> > thousands of milliseconds spent waiting on the lock.
> >
> > Instead of waiting, if I bailed out right away (as this patch does), I
> > could see kcompactd0 move forward to successfully to migrate other
> > pages instead. This seems like a better use of kcompactd's time.
> >
> > Thus, even though this didn't make the system any more usable in my
> > absurd test case, it still seemed to make migration behave better and
> > that feels like a win. It also makes the code simpler since we have
> > one fewer special case.
>
> TBH, the test case is too extreme for me.

That's fair. That being said, I guess the point I was trying to make
is that waiting for this lock could take an unbounded amount of time.
Other parts of the system sometimes hold a page lock and then do a
blocking operation. At least in the case of kcompactd there are better
uses of its time than waiting for any given page.

> And, we have multiple "sync" mode to deal with latency requirement, for
> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
> latency.  If you have latency requirement for some users, you may
> consider to add new "sync" mode.

Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
guess my first thought would be to avoid adding a new mode and make
MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
wait for all the pages to be migrated can use the heavier sync modes.
It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
want to block for an unbounded amount of time here. What do you think?

-Doug

Huang, Ying April 17, 2023, 1:14 a.m. UTC | #3

Doug Anderson <dianders@chromium.org> writes:

> Hi,
>
> On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Douglas Anderson <dianders@chromium.org> writes:
>>
>> > Currently when we try to do page migration and we're in "synchronous"
>> > mode (and not doing direct compaction) then we'll wait an infinite
>> > amount of time for a page lock. This does not appear to be a great
>> > idea.
>> >
>> > One issue can be seen when I put a device under extreme memory
>> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
>> > swap). I ran the browser along with Android (which runs from a
>> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
>> > the mmm_donut memory pressure tool [1]. The system is completely
>> > unusable both with and without this patch since there are 8 processes
>> > completely thrashing memory, but it was still interesting to look at
>> > how migration was behaving. I put some timing code in and I could see
>> > that we sometimes waited over 25 seconds (in the context of
>> > kcompactd0) for a page lock to become available. Although the 25
>> > seconds was the high mark, it was easy to see tens, hundreds, or
>> > thousands of milliseconds spent waiting on the lock.
>> >
>> > Instead of waiting, if I bailed out right away (as this patch does), I
>> > could see kcompactd0 move forward to successfully to migrate other
>> > pages instead. This seems like a better use of kcompactd's time.
>> >
>> > Thus, even though this didn't make the system any more usable in my
>> > absurd test case, it still seemed to make migration behave better and
>> > that feels like a win. It also makes the code simpler since we have
>> > one fewer special case.
>>
>> TBH, the test case is too extreme for me.
>
> That's fair. That being said, I guess the point I was trying to make
> is that waiting for this lock could take an unbounded amount of time.
> Other parts of the system sometimes hold a page lock and then do a
> blocking operation. At least in the case of kcompactd there are better
> uses of its time than waiting for any given page.
>
>> And, we have multiple "sync" mode to deal with latency requirement, for
>> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
>> latency.  If you have latency requirement for some users, you may
>> consider to add new "sync" mode.
>
> Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
> guess my first thought would be to avoid adding a new mode and make
> MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
> wait for all the pages to be migrated can use the heavier sync modes.
> It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
> want to block for an unbounded amount of time here. What do you think?

It appears that you can just use MIGRATE_ASYNC if you think the correct
behavior is "NOT block at all".  I found that there are more
fine-grained controls on this in compaction code, please take a look at
"enum compact_priority" and its comments.

Best Regards,
Huang, Ying

Doug Anderson April 17, 2023, 2:28 p.m. UTC | #4

Hi,

On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Doug Anderson <dianders@chromium.org> writes:
>
> > Hi,
> >
> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Douglas Anderson <dianders@chromium.org> writes:
> >>
> >> > Currently when we try to do page migration and we're in "synchronous"
> >> > mode (and not doing direct compaction) then we'll wait an infinite
> >> > amount of time for a page lock. This does not appear to be a great
> >> > idea.
> >> >
> >> > One issue can be seen when I put a device under extreme memory
> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
> >> > swap). I ran the browser along with Android (which runs from a
> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
> >> > the mmm_donut memory pressure tool [1]. The system is completely
> >> > unusable both with and without this patch since there are 8 processes
> >> > completely thrashing memory, but it was still interesting to look at
> >> > how migration was behaving. I put some timing code in and I could see
> >> > that we sometimes waited over 25 seconds (in the context of
> >> > kcompactd0) for a page lock to become available. Although the 25
> >> > seconds was the high mark, it was easy to see tens, hundreds, or
> >> > thousands of milliseconds spent waiting on the lock.
> >> >
> >> > Instead of waiting, if I bailed out right away (as this patch does), I
> >> > could see kcompactd0 move forward to successfully to migrate other
> >> > pages instead. This seems like a better use of kcompactd's time.
> >> >
> >> > Thus, even though this didn't make the system any more usable in my
> >> > absurd test case, it still seemed to make migration behave better and
> >> > that feels like a win. It also makes the code simpler since we have
> >> > one fewer special case.
> >>
> >> TBH, the test case is too extreme for me.
> >
> > That's fair. That being said, I guess the point I was trying to make
> > is that waiting for this lock could take an unbounded amount of time.
> > Other parts of the system sometimes hold a page lock and then do a
> > blocking operation. At least in the case of kcompactd there are better
> > uses of its time than waiting for any given page.
> >
> >> And, we have multiple "sync" mode to deal with latency requirement, for
> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
> >> latency.  If you have latency requirement for some users, you may
> >> consider to add new "sync" mode.
> >
> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
> > guess my first thought would be to avoid adding a new mode and make
> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
> > wait for all the pages to be migrated can use the heavier sync modes.
> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
> > want to block for an unbounded amount of time here. What do you think?
>
> It appears that you can just use MIGRATE_ASYNC if you think the correct
> behavior is "NOT block at all".  I found that there are more
> fine-grained controls on this in compaction code, please take a look at
> "enum compact_priority" and its comments.

Actually, the more I think about it the more I think the right answer
is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept
some blocking but we don't want long / unbounded blocking. Reading the
comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty
well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is
too much. It's entirely plausible that someone else holding the lock
is doing something as slow as writepage() and thus waiting on the lock
can be just as bad for latency.

I'll try to send out a v2 with this approach today and we can see what
people think.

-Doug

Huang, Ying April 18, 2023, 3:17 a.m. UTC | #5

Doug Anderson <dianders@chromium.org> writes:

> Hi,
>
> On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Doug Anderson <dianders@chromium.org> writes:
>>
>> > Hi,
>> >
>> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Douglas Anderson <dianders@chromium.org> writes:
>> >>
>> >> > Currently when we try to do page migration and we're in "synchronous"
>> >> > mode (and not doing direct compaction) then we'll wait an infinite
>> >> > amount of time for a page lock. This does not appear to be a great
>> >> > idea.
>> >> >
>> >> > One issue can be seen when I put a device under extreme memory
>> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
>> >> > swap). I ran the browser along with Android (which runs from a
>> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
>> >> > the mmm_donut memory pressure tool [1]. The system is completely
>> >> > unusable both with and without this patch since there are 8 processes
>> >> > completely thrashing memory, but it was still interesting to look at
>> >> > how migration was behaving. I put some timing code in and I could see
>> >> > that we sometimes waited over 25 seconds (in the context of
>> >> > kcompactd0) for a page lock to become available. Although the 25
>> >> > seconds was the high mark, it was easy to see tens, hundreds, or
>> >> > thousands of milliseconds spent waiting on the lock.
>> >> >
>> >> > Instead of waiting, if I bailed out right away (as this patch does), I
>> >> > could see kcompactd0 move forward to successfully to migrate other
>> >> > pages instead. This seems like a better use of kcompactd's time.
>> >> >
>> >> > Thus, even though this didn't make the system any more usable in my
>> >> > absurd test case, it still seemed to make migration behave better and
>> >> > that feels like a win. It also makes the code simpler since we have
>> >> > one fewer special case.
>> >>
>> >> TBH, the test case is too extreme for me.
>> >
>> > That's fair. That being said, I guess the point I was trying to make
>> > is that waiting for this lock could take an unbounded amount of time.
>> > Other parts of the system sometimes hold a page lock and then do a
>> > blocking operation. At least in the case of kcompactd there are better
>> > uses of its time than waiting for any given page.
>> >
>> >> And, we have multiple "sync" mode to deal with latency requirement, for
>> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
>> >> latency.  If you have latency requirement for some users, you may
>> >> consider to add new "sync" mode.
>> >
>> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
>> > guess my first thought would be to avoid adding a new mode and make
>> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
>> > wait for all the pages to be migrated can use the heavier sync modes.
>> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
>> > want to block for an unbounded amount of time here. What do you think?
>>
>> It appears that you can just use MIGRATE_ASYNC if you think the correct
>> behavior is "NOT block at all".  I found that there are more
>> fine-grained controls on this in compaction code, please take a look at
>> "enum compact_priority" and its comments.
>
> Actually, the more I think about it the more I think the right answer
> is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> MIGRATE_SYNC_LIGHT not block on the folio lock.

Then, what is the difference between MIGRATE_SYNC_LIGHT and
MIGRATE_ASYNC?

> kcompactd can accept some blocking but we don't want long / unbounded
> blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems
> like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time
> of writepage() is too much. It's entirely plausible that someone else
> holding the lock is doing something as slow as writepage() and thus
> waiting on the lock can be just as bad for latency.

IIUC, during writepage(), the page/folio will be unlocked.

But, during page reading, the page/folio will be locked.  I don't really
understand why we can wait for page reading but cannot wait for page
writeback.

Best Regards,
Huang, Ying

> I'll try to send out a v2 with this approach today and we can see what
> people think.
>
> -Doug

Vlastimil Babka April 18, 2023, 9:17 a.m. UTC | #6

On 4/17/23 16:28, Doug Anderson wrote:
> Hi,
> 
> On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Doug Anderson <dianders@chromium.org> writes:
>>
>> > Hi,
>> >
>> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Douglas Anderson <dianders@chromium.org> writes:
>> >>
>> >> > Currently when we try to do page migration and we're in "synchronous"
>> >> > mode (and not doing direct compaction) then we'll wait an infinite
>> >> > amount of time for a page lock. This does not appear to be a great
>> >> > idea.
>> >> >
>> >> > One issue can be seen when I put a device under extreme memory
>> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
>> >> > swap). I ran the browser along with Android (which runs from a
>> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
>> >> > the mmm_donut memory pressure tool [1]. The system is completely
>> >> > unusable both with and without this patch since there are 8 processes
>> >> > completely thrashing memory, but it was still interesting to look at
>> >> > how migration was behaving. I put some timing code in and I could see
>> >> > that we sometimes waited over 25 seconds (in the context of
>> >> > kcompactd0) for a page lock to become available. Although the 25
>> >> > seconds was the high mark, it was easy to see tens, hundreds, or
>> >> > thousands of milliseconds spent waiting on the lock.
>> >> >
>> >> > Instead of waiting, if I bailed out right away (as this patch does), I
>> >> > could see kcompactd0 move forward to successfully to migrate other
>> >> > pages instead. This seems like a better use of kcompactd's time.
>> >> >
>> >> > Thus, even though this didn't make the system any more usable in my
>> >> > absurd test case, it still seemed to make migration behave better and
>> >> > that feels like a win. It also makes the code simpler since we have
>> >> > one fewer special case.
>> >>
>> >> TBH, the test case is too extreme for me.
>> >
>> > That's fair. That being said, I guess the point I was trying to make
>> > is that waiting for this lock could take an unbounded amount of time.
>> > Other parts of the system sometimes hold a page lock and then do a
>> > blocking operation. At least in the case of kcompactd there are better
>> > uses of its time than waiting for any given page.
>> >
>> >> And, we have multiple "sync" mode to deal with latency requirement, for
>> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
>> >> latency.  If you have latency requirement for some users, you may
>> >> consider to add new "sync" mode.
>> >
>> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
>> > guess my first thought would be to avoid adding a new mode and make
>> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
>> > wait for all the pages to be migrated can use the heavier sync modes.
>> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
>> > want to block for an unbounded amount of time here. What do you think?
>>
>> It appears that you can just use MIGRATE_ASYNC if you think the correct
>> behavior is "NOT block at all".  I found that there are more
>> fine-grained controls on this in compaction code, please take a look at
>> "enum compact_priority" and its comments.
> 
> Actually, the more I think about it the more I think the right answer
> is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept
> some blocking but we don't want long / unbounded blocking. Reading the
> comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty
> well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is
> too much. It's entirely plausible that someone else holding the lock
> is doing something as slow as writepage() and thus waiting on the lock
> can be just as bad for latency.

+Cc Mel for potential insights. Sounds like a good compromise at first
glance, but it's a tricky area.
Also there are other callers of migration than compaction, and we should
make sure we are not breaking them unexpectedly.

> I'll try to send out a v2 with this approach today and we can see what
> people think.

Please Cc Mel and myself for further versions.

> -Doug
>

Doug Anderson April 18, 2023, 4:19 p.m. UTC | #7

Hi,

On Mon, Apr 17, 2023 at 8:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Doug Anderson <dianders@chromium.org> writes:
>
> > Hi,
> >
> > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Doug Anderson <dianders@chromium.org> writes:
> >>
> >> > Hi,
> >> >
> >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Douglas Anderson <dianders@chromium.org> writes:
> >> >>
> >> >> > Currently when we try to do page migration and we're in "synchronous"
> >> >> > mode (and not doing direct compaction) then we'll wait an infinite
> >> >> > amount of time for a page lock. This does not appear to be a great
> >> >> > idea.
> >> >> >
> >> >> > One issue can be seen when I put a device under extreme memory
> >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
> >> >> > swap). I ran the browser along with Android (which runs from a
> >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
> >> >> > the mmm_donut memory pressure tool [1]. The system is completely
> >> >> > unusable both with and without this patch since there are 8 processes
> >> >> > completely thrashing memory, but it was still interesting to look at
> >> >> > how migration was behaving. I put some timing code in and I could see
> >> >> > that we sometimes waited over 25 seconds (in the context of
> >> >> > kcompactd0) for a page lock to become available. Although the 25
> >> >> > seconds was the high mark, it was easy to see tens, hundreds, or
> >> >> > thousands of milliseconds spent waiting on the lock.
> >> >> >
> >> >> > Instead of waiting, if I bailed out right away (as this patch does), I
> >> >> > could see kcompactd0 move forward to successfully to migrate other
> >> >> > pages instead. This seems like a better use of kcompactd's time.
> >> >> >
> >> >> > Thus, even though this didn't make the system any more usable in my
> >> >> > absurd test case, it still seemed to make migration behave better and
> >> >> > that feels like a win. It also makes the code simpler since we have
> >> >> > one fewer special case.
> >> >>
> >> >> TBH, the test case is too extreme for me.
> >> >
> >> > That's fair. That being said, I guess the point I was trying to make
> >> > is that waiting for this lock could take an unbounded amount of time.
> >> > Other parts of the system sometimes hold a page lock and then do a
> >> > blocking operation. At least in the case of kcompactd there are better
> >> > uses of its time than waiting for any given page.
> >> >
> >> >> And, we have multiple "sync" mode to deal with latency requirement, for
> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
> >> >> latency.  If you have latency requirement for some users, you may
> >> >> consider to add new "sync" mode.
> >> >
> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
> >> > guess my first thought would be to avoid adding a new mode and make
> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
> >> > wait for all the pages to be migrated can use the heavier sync modes.
> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
> >> > want to block for an unbounded amount of time here. What do you think?
> >>
> >> It appears that you can just use MIGRATE_ASYNC if you think the correct
> >> behavior is "NOT block at all".  I found that there are more
> >> fine-grained controls on this in compaction code, please take a look at
> >> "enum compact_priority" and its comments.
> >
> > Actually, the more I think about it the more I think the right answer
> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> > MIGRATE_SYNC_LIGHT not block on the folio lock.
>
> Then, what is the difference between MIGRATE_SYNC_LIGHT and
> MIGRATE_ASYNC?

Aren't there still some differences even if we remove blocking this
one lock? ...or maybe your point is that maybe the other differences
have similar properties?

OK, so let's think about just using MIGRATE_ASYNC and either leaving
MIGRATE_SYNC_LIGHT alone or deleting it (if there are no users left).
The nice thing is that the only users of MIGRATE_SYNC_LIGHT are in
"compaction.c" and there are only 3 places where it's specified.

1. kcompactd_do_work() - This is what I was analyzing and where I
argued that indefinite blocking is less useful than simply trying to
compact a different page. So sure, moving this to MIGRATE_ASYNC seems
like it would be OK?

2. proactive_compact_node() - Just like kcompactd_do_work(), this is
called from kcompactd and thus probably should have the same mode.

3. compact_zone_order() - This explicitly chooses between
MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC, so I guess I'd keep
MIGRATE_SYNC_LIGHT just for this use case. It looks as if
compact_zone_order() is called for direct compaction and thus making
it synchronous can make sense, especially since it seems to go to the
synchronous case after it failed with the async case. Ironically,
though, the exact lock I was proposing to not wait on _isn't_ ever
waited on in direct reclaim (see the comment in migrate_folio_unmap()
about deadlock), but the other differences between SYNC_LIGHT and
ASYNC come into play.

If the above sounds correct then I'm OK w/ moving #1 and #2 to
MIGRATE_ASYNC and leaving #3 as the sole user or MIGRATE_SYNC_LIGHT.

> > kcompactd can accept some blocking but we don't want long / unbounded
> > blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems
> > like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time
> > of writepage() is too much. It's entirely plausible that someone else
> > holding the lock is doing something as slow as writepage() and thus
> > waiting on the lock can be just as bad for latency.
>
> IIUC, during writepage(), the page/folio will be unlocked.
>
> But, during page reading, the page/folio will be locked.  I don't really
> understand why we can wait for page reading but cannot wait for page
> writeback.

I'm not sure I totally got your point here. It sorta sounds as if
you're making the same point that I was? IIUC by waiting on the lock
we may be implicitly waiting for someone to finish reading which seems
as bad as waiting for writing. That was why I was arguing that with
MIGRATE_SYNC_LIGHT (which says that waiting for the write was too
slow) that we shouldn't wait for the lock (which may be blocking on a
read).

-Doug

Huang, Ying April 19, 2023, 12:33 a.m. UTC | #8

Doug Anderson <dianders@chromium.org> writes:

> Hi,
>
> On Mon, Apr 17, 2023 at 8:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Doug Anderson <dianders@chromium.org> writes:
>>
>> > Hi,
>> >
>> > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Doug Anderson <dianders@chromium.org> writes:
>> >>
>> >> > Hi,
>> >> >
>> >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Douglas Anderson <dianders@chromium.org> writes:
>> >> >>
>> >> >> > Currently when we try to do page migration and we're in "synchronous"
>> >> >> > mode (and not doing direct compaction) then we'll wait an infinite
>> >> >> > amount of time for a page lock. This does not appear to be a great
>> >> >> > idea.
>> >> >> >
>> >> >> > One issue can be seen when I put a device under extreme memory
>> >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
>> >> >> > swap). I ran the browser along with Android (which runs from a
>> >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran
>> >> >> > the mmm_donut memory pressure tool [1]. The system is completely
>> >> >> > unusable both with and without this patch since there are 8 processes
>> >> >> > completely thrashing memory, but it was still interesting to look at
>> >> >> > how migration was behaving. I put some timing code in and I could see
>> >> >> > that we sometimes waited over 25 seconds (in the context of
>> >> >> > kcompactd0) for a page lock to become available. Although the 25
>> >> >> > seconds was the high mark, it was easy to see tens, hundreds, or
>> >> >> > thousands of milliseconds spent waiting on the lock.
>> >> >> >
>> >> >> > Instead of waiting, if I bailed out right away (as this patch does), I
>> >> >> > could see kcompactd0 move forward to successfully to migrate other
>> >> >> > pages instead. This seems like a better use of kcompactd's time.
>> >> >> >
>> >> >> > Thus, even though this didn't make the system any more usable in my
>> >> >> > absurd test case, it still seemed to make migration behave better and
>> >> >> > that feels like a win. It also makes the code simpler since we have
>> >> >> > one fewer special case.
>> >> >>
>> >> >> TBH, the test case is too extreme for me.
>> >> >
>> >> > That's fair. That being said, I guess the point I was trying to make
>> >> > is that waiting for this lock could take an unbounded amount of time.
>> >> > Other parts of the system sometimes hold a page lock and then do a
>> >> > blocking operation. At least in the case of kcompactd there are better
>> >> > uses of its time than waiting for any given page.
>> >> >
>> >> >> And, we have multiple "sync" mode to deal with latency requirement, for
>> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
>> >> >> latency.  If you have latency requirement for some users, you may
>> >> >> consider to add new "sync" mode.
>> >> >
>> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
>> >> > guess my first thought would be to avoid adding a new mode and make
>> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
>> >> > wait for all the pages to be migrated can use the heavier sync modes.
>> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
>> >> > want to block for an unbounded amount of time here. What do you think?
>> >>
>> >> It appears that you can just use MIGRATE_ASYNC if you think the correct
>> >> behavior is "NOT block at all".  I found that there are more
>> >> fine-grained controls on this in compaction code, please take a look at
>> >> "enum compact_priority" and its comments.
>> >
>> > Actually, the more I think about it the more I think the right answer
>> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
>> > MIGRATE_SYNC_LIGHT not block on the folio lock.
>>
>> Then, what is the difference between MIGRATE_SYNC_LIGHT and
>> MIGRATE_ASYNC?
>
> Aren't there still some differences even if we remove blocking this
> one lock? ...or maybe your point is that maybe the other differences
> have similar properties?

Sorry for confusing words.  Here, I asked you to list the implementation
difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your
proposed changes.  Which are waited in MIGRATE_SYNC_LIGHT but not in
MIGRATE_ASYNC?

> OK, so let's think about just using MIGRATE_ASYNC and either leaving
> MIGRATE_SYNC_LIGHT alone or deleting it (if there are no users left).
> The nice thing is that the only users of MIGRATE_SYNC_LIGHT are in
> "compaction.c" and there are only 3 places where it's specified.
>
> 1. kcompactd_do_work() - This is what I was analyzing and where I
> argued that indefinite blocking is less useful than simply trying to
> compact a different page. So sure, moving this to MIGRATE_ASYNC seems
> like it would be OK?
>
> 2. proactive_compact_node() - Just like kcompactd_do_work(), this is
> called from kcompactd and thus probably should have the same mode.
>
> 3. compact_zone_order() - This explicitly chooses between
> MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC, so I guess I'd keep
> MIGRATE_SYNC_LIGHT just for this use case. It looks as if
> compact_zone_order() is called for direct compaction and thus making
> it synchronous can make sense, especially since it seems to go to the
> synchronous case after it failed with the async case. Ironically,
> though, the exact lock I was proposing to not wait on _isn't_ ever
> waited on in direct reclaim (see the comment in migrate_folio_unmap()
> about deadlock), but the other differences between SYNC_LIGHT and
> ASYNC come into play.
>
> If the above sounds correct then I'm OK w/ moving #1 and #2 to
> MIGRATE_ASYNC and leaving #3 as the sole user or MIGRATE_SYNC_LIGHT.
>
>> > kcompactd can accept some blocking but we don't want long / unbounded
>> > blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems
>> > like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time
>> > of writepage() is too much. It's entirely plausible that someone else
>> > holding the lock is doing something as slow as writepage() and thus
>> > waiting on the lock can be just as bad for latency.
>>
>> IIUC, during writepage(), the page/folio will be unlocked.
>>
>> But, during page reading, the page/folio will be locked.  I don't really
>> understand why we can wait for page reading but cannot wait for page
>> writeback.
>
> I'm not sure I totally got your point here. It sorta sounds as if
> you're making the same point that I was?

Yes, kind of.  It is a question, not conclusion.

> IIUC by waiting on the lock
> we may be implicitly waiting for someone to finish reading which seems
> as bad as waiting for writing. That was why I was arguing that with
> MIGRATE_SYNC_LIGHT (which says that waiting for the write was too
> slow) that we shouldn't wait for the lock (which may be blocking on a
> read).

Best Regards,
Huang, Ying

Doug Anderson April 19, 2023, 7:30 p.m. UTC | #9

Hi,

On Tue, Apr 18, 2023 at 5:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> >> >> >> TBH, the test case is too extreme for me.
> >> >> >
> >> >> > That's fair. That being said, I guess the point I was trying to make
> >> >> > is that waiting for this lock could take an unbounded amount of time.
> >> >> > Other parts of the system sometimes hold a page lock and then do a
> >> >> > blocking operation. At least in the case of kcompactd there are better
> >> >> > uses of its time than waiting for any given page.
> >> >> >
> >> >> >> And, we have multiple "sync" mode to deal with latency requirement, for
> >> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
> >> >> >> latency.  If you have latency requirement for some users, you may
> >> >> >> consider to add new "sync" mode.
> >> >> >
> >> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
> >> >> > guess my first thought would be to avoid adding a new mode and make
> >> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
> >> >> > wait for all the pages to be migrated can use the heavier sync modes.
> >> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
> >> >> > want to block for an unbounded amount of time here. What do you think?
> >> >>
> >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct
> >> >> behavior is "NOT block at all".  I found that there are more
> >> >> fine-grained controls on this in compaction code, please take a look at
> >> >> "enum compact_priority" and its comments.
> >> >
> >> > Actually, the more I think about it the more I think the right answer
> >> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> >> > MIGRATE_SYNC_LIGHT not block on the folio lock.
> >>
> >> Then, what is the difference between MIGRATE_SYNC_LIGHT and
> >> MIGRATE_ASYNC?
> >
> > Aren't there still some differences even if we remove blocking this
> > one lock? ...or maybe your point is that maybe the other differences
> > have similar properties?
>
> Sorry for confusing words.  Here, I asked you to list the implementation
> difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your
> proposed changes.  Which are waited in MIGRATE_SYNC_LIGHT but not in
> MIGRATE_ASYNC?

Ah, got it! It's not always the easiest to follow all the code paths,
but let's see what I can find.

I guess to start with, though, I will assert that someone seems to
have believed that there was an important difference between
MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT besides waiting on the lock in
migrate_folio_unmapt() since (as I found in my previous digging) the
"direct reclaim" path never grabs this lock but explicitly sometimes
chooses MIGRATE_ASYNC some times and MIGRATE_SYNC_LIGHT other times.

OK, so looking at mainline Linux and comparing differences in behavior
between SYNC_LIGHT and ASYNC and thoughts about which one should be
used for kcompactd. Note that I won't go _too_ deep into all the
differences...

--

In nfs.c:

1. We will wait for the fscache if SYNC_LIGHT but not ASYNC. No idea
what would be the most ideal for calls from kcompactd.

In compaction.c:

2. We will update the non-async "compact_cached_migrate_pfn" for
SYNC_LIGHT but not ASYNC since we keep track of sync and async
progress separately.

3. compact_lock_irqsave() note contentions for ASYNC but not
SYNC_LIGHT and cause an earlier abort. Seems like kcompactd would want
the SYNC_LIGHT behavior since this isn't about things indefinitely
blocking.

4. isolate_migratepages_block() will bail if too_many_isolated() for
ASYNC but not SYNC_LIGHT. My hunch is that kcompactd wants the
SYNC_LIGHT behavior for kcompact.

5. If in direct compaction, isolate_migratepages_block() sets
"skip_on_failure" for ASYNC but not SYNC_LIGHT. My hunch is that
kcompactd wants the SYNC_LIGHT behavior for kcompact.

6. suitable_migration_source() considers more things suitable
migration sources when SYNC_LIGHT but not (ASYNC+direct_compaction).
Doesn't matter since kcompactd isn't direct compaction and non-direct
compaction is the same.

7. fast_isolate_around() does less scanning when SYNC_LIGHT but not
(ASYNC+direct_compaction). Again, it doesn't matter for kcompactd.

8. isolate_freepages() uses a different stride with SYNC_LIGHT vs.
ASYNC. My hunch is that kcompactd wants the SYNC_LIGHT behavior for
kcompact.

In migrate.c:

9. buffer_migrate_lock_buffers() will block waiting to lock buffers
with SYNC_LIGHT but not ASYNC. I don't know for sure, but this feels
like something we _wouldn't_ want to block on in kcompactd and instead
should look for easier pickings.

10. migrate_folio_unmap() has the case we've already talked about

11. migrate_folio_unmap() does batch flushes for async because it
doesn't need to worry about a class of deadlock. Somewhat recent code
actually ends up running this code first even for sync modes to get
the batch.

12. We'll retry a few more times for SYNC_LIGHT than ASYNC. Seems like
the more retries won't really hurt for kcompactd.

--

So from looking at all the above, I'll say that kcompactd should stick
with SYNC_LIGHT and we should fix #10. In other words, like my
original patch except that we keep blocking on the lock in the full
SYNC modes.

It's possible that we should also change case #9 I listed above. Do
you know if locking buffers is likely to block on something as slow as
page reading/writing?

-Doug

Huang, Ying April 20, 2023, 1:39 a.m. UTC | #10

Doug Anderson <dianders@chromium.org> writes:

> Hi,
>
> On Tue, Apr 18, 2023 at 5:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> >> >> >> TBH, the test case is too extreme for me.
>> >> >> >
>> >> >> > That's fair. That being said, I guess the point I was trying to make
>> >> >> > is that waiting for this lock could take an unbounded amount of time.
>> >> >> > Other parts of the system sometimes hold a page lock and then do a
>> >> >> > blocking operation. At least in the case of kcompactd there are better
>> >> >> > uses of its time than waiting for any given page.
>> >> >> >
>> >> >> >> And, we have multiple "sync" mode to deal with latency requirement, for
>> >> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long
>> >> >> >> latency.  If you have latency requirement for some users, you may
>> >> >> >> consider to add new "sync" mode.
>> >> >> >
>> >> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I
>> >> >> > guess my first thought would be to avoid adding a new mode and make
>> >> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to
>> >> >> > wait for all the pages to be migrated can use the heavier sync modes.
>> >> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not
>> >> >> > want to block for an unbounded amount of time here. What do you think?
>> >> >>
>> >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct
>> >> >> behavior is "NOT block at all".  I found that there are more
>> >> >> fine-grained controls on this in compaction code, please take a look at
>> >> >> "enum compact_priority" and its comments.
>> >> >
>> >> > Actually, the more I think about it the more I think the right answer
>> >> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
>> >> > MIGRATE_SYNC_LIGHT not block on the folio lock.
>> >>
>> >> Then, what is the difference between MIGRATE_SYNC_LIGHT and
>> >> MIGRATE_ASYNC?
>> >
>> > Aren't there still some differences even if we remove blocking this
>> > one lock? ...or maybe your point is that maybe the other differences
>> > have similar properties?
>>
>> Sorry for confusing words.  Here, I asked you to list the implementation
>> difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your
>> proposed changes.  Which are waited in MIGRATE_SYNC_LIGHT but not in
>> MIGRATE_ASYNC?
>
> Ah, got it! It's not always the easiest to follow all the code paths,
> but let's see what I can find.
>
> I guess to start with, though, I will assert that someone seems to
> have believed that there was an important difference between
> MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT besides waiting on the lock in
> migrate_folio_unmapt() since (as I found in my previous digging) the
> "direct reclaim" path never grabs this lock but explicitly sometimes
> chooses MIGRATE_ASYNC some times and MIGRATE_SYNC_LIGHT other times.
>
> OK, so looking at mainline Linux and comparing differences in behavior
> between SYNC_LIGHT and ASYNC and thoughts about which one should be
> used for kcompactd. Note that I won't go _too_ deep into all the
> differences...
>
> --
>
> In nfs.c:
>
> 1. We will wait for the fscache if SYNC_LIGHT but not ASYNC. No idea
> what would be the most ideal for calls from kcompactd.

This appears like something like disk writing.

> In compaction.c:
>
> 2. We will update the non-async "compact_cached_migrate_pfn" for
> SYNC_LIGHT but not ASYNC since we keep track of sync and async
> progress separately.
>
> 3. compact_lock_irqsave() note contentions for ASYNC but not
> SYNC_LIGHT and cause an earlier abort. Seems like kcompactd would want
> the SYNC_LIGHT behavior since this isn't about things indefinitely
> blocking.
>
> 4. isolate_migratepages_block() will bail if too_many_isolated() for
> ASYNC but not SYNC_LIGHT. My hunch is that kcompactd wants the
> SYNC_LIGHT behavior for kcompact.
>
> 5. If in direct compaction, isolate_migratepages_block() sets
> "skip_on_failure" for ASYNC but not SYNC_LIGHT. My hunch is that
> kcompactd wants the SYNC_LIGHT behavior for kcompact.
>
> 6. suitable_migration_source() considers more things suitable
> migration sources when SYNC_LIGHT but not (ASYNC+direct_compaction).
> Doesn't matter since kcompactd isn't direct compaction and non-direct
> compaction is the same.
>
> 7. fast_isolate_around() does less scanning when SYNC_LIGHT but not
> (ASYNC+direct_compaction). Again, it doesn't matter for kcompactd.
>
> 8. isolate_freepages() uses a different stride with SYNC_LIGHT vs.
> ASYNC. My hunch is that kcompactd wants the SYNC_LIGHT behavior for
> kcompact.
>
> In migrate.c:
>
> 9. buffer_migrate_lock_buffers() will block waiting to lock buffers
> with SYNC_LIGHT but not ASYNC. I don't know for sure, but this feels
> like something we _wouldn't_ want to block on in kcompactd and instead
> should look for easier pickings.

IIUC, this is similar as page lock.

> 10. migrate_folio_unmap() has the case we've already talked about
>
> 11. migrate_folio_unmap() does batch flushes for async because it
> doesn't need to worry about a class of deadlock. Somewhat recent code
> actually ends up running this code first even for sync modes to get
> the batch.
>
> 12. We'll retry a few more times for SYNC_LIGHT than ASYNC. Seems like
> the more retries won't really hurt for kcompactd.
>
> --
>
> So from looking at all the above, I'll say that kcompactd should stick
> with SYNC_LIGHT and we should fix #10. In other words, like my
> original patch except that we keep blocking on the lock in the full
> SYNC modes.
>
> It's possible that we should also change case #9 I listed above. Do
> you know if locking buffers is likely to block on something as slow as
> page reading/writing?

IIUC, this is related to page reading/writing.  Buffer head is used by
ext2/4 to read/write.

Thank you very much for your research.  It looks like ASYNC isn't
appropriate for kcompactd.

From the comments of SYNC_LIGHT,

 * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
 *	on most operations but not ->writepage as the potential stall time
 *	is too significant

To make SYNC_LIGHT block on less operations than before, I guess that
you need to prove the stall time can be long with the operation with
not-so-extreme test cases.

Best Regards,
Huang, Ying

Mel Gorman April 20, 2023, 10:23 a.m. UTC | #11

On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote:
> > Actually, the more I think about it the more I think the right answer
> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept
> > some blocking but we don't want long / unbounded blocking. Reading the
> > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty
> > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is
> > too much. It's entirely plausible that someone else holding the lock
> > is doing something as slow as writepage() and thus waiting on the lock
> > can be just as bad for latency.
> 
> +Cc Mel for potential insights. Sounds like a good compromise at first
> glance, but it's a tricky area.
> Also there are other callers of migration than compaction, and we should
> make sure we are not breaking them unexpectedly.
> 

It's tricky because part of the point of SYNC_LIGHT was to block on
some operations but not for too long. writepage was considered to be an
exception because it can be very slow for a variety of reasons. I think
At the time that writeback from reclaim context was possible and it was
very inefficient, more more inefficient than reads.  Storage can also be
very slow (e.g. USB stick plugged in) and there can be large differences
between read/write performance (SMR, some ssd etc). Pages read were generally
clean and could be migrated, pages for write may be redirtied etc. It was
assumed that while both read/write could lock the page for a long time,
write had worse hold times and most other users of lock page were transient.

A compromise for SYNC_LIGHT or even SYNC on lock page would be to try
locking with a timeout. I don't think there is a specific helper but it
should be possible to trylock, wait on the folio_waitqueue and attempt
once to get the lock again. I didn't look very closely but it would
doing something similar to folio_wait_bit_common() with
io_schedule_timeout instead of io_schedule. This will have false
positives because the folio waitqueue may be woken for unrelated pages
and obviously it can race with other wait queues.

kcompactd is an out-of-line wait and can afford to wait for a long time
without user-visible impact but 120 seconds or any potentially unbounded
length of time is ridiculous and unhelpful. I would still be wary about
adding new sync modes or making major modifications to kcompactd because
detecting application stalls due to a kcompactd modification is difficult.

There is another approach -- kcompactd or proactive under heavy memory
pressure is probably a waste of CPU time and resources and should
avoid or minimise effort when under pressure. While direct compaction
can capture a page for immediate use, kcompactd and proactive reclaim
are just shuffling memory around for *potential* users and may be making
the overall memory pressure even worse. If memory pressure detection was
better and proactive/kcompactd reclaim bailed then the unbounded time to
lock a page is mitigated or completely avoided.

The two options are ortogonal. trylock_page_timeout() would mitigate
worst case behaviour while memory pressure detection and bailing on SYNC*
compaction side-steps the issue in extreme cases. Both have value as
pressure detection would be best-effort and trylock_page_timeout() would
limit worst-case behaviour.

> > I'll try to send out a v2 with this approach today and we can see what
> > people think.
> 
> Please Cc Mel and myself for further versions.
> 

Yes please.

Doug Anderson April 21, 2023, 12:36 a.m. UTC | #12

Hi,

On Thu, Apr 20, 2023 at 3:23 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote:
> > > Actually, the more I think about it the more I think the right answer
> > > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> > > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept
> > > some blocking but we don't want long / unbounded blocking. Reading the
> > > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty
> > > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is
> > > too much. It's entirely plausible that someone else holding the lock
> > > is doing something as slow as writepage() and thus waiting on the lock
> > > can be just as bad for latency.
> >
> > +Cc Mel for potential insights. Sounds like a good compromise at first
> > glance, but it's a tricky area.
> > Also there are other callers of migration than compaction, and we should
> > make sure we are not breaking them unexpectedly.
> >
>
> It's tricky because part of the point of SYNC_LIGHT was to block on
> some operations but not for too long. writepage was considered to be an
> exception because it can be very slow for a variety of reasons. I think
> At the time that writeback from reclaim context was possible and it was
> very inefficient, more more inefficient than reads.  Storage can also be
> very slow (e.g. USB stick plugged in) and there can be large differences
> between read/write performance (SMR, some ssd etc). Pages read were generally
> clean and could be migrated, pages for write may be redirtied etc. It was
> assumed that while both read/write could lock the page for a long time,
> write had worse hold times and most other users of lock page were transient.

I think some of the slowness gets into the complex ways that systems
like ChromeOS are currently working. As mentioned in the commit
message of my RFC patch, ChromeOS currently runs Android programs out
of a 128K-block, zlib-compressed squashfs disk. That squashfs disk is
actually a loopback mounted file on the main ext2 filesystem which is
stored on something like eMMC.

If I understand the everything correctly, if we get a page fault on
memory backed by this squashfs filesystem, then we can end up holding
a page/folio lock and then trying to read a pile of pages (enough to
decompress the whole 128K block). ...but we don't read them directly,
we instead block on ext4 which might need to allocate memory and then
finally blocks on the block driver completing the task. This whole
sequence of things is not necessarily fast. I think this is
responsible for some of the very large numbers that were part of my
original patch description.

Without the above squashfs setup, we can still run into slow times but
not quite as bad. I tried running a ChromeOS "memory pressure" test
against a mainline kernel plus _just_ the browser (Android disabled).
The test eventually opened over 90 tabs on my 4GB system and the
system got pretty janky, but still barely usable. While running the
test, I saw dozens of cases of folio_lock() taking over 10 ms and
quite a few (~10?) of it taking over 100 ms. The peak I saw was
~380ms. I also happened to time buffer locking. That was much less bad
with the worst case topping out at ~70ms. I'm not sure what timeout
you'd consider to be bad. 10 ms? 20 ms?

Also as a side note: I ran the same memory pressure but _with_ Android
running (though it doesn't actually stress Android, it's just running
in the background). To do this I had to run a downstream kernel. Here
it was easy to see a ~1.7 ms wait on the page lock without any
ridiculous amount of stressing. ...and a ~1.5 second wait for the
buffer lock, too.

> A compromise for SYNC_LIGHT or even SYNC on lock page would be to try
> locking with a timeout. I don't think there is a specific helper but it
> should be possible to trylock, wait on the folio_waitqueue and attempt
> once to get the lock again. I didn't look very closely but it would
> doing something similar to folio_wait_bit_common() with
> io_schedule_timeout instead of io_schedule. This will have false
> positives because the folio waitqueue may be woken for unrelated pages
> and obviously it can race with other wait queues.
>
> kcompactd is an out-of-line wait and can afford to wait for a long time
> without user-visible impact but 120 seconds or any potentially unbounded
> length of time is ridiculous and unhelpful. I would still be wary about
> adding new sync modes or making major modifications to kcompactd because
> detecting application stalls due to a kcompactd modification is difficult.

OK, I'll give this a shot. It doesn't look too hard, but we'll see.

> There is another approach -- kcompactd or proactive under heavy memory
> pressure is probably a waste of CPU time and resources and should
> avoid or minimise effort when under pressure. While direct compaction
> can capture a page for immediate use, kcompactd and proactive reclaim
> are just shuffling memory around for *potential* users and may be making
> the overall memory pressure even worse. If memory pressure detection was
> better and proactive/kcompactd reclaim bailed then the unbounded time to
> lock a page is mitigated or completely avoided.

I probably won't try to take this on, though it does sound like a good
idea for future research.

[RFC] migrate_pages: Never block waiting for the page lock

Commit Message

Comments

Patch