diff mbox series

[v4] mm: Enable suspend-only swap spaces

Message ID 20210726171106.v4.1.I09866d90c6de14f21223a03e9e6a31f8a02ecbaf@changeid (mailing list archive)
State New
Headers show
Series [v4] mm: Enable suspend-only swap spaces | expand

Commit Message

Evan Green July 27, 2021, 12:12 a.m. UTC
Add a new SWAP_FLAG_HIBERNATE_ONLY that adds a swap region but refuses
to allow generic swapping to it. This region can still be wired up for
use in suspend-to-disk activities, but will never have regular pages
swapped to it. This flag will be passed in by utilities like swapon(8),
usage would probably look something like: swapon -o hibernate /dev/sda2.

Currently it's not possible to enable hibernation without also enabling
generic swap for a given area. One semi-workaround for this is to delay
the call to swapon() until just before attempting to hibernate, and then
call swapoff() just after hibernate completes. This is somewhat kludgy,
and also doesn't really work to keep swap out of the hibernate region.
When hibernate begins, it starts by allocating a large chunk of memory
for itself. This often ends up forcing a lot of data out into swap. By
this time the hibernate region is eligible for generic swap, so swap
ends up leaking into the hibernate region even with the workaround.

There are a few reasons why usermode might want to be able to
exclusively steer swap and hibernate. One reason relates to SSD wearing.
Hibernate's endurance and speed requirements are different from swap.
It may for instance be advantageous to keep hibernate in primary
storage, but put swap in an SLC namespace. These namespaces are faster
and have better endurance, but cost 3-4x in terms of capacity.
Exclusively steering hibernate and swap enables system designers to
accurately partition their storage without either wearing out their
primary storage, or overprovisioning their fast swap area.

Another reason to allow exclusive steering has to do with security.
The requirements for designing systems with resilience against
offline attacks are different between swap and hibernate. Swap
effectively requires a dictionary of hashes, as pages can be added and
removed arbitrarily, whereas hibernate only needs a single hash for the
entire image. If you've set up block-level integrity for swap and
image-level integrity for hibernate, then allowing swap blocks to
possibly leak out to the hibernate region is problematic, since it
creates swap pages not protected by any integrity.

Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in
/proc/meminfo under SwapTotal and SwapFree, since they are not usable as
general swap. These regions do still appear in /proc/swaps.

Signed-off-by: Evan Green <evgreen@chromium.org>
---

Changes in v4:
 - Rework commit message to summarize workaround discussion [David]
 - Rename flag from SWAP_FLAG_SWAPON to SWAP_FLAG_HIBERNATE_ONLY [David]
 - Reject invalid flags when HIBERNATE_ONLY is set. [David]
 - Reject HIBERNATE_ONLY if CONFIG_HIBERNATION is not set. [David]
 - Relax VM_BUG_ON() since HIBERNATE_ONLY regions may directly free
   pages. [Evan]

Changes in v3:
 - Updated commit message with additional explanation [Andrew]

Changes in v2:
 - NOSWAP regions should not contribute to Swap stats in /proc/meminfo.
   [David]
 - Adjusted comment of SWAP_FLAG_NOSWAP [Pavel]
 - Note: Opted not to take Pavel's tag since enough has changed in this
   revision to warrant another look.
 - Call swap_entry_free() in swap_free to avoid NOSWAP leaks back into
   the general pool via swap_slots_cache [me].

 include/linux/swap.h | 10 ++++++-
 mm/swapfile.c        | 67 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 60 insertions(+), 17 deletions(-)

Comments

David Hildenbrand July 27, 2021, 9:48 a.m. UTC | #1
On 27.07.21 02:12, Evan Green wrote:
> Add a new SWAP_FLAG_HIBERNATE_ONLY that adds a swap region but refuses
> to allow generic swapping to it. This region can still be wired up for
> use in suspend-to-disk activities, but will never have regular pages
> swapped to it. This flag will be passed in by utilities like swapon(8),
> usage would probably look something like: swapon -o hibernate /dev/sda2.
> 
> Currently it's not possible to enable hibernation without also enabling
> generic swap for a given area. One semi-workaround for this is to delay
> the call to swapon() until just before attempting to hibernate, and then
> call swapoff() just after hibernate completes. This is somewhat kludgy,
> and also doesn't really work to keep swap out of the hibernate region.
> When hibernate begins, it starts by allocating a large chunk of memory
> for itself. This often ends up forcing a lot of data out into swap. By
> this time the hibernate region is eligible for generic swap, so swap
> ends up leaking into the hibernate region even with the workaround.
> 
> There are a few reasons why usermode might want to be able to
> exclusively steer swap and hibernate. One reason relates to SSD wearing.
> Hibernate's endurance and speed requirements are different from swap.
> It may for instance be advantageous to keep hibernate in primary
> storage, but put swap in an SLC namespace. These namespaces are faster
> and have better endurance, but cost 3-4x in terms of capacity.
> Exclusively steering hibernate and swap enables system designers to
> accurately partition their storage without either wearing out their
> primary storage, or overprovisioning their fast swap area.
> 
> Another reason to allow exclusive steering has to do with security.
> The requirements for designing systems with resilience against
> offline attacks are different between swap and hibernate. Swap
> effectively requires a dictionary of hashes, as pages can be added and
> removed arbitrarily, whereas hibernate only needs a single hash for the
> entire image. If you've set up block-level integrity for swap and
> image-level integrity for hibernate, then allowing swap blocks to
> possibly leak out to the hibernate region is problematic, since it
> creates swap pages not protected by any integrity.
> 
> Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in
> /proc/meminfo under SwapTotal and SwapFree, since they are not usable as
> general swap. These regions do still appear in /proc/swaps.

Right, and they also don't account towards the memory overcommit 
calculations.

Thanks for extending the patch description!

[...]

> +	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) {
> +		if (IS_ENABLED(CONFIG_HIBERNATION)) {
> +			if (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)
> +				return -EINVAL;
> +
> +		} else {
> +			return -EINVAL;
> +		}
> +	}

We could do short

if ((swap_flags & SWAP_FLAG_HIBERNATE_ONLY) &&
      (!IS_ENABLED(CONFIG_HIBERNATION) ||
       (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)))
	return -EINVAL;

or

if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY))
	if (!IS_ENABLED(CONFIG_HIBERNATION) ||
             (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS))
		return -EINVAL;

> +
>   	if (!capable(CAP_SYS_ADMIN))
>   		return -EPERM;
>   
> @@ -3335,16 +3366,20 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>   	if (swap_flags & SWAP_FLAG_PREFER)
>   		prio =
>   		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
> +
> +	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY)
> +		p->flags |= SWP_HIBERNATE_ONLY;
>   	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
>   
> -	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
> +	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
>   		p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
>   		nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
>   		(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
>   		(p->flags & SWP_DISCARDABLE) ? "D" : "",
>   		(p->flags & SWP_AREA_DISCARD) ? "s" : "",
>   		(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
> -		(frontswap_map) ? "FS" : "");
> +		(frontswap_map) ? "FS" : "",
> +		(p->flags & SWP_HIBERNATE_ONLY) ? "H" : "");
>   
>   	mutex_unlock(&swapon_mutex);
>   	atomic_inc(&proc_poll_event);
> 

Looks like the cleanest alternative to me, as long as we don't want to 
invent new interfaces.

Acked-by: David Hildenbrand <david@redhat.com>
David Hildenbrand July 27, 2021, 12:21 p.m. UTC | #2
On 27.07.21 11:48, David Hildenbrand wrote:
> On 27.07.21 02:12, Evan Green wrote:
>> Add a new SWAP_FLAG_HIBERNATE_ONLY that adds a swap region but refuses
>> to allow generic swapping to it. This region can still be wired up for
>> use in suspend-to-disk activities, but will never have regular pages
>> swapped to it. This flag will be passed in by utilities like swapon(8),
>> usage would probably look something like: swapon -o hibernate /dev/sda2.
>>
>> Currently it's not possible to enable hibernation without also enabling
>> generic swap for a given area. One semi-workaround for this is to delay
>> the call to swapon() until just before attempting to hibernate, and then
>> call swapoff() just after hibernate completes. This is somewhat kludgy,
>> and also doesn't really work to keep swap out of the hibernate region.
>> When hibernate begins, it starts by allocating a large chunk of memory
>> for itself. This often ends up forcing a lot of data out into swap. By
>> this time the hibernate region is eligible for generic swap, so swap
>> ends up leaking into the hibernate region even with the workaround.
>>
>> There are a few reasons why usermode might want to be able to
>> exclusively steer swap and hibernate. One reason relates to SSD wearing.
>> Hibernate's endurance and speed requirements are different from swap.
>> It may for instance be advantageous to keep hibernate in primary
>> storage, but put swap in an SLC namespace. These namespaces are faster
>> and have better endurance, but cost 3-4x in terms of capacity.
>> Exclusively steering hibernate and swap enables system designers to
>> accurately partition their storage without either wearing out their
>> primary storage, or overprovisioning their fast swap area.
>>
>> Another reason to allow exclusive steering has to do with security.
>> The requirements for designing systems with resilience against
>> offline attacks are different between swap and hibernate. Swap
>> effectively requires a dictionary of hashes, as pages can be added and
>> removed arbitrarily, whereas hibernate only needs a single hash for the
>> entire image. If you've set up block-level integrity for swap and
>> image-level integrity for hibernate, then allowing swap blocks to
>> possibly leak out to the hibernate region is problematic, since it
>> creates swap pages not protected by any integrity.
>>
>> Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in
>> /proc/meminfo under SwapTotal and SwapFree, since they are not usable as
>> general swap. These regions do still appear in /proc/swaps.
> 
> Right, and they also don't account towards the memory overcommit
> calculations.
> 
> Thanks for extending the patch description!
> 
> [...]
> 
>> +	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) {
>> +		if (IS_ENABLED(CONFIG_HIBERNATION)) {
>> +			if (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)
>> +				return -EINVAL;
>> +
>> +		} else {
>> +			return -EINVAL;
>> +		}
>> +	}
> 
> We could do short
> 
> if ((swap_flags & SWAP_FLAG_HIBERNATE_ONLY) &&
>        (!IS_ENABLED(CONFIG_HIBERNATION) ||
>         (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)))
> 	return -EINVAL;
> 
> or
> 
> if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY))
> 	if (!IS_ENABLED(CONFIG_HIBERNATION) ||
>               (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS))
> 		return -EINVAL;
> 
>> +
>>    	if (!capable(CAP_SYS_ADMIN))
>>    		return -EPERM;
>>    
>> @@ -3335,16 +3366,20 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>    	if (swap_flags & SWAP_FLAG_PREFER)
>>    		prio =
>>    		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
>> +
>> +	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY)
>> +		p->flags |= SWP_HIBERNATE_ONLY;
>>    	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
>>    
>> -	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
>> +	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
>>    		p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
>>    		nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
>>    		(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
>>    		(p->flags & SWP_DISCARDABLE) ? "D" : "",
>>    		(p->flags & SWP_AREA_DISCARD) ? "s" : "",
>>    		(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
>> -		(frontswap_map) ? "FS" : "");
>> +		(frontswap_map) ? "FS" : "",
>> +		(p->flags & SWP_HIBERNATE_ONLY) ? "H" : "");
>>    
>>    	mutex_unlock(&swapon_mutex);
>>    	atomic_inc(&proc_poll_event);
>>
> 
> Looks like the cleanest alternative to me, as long as we don't want to
> invent new interfaces.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 

Pavel just mentioned uswsusp, and I wonder if it would be a possible 
alternative to this patch.
Evan Green July 27, 2021, 4:31 p.m. UTC | #3
On Tue, Jul 27, 2021 at 5:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.07.21 11:48, David Hildenbrand wrote:
> > On 27.07.21 02:12, Evan Green wrote:
> >> Add a new SWAP_FLAG_HIBERNATE_ONLY that adds a swap region but refuses
> >> to allow generic swapping to it. This region can still be wired up for
> >> use in suspend-to-disk activities, but will never have regular pages
> >> swapped to it. This flag will be passed in by utilities like swapon(8),
> >> usage would probably look something like: swapon -o hibernate /dev/sda2.
> >>
> >> Currently it's not possible to enable hibernation without also enabling
> >> generic swap for a given area. One semi-workaround for this is to delay
> >> the call to swapon() until just before attempting to hibernate, and then
> >> call swapoff() just after hibernate completes. This is somewhat kludgy,
> >> and also doesn't really work to keep swap out of the hibernate region.
> >> When hibernate begins, it starts by allocating a large chunk of memory
> >> for itself. This often ends up forcing a lot of data out into swap. By
> >> this time the hibernate region is eligible for generic swap, so swap
> >> ends up leaking into the hibernate region even with the workaround.
> >>
> >> There are a few reasons why usermode might want to be able to
> >> exclusively steer swap and hibernate. One reason relates to SSD wearing.
> >> Hibernate's endurance and speed requirements are different from swap.
> >> It may for instance be advantageous to keep hibernate in primary
> >> storage, but put swap in an SLC namespace. These namespaces are faster
> >> and have better endurance, but cost 3-4x in terms of capacity.
> >> Exclusively steering hibernate and swap enables system designers to
> >> accurately partition their storage without either wearing out their
> >> primary storage, or overprovisioning their fast swap area.
> >>
> >> Another reason to allow exclusive steering has to do with security.
> >> The requirements for designing systems with resilience against
> >> offline attacks are different between swap and hibernate. Swap
> >> effectively requires a dictionary of hashes, as pages can be added and
> >> removed arbitrarily, whereas hibernate only needs a single hash for the
> >> entire image. If you've set up block-level integrity for swap and
> >> image-level integrity for hibernate, then allowing swap blocks to
> >> possibly leak out to the hibernate region is problematic, since it
> >> creates swap pages not protected by any integrity.
> >>
> >> Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in
> >> /proc/meminfo under SwapTotal and SwapFree, since they are not usable as
> >> general swap. These regions do still appear in /proc/swaps.
> >
> > Right, and they also don't account towards the memory overcommit
> > calculations.
> >
> > Thanks for extending the patch description!

No problem, thanks for all the brainwaves directed at this.

> >
> > [...]
> >
> >> +    if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) {
> >> +            if (IS_ENABLED(CONFIG_HIBERNATION)) {
> >> +                    if (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)
> >> +                            return -EINVAL;
> >> +
> >> +            } else {
> >> +                    return -EINVAL;
> >> +            }
> >> +    }
> >
> > We could do short
> >
> > if ((swap_flags & SWAP_FLAG_HIBERNATE_ONLY) &&
> >        (!IS_ENABLED(CONFIG_HIBERNATION) ||
> >         (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)))
> >       return -EINVAL;
> >
> > or
> >
> > if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY))
> >       if (!IS_ENABLED(CONFIG_HIBERNATION) ||
> >               (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS))
> >               return -EINVAL;
> >
> >> +
> >>      if (!capable(CAP_SYS_ADMIN))
> >>              return -EPERM;
> >>
> >> @@ -3335,16 +3366,20 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >>      if (swap_flags & SWAP_FLAG_PREFER)
> >>              prio =
> >>                (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
> >> +
> >> +    if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY)
> >> +            p->flags |= SWP_HIBERNATE_ONLY;
> >>      enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
> >>
> >> -    pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
> >> +    pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
> >>              p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
> >>              nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
> >>              (p->flags & SWP_SOLIDSTATE) ? "SS" : "",
> >>              (p->flags & SWP_DISCARDABLE) ? "D" : "",
> >>              (p->flags & SWP_AREA_DISCARD) ? "s" : "",
> >>              (p->flags & SWP_PAGE_DISCARD) ? "c" : "",
> >> -            (frontswap_map) ? "FS" : "");
> >> +            (frontswap_map) ? "FS" : "",
> >> +            (p->flags & SWP_HIBERNATE_ONLY) ? "H" : "");
> >>
> >>      mutex_unlock(&swapon_mutex);
> >>      atomic_inc(&proc_poll_event);
> >>
> >
> > Looks like the cleanest alternative to me, as long as we don't want to
> > invent new interfaces.
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> >
>
> Pavel just mentioned uswsusp, and I wonder if it would be a possible
> alternative to this patch.

I think you're right that it would be possible to isolate the
hibernate image with uswsusp if you avoid using the SNAPSHOT_*SWAP*
ioctls. But I'd expect performance to suffer noticeably, since now
every page is making a round trip out to usermode and back. I'd still
very much use the HIBERNATE_ONLY flag if it were accepted, I think
there's value to it.

-Evan
Andrew Morton July 27, 2021, 9:18 p.m. UTC | #4
On Tue, 27 Jul 2021 09:31:33 -0700 Evan Green <evgreen@chromium.org> wrote:

> > Pavel just mentioned uswsusp, and I wonder if it would be a possible
> > alternative to this patch.
> 
> I think you're right that it would be possible to isolate the
> hibernate image with uswsusp if you avoid using the SNAPSHOT_*SWAP*
> ioctls. But I'd expect performance to suffer noticeably, since now
> every page is making a round trip out to usermode and back. I'd still
> very much use the HIBERNATE_ONLY flag if it were accepted, I think
> there's value to it.

The uswsusp option makes your patch a performance optimization rather
than a feature-add.  And we do like to see quantitative testing results
when considering a performance optimization.  Especially when the
performance optimization is a bit icky, putting special-case testing
all over the place, maintenance cost, additional testing effort, etc.

I do think that diligence demands that we quantify the difference.  Is
this a thing you can help with?
Karel Zak July 30, 2021, 1:15 p.m. UTC | #5
On Mon, Jul 26, 2021 at 05:12:46PM -0700, Evan Green wrote:
> Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in
> /proc/meminfo under SwapTotal and SwapFree, since they are not usable as
> general swap. These regions do still appear in /proc/swaps.

Off-topic, /proc/swaps is in the same situation like /proc/mounts years ago ...

It does not provide all important information like SWAP_FLAG_DISCARD_PAGES
SWAP_FLAG_DISCARD_ONCE and SWAP_FLAG_HIBERNATE_ONLY flags. 

Users will not able to differentiate between regular and hibernate-only
devices in "swapon" or "cat /proc/swaps" output ;-(

It would be nice to have /proc/swapsinfo with extendible "flags" column.

 Karel
Evan Green Aug. 3, 2021, 6 p.m. UTC | #6
On Tue, Jul 27, 2021 at 2:18 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 27 Jul 2021 09:31:33 -0700 Evan Green <evgreen@chromium.org> wrote:
>
> > > Pavel just mentioned uswsusp, and I wonder if it would be a possible
> > > alternative to this patch.
> >
> > I think you're right that it would be possible to isolate the
> > hibernate image with uswsusp if you avoid using the SNAPSHOT_*SWAP*
> > ioctls. But I'd expect performance to suffer noticeably, since now
> > every page is making a round trip out to usermode and back. I'd still
> > very much use the HIBERNATE_ONLY flag if it were accepted, I think
> > there's value to it.
>
> The uswsusp option makes your patch a performance optimization rather
> than a feature-add.  And we do like to see quantitative testing results
> when considering a performance optimization.  Especially when the
> performance optimization is a bit icky, putting special-case testing
> all over the place, maintenance cost, additional testing effort, etc.
>
> I do think that diligence demands that we quantify the difference.  Is
> this a thing you can help with?

I'm wrong about the performance. Uswsusp is just as fast, and possibly
faster in my use case than kernel-driven hibernate. What's more,
uswsusp also helps me solve several additional problems I hadn't
tackled yet that were looming in front of me. Thanks all for your
patience and thoughtful review on this.

Patch withdrawn.
-Evan
Pavel Machek Aug. 17, 2021, 6:32 p.m. UTC | #7
Hi!

> > > > Pavel just mentioned uswsusp, and I wonder if it would be a possible
> > > > alternative to this patch.
> > >
> > > I think you're right that it would be possible to isolate the
> > > hibernate image with uswsusp if you avoid using the SNAPSHOT_*SWAP*
> > > ioctls. But I'd expect performance to suffer noticeably, since now
> > > every page is making a round trip out to usermode and back. I'd still
> > > very much use the HIBERNATE_ONLY flag if it were accepted, I think
> > > there's value to it.
> >
> > The uswsusp option makes your patch a performance optimization rather
> > than a feature-add.  And we do like to see quantitative testing results
> > when considering a performance optimization.  Especially when the
> > performance optimization is a bit icky, putting special-case testing
> > all over the place, maintenance cost, additional testing effort, etc.
> >
> > I do think that diligence demands that we quantify the difference.  Is
> > this a thing you can help with?
> 
> I'm wrong about the performance. Uswsusp is just as fast, and possibly
> faster in my use case than kernel-driven hibernate. What's more,
> uswsusp also helps me solve several additional problems I hadn't
> tackled yet that were looming in front of me. Thanks all for your
> patience and thoughtful review on this.

Great to see uswsusp being used :-).

Thanks,
								Pavel
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cdf0957a88a49a..0d922daac94cd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -28,10 +28,17 @@  struct pagevec;
 #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
 #define SWAP_FLAG_DISCARD_ONCE	0x20000 /* discard swap area at swapon-time */
 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
+#define SWAP_FLAG_HIBERNATE_ONLY 0x80000 /* use only for hibernate, not swap */
 
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
-				 SWAP_FLAG_DISCARD_PAGES)
+				 SWAP_FLAG_DISCARD_PAGES | \
+				 SWAP_FLAG_HIBERNATE_ONLY)
+
+/* Valid flags when SWAP_FLAG_HIBERNATE_ONLY is set */
+#define SWAP_HIBERNATE_ONLY_VALID_FLAGS \
+	(SWAP_FLAG_HIBERNATE_ONLY | SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE)
+
 #define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
@@ -182,6 +189,7 @@  enum {
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
+	SWP_HIBERNATE_ONLY = (1 << 13),	/* use only for hibernate, not swap */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3dcaeecc50f54..0c782e12328d25 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -697,7 +697,8 @@  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
 	if (si->inuse_pages == si->pages) {
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
-		del_from_avail_list(si);
+		if (!(si->flags & SWP_HIBERNATE_ONLY))
+			del_from_avail_list(si);
 	}
 }
 
@@ -726,10 +727,13 @@  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 		bool was_full = !si->highest_bit;
 
 		WRITE_ONCE(si->highest_bit, end);
-		if (was_full && (si->flags & SWP_WRITEOK))
+		if (was_full &&
+		    ((si->flags & (SWP_WRITEOK | SWP_HIBERNATE_ONLY)) ==
+		     SWP_WRITEOK))
 			add_to_avail_list(si);
 	}
-	atomic_long_add(nr_entries, &nr_swap_pages);
+	if (!(si->flags & SWP_HIBERNATE_ONLY))
+		atomic_long_add(nr_entries, &nr_swap_pages);
 	si->inuse_pages -= nr_entries;
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1078,6 +1082,9 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			WARN(!(si->flags & SWP_WRITEOK),
 			     "swap_info %d in list but !SWP_WRITEOK\n",
 			     si->type);
+			WARN((si->flags & SWP_HIBERNATE_ONLY),
+			     "swap_info %d in list but SWP_HIBERNATE_ONLY\n",
+			     si->type);
 			__del_from_avail_list(si);
 			spin_unlock(&si->lock);
 			goto nextsi;
@@ -1320,7 +1327,9 @@  static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 
 	ci = lock_cluster(p, offset);
 	count = p->swap_map[offset];
-	VM_BUG_ON(count != SWAP_HAS_CACHE);
+	/* Pages are only freed by the cache or directly by hibernate. */
+	VM_BUG_ON((count != SWAP_HAS_CACHE) &&
+		  !((p->flags & SWP_HIBERNATE_ONLY) && (count == 1)));
 	p->swap_map[offset] = 0;
 	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
@@ -1338,8 +1347,12 @@  void swap_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
-	if (p)
-		__swap_entry_free(p, entry);
+	if (p) {
+		if (p->flags & SWP_HIBERNATE_ONLY)
+			swap_entry_free(p, entry);
+		else
+			__swap_entry_free(p, entry);
+	}
 }
 
 /*
@@ -1783,8 +1796,10 @@  swp_entry_t get_swap_page_of_type(int type)
 
 	/* This is called for allocating swap entry, not cache */
 	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
-		atomic_long_dec(&nr_swap_pages);
+	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) {
+		if (!(si->flags & SWP_HIBERNATE_ONLY))
+			atomic_long_dec(&nr_swap_pages);
+	}
 	spin_unlock(&si->lock);
 fail:
 	return entry;
@@ -2454,8 +2469,6 @@  static void setup_swap_info(struct swap_info_struct *p, int prio,
 static void _enable_swap_info(struct swap_info_struct *p)
 {
 	p->flags |= SWP_WRITEOK;
-	atomic_long_add(p->pages, &nr_swap_pages);
-	total_swap_pages += p->pages;
 
 	assert_spin_locked(&swap_lock);
 	/*
@@ -2469,7 +2482,11 @@  static void _enable_swap_info(struct swap_info_struct *p)
 	 * swap_info_struct.
 	 */
 	plist_add(&p->list, &swap_active_head);
-	add_to_avail_list(p);
+	if (!(p->flags & SWP_HIBERNATE_ONLY)) {
+		atomic_long_add(p->pages, &nr_swap_pages);
+		total_swap_pages += p->pages;
+		add_to_avail_list(p);
+	}
 }
 
 static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -2564,7 +2581,9 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		spin_unlock(&swap_lock);
 		goto out_dput;
 	}
-	del_from_avail_list(p);
+	if (!(p->flags & SWP_HIBERNATE_ONLY))
+		del_from_avail_list(p);
+
 	spin_lock(&p->lock);
 	if (p->prio < 0) {
 		struct swap_info_struct *si = p;
@@ -2581,8 +2600,10 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		least_priority++;
 	}
 	plist_del(&p->list, &swap_active_head);
-	atomic_long_sub(p->pages, &nr_swap_pages);
-	total_swap_pages -= p->pages;
+	if (!(p->flags & SWP_HIBERNATE_ONLY)) {
+		atomic_long_sub(p->pages, &nr_swap_pages);
+		total_swap_pages -= p->pages;
+	}
 	p->flags &= ~SWP_WRITEOK;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -3147,6 +3168,16 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (swap_flags & ~SWAP_FLAGS_VALID)
 		return -EINVAL;
 
+	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) {
+		if (IS_ENABLED(CONFIG_HIBERNATION)) {
+			if (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)
+				return -EINVAL;
+
+		} else {
+			return -EINVAL;
+		}
+	}
+
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
@@ -3335,16 +3366,20 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio =
 		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
+
+	if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY)
+		p->flags |= SWP_HIBERNATE_ONLY;
 	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
 
-	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
+	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
 		p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
 		nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
 		(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
 		(p->flags & SWP_DISCARDABLE) ? "D" : "",
 		(p->flags & SWP_AREA_DISCARD) ? "s" : "",
 		(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
-		(frontswap_map) ? "FS" : "");
+		(frontswap_map) ? "FS" : "",
+		(p->flags & SWP_HIBERNATE_ONLY) ? "H" : "");
 
 	mutex_unlock(&swapon_mutex);
 	atomic_inc(&proc_poll_event);