diff mbox series

[v2] mm: Enable suspend-only swap spaces

Message ID 20210709105012.v2.1.I09866d90c6de14f21223a03e9e6a31f8a02ecbaf@changeid (mailing list archive)
State New
Headers show
Series [v2] mm: Enable suspend-only swap spaces | expand

Commit Message

Evan Green July 9, 2021, 5:50 p.m. UTC
Currently it's not possible to enable hibernation without also enabling
generic swap for a given swap area. These two use cases are not the
same. For example there may be users who want to enable hibernation,
but whose drives don't have the write endurance for generic swap
activities.

Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
generic swapping to it. This region can still be wired up for use in
suspend-to-disk activities, but will never have regular pages swapped to
it.

Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
under SwapTotal and SwapFree, since they are not usable as general swap.

Signed-off-by: Evan Green <evgreen@chromium.org>
---

Changes in v2:
 - NOSWAP regions should not contribute to Swap stats in /proc/meminfo.
   [David]
 - Adjusted comment of SWAP_FLAG_NOSWAP [Pavel]
 - Note: Opted not to take Pavel's tag since enough has changed in this
   revision to warrant another look.
 - Call swap_entry_free() in swap_free to avoid NOSWAP leaks back into
   the general pool via swap_slots_cache [me].

 include/linux/swap.h |  4 +++-
 mm/swapfile.c        | 52 +++++++++++++++++++++++++++++++-------------
 2 files changed, 40 insertions(+), 16 deletions(-)

Comments

Andrew Morton July 9, 2021, 10:20 p.m. UTC | #1
On Fri,  9 Jul 2021 10:50:48 -0700 Evan Green <evgreen@chromium.org> wrote:

> Currently it's not possible to enable hibernation without also enabling
> generic swap for a given swap area. These two use cases are not the
> same. For example there may be users who want to enable hibernation,
> but whose drives don't have the write endurance for generic swap
> activities.
> 
> Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> generic swapping to it. This region can still be wired up for use in
> suspend-to-disk activities, but will never have regular pages swapped to
> it.
> 
> Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
> under SwapTotal and SwapFree, since they are not usable as general swap.
> 

This patch doesn't appear to set SWAP_FLAG_NOSWAP anywhere.  Perhaps
there's another patch somewhere which changes the hibernation code?  If
so, can we please have both patches in a series?

Once we have a description of how this thing gets set, please let's
discuss what happens if someone tries to enable generic swap onto that
device after hibernation has set SWAP_FLAG_NOSWAP (I'm basically
guessing now).  Will it work?  Is there a backward-compatibility issue
here?
Evan Green July 9, 2021, 11:23 p.m. UTC | #2
On Fri, Jul 9, 2021 at 3:20 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  9 Jul 2021 10:50:48 -0700 Evan Green <evgreen@chromium.org> wrote:
>
> > Currently it's not possible to enable hibernation without also enabling
> > generic swap for a given swap area. These two use cases are not the
> > same. For example there may be users who want to enable hibernation,
> > but whose drives don't have the write endurance for generic swap
> > activities.
> >
> > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > generic swapping to it. This region can still be wired up for use in
> > suspend-to-disk activities, but will never have regular pages swapped to
> > it.
> >
> > Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
> > under SwapTotal and SwapFree, since they are not usable as general swap.
> >
>
> This patch doesn't appear to set SWAP_FLAG_NOSWAP anywhere.  Perhaps
> there's another patch somewhere which changes the hibernation code?  If
> so, can we please have both patches in a series?

There's no other patch, in the kernel at least. SWAP_FLAG_* is exposed
to usermode, which would set it when calling swapon(2). Once this
patch is accepted, I'll have to add the option into util-linux [1], so
that I can use it in my init scripts.

Said a different way, this patch isn't about altering how hibernate
behaves, but about giving usermode the freedom to set up hibernate and
swap independently.

[1] https://github.com/karelzak/util-linux/blob/b4533177aeac287e0b0563cd1b9ee61bce29ee88/sys-utils/swapon.c#L710

>
> Once we have a description of how this thing gets set, please let's
> discuss what happens if someone tries to enable generic swap onto that
> device after hibernation has set SWAP_FLAG_NOSWAP (I'm basically
> guessing now).  Will it work?  Is there a backward-compatibility issue
> here?

The above paragraph maybe cleared this up. The hibernate code will
still happily do I/O to any region handed to it by the swap code. That
could be a region already peppered with active swap (the normal case
today), or a NOSWAP region which swap otherwise stays out of (but
still manages). If we did swapoff and swapon with a different setting
for NOSWAP, that should all work fine, since hibernate leaves it to
the swapfile code to be in charge of sector allocs/frees.

There shouldn't be any backward compatibility issues because
SWAP_FLAGS_VALID enforced that usermode had been keeping the
unallocated bits as 0.

-Evan
Andrew Morton July 9, 2021, 11:33 p.m. UTC | #3
On Fri, 9 Jul 2021 16:23:18 -0700 Evan Green <evgreen@chromium.org> wrote:

> On Fri, Jul 9, 2021 at 3:20 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  9 Jul 2021 10:50:48 -0700 Evan Green <evgreen@chromium.org> wrote:
> >
> > > Currently it's not possible to enable hibernation without also enabling
> > > generic swap for a given swap area. These two use cases are not the
> > > same. For example there may be users who want to enable hibernation,
> > > but whose drives don't have the write endurance for generic swap
> > > activities.
> > >
> > > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > > generic swapping to it. This region can still be wired up for use in
> > > suspend-to-disk activities, but will never have regular pages swapped to
> > > it.
> > >
> > > Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
> > > under SwapTotal and SwapFree, since they are not usable as general swap.
> > >
> >
> > This patch doesn't appear to set SWAP_FLAG_NOSWAP anywhere.  Perhaps
> > there's another patch somewhere which changes the hibernation code?  If
> > so, can we please have both patches in a series?
> 
> There's no other patch, in the kernel at least. SWAP_FLAG_* is exposed
> to usermode, which would set it when calling swapon(2). Once this
> patch is accepted, I'll have to add the option into util-linux [1], so
> that I can use it in my init scripts.
> 
> Said a different way, this patch isn't about altering how hibernate
> behaves, but about giving usermode the freedom to set up hibernate and
> swap independently.

OK, can we please get this into the changelog?  And it would be helpful
to describe how this will be invoked via swapon(8).

And I expect an update to the swapon syscall's manpage will be in order.
Michal Hocko July 12, 2021, 7:03 a.m. UTC | #4
[Cc linux-api]

On Fri 09-07-21 10:50:48, Evan Green wrote:
> Currently it's not possible to enable hibernation without also enabling
> generic swap for a given swap area. These two use cases are not the
> same. For example there may be users who want to enable hibernation,
> but whose drives don't have the write endurance for generic swap
> activities.
> 
> Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> generic swapping to it. This region can still be wired up for use in
> suspend-to-disk activities, but will never have regular pages swapped to
> it.

Could you expand some more on why a strict exclusion is really
necessary? I do understand that one might not want to have swap storage
available all the time but considering that swapon is really a light
operation so something like the following should be a reasonable
workaround, no?
	swapon storage/file
	s2disk
	swapoff storage

> Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
> under SwapTotal and SwapFree, since they are not usable as general swap.
> 
> Signed-off-by: Evan Green <evgreen@chromium.org>
> ---
> 
> Changes in v2:
>  - NOSWAP regions should not contribute to Swap stats in /proc/meminfo.
>    [David]
>  - Adjusted comment of SWAP_FLAG_NOSWAP [Pavel]
>  - Note: Opted not to take Pavel's tag since enough has changed in this
>    revision to warrant another look.
>  - Call swap_entry_free() in swap_free to avoid NOSWAP leaks back into
>    the general pool via swap_slots_cache [me].
> 
>  include/linux/swap.h |  4 +++-
>  mm/swapfile.c        | 52 +++++++++++++++++++++++++++++++-------------
>  2 files changed, 40 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 6f5a43251593c8..995466c9f16a76 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -28,10 +28,11 @@ struct pagevec;
>  #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
>  #define SWAP_FLAG_DISCARD_ONCE	0x20000 /* discard swap area at swapon-time */
>  #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
> +#define SWAP_FLAG_NOSWAP	0x80000 /* use only for hibernate, not swap */
>  
>  #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
>  				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
> -				 SWAP_FLAG_DISCARD_PAGES)
> +				 SWAP_FLAG_DISCARD_PAGES | SWAP_FLAG_NOSWAP)
>  #define SWAP_BATCH 64
>  
>  static inline int current_is_kswapd(void)
> @@ -182,6 +183,7 @@ enum {
>  	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
>  	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
>  	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
> +	SWP_NOSWAP	= (1 << 13),	/* use only for suspend, not swap */
>  					/* add others here before... */
>  	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
>  };
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1e07d1c776f2ae..d9ab39d32e55ec 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -697,7 +697,8 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
>  	if (si->inuse_pages == si->pages) {
>  		si->lowest_bit = si->max;
>  		si->highest_bit = 0;
> -		del_from_avail_list(si);
> +		if (!(si->flags & SWP_NOSWAP))
> +			del_from_avail_list(si);
>  	}
>  }
>  
> @@ -726,10 +727,12 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  		bool was_full = !si->highest_bit;
>  
>  		WRITE_ONCE(si->highest_bit, end);
> -		if (was_full && (si->flags & SWP_WRITEOK))
> +		if (was_full &&
> +		    ((si->flags & (SWP_WRITEOK | SWP_NOSWAP)) == SWP_WRITEOK))
>  			add_to_avail_list(si);
>  	}
> -	atomic_long_add(nr_entries, &nr_swap_pages);
> +	if (!(si->flags & SWP_NOSWAP))
> +		atomic_long_add(nr_entries, &nr_swap_pages);
>  	si->inuse_pages -= nr_entries;
>  	if (si->flags & SWP_BLKDEV)
>  		swap_slot_free_notify =
> @@ -1078,6 +1081,9 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			WARN(!(si->flags & SWP_WRITEOK),
>  			     "swap_info %d in list but !SWP_WRITEOK\n",
>  			     si->type);
> +			WARN((si->flags & SWP_NOSWAP),
> +			     "swap_info %d in list but SWP_NOSWAP\n",
> +			     si->type);
>  			__del_from_avail_list(si);
>  			spin_unlock(&si->lock);
>  			goto nextsi;
> @@ -1338,8 +1344,12 @@ void swap_free(swp_entry_t entry)
>  	struct swap_info_struct *p;
>  
>  	p = _swap_info_get(entry);
> -	if (p)
> -		__swap_entry_free(p, entry);
> +	if (p) {
> +		if (p->flags & SWP_NOSWAP)
> +			swap_entry_free(p, entry);
> +		else
> +			__swap_entry_free(p, entry);
> +	}
>  }
>  
>  /*
> @@ -1783,8 +1793,10 @@ swp_entry_t get_swap_page_of_type(int type)
>  
>  	/* This is called for allocating swap entry, not cache */
>  	spin_lock(&si->lock);
> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
> -		atomic_long_dec(&nr_swap_pages);
> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) {
> +		if (!(si->flags & SWP_NOSWAP))
> +			atomic_long_dec(&nr_swap_pages);
> +	}
>  	spin_unlock(&si->lock);
>  fail:
>  	return entry;
> @@ -2454,8 +2466,6 @@ static void setup_swap_info(struct swap_info_struct *p, int prio,
>  static void _enable_swap_info(struct swap_info_struct *p)
>  {
>  	p->flags |= SWP_WRITEOK;
> -	atomic_long_add(p->pages, &nr_swap_pages);
> -	total_swap_pages += p->pages;
>  
>  	assert_spin_locked(&swap_lock);
>  	/*
> @@ -2469,7 +2479,11 @@ static void _enable_swap_info(struct swap_info_struct *p)
>  	 * swap_info_struct.
>  	 */
>  	plist_add(&p->list, &swap_active_head);
> -	add_to_avail_list(p);
> +	if (!(p->flags & SWP_NOSWAP)) {
> +		atomic_long_add(p->pages, &nr_swap_pages);
> +		total_swap_pages += p->pages;
> +		add_to_avail_list(p);
> +	}
>  }
>  
>  static void enable_swap_info(struct swap_info_struct *p, int prio,
> @@ -2564,7 +2578,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		spin_unlock(&swap_lock);
>  		goto out_dput;
>  	}
> -	del_from_avail_list(p);
> +	if (!(p->flags & SWP_NOSWAP))
> +		del_from_avail_list(p);
> +
>  	spin_lock(&p->lock);
>  	if (p->prio < 0) {
>  		struct swap_info_struct *si = p;
> @@ -2581,8 +2597,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		least_priority++;
>  	}
>  	plist_del(&p->list, &swap_active_head);
> -	atomic_long_sub(p->pages, &nr_swap_pages);
> -	total_swap_pages -= p->pages;
> +	if (!(p->flags & SWP_NOSWAP)) {
> +		atomic_long_sub(p->pages, &nr_swap_pages);
> +		total_swap_pages -= p->pages;
> +	}
>  	p->flags &= ~SWP_WRITEOK;
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
> @@ -3329,16 +3347,20 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	if (swap_flags & SWAP_FLAG_PREFER)
>  		prio =
>  		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
> +
> +	if (swap_flags & SWAP_FLAG_NOSWAP)
> +		p->flags |= SWP_NOSWAP;
>  	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
>  
> -	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
> +	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
>  		p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
>  		nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
>  		(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
>  		(p->flags & SWP_DISCARDABLE) ? "D" : "",
>  		(p->flags & SWP_AREA_DISCARD) ? "s" : "",
>  		(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
> -		(frontswap_map) ? "FS" : "");
> +		(frontswap_map) ? "FS" : "",
> +		(p->flags & SWP_NOSWAP) ? "N" : "");
>  
>  	mutex_unlock(&swapon_mutex);
>  	atomic_inc(&proc_poll_event);
> -- 
> 2.31.0
David Hildenbrand July 12, 2021, 7:41 a.m. UTC | #5
On 12.07.21 09:03, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Fri 09-07-21 10:50:48, Evan Green wrote:
>> Currently it's not possible to enable hibernation without also enabling
>> generic swap for a given swap area. These two use cases are not the
>> same. For example there may be users who want to enable hibernation,
>> but whose drives don't have the write endurance for generic swap
>> activities.
>>
>> Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
>> generic swapping to it. This region can still be wired up for use in
>> suspend-to-disk activities, but will never have regular pages swapped to
>> it.
> 
> Could you expand some more on why a strict exclusion is really
> necessary? I do understand that one might not want to have swap storage
> available all the time but considering that swapon is really a light
> operation so something like the following should be a reasonable
> workaround, no?
> 	swapon storage/file
> 	s2disk
> 	swapoff storage

I'm certainly not a hibernation expert, but I'd guess this can also be 
triggered by HW events, so from the kernel and not only from user space 
where your workaround would apply.
Evan Green July 12, 2021, 9:32 p.m. UTC | #6
On Mon, Jul 12, 2021 at 12:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [Cc linux-api]
>
> On Fri 09-07-21 10:50:48, Evan Green wrote:
> > Currently it's not possible to enable hibernation without also enabling
> > generic swap for a given swap area. These two use cases are not the
> > same. For example there may be users who want to enable hibernation,
> > but whose drives don't have the write endurance for generic swap
> > activities.
> >
> > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > generic swapping to it. This region can still be wired up for use in
> > suspend-to-disk activities, but will never have regular pages swapped to
> > it.
>
> Could you expand some more on why a strict exclusion is really
> necessary? I do understand that one might not want to have swap storage
> available all the time but considering that swapon is really a light
> operation so something like the following should be a reasonable
> workaround, no?
>         swapon storage/file
>         s2disk
>         swapoff storage

Broadly, it seemed like a reasonable thing for the kernel to be able
to do. The workaround you suggest does work for some use cases, but it
seems like a gap the kernel could more naturally fill.

Without getting too off into the weeds, there a handful of factors
that make this change particularly useful to me:

 * Slicing off part of your SSD to be SLC (single level cell) is
expensive. From what I understand you gain endurance and speed at the
cost of 3-4x capacity. In other words for every 1GB of SLC space you
need for swap, it costs you 3-4GB of storage space out of the primary
namespace. So I'm incentivized to size this region as small as
possible. Hibernate's speed/endurance requirements are not quite as
harsh as regular swap. Steering them separately gives me the ability
to put the hibernate image in regular storage, and not be forced to
oversize expensive/fast swap space.

 * Even with the workaround, swap can end up in the hibernate region.
Hibernate starts by allocating its giant 50%-of-memory region, which
is often the forcing function for pushing things into swap. With the
workaround, even if my hibernate region is in last priority, there's
still a reasonable chance I'll end up swapping into it. If I have
different security designs for swap space and hibernate, then even a
chance of some swap leaking into this region is a problem.

 * I also want to limit the online attack surface that swap presents.
I can make headway here by disallowing open() calls on active swap
regions (via an LSM), and permanently disabling swapon/swapoff system
calls after early init. The workaround isn't great for me because I
want to set everything up at early init time and then not touch it. By
suspend time, on my system I no longer have the ability to make
swapon/swapoff calls.

-Evan
Evan Green July 12, 2021, 9:36 p.m. UTC | #7
On Fri, Jul 9, 2021 at 4:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 9 Jul 2021 16:23:18 -0700 Evan Green <evgreen@chromium.org> wrote:
>
> > On Fri, Jul 9, 2021 at 3:20 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri,  9 Jul 2021 10:50:48 -0700 Evan Green <evgreen@chromium.org> wrote:
> > >
> > > > Currently it's not possible to enable hibernation without also enabling
> > > > generic swap for a given swap area. These two use cases are not the
> > > > same. For example there may be users who want to enable hibernation,
> > > > but whose drives don't have the write endurance for generic swap
> > > > activities.
> > > >
> > > > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > > > generic swapping to it. This region can still be wired up for use in
> > > > suspend-to-disk activities, but will never have regular pages swapped to
> > > > it.
> > > >
> > > > Swap regions with SWAP_FLAG_NOSWAP set will not appear in /proc/meminfo
> > > > under SwapTotal and SwapFree, since they are not usable as general swap.
> > > >
> > >
> > > This patch doesn't appear to set SWAP_FLAG_NOSWAP anywhere.  Perhaps
> > > there's another patch somewhere which changes the hibernation code?  If
> > > so, can we please have both patches in a series?
> >
> > There's no other patch, in the kernel at least. SWAP_FLAG_* is exposed
> > to usermode, which would set it when calling swapon(2). Once this
> > patch is accepted, I'll have to add the option into util-linux [1], so
> > that I can use it in my init scripts.
> >
> > Said a different way, this patch isn't about altering how hibernate
> > behaves, but about giving usermode the freedom to set up hibernate and
> > swap independently.
>
> OK, can we please get this into the changelog?  And it would be helpful
> to describe how this will be invoked via swapon(8).

Sure, I can augment the commit text to include some of this, and what
it would likely look like from the commandline. I'll send a v3 for
that.

>
> And I expect an update to the swapon syscall's manpage will be in order.
>

Yes! I was originally planning to do that once this was accepted, but
can also spin it up in parallel if requested.
-Evan
Michal Hocko July 14, 2021, 5:41 a.m. UTC | #8
On Mon 12-07-21 14:32:05, Evan Green wrote:
> On Mon, Jul 12, 2021 at 12:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [Cc linux-api]
> >
> > On Fri 09-07-21 10:50:48, Evan Green wrote:
> > > Currently it's not possible to enable hibernation without also enabling
> > > generic swap for a given swap area. These two use cases are not the
> > > same. For example there may be users who want to enable hibernation,
> > > but whose drives don't have the write endurance for generic swap
> > > activities.
> > >
> > > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > > generic swapping to it. This region can still be wired up for use in
> > > suspend-to-disk activities, but will never have regular pages swapped to
> > > it.
> >
> > Could you expand some more on why a strict exclusion is really
> > necessary? I do understand that one might not want to have swap storage
> > available all the time but considering that swapon is really a light
> > operation so something like the following should be a reasonable
> > workaround, no?
> >         swapon storage/file
> >         s2disk
> >         swapoff storage
> 
> Broadly, it seemed like a reasonable thing for the kernel to be able
> to do. The workaround you suggest does work for some use cases, but it
> seems like a gap the kernel could more naturally fill.
> 
> Without getting too off into the weeds, there a handful of factors
> that make this change particularly useful to me:
> 
>  * Slicing off part of your SSD to be SLC (single level cell) is
> expensive. From what I understand you gain endurance and speed at the
> cost of 3-4x capacity. In other words for every 1GB of SLC space you
> need for swap, it costs you 3-4GB of storage space out of the primary
> namespace. So I'm incentivized to size this region as small as
> possible. Hibernate's speed/endurance requirements are not quite as
> harsh as regular swap. Steering them separately gives me the ability
> to put the hibernate image in regular storage, and not be forced to
> oversize expensive/fast swap space.

OK, this is likely true but it doesn't really explain/justify a
dedicated swap storage for hibernation.

>  * Even with the workaround, swap can end up in the hibernate region.
> Hibernate starts by allocating its giant 50%-of-memory region, which
> is often the forcing function for pushing things into swap. With the
> workaround, even if my hibernate region is in last priority, there's
> still a reasonable chance I'll end up swapping into it.

Right there is no guarantee but why does that matter at all. From the
kernel point of view it doesn't really makes much difference what was
the source of the swapout.

> If I have
> different security designs for swap space and hibernate, then even a
> chance of some swap leaking into this region is a problem.

Could you expand some more about the this part please?

>  * I also want to limit the online attack surface that swap presents.
> I can make headway here by disallowing open() calls on active swap
> regions (via an LSM), and permanently disabling swapon/swapoff system
> calls after early init. The workaround isn't great for me because I
> want to set everything up at early init time and then not touch it. By
> suspend time, on my system I no longer have the ability to make
> swapon/swapoff calls.

This is clearly a policy call.

All that being said, I am still missing any justification for the
dedicated swap storage. This is an ABI thing so the reasoning should be
really solid.
Michal Hocko July 14, 2021, 5:43 a.m. UTC | #9
On Mon 12-07-21 09:41:26, David Hildenbrand wrote:
> On 12.07.21 09:03, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Fri 09-07-21 10:50:48, Evan Green wrote:
> > > Currently it's not possible to enable hibernation without also enabling
> > > generic swap for a given swap area. These two use cases are not the
> > > same. For example there may be users who want to enable hibernation,
> > > but whose drives don't have the write endurance for generic swap
> > > activities.
> > > 
> > > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > > generic swapping to it. This region can still be wired up for use in
> > > suspend-to-disk activities, but will never have regular pages swapped to
> > > it.
> > 
> > Could you expand some more on why a strict exclusion is really
> > necessary? I do understand that one might not want to have swap storage
> > available all the time but considering that swapon is really a light
> > operation so something like the following should be a reasonable
> > workaround, no?
> > 	swapon storage/file
> > 	s2disk
> > 	swapoff storage
> 
> I'm certainly not a hibernation expert, but I'd guess this can also be
> triggered by HW events, so from the kernel and not only from user space
> where your workaround would apply.

Is there anything preventing such a HW event doing the equivalent of the
above?
David Hildenbrand July 14, 2021, 7:51 a.m. UTC | #10
On 14.07.21 07:43, Michal Hocko wrote:
> On Mon 12-07-21 09:41:26, David Hildenbrand wrote:
>> On 12.07.21 09:03, Michal Hocko wrote:
>>> [Cc linux-api]
>>>
>>> On Fri 09-07-21 10:50:48, Evan Green wrote:
>>>> Currently it's not possible to enable hibernation without also enabling
>>>> generic swap for a given swap area. These two use cases are not the
>>>> same. For example there may be users who want to enable hibernation,
>>>> but whose drives don't have the write endurance for generic swap
>>>> activities.
>>>>
>>>> Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
>>>> generic swapping to it. This region can still be wired up for use in
>>>> suspend-to-disk activities, but will never have regular pages swapped to
>>>> it.
>>>
>>> Could you expand some more on why a strict exclusion is really
>>> necessary? I do understand that one might not want to have swap storage
>>> available all the time but considering that swapon is really a light
>>> operation so something like the following should be a reasonable
>>> workaround, no?
>>> 	swapon storage/file
>>> 	s2disk
>>> 	swapoff storage
>>
>> I'm certainly not a hibernation expert, but I'd guess this can also be
>> triggered by HW events, so from the kernel and not only from user space
>> where your workaround would apply.
> 
> Is there anything preventing such a HW event doing the equivalent of the
> above?
> 

Let's take a look at hibernate() callers:

drivers/mfd/tps65010.c: calls hibernate() from IRQ contex, based on HW
                         event
kernel/power/autosleep.c: calls hibernate() when it thinks it might be
  			  a good time to go to sleep
kernel/power/main.c: calls hibernate() triggered by userspace
kernel/reboot.c: calls hibernate() triggered by userspace

So on two paths, hibernate() is not under user space control and the 
sequence you propose does not apply. The kernel itself has no idea which 
swap space to activate before hibernating, that's a user space decision. 
And without this patch, user space cannot communicate that decision to 
the kernel without eventually also swapping.

However, I assume in most cases (e.g., ACPI events, power button 
pressed, ...) we always notify user space, which in return decides which 
action to take. Doing it under kernel control is more of a corner case I 
guess, so I do wonder if we really care about these setups.

Anyhow, the proposal here does not sound completely crazy to me, 
although it's unfortunate how we decided to mangle hibernation and 
swapping into the same mechanism originally; a different interface to 
active "hibernation only backends" would be cleaner than doing a "swapon 
..." without swapping. However, that will require much more work and I 
am not sure if it's worth it ...
Michal Hocko July 14, 2021, 8 a.m. UTC | #11
On Wed 14-07-21 09:51:13, David Hildenbrand wrote:
[...]
> Anyhow, the proposal here does not sound completely crazy to me, although
> it's unfortunate how we decided to mangle hibernation and swapping into the
> same mechanism originally; a different interface to active "hibernation only
> backends" would be cleaner than doing a "swapon ..." without swapping.

Completely agreed! And I suspect that a special swap flag just digs that
hole even deeper. While the flag might look simple enough now I am a bit
worried this will open traps in the future.

I am not saying the idea is crazy either, it is just a hack on top of
the existing hack and as such it requires a very good reasoning. So far
I have heard rather vague justification and I am especially curious
about the "no mixing with the regular swapout" concern. It might be very
well the case that there are more usecases which would benefit from it.
Evan Green July 14, 2021, 10:39 p.m. UTC | #12
On Tue, Jul 13, 2021 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 12-07-21 14:32:05, Evan Green wrote:
> > On Mon, Jul 12, 2021 at 12:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [Cc linux-api]
> > >
> > > On Fri 09-07-21 10:50:48, Evan Green wrote:
> > > > Currently it's not possible to enable hibernation without also enabling
> > > > generic swap for a given swap area. These two use cases are not the
> > > > same. For example there may be users who want to enable hibernation,
> > > > but whose drives don't have the write endurance for generic swap
> > > > activities.
> > > >
> > > > Add a new SWAP_FLAG_NOSWAP that adds a swap region but refuses to allow
> > > > generic swapping to it. This region can still be wired up for use in
> > > > suspend-to-disk activities, but will never have regular pages swapped to
> > > > it.
> > >
> > > Could you expand some more on why a strict exclusion is really
> > > necessary? I do understand that one might not want to have swap storage
> > > available all the time but considering that swapon is really a light
> > > operation so something like the following should be a reasonable
> > > workaround, no?
> > >         swapon storage/file
> > >         s2disk
> > >         swapoff storage
> >
> > Broadly, it seemed like a reasonable thing for the kernel to be able
> > to do. The workaround you suggest does work for some use cases, but it
> > seems like a gap the kernel could more naturally fill.
> >
> > Without getting too off into the weeds, there a handful of factors
> > that make this change particularly useful to me:
> >
> >  * Slicing off part of your SSD to be SLC (single level cell) is
> > expensive. From what I understand you gain endurance and speed at the
> > cost of 3-4x capacity. In other words for every 1GB of SLC space you
> > need for swap, it costs you 3-4GB of storage space out of the primary
> > namespace. So I'm incentivized to size this region as small as
> > possible. Hibernate's speed/endurance requirements are not quite as
> > harsh as regular swap. Steering them separately gives me the ability
> > to put the hibernate image in regular storage, and not be forced to
> > oversize expensive/fast swap space.
>
> OK, this is likely true but it doesn't really explain/justify a
> dedicated swap storage for hibernation.

Wait, yes it does. Hibernation has less stringent write endurance and
speed requirements than swap, so it makes sense to point it at storage
that doesn't pay the 3x capacity penalty, and save the fancy fast
stuff for swap. The exclusivity makes sense since you're trying not to
wear out your higher capacity storage with unnecessary writes. I'd
argue the API addition is worth it for this reason by itself. Usermode
has valid reasons for wanting to disentangle these.

>
> >  * Even with the workaround, swap can end up in the hibernate region.
> > Hibernate starts by allocating its giant 50%-of-memory region, which
> > is often the forcing function for pushing things into swap. With the
> > workaround, even if my hibernate region is in last priority, there's
> > still a reasonable chance I'll end up swapping into it.
>
> Right there is no guarantee but why does that matter at all. From the
> kernel point of view it doesn't really makes much difference what was
> the source of the swapout.
>
> > If I have
> > different security designs for swap space and hibernate, then even a
> > chance of some swap leaking into this region is a problem.
>
> Could you expand some more about the this part please?

Offline attacks (ie manipulating storage from underneath the machine)
are a major concern when enabling both swap and hibernate. But the
approach of adding integrity to mitigate offline attacks may differ
between swap and hibernate in the interest of performance. Swap for
instance essentially needs a per-page dictionary of hashes for
integrity, since pages can be added and removed arbitrarily. Hibernate
however just needs a single hash across the entire image to provide
integrity. If you have swap leaking onto a region where you don't have
integrity enabled (because say you handled integrity at the image
level for hibernate, and at the block layer for swap), your swap
integrity story is compromised.

There's a (likely defunct) series from Matthew Garrett that expounds a
bit on some of this, though it's also partially tangential:
https://lore.kernel.org/lkml/20210220013255.1083202-1-matthewgarrett@google.com/

>
> >  * I also want to limit the online attack surface that swap presents.
> > I can make headway here by disallowing open() calls on active swap
> > regions (via an LSM), and permanently disabling swapon/swapoff system
> > calls after early init. The workaround isn't great for me because I
> > want to set everything up at early init time and then not touch it. By
> > suspend time, on my system I no longer have the ability to make
> > swapon/swapoff calls.
>
> This is clearly a policy call.

The goal was to show examples of why the workaround was insufficient.
Yes, the response to any particular example could be "just don't
choose to do that", but I'm hoping to show examples from several
different angles of how the flag is a valuable knob for usermode to
have.

>
> All that being said, I am still missing any justification for the
> dedicated swap storage. This is an ABI thing so the reasoning should be
> really solid.

I'm hoping it is. I sympathize with the awkwardness of "swapon, but
don't swap!". But from what I can there is no other route that
wouldn't be hugely disruptive and risk breaking compatibility for
folks who want to continue to combine their hibernate and swap
regions.

I don't think this digs the design hole deeper. Yes, the ship on this
design has long ago sailed. But if we ever did try to dig ourselves
out of the swap/hibernate hole by providing new APIs to handle them
separately, this flag would serve as a good cutover to divert out of
the swap code and into the new shiny hibernate-only code. The APIs are
never going to be totally disentangled, so a clean cutover opportunity
is the best one can hope for.

-Evan
Pavel Machek July 27, 2021, 12:01 p.m. UTC | #13
Hi!

> > Could you expand some more on why a strict exclusion is really
> > necessary? I do understand that one might not want to have swap storage
> > available all the time but considering that swapon is really a light
> > operation so something like the following should be a reasonable
> > workaround, no?
> > 	swapon storage/file
> > 	s2disk
> > 	swapoff storage
> 
> I'm certainly not a hibernation expert, but I'd guess this can also be
> triggered by HW events, so from the kernel and not only from user space
> where your workaround would apply.

I should know about hibernation, and I'm not aware of any case where
we do hibernation from kernel. Userspace should be always involved.

Best regards,
								Pavel
Pavel Machek July 27, 2021, 12:04 p.m. UTC | #14
Hi!

> Let's take a look at hibernate() callers:
> 
> drivers/mfd/tps65010.c: calls hibernate() from IRQ contex, based on HW
>                         event

No it does not. Look again.

> kernel/power/autosleep.c: calls hibernate() when it thinks it might be
>  			  a good time to go to sleep

Ok, you are right, it is there. But I don't believe anyone uses this
configuration.

Best regards,
								Pavel
Pavel Machek July 27, 2021, 12:10 p.m. UTC | #15
Hi!

> > > If I have
> > > different security designs for swap space and hibernate, then even a
> > > chance of some swap leaking into this region is a problem.
> >
> > Could you expand some more about the this part please?
> 
> Offline attacks (ie manipulating storage from underneath the machine)
> are a major concern when enabling both swap and hibernate. But the
> approach of adding integrity to mitigate offline attacks may differ
> between swap and hibernate in the interest of performance. Swap for
> instance essentially needs a per-page dictionary of hashes for
> integrity, since pages can be added and removed arbitrarily. Hibernate
> however just needs a single hash across the entire image to provide
> integrity. If you have swap leaking onto a region where you don't have
> integrity enabled (because say you handled integrity at the image
> level for hibernate, and at the block layer for swap), your swap
> integrity story is compromised.

If you want to encrypt/sign the hibernation, you likely should use
uswsusp, and that means you can store hibernation image where (and
how) you want it, without modifying kernel.

See kernel/power/user.c .

> I don't think this digs the design hole deeper. Yes, the ship on this
> design has long ago sailed. But if we ever did try to dig ourselves
> out of the swap/hibernate hole by providing new APIs to handle them
> separately, this flag would serve as a good cutover to divert out of
> the swap code and into the new shiny hibernate-only code. The APIs are
> never going to be totally disentangled, so a clean cutover opportunity
> is the best one can hope for.

Is uswsusp the place that should provide clean cutover?

Anyway, I acked the patch before, but it looks like it was
mistake. I withdraw the ack.

Best regards,
								Pavel
David Hildenbrand July 27, 2021, 12:19 p.m. UTC | #16
On 27.07.21 14:04, Pavel Machek wrote:
> Hi!
> 
>> Let's take a look at hibernate() callers:
>>
>> drivers/mfd/tps65010.c: calls hibernate() from IRQ contex, based on HW
>>                          event
> 
> No it does not. Look again.
> 

Oh, dead code :)

>> kernel/power/autosleep.c: calls hibernate() when it thinks it might be
>>   			  a good time to go to sleep
> 
> Ok, you are right, it is there. But I don't believe anyone uses this
> configuration.

If it's dead code, we might want to look into deprecating and removing 
it. It was introduced around 2012:

https://lwn.net/Articles/479841/
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6f5a43251593c8..995466c9f16a76 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -28,10 +28,11 @@  struct pagevec;
 #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
 #define SWAP_FLAG_DISCARD_ONCE	0x20000 /* discard swap area at swapon-time */
 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
+#define SWAP_FLAG_NOSWAP	0x80000 /* use only for hibernate, not swap */
 
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
-				 SWAP_FLAG_DISCARD_PAGES)
+				 SWAP_FLAG_DISCARD_PAGES | SWAP_FLAG_NOSWAP)
 #define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
@@ -182,6 +183,7 @@  enum {
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
+	SWP_NOSWAP	= (1 << 13),	/* use only for suspend, not swap */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1e07d1c776f2ae..d9ab39d32e55ec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -697,7 +697,8 @@  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
 	if (si->inuse_pages == si->pages) {
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
-		del_from_avail_list(si);
+		if (!(si->flags & SWP_NOSWAP))
+			del_from_avail_list(si);
 	}
 }
 
@@ -726,10 +727,12 @@  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 		bool was_full = !si->highest_bit;
 
 		WRITE_ONCE(si->highest_bit, end);
-		if (was_full && (si->flags & SWP_WRITEOK))
+		if (was_full &&
+		    ((si->flags & (SWP_WRITEOK | SWP_NOSWAP)) == SWP_WRITEOK))
 			add_to_avail_list(si);
 	}
-	atomic_long_add(nr_entries, &nr_swap_pages);
+	if (!(si->flags & SWP_NOSWAP))
+		atomic_long_add(nr_entries, &nr_swap_pages);
 	si->inuse_pages -= nr_entries;
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1078,6 +1081,9 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			WARN(!(si->flags & SWP_WRITEOK),
 			     "swap_info %d in list but !SWP_WRITEOK\n",
 			     si->type);
+			WARN((si->flags & SWP_NOSWAP),
+			     "swap_info %d in list but SWP_NOSWAP\n",
+			     si->type);
 			__del_from_avail_list(si);
 			spin_unlock(&si->lock);
 			goto nextsi;
@@ -1338,8 +1344,12 @@  void swap_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
-	if (p)
-		__swap_entry_free(p, entry);
+	if (p) {
+		if (p->flags & SWP_NOSWAP)
+			swap_entry_free(p, entry);
+		else
+			__swap_entry_free(p, entry);
+	}
 }
 
 /*
@@ -1783,8 +1793,10 @@  swp_entry_t get_swap_page_of_type(int type)
 
 	/* This is called for allocating swap entry, not cache */
 	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
-		atomic_long_dec(&nr_swap_pages);
+	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) {
+		if (!(si->flags & SWP_NOSWAP))
+			atomic_long_dec(&nr_swap_pages);
+	}
 	spin_unlock(&si->lock);
 fail:
 	return entry;
@@ -2454,8 +2466,6 @@  static void setup_swap_info(struct swap_info_struct *p, int prio,
 static void _enable_swap_info(struct swap_info_struct *p)
 {
 	p->flags |= SWP_WRITEOK;
-	atomic_long_add(p->pages, &nr_swap_pages);
-	total_swap_pages += p->pages;
 
 	assert_spin_locked(&swap_lock);
 	/*
@@ -2469,7 +2479,11 @@  static void _enable_swap_info(struct swap_info_struct *p)
 	 * swap_info_struct.
 	 */
 	plist_add(&p->list, &swap_active_head);
-	add_to_avail_list(p);
+	if (!(p->flags & SWP_NOSWAP)) {
+		atomic_long_add(p->pages, &nr_swap_pages);
+		total_swap_pages += p->pages;
+		add_to_avail_list(p);
+	}
 }
 
 static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -2564,7 +2578,9 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		spin_unlock(&swap_lock);
 		goto out_dput;
 	}
-	del_from_avail_list(p);
+	if (!(p->flags & SWP_NOSWAP))
+		del_from_avail_list(p);
+
 	spin_lock(&p->lock);
 	if (p->prio < 0) {
 		struct swap_info_struct *si = p;
@@ -2581,8 +2597,10 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		least_priority++;
 	}
 	plist_del(&p->list, &swap_active_head);
-	atomic_long_sub(p->pages, &nr_swap_pages);
-	total_swap_pages -= p->pages;
+	if (!(p->flags & SWP_NOSWAP)) {
+		atomic_long_sub(p->pages, &nr_swap_pages);
+		total_swap_pages -= p->pages;
+	}
 	p->flags &= ~SWP_WRITEOK;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -3329,16 +3347,20 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio =
 		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
+
+	if (swap_flags & SWAP_FLAG_NOSWAP)
+		p->flags |= SWP_NOSWAP;
 	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
 
-	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s\n",
+	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n",
 		p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
 		nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
 		(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
 		(p->flags & SWP_DISCARDABLE) ? "D" : "",
 		(p->flags & SWP_AREA_DISCARD) ? "s" : "",
 		(p->flags & SWP_PAGE_DISCARD) ? "c" : "",
-		(frontswap_map) ? "FS" : "");
+		(frontswap_map) ? "FS" : "",
+		(p->flags & SWP_NOSWAP) ? "N" : "");
 
 	mutex_unlock(&swapon_mutex);
 	atomic_inc(&proc_poll_event);