[RFC,V2,17/17] x86/entry: Preserve PKRS MSR across exceptions
diff mbox series

Message ID 20200717072056.73134-18-ira.weiny@intel.com
State New
Headers show
Series
  • PKS: Add Protection Keys Supervisor (PKS) support
Related show

Commit Message

Ira Weiny July 17, 2020, 7:20 a.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  It is already preserved through a
context switch but this support leaves exception handling code open to
memory accesses which the interrupted process has allowed.

Close this hole by preserve the current task's PKRS MSR, reset the PKRS
MSR value on exception entry, and then restore the state on exception
exit.

Notice that, in addition to the MSR value itself, the saved task state
must also include the device memory protection reference count which is
maintained to keep kmap re-entrant.  The following shows how this works:

...
	// ref == 0
	dev_access_enable()  // ref += 1 ==> disable protection
		irq()
			// enable protection
			// ref = 0
			_handler()
				dev_access_enable()   // ref += 1 ==> disable protection
				dev_access_disable()  // ref -= 1 ==> enable protection
			// WARN_ON(ref != 0)
			// disable protection
	do_pmem_thing()  // all good here
	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
...

Nested exceptions should also operate the same way with each exception
storing the previous reference count all the way down.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
RFC NOTES:

First I'm not sure if adding this state to idtentry_state and having
that state copied is the right way to go.  It seems like we should start
passing this by reference instead of value.  But for now this works as
an RFC.  Comments?

Second, I'm not 100% happy with having to save the reference count in
the exception handler.  It seems like a very ugly layering violation but
I don't see a way around it at the moment.

Third, this patch has gone through a couple of revisions as I've had
crashes which just don't make sense to me.  One particular issue I've
had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
The code path was a pmem copy and the ref count should have been
elevated due to dev_access_enable() but why was
idtentry_enter()->idt_save_pkrs() not called I don't know.

Finally, it looks like the entry/exit code is being refactored into
common code.  So likely this is best handled somewhat there.  Because
this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
in a generic fashion.  But that is a ways off I think.

This patch depends on:
	https://lore.kernel.org/lkml/159411021855.4006.13113751062324360868.tip-bot2@tip-bot2/
	Which has yet to land upstream.  I just pulled it into my test
	branch which is based on 5.8-rc5 to get the exception state
	tracking.  Hopefully it is self contained enough I'm not causing
	other issues with my tests.
---
 arch/x86/entry/common.c               | 78 ++++++++++++++++++++++++++-
 arch/x86/include/asm/idtentry.h       |  2 +
 arch/x86/include/asm/pkeys_internal.h |  3 +-
 arch/x86/kernel/process.c             |  1 -
 arch/x86/mm/pkeys.c                   |  2 +-
 5 files changed, 82 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra July 17, 2020, 9:30 a.m. UTC | #1
On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> +static void noinstr idt_save_pkrs(idtentry_state_t state)

noinstr goes in the same place you would normally put inline, that's
before the return type, not after it.
Peter Zijlstra July 17, 2020, 9:34 a.m. UTC | #2
On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> +/* Define as macros to prevent conflict of inline and noinstr */
> +#define idt_save_pkrs(state)
> +#define idt_restore_pkrs(state)

Use __always_inline
Peter Zijlstra July 17, 2020, 10:06 a.m. UTC | #3
On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> First I'm not sure if adding this state to idtentry_state and having
> that state copied is the right way to go.  It seems like we should start
> passing this by reference instead of value.  But for now this works as
> an RFC.  Comments?

As long as you keep sizeof(struct idtentry_state_t) <= sizeof(u64) or
possibly 2*sizeof(unsigned long), code gen shouldn't be too horrid IIRC.
You'll have to look at what the compiler makes of it.

> Second, I'm not 100% happy with having to save the reference count in
> the exception handler.  It seems like a very ugly layering violation but
> I don't see a way around it at the moment.

So I've been struggling with that API, all the way from
pks_update_protection() to that dev_access_{en,dis}able(). I _really_
hate it, but I see how you ended up with it.

I wanted to propose something like:

u32 current_pkey_save(int pkey, unsigned flags)
{
	u32 *lpkr = get_cpu_ptr(&local_pkr);
	u32 pkr, saved = *lpkr;

	pkr = update_pkey_reg(saved, pkey, flags);
	if (pkr != saved)
		wrpkr(pkr);

	put_cpu_ptr(&local_pkr);
	return saved;
}

void current_pkey_restore(u32 pkr)
{
	u32 *lpkr = get_cpu_ptr(&local_pkr);
	if (*lpkr != pkr)
		wrpkr(pkr);
	put_cpu_ptr(&local_pkr);
}

Together with:

void pkey_switch(struct task_struct *prev, struct task_struct *next)
{
	prev->pkr = this_cpu_read(local_pkr);
	if (prev->pkr != next->pkr)
		wrpkr(next->pkr);
}

But that's actually hard to frob into the kmap() model :-( The upside is
that you only have 1 word of state, instead of the 2 you have now.

> Third, this patch has gone through a couple of revisions as I've had
> crashes which just don't make sense to me.  One particular issue I've
> had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> The code path was a pmem copy and the ref count should have been
> elevated due to dev_access_enable() but why was
> idtentry_enter()->idt_save_pkrs() not called I don't know.

Because MCEs are NMI-like and don't go through the normal interrupt
path. MCEs are an abomination, please wear all the protective devices
you can lay hands on when delving into that.
Ira Weiny July 21, 2020, 6:01 p.m. UTC | #4
On Fri, Jul 17, 2020 at 11:30:41AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> 
> noinstr goes in the same place you would normally put inline, that's
> before the return type, not after it.
>

Sorry about that.  But that does not look to be consistent.

10:57:35 > git grep 'noinstr' arch/x86/entry/common.c
...
arch/x86/entry/common.c:idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
arch/x86/entry/common.c:void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
arch/x86/entry/common.c:void noinstr idtentry_enter_user(struct pt_regs *regs)
arch/x86/entry/common.c:void noinstr idtentry_exit_user(struct pt_regs *regs)
...

Are the above 'wrong'?  Is it worth me sending a patch?

I've changed my code,
Ira
Peter Zijlstra July 21, 2020, 7:11 p.m. UTC | #5
On Tue, Jul 21, 2020 at 11:01:34AM -0700, Ira Weiny wrote:
> On Fri, Jul 17, 2020 at 11:30:41AM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> > 
> > noinstr goes in the same place you would normally put inline, that's
> > before the return type, not after it.
> >
> 
> Sorry about that.  But that does not look to be consistent.
> 
> 10:57:35 > git grep 'noinstr' arch/x86/entry/common.c
> ...
> arch/x86/entry/common.c:idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
> arch/x86/entry/common.c:void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
> arch/x86/entry/common.c:void noinstr idtentry_enter_user(struct pt_regs *regs)
> arch/x86/entry/common.c:void noinstr idtentry_exit_user(struct pt_regs *regs)
> ...
> 
> Are the above 'wrong'?  Is it worth me sending a patch?

I think I already fixed all those, please check tip/master.
Ira Weiny July 22, 2020, 5:27 a.m. UTC | #6
On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> > First I'm not sure if adding this state to idtentry_state and having
> > that state copied is the right way to go.  It seems like we should start
> > passing this by reference instead of value.  But for now this works as
> > an RFC.  Comments?
> 
> As long as you keep sizeof(struct idtentry_state_t) <= sizeof(u64) or
> possibly 2*sizeof(unsigned long), code gen shouldn't be too horrid IIRC.
> You'll have to look at what the compiler makes of it.
> 
> > Second, I'm not 100% happy with having to save the reference count in
> > the exception handler.  It seems like a very ugly layering violation but
> > I don't see a way around it at the moment.
> 
> So I've been struggling with that API, all the way from
> pks_update_protection() to that dev_access_{en,dis}able(). I _really_
> hate it, but I see how you ended up with it.
> 
> I wanted to propose something like:
> 
> u32 current_pkey_save(int pkey, unsigned flags)
> {
> 	u32 *lpkr = get_cpu_ptr(&local_pkr);
> 	u32 pkr, saved = *lpkr;
> 
> 	pkr = update_pkey_reg(saved, pkey, flags);
> 	if (pkr != saved)
> 		wrpkr(pkr);
> 
> 	put_cpu_ptr(&local_pkr);
> 	return saved;
> }
> 
> void current_pkey_restore(u32 pkr)
> {
> 	u32 *lpkr = get_cpu_ptr(&local_pkr);
> 	if (*lpkr != pkr)
> 		wrpkr(pkr);
> 	put_cpu_ptr(&local_pkr);
> }
> 
> Together with:
> 
> void pkey_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	prev->pkr = this_cpu_read(local_pkr);
> 	if (prev->pkr != next->pkr)
> 		wrpkr(next->pkr);
> }
> 
> But that's actually hard to frob into the kmap() model :-( The upside is
> that you only have 1 word of state, instead of the 2 you have now.

I'm still mulling over if what you have really helps us.  It seems like we may
be able to remove the percpu pkrs_cache variable...  But I'm hesitant to do so.

Regardless that is not the big issue right now...

> 
> > Third, this patch has gone through a couple of revisions as I've had
> > crashes which just don't make sense to me.  One particular issue I've
> > had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> > The code path was a pmem copy and the ref count should have been
> > elevated due to dev_access_enable() but why was
> > idtentry_enter()->idt_save_pkrs() not called I don't know.
> 
> Because MCEs are NMI-like and don't go through the normal interrupt
> path. MCEs are an abomination, please wear all the protective devices
> you can lay hands on when delving into that.

I've been really digging into this today and I'm very concerned that I'm
completely missing something WRT idtentry_enter() and idtentry_exit().

I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
with trace_printk()'s.

With this debug code, I have found an instance where it seems like
idtentry_enter() is called without a corresponding idtentry_exit().  This has
left the thread ref counter at 0 which results in very bad things happening
when __dev_access_disable() is called and the ref count goes negative.

Effectively this seems to be happening:

...
	// ref == 0
	dev_access_enable()  // ref += 1 ==> disable protection
		// exception  (which one I don't know)
			idtentry_enter()
				// ref = 0
				_handler() // or whatever code...
			// *_exit() not called [at least there is no trace_printk() output]...
			// Regardless of trace output, the ref is left at 0
	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
	(Bad stuff is bound to happen now...)
...

The ref count ends up completely messed up after this sequence...  and the PKRS
register gets out of sync as well.  This is starting to make some sense of how
I was getting what seemed like random crashes before.

Unfortunately I don't understand the idt entry/exit code well enough to see
clearly what is going on.  Is there any reason idtentry_exit() is not called
after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
path I'm not seeing?  In my searches I see a corresponding exit for every
enter.  But MCE is pretty hard to follow.

Also is there any chance that the process could be getting scheduled and that
is causing an issue?

Thanks,
Ira
Peter Zijlstra July 22, 2020, 9:48 a.m. UTC | #7
On Tue, Jul 21, 2020 at 10:27:09PM -0700, Ira Weiny wrote:

> I've been really digging into this today and I'm very concerned that I'm
> completely missing something WRT idtentry_enter() and idtentry_exit().
> 
> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> with trace_printk()'s.
> 
> With this debug code, I have found an instance where it seems like
> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> left the thread ref counter at 0 which results in very bad things happening
> when __dev_access_disable() is called and the ref count goes negative.
> 
> Effectively this seems to be happening:
> 
> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		// exception  (which one I don't know)
> 			idtentry_enter()
> 				// ref = 0
> 				_handler() // or whatever code...
> 			// *_exit() not called [at least there is no trace_printk() output]...
> 			// Regardless of trace output, the ref is left at 0
> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> 	(Bad stuff is bound to happen now...)
> ...
> 
> The ref count ends up completely messed up after this sequence...  and the PKRS
> register gets out of sync as well.  This is starting to make some sense of how
> I was getting what seemed like random crashes before.
> 
> Unfortunately I don't understand the idt entry/exit code well enough to see
> clearly what is going on.  Is there any reason idtentry_exit() is not called
> after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
> path I'm not seeing?  In my searches I see a corresponding exit for every
> enter.  But MCE is pretty hard to follow.
> 
> Also is there any chance that the process could be getting scheduled and that
> is causing an issue?

Ooh, I think I see the problem. We also use idtentry_enter() for #PF,
and #PF can schedule, exactly as you observed. Argh!

This then means you need to go frob something in
arch/x86/include/asm/idtentry.h, which is somewhat unfortunate.

Thomas, would it make sense to add a type parameter to
idtentry_{enter,exit}() ?
Andy Lutomirski July 22, 2020, 4:21 p.m. UTC | #8
On Fri, Jul 17, 2020 at 12:21 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> The PKRS MSR is not managed by XSAVE.  It is already preserved through a
> context switch but this support leaves exception handling code open to
> memory accesses which the interrupted process has allowed.
>
> Close this hole by preserve the current task's PKRS MSR, reset the PKRS
> MSR value on exception entry, and then restore the state on exception
> exit.

Should this live in pt_regs?

--Andy
Ira Weiny July 22, 2020, 9:24 p.m. UTC | #9
On Wed, Jul 22, 2020 at 11:48:53AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 21, 2020 at 10:27:09PM -0700, Ira Weiny wrote:
> 
> > I've been really digging into this today and I'm very concerned that I'm
> > completely missing something WRT idtentry_enter() and idtentry_exit().
> > 
> > I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> > with trace_printk()'s.
> > 
> > With this debug code, I have found an instance where it seems like
> > idtentry_enter() is called without a corresponding idtentry_exit().  This has
> > left the thread ref counter at 0 which results in very bad things happening
> > when __dev_access_disable() is called and the ref count goes negative.
> > 
> > Effectively this seems to be happening:
> > 
> > ...
> > 	// ref == 0
> > 	dev_access_enable()  // ref += 1 ==> disable protection
> > 		// exception  (which one I don't know)
> > 			idtentry_enter()
> > 				// ref = 0
> > 				_handler() // or whatever code...
> > 			// *_exit() not called [at least there is no trace_printk() output]...
> > 			// Regardless of trace output, the ref is left at 0
> > 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> > 	(Bad stuff is bound to happen now...)
> > ...
> > 
> > The ref count ends up completely messed up after this sequence...  and the PKRS
> > register gets out of sync as well.  This is starting to make some sense of how
> > I was getting what seemed like random crashes before.
> > 
> > Unfortunately I don't understand the idt entry/exit code well enough to see
> > clearly what is going on.  Is there any reason idtentry_exit() is not called
> > after idtentry_enter()?  Perhaps some NMI/MCE or 'not normal' exception code
> > path I'm not seeing?  In my searches I see a corresponding exit for every
> > enter.  But MCE is pretty hard to follow.
> > 
> > Also is there any chance that the process could be getting scheduled and that
> > is causing an issue?
> 
> Ooh, I think I see the problem. We also use idtentry_enter() for #PF,
> and #PF can schedule, exactly as you observed. Argh!

I dug more and I don't see this?  I threw a WARN_ON in the idt_save_pkrs().

Its showing most of these are coming from asm_sysvec_acpi_timer_interrupt().[1]

I don't see that call schedule() between idtentry_enter() and *_exit()...

I do see page faults.  But that is not the first instance where the count gets
messed up.

> 
> This then means you need to go frob something in
> arch/x86/include/asm/idtentry.h, which is somewhat unfortunate.
> 
> Thomas, would it make sense to add a type parameter to
> idtentry_{enter,exit}() ?

Ira

[1] Example trace.

[   30.289514] ------------[ cut here ]------------^M
[   30.290706] WARNING: CPU: 0 PID: 485 at arch/x86/entry/common.c:596 idt_save_pkrs.isra.0.constprop.0+0x45/0xc0^M
[   30.293178] Modules linked in: kvm_intel(+) dax_pmem bochs_drm dax_pmem_core snd_hwdep kvm drm_vram_helper nd_pmem(+) snd_seq nd_btt snd_seq_device irqbypass snd_pcm drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul drm_kms_helper snd_timer nd_e820 ghash_clmulni_intel libnvdimm cec snd drm pcspkr joydev soundcore i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter ip_tables crc32c_intel e1000e wmi pinctrl_sunrisepoint pinctrl_intel fuse^M
[   30.303437] CPU: 0 PID: 485 Comm: systemd-udevd Not tainted 5.8.0-rc5+ #7^M
[   30.305132] Hardware name: Intel Corporation XXXX/YYYY, BIOS EGSDCRB1.86B.0034.D09.2005061015 05/06/2020^M
[   30.307774] RIP: 0010:idt_save_pkrs.isra.0.constprop.0+0x45/0xc0^M
[   30.309281] Code: 85 c0 74 2d 45 8b 88 58 22 00 00 48 8b 35 bb 65 dd 00 ba 51 02 00 00 48 c7 c7 1f 97 a6 a8 65 8b 0d 20 7c 5a 57 e8 cb ba 77 ff <0f> 0b bb 01 00 00 00 65 48 8b 04 25 c0 7b 01 00 8b a8 58 22 00 00^M
[   30.313879] RSP: 0018:ff5ba73f008e73f0 EFLAGS: 00010046^M
[   30.315196] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000^M
[   30.316975] RDX: ff24c78be0c17400 RSI: 0000000000000080 RDI: ff24c78be0c01300^M
[   30.318754] RBP: ff5ba73f008e7430 R08: ff24c78be0e3dd00 R09: ff24c78be0e71000^M
[   30.320533] R10: ff24c78be0e7110c R11: 0000000000000000 R12: ff5ba73f008e7468^M
[   30.322312] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[   30.324091] FS:  00007efd21acb940(0000) GS:ff24c78be4000000(0000) knlGS:0000000000000000^M
[   30.326102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[   30.327544] CR2: 0000559609d33940 CR3: 0000000833b70003 CR4: 0000000001761ef0^M
[   30.332593] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[   30.337638] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400^M
[   30.342632] PKRU: 55555554^M
[   30.346545] Call Trace:^M
[   30.350342]  idtentry_enter+0x1a/0xa0^M
[   30.354389]  sysvec_apic_timer_interrupt+0x12/0xa0^M
[   30.358681]  ? __dev_access_enable+0x82/0xe0^M
[   30.362846]  asm_sysvec_apic_timer_interrupt+0x12/0x20^M
[   30.367225] RIP: 0010:__dev_access_enable+0xad/0xe0^M
[   30.371517] Code: 8b 88 58 22 00 00 48 8b 35 d8 62 51 01 ba 86 00 00 00 48 c7 c7 22 9a 32 a8 65 8b 0d 1d 79 ce 57 e8 c8 b7 eb ff 48 89 df 57 9d <0f> 1f 44 00 00 5b 5d c3 8b bd 1c 23 00 00 8b 35 0f 8a 3d 01 31 d2^M
[   30.382320] RSP: 0018:ff5ba73f008e7510 EFLAGS: 00000246^M
[   30.386760] RAX: 0000000000000005 RBX: 0000000000000246 RCX: 0000000000000000^M
[   30.391651] RDX: ff24c78be0c17400 RSI: 0000000000000080 RDI: 0000000000000246^M
[   30.396525] RBP: ff24c78bb4fe4000 R08: ff24c78be0e3dd00 R09: ff24c78be0e71000^M
[   30.401393] R10: ff24c78be0e710dc R11: ff24c78bdbb81468 R12: ffd1de5920d38040^M
[   30.406246] R13: 0000000000000008 R14: 0000000000001000 R15: 0000000000000000^M
[   30.411101]  pmem_do_read+0x1a4/0x230 [nd_pmem]^M
[   30.415329]  pmem_make_request+0x147/0x2a0 [nd_pmem]^M
[   30.419678]  generic_make_request+0xeb/0x350^M
[   30.423846]  submit_bio+0x4b/0x1a0^M
[   30.427818]  ? bio_add_page+0x62/0x90^M
[   30.431825]  submit_bh_wbc+0x16a/0x190^M
[   30.435806]  block_read_full_page+0x344/0x460^M
[   30.439915]  ? blkdev_direct_IO+0x4a0/0x4a0^M
[   30.443960]  ? memcg_drain_all_list_lrus+0x1d0/0x1d0^M
[   30.448170]  do_read_cache_page+0x61e/0x890^M
[   30.452163]  ? do_read_cache_page+0x2af/0x890^M
[   30.456177]  read_part_sector+0x32/0xf7^M

...
Fenghua Yu July 23, 2020, 4:18 p.m. UTC | #10
On Wed, Jul 22, 2020 at 09:21:43AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2020 at 12:21 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > The PKRS MSR is not managed by XSAVE.  It is already preserved through a
> > context switch but this support leaves exception handling code open to
> > memory accesses which the interrupted process has allowed.
> >
> > Close this hole by preserve the current task's PKRS MSR, reset the PKRS
> > MSR value on exception entry, and then restore the state on exception
> > exit.
> 
> Should this live in pt_regs?

The PKRS MSR has been preserved in thread_info during kernel entry. We
don't need to preserve it in another place (i.e. idtentry_state).

To avoid confusion, I think we need to change the above commit message to:

"Exception handling code is open to memory accesses which the interrupted
process has allowed.

Close this hole by reset the PKRS MSR value on exception entry and restore
the state on exception exit. The MSR was preserved in thread_info."

The patch needs to be changed accordingly, I think:

1. No need to define "pks" in struct idtentry_state because the MSR is
   already preserved in thread_info.
2. idt_save_pkrs() could be renamed as idt_reset_pkrs() to reset
   the MSR (no need to save it). "state.pkrs" can be replaced by
   "current->thread_info.pkrs" now.
3. The "pkrs_ref" could be defined in thread_info as well. But I'm not
   sure if it's better than defined in idtentry_state.

Thanks.

-Fenghua
Dave Hansen July 23, 2020, 4:23 p.m. UTC | #11
On 7/23/20 9:18 AM, Fenghua Yu wrote:
> The PKRS MSR has been preserved in thread_info during kernel entry. We
> don't need to preserve it in another place (i.e. idtentry_state).

I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
explain the mechanism by which this happens and point to the code
implementing it, please?
Fenghua Yu July 23, 2020, 4:52 p.m. UTC | #12
Hi, Dave,

On Thu, Jul 23, 2020 at 09:23:13AM -0700, Dave Hansen wrote:
> On 7/23/20 9:18 AM, Fenghua Yu wrote:
> > The PKRS MSR has been preserved in thread_info during kernel entry. We
> > don't need to preserve it in another place (i.e. idtentry_state).
> 
> I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
> explain the mechanism by which this happens and point to the code
> implementing it, please?

[Sorry, my mistake: I mean "thread_struct" instead of "thread_info".
Hopefully the typo doesn't change the essential part in my last email.]

The "saved_pkrs" is defined in thread_struct and context switched in
patch 04/17:
https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/

Because there is no XSAVE support the PKRS MSR, we preserve it in
"saved_pkrs" in thread_struct. It's initialized as 0 (init state, no
protection key) in fork() or exec(). It's updated to a right protection
value when a driver calls the updating API. The PKRS MSR is context
switched by "saved_pkrs" when switching to a task (unless optimized if the
cached MSR is the same as the saved one).

Thanks.

-Fenghua
Andy Lutomirski July 23, 2020, 5:08 p.m. UTC | #13
> On Jul 23, 2020, at 9:52 AM, Fenghua Yu <fenghua.yu@intel.com> wrote:
> 
> Hi, Dave,
> 
>> On Thu, Jul 23, 2020 at 09:23:13AM -0700, Dave Hansen wrote:
>>> On 7/23/20 9:18 AM, Fenghua Yu wrote:
>>> The PKRS MSR has been preserved in thread_info during kernel entry. We
>>> don't need to preserve it in another place (i.e. idtentry_state).
>> 
>> I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
>> explain the mechanism by which this happens and point to the code
>> implementing it, please?
> 
> [Sorry, my mistake: I mean "thread_struct" instead of "thread_info".
> Hopefully the typo doesn't change the essential part in my last email.]
> 
> The "saved_pkrs" is defined in thread_struct and context switched in
> patch 04/17:
> https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/
> 
> Because there is no XSAVE support the PKRS MSR, we preserve it in
> "saved_pkrs" in thread_struct. It's initialized as 0 (init state, no
> protection key) in fork() or exec(). It's updated to a right protection
> value when a driver calls the updating API. The PKRS MSR is context
> switched by "saved_pkrs" when switching to a task (unless optimized if the
> cached MSR is the same as the saved one).
> 
> 

Suppose some kernel code (a syscall or
kernel thread) changes PKRS then takes a page fault. The page fault handler needs a fresh PKRS. Then the page fault handler (say a VMA’s .fault handler) changes PKRS.  The we get an interrupt. The interrupt *also* needs a fresh PKRS and the page fault value needs to be saved somewhere.

So we have more than one saved value per thread, and thread_struct isn’t going to solve this problem.

But idtentry_state is also not great for a couple reasons.  Not all entries have idtentry_state, and the unwinder can’t find it for debugging. For that matter, the page fault logic probably wants to know the previous PKRS, so it should either be stashed somewhere findable or it should be explicitly passed around.

My suggestion is to enlarge pt_regs.  The save and restore logic can probably be in C, but pt_regs is the logical place to put a register that is saved and restored across all entries.

Whoever does this work will have the delightful job of figuring out whether BPF thinks that the layout of pt_regs is ABI and, if so, fixing the resulting mess.

The fact the new fields will go at the beginning of pt_regs will make this an entertaining prospect.
Dave Hansen July 23, 2020, 5:30 p.m. UTC | #14
On 7/23/20 10:08 AM, Andy Lutomirski wrote:
> Suppose some kernel code (a syscall or kernel thread) changes PKRS
> then takes a page fault. The page fault handler needs a fresh PKRS.
> Then the page fault handler (say a VMA’s .fault handler) changes
> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
> PKRS and the page fault value needs to be saved somewhere.
> 
> So we have more than one saved value per thread, and thread_struct
> isn’t going to solve this problem.

Taking a step back...  This is all true only if we decide that we want
protection keys to provide protection during exceptions and interrupts.
 Right now, the code supports nesting:

	kmap(foo);
		kmap(bar);
		kunmap(bar);
	kunmap(foo);

with a reference count.  So, the nested kmap() will see the count
elevated and do nothing.

I'm generally OK with this getting merged without extending PKS
protection to interrupts and exceptions.  Ira and Fenghua should
certainly give it a go, but I'd like to see it as an add-on feature and
we can judge the benefits versus complexity separately.

Ira was looking at adding it because it _looked_ simple.  Andy has me
really scared about it now. :)
Thomas Gleixner July 23, 2020, 7:53 p.m. UTC | #15
Ira,

ira.weiny@intel.com writes:

> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		irq()
> 			// enable protection
> 			// ref = 0
> 			_handler()
> 				dev_access_enable()   // ref += 1 ==> disable protection
> 				dev_access_disable()  // ref -= 1 ==> enable protection
> 			// WARN_ON(ref != 0)
> 			// disable protection
> 	do_pmem_thing()  // all good here
> 	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection

...

> First I'm not sure if adding this state to idtentry_state and having
> that state copied is the right way to go.

Adding the state to idtentry_state is fine at least for most interrupts
and exceptions. Emphasis on most.

#PF does not work because #PF can schedule.

> It seems like we should start passing this by reference instead of
> value.  But for now this works as an RFC.  Comments?

Works as in compiles, right?

static void noinstr idt_save_pkrs(idtentry_state_t state)
{
        state.foo = 1;
}

How is that supposed to change the caller state? C programming basics.

> Second, I'm not 100% happy with having to save the reference count in
> the exception handler.  It seems like a very ugly layering violation but
> I don't see a way around it at the moment.

That state is strict per task, right? So why do you want to store it
anywhere else that in task/thread storage. That solves your problem of
#PF scheduling nicely.

> Third, this patch has gone through a couple of revisions as I've had
> crashes which just don't make sense to me.  One particular issue I've
> had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> The code path was a pmem copy and the ref count should have been
> elevated due to dev_access_enable() but why was
> idtentry_enter()->idt_save_pkrs() not called I don't know.

Because #MC does not go through idtentry_enter(). Neither do #NMI, #DB, #BP.

> Finally, it looks like the entry/exit code is being refactored into
> common code.  So likely this is best handled somewhat there.  Because
> this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
> in a generic fashion.  But that is a ways off I think.

The invocation of save/restore might be placed in generic code at least
for the common exception and interrupt entries.

> +static void noinstr idt_save_pkrs(idtentry_state_t state)

*state. See above.

> +#else
> +/* Define as macros to prevent conflict of inline and noinstr */
> +#define idt_save_pkrs(state)
> +#define idt_restore_pkrs(state)

Empty inlines do not need noinstr because they are optimized out. If you
want inlines in a noinstr section then use __always_inline

>  /**
>   * idtentry_enter - Handle state tracking on ordinary idtentries
>   * @regs:	Pointer to pt_regs of interrupted context
> @@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
>  		return ret;
>  	}
>  
> +	idt_save_pkrs(ret);

No. This really has no business to be called before the state is
established. It's not something horribly urgent and write_pkrs() is NOT
noinstr and invokes wrmsrl() which is subject to tracing.

> +
> +	idt_restore_pkrs(state);

This one is placed correctly.

Thanks,

        tglx
Thomas Gleixner July 23, 2020, 8:08 p.m. UTC | #16
Ira Weiny <ira.weiny@intel.com> writes:
> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> I've been really digging into this today and I'm very concerned that I'm
> completely missing something WRT idtentry_enter() and idtentry_exit().
>
> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> with trace_printk()'s.
>
> With this debug code, I have found an instance where it seems like
> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> left the thread ref counter at 0 which results in very bad things happening
> when __dev_access_disable() is called and the ref count goes negative.
>
> Effectively this seems to be happening:
>
> ...
> 	// ref == 0
> 	dev_access_enable()  // ref += 1 ==> disable protection
> 		// exception  (which one I don't know)
> 			idtentry_enter()
> 				// ref = 0
> 				_handler() // or whatever code...
> 			// *_exit() not called [at least there is no trace_printk() output]...
> 			// Regardless of trace output, the ref is left at 0
> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> 	(Bad stuff is bound to happen now...)

Well, if any exception which calls idtentry_enter() would return without
going through idtentry_exit() then lots of bad stuff would happen even
without your patches.

> Also is there any chance that the process could be getting scheduled and that
> is causing an issue?

Only from #PF, but after the fault has been resolved and the tasks is
scheduled in again then the task returns through idtentry_exit() to the
place where it took the fault. That's not guaranteed to be on the same
CPU. If schedule is not aware of the fact that the exception turned off
stuff then you surely get into trouble. So you really want to store it
in the task itself then the context switch code can actually see the
state and act accordingly.

Thanks,

        tglx
Thomas Gleixner July 23, 2020, 8:15 p.m. UTC | #17
Thomas Gleixner <tglx@linutronix.de> writes:

> Ira Weiny <ira.weiny@intel.com> writes:
>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
>> I've been really digging into this today and I'm very concerned that I'm
>> completely missing something WRT idtentry_enter() and idtentry_exit().
>>
>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
>> with trace_printk()'s.
>>
>> With this debug code, I have found an instance where it seems like
>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
>> left the thread ref counter at 0 which results in very bad things happening
>> when __dev_access_disable() is called and the ref count goes negative.
>>
>> Effectively this seems to be happening:
>>
>> ...
>> 	// ref == 0
>> 	dev_access_enable()  // ref += 1 ==> disable protection
>> 		// exception  (which one I don't know)
>> 			idtentry_enter()
>> 				// ref = 0
>> 				_handler() // or whatever code...
>> 			// *_exit() not called [at least there is no trace_printk() output]...
>> 			// Regardless of trace output, the ref is left at 0
>> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
>> 	(Bad stuff is bound to happen now...)
>
> Well, if any exception which calls idtentry_enter() would return without
> going through idtentry_exit() then lots of bad stuff would happen even
> without your patches.
>
>> Also is there any chance that the process could be getting scheduled and that
>> is causing an issue?
>
> Only from #PF, but after the fault has been resolved and the tasks is
> scheduled in again then the task returns through idtentry_exit() to the
> place where it took the fault. That's not guaranteed to be on the same
> CPU. If schedule is not aware of the fact that the exception turned off
> stuff then you surely get into trouble. So you really want to store it
> in the task itself then the context switch code can actually see the
> state and act accordingly.

Actually thats nasty as well as you need a stack of PKRS values to
handle nested exceptions. But it might be still the most reasonable
thing to do. 7 PKRS values plus an index should be really sufficient,
that's 32bytes total, not that bad.

Thanks,

        tglx
Thomas Gleixner July 23, 2020, 8:22 p.m. UTC | #18
Andy Lutomirski <luto@amacapital.net> writes:

> Suppose some kernel code (a syscall or kernel thread) changes PKRS
> then takes a page fault. The page fault handler needs a fresh
> PKRS. Then the page fault handler (say a VMA’s .fault handler) changes
> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
> PKRS and the page fault value needs to be saved somewhere.
>
> So we have more than one saved value per thread, and thread_struct
> isn’t going to solve this problem.

A stack of 7 entries and an index needs 32bytes total which is a
reasonable amount and solves the problem including scheduling from #PF
nicely. Make it 15 and it's still only 64 bytes.

> But idtentry_state is also not great for a couple reasons.  Not all
> entries have idtentry_state, and the unwinder can’t find it for
> debugging. For that matter, the page fault logic probably wants to
> know the previous PKRS, so it should either be stashed somewhere
> findable or it should be explicitly passed around.
>
> My suggestion is to enlarge pt_regs.  The save and restore logic can
> probably be in C, but pt_regs is the logical place to put a register
> that is saved and restored across all entries.

Kinda, but that still sucks because schedule from #PF will get it wrong
unless you do extra nasties.

> Whoever does this work will have the delightful job of figuring out
> whether BPF thinks that the layout of pt_regs is ABI and, if so,
> fixing the resulting mess.
>
> The fact the new fields will go at the beginning of pt_regs will make
> this an entertaining prospect.

Good luck with all of that.

Thanks,

        tglx
Thomas Gleixner July 23, 2020, 8:23 p.m. UTC | #19
Dave Hansen <dave.hansen@intel.com> writes:

> On 7/23/20 10:08 AM, Andy Lutomirski wrote:
>> Suppose some kernel code (a syscall or kernel thread) changes PKRS
>> then takes a page fault. The page fault handler needs a fresh PKRS.
>> Then the page fault handler (say a VMA’s .fault handler) changes
>> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
>> PKRS and the page fault value needs to be saved somewhere.
>> 
>> So we have more than one saved value per thread, and thread_struct
>> isn’t going to solve this problem.
>
> Taking a step back...  This is all true only if we decide that we want
> protection keys to provide protection during exceptions and interrupts.
>  Right now, the code supports nesting:
>
> 	kmap(foo);
> 		kmap(bar);
> 		kunmap(bar);
> 	kunmap(foo);
>
> with a reference count.  So, the nested kmap() will see the count
> elevated and do nothing.

Hopefully with a big fat warning if the nested map requires a different
key than the outer one.
Andy Lutomirski July 23, 2020, 9:30 p.m. UTC | #20
> On Jul 23, 2020, at 1:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> Suppose some kernel code (a syscall or kernel thread) changes PKRS
>> then takes a page fault. The page fault handler needs a fresh
>> PKRS. Then the page fault handler (say a VMA’s .fault handler) changes
>> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
>> PKRS and the page fault value needs to be saved somewhere.
>>
>> So we have more than one saved value per thread, and thread_struct
>> isn’t going to solve this problem.
>
> A stack of 7 entries and an index needs 32bytes total which is a
> reasonable amount and solves the problem including scheduling from #PF
> nicely. Make it 15 and it's still only 64 bytes.
>
>> But idtentry_state is also not great for a couple reasons.  Not all
>> entries have idtentry_state, and the unwinder can’t find it for
>> debugging. For that matter, the page fault logic probably wants to
>> know the previous PKRS, so it should either be stashed somewhere
>> findable or it should be explicitly passed around.
>>
>> My suggestion is to enlarge pt_regs.  The save and restore logic can
>> probably be in C, but pt_regs is the logical place to put a register
>> that is saved and restored across all entries.
>
> Kinda, but that still sucks because schedule from #PF will get it wrong
> unless you do extra nasties.

This seems like we’re reinventing the wheel.  PKRS is not
fundamentally different from, say, RSP.  If we want to save it across
exceptions, we save it on entry and context-switch-out and restore it
on exit and context-switch-in.


>
>> Whoever does this work will have the delightful job of figuring out
>> whether BPF thinks that the layout of pt_regs is ABI and, if so,
>> fixing the resulting mess.
>>
>> The fact the new fields will go at the beginning of pt_regs will make
>> this an entertaining prospect.
>
> Good luck with all of that.

We can always cheat like this:

struct real_pt_regs {
  unsigned long pkrs;
  struct pt_regs regs;
};

and pass a pointer to regs around.  What BPF doesn't know about can't hurt it.
Ira Weiny July 23, 2020, 10:04 p.m. UTC | #21
On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> Ira,
> 
> ira.weiny@intel.com writes:
> 
> > ...
> > 	// ref == 0
> > 	dev_access_enable()  // ref += 1 ==> disable protection
> > 		irq()
> > 			// enable protection
> > 			// ref = 0
> > 			_handler()
> > 				dev_access_enable()   // ref += 1 ==> disable protection
> > 				dev_access_disable()  // ref -= 1 ==> enable protection
> > 			// WARN_ON(ref != 0)
> > 			// disable protection
> > 	do_pmem_thing()  // all good here
> > 	dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
> 
> ...
> 
> > First I'm not sure if adding this state to idtentry_state and having
> > that state copied is the right way to go.
> 
> Adding the state to idtentry_state is fine at least for most interrupts
> and exceptions. Emphasis on most.
> 
> #PF does not work because #PF can schedule.


Merging with your other response:

<quote: https://lore.kernel.org/lkml/87o8o6vvt0.fsf@nanos.tec.linutronix.de/>
Only from #PF, but after the fault has been resolved and the tasks is
scheduled in again then the task returns through idtentry_exit() to the
place where it took the fault. That's not guaranteed to be on the same
CPU. If schedule is not aware of the fact that the exception turned off
stuff then you surely get into trouble. So you really want to store it
in the task itself then the context switch code can actually see the
state and act accordingly.
</quote>

I think, after fixing my code (see below), using idtentry_state could still
work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
should carry the state to the new cpu, correct?

The exception may have turned off stuff but it is a bug if it did not also turn
it back on.

Ie the irq should not be doing:

	kmap_atomic()
	return;

... without a kunmap_atomic().

That is why I put the WARN_ON_ONCE() to ensure the ref count has been properly
returned to 0 by the exception handler.

FWIW the fixed code is working for my tests...

> 
> > It seems like we should start passing this by reference instead of
> > value.  But for now this works as an RFC.  Comments?
> 
> Works as in compiles, right?
> 
> static void noinstr idt_save_pkrs(idtentry_state_t state)
> {
>         state.foo = 1;
> }
> 
> How is that supposed to change the caller state? C programming basics.

<sigh>  I am so stupid.  I was not looking at this particular case but you are
100% correct...  I can't believe I did not see this.

In the above statement I was only thinking about the extra overhead I was
adding to idtentry_enter() and the callers of it.

"C programming basics" indeed... Once again sorry...

> 
> > Second, I'm not 100% happy with having to save the reference count in
> > the exception handler.  It seems like a very ugly layering violation but
> > I don't see a way around it at the moment.
> 
> That state is strict per task, right? So why do you want to store it
> anywhere else that in task/thread storage. That solves your problem of
> #PF scheduling nicely.

The problem is with the kmap code.  If an exception handler calls kmap_atomic()
[more importantly if they nest kmap_atomic() calls] the kmap code will need to
keep track of that re-entrant call.  With a separate reference counter, the
kmap code would have to know it is in irq context or not.

Here I'm attempting to make the task 'current->dev_page_access_ref' counter
serve double duty so that the kmap code is agnostic to the context it is in.

> 
> > Third, this patch has gone through a couple of revisions as I've had
> > crashes which just don't make sense to me.  One particular issue I've
> > had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> > The code path was a pmem copy and the ref count should have been
> > elevated due to dev_access_enable() but why was
> > idtentry_enter()->idt_save_pkrs() not called I don't know.
> 
> Because #MC does not go through idtentry_enter(). Neither do #NMI, #DB, #BP.

And the above probably would work if I knew how to code in C...  :-/  I'm so
embarrassed.

> 
> > Finally, it looks like the entry/exit code is being refactored into
> > common code.  So likely this is best handled somewhat there.  Because
> > this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
> > in a generic fashion.  But that is a ways off I think.
> 
> The invocation of save/restore might be placed in generic code at least
> for the common exception and interrupt entries.

I think you are correct.  I'm testing now...

> 
> > +static void noinstr idt_save_pkrs(idtentry_state_t state)
> 
> *state. See above.

Yep...  :-/

> 
> > +#else
> > +/* Define as macros to prevent conflict of inline and noinstr */
> > +#define idt_save_pkrs(state)
> > +#define idt_restore_pkrs(state)
> 
> Empty inlines do not need noinstr because they are optimized out. If you
> want inlines in a noinstr section then use __always_inline

Peter made the same comment so I've changed to __always_inline.

> 
> >  /**
> >   * idtentry_enter - Handle state tracking on ordinary idtentries
> >   * @regs:	Pointer to pt_regs of interrupted context
> > @@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
> >  		return ret;
> >  	}
> >  
> > +	idt_save_pkrs(ret);
> 
> No. This really has no business to be called before the state is
> established. It's not something horribly urgent and write_pkrs() is NOT
> noinstr and invokes wrmsrl() which is subject to tracing.

I don't understand.  I'm not calling it within intrumentation_{begin,end}() so
does that mean I can remove the noinstr?  I think I can actually.

Or do you mean call it at the end of idtentry_enter()?  Like this:

@@ -672,8 +672,6 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
                return ret;
        }
 
-       idt_save_pkrs(ret);
-
        /*
         * If this entry hit the idle task invoke rcu_irq_enter() whether
         * RCU is watching or not.
@@ -710,7 +708,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
                instrumentation_end();
 
                ret.exit_rcu = true;
-               return ret;
+               goto done;
        }
 
        /*
@@ -725,6 +723,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
        trace_hardirqs_off();
        instrumentation_end();
 
+done:
+       idt_save_pkrs(&ret);
        return ret;
 }


> 
> > +
> > +	idt_restore_pkrs(state);
> 
> This one is placed correctly.
> 
> Thanks,

No thank you...  I'm really sorry for wasting every ones time on this one.

I'm still processing all your responses so forgive me if I've missed something.
And thank you so much for pointing out my mistake.  That seems to have fixed
all the problems I have seen thus far.  But I want to think on if there may be
more issues.

Thank you,
Ira
Thomas Gleixner July 23, 2020, 10:14 p.m. UTC | #22
Andy Lutomirski <luto@kernel.org> writes:
>> On Jul 23, 2020, at 1:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>> My suggestion is to enlarge pt_regs.  The save and restore logic can
>>> probably be in C, but pt_regs is the logical place to put a register
>>> that is saved and restored across all entries.
>>
>> Kinda, but that still sucks because schedule from #PF will get it wrong
>> unless you do extra nasties.
>
> This seems like we’re reinventing the wheel.  PKRS is not
> fundamentally different from, say, RSP.  If we want to save it across
> exceptions, we save it on entry and context-switch-out and restore it
> on exit and context-switch-in.

It's fundamentally different from RSP because it has state (refcount)
attached, which RSP clearly has not. If you get rid of the state then
yes.

>>> Whoever does this work will have the delightful job of figuring out
>>> whether BPF thinks that the layout of pt_regs is ABI and, if so,
>>> fixing the resulting mess.
>>>
>>> The fact the new fields will go at the beginning of pt_regs will make
>>> this an entertaining prospect.
>>
>> Good luck with all of that.
>
> We can always cheat like this:
>
> struct real_pt_regs {
>   unsigned long pkrs;
>   struct pt_regs regs;
> };
>
> and pass a pointer to regs around.  What BPF doesn't know about can't hurt it.

Yes, but that's the easy part of the problem :)

Thanks,

        tglx
Thomas Gleixner July 23, 2020, 11:41 p.m. UTC | #23
Ira,

Ira Weiny <ira.weiny@intel.com> writes:
> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> I think, after fixing my code (see below), using idtentry_state could still
> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> should carry the state to the new cpu, correct?

I'm way too tired to think about that now. Will have a look tomorrow
with brain awake.

>> > It seems like we should start passing this by reference instead of
>> > value.  But for now this works as an RFC.  Comments?
>> 
>> Works as in compiles, right?
>> 
>> static void noinstr idt_save_pkrs(idtentry_state_t state)
>> {
>>         state.foo = 1;
>> }
>> 
>> How is that supposed to change the caller state? C programming basics.
>
> <sigh>  I am so stupid.  I was not looking at this particular case but you are
> 100% correct...  I can't believe I did not see this.
>
> In the above statement I was only thinking about the extra overhead I was
> adding to idtentry_enter() and the callers of it.

Fun. That statement immediately caught my attention and made me look at
that function.

> "C programming basics" indeed... Once again sorry...

Don't worry.

One interesting design bug of the human brain is that it tricks you into
seeing what you expect to see no matter how hard you try not to fall for
that. You can spend days staring at the obvious without seeing it. The
saying 'you can't see the forest for the trees' exists for a reason.

Yes, I know it's embarrassing, but that happens and it happens to all of
us no matter how experienced we are. Just search the LKML archives for
'brown paperbag'. You'll find amazing things.

If you show your problem to people who are not involved in that at all
there is a high propability that it immediately snaps for one of
them. But there is no guarantee, just look at this mail thread and the
number of people who did not notice.

Move on and accept the fact that it will happen again :)

Thanks,

        tglx
Ira Weiny July 24, 2020, 5:23 p.m. UTC | #24
On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> >>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> >> I've been really digging into this today and I'm very concerned that I'm
> >> completely missing something WRT idtentry_enter() and idtentry_exit().
> >>
> >> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> >> with trace_printk()'s.
> >>
> >> With this debug code, I have found an instance where it seems like
> >> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> >> left the thread ref counter at 0 which results in very bad things happening
> >> when __dev_access_disable() is called and the ref count goes negative.
> >>
> >> Effectively this seems to be happening:
> >>
> >> ...
> >> 	// ref == 0
> >> 	dev_access_enable()  // ref += 1 ==> disable protection
> >> 		// exception  (which one I don't know)
> >> 			idtentry_enter()
> >> 				// ref = 0
> >> 				_handler() // or whatever code...
> >> 			// *_exit() not called [at least there is no trace_printk() output]...
> >> 			// Regardless of trace output, the ref is left at 0
> >> 	dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> >> 	(Bad stuff is bound to happen now...)
> >
> > Well, if any exception which calls idtentry_enter() would return without
> > going through idtentry_exit() then lots of bad stuff would happen even
> > without your patches.
> >
> >> Also is there any chance that the process could be getting scheduled and that
> >> is causing an issue?
> >
> > Only from #PF, but after the fault has been resolved and the tasks is
> > scheduled in again then the task returns through idtentry_exit() to the
> > place where it took the fault. That's not guaranteed to be on the same
> > CPU. If schedule is not aware of the fact that the exception turned off
> > stuff then you surely get into trouble. So you really want to store it
> > in the task itself then the context switch code can actually see the
> > state and act accordingly.
> 
> Actually thats nasty as well as you need a stack of PKRS values to
> handle nested exceptions. But it might be still the most reasonable
> thing to do. 7 PKRS values plus an index should be really sufficient,
> that's 32bytes total, not that bad.

I've thought about this a bit more and unless I'm wrong I think the
idtentry_state provides for that because each nested exception has it's own
idtentry_state doesn't it?

Ira
Andy Lutomirski July 24, 2020, 5:29 p.m. UTC | #25
> On Jul 24, 2020, at 10:23 AM, Ira Weiny <ira.weiny@intel.com> wrote:
> 
> On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>> 
>>> Ira Weiny <ira.weiny@intel.com> writes:
>>>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
>>>>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
>>>>> I've been really digging into this today and I'm very concerned that I'm
>>>>> completely missing something WRT idtentry_enter() and idtentry_exit().
>>>>> 
>>>>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
>>>>> with trace_printk()'s.
>>>>> 
>>>>> With this debug code, I have found an instance where it seems like
>>>>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
>>>>> left the thread ref counter at 0 which results in very bad things happening
>>>>> when __dev_access_disable() is called and the ref count goes negative.
>>>>> 
>>>>> Effectively this seems to be happening:
>>>>> 
>>>>> ...
>>>>>    // ref == 0
>>>>>    dev_access_enable()  // ref += 1 ==> disable protection
>>>>>        // exception  (which one I don't know)
>>>>>            idtentry_enter()
>>>>>                // ref = 0
>>>>>                _handler() // or whatever code...
>>>>>            // *_exit() not called [at least there is no trace_printk() output]...
>>>>>            // Regardless of trace output, the ref is left at 0
>>>>>    dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
>>>>>    (Bad stuff is bound to happen now...)
>>> 
>>> Well, if any exception which calls idtentry_enter() would return without
>>> going through idtentry_exit() then lots of bad stuff would happen even
>>> without your patches.
>>> 
>>>> Also is there any chance that the process could be getting scheduled and that
>>>> is causing an issue?
>>> 
>>> Only from #PF, but after the fault has been resolved and the tasks is
>>> scheduled in again then the task returns through idtentry_exit() to the
>>> place where it took the fault. That's not guaranteed to be on the same
>>> CPU. If schedule is not aware of the fact that the exception turned off
>>> stuff then you surely get into trouble. So you really want to store it
>>> in the task itself then the context switch code can actually see the
>>> state and act accordingly.
>> 
>> Actually thats nasty as well as you need a stack of PKRS values to
>> handle nested exceptions. But it might be still the most reasonable
>> thing to do. 7 PKRS values plus an index should be really sufficient,
>> that's 32bytes total, not that bad.
> 
> I've thought about this a bit more and unless I'm wrong I think the
> idtentry_state provides for that because each nested exception has it's own
> idtentry_state doesn't it?

Only the ones that use idtentry_enter() instead of, say, nmi_enter().

> 
> Ira
Ira Weiny July 24, 2020, 7:43 p.m. UTC | #26
On Fri, Jul 24, 2020 at 10:29:23AM -0700, Andy Lutomirski wrote:
> 
> > On Jul 24, 2020, at 10:23 AM, Ira Weiny <ira.weiny@intel.com> wrote:
> > 
> > On Thu, Jul 23, 2020 at 10:15:17PM +0200, Thomas Gleixner wrote:
> >> Thomas Gleixner <tglx@linutronix.de> writes:
> >> 
> >>> Ira Weiny <ira.weiny@intel.com> writes:
> >>>> On Fri, Jul 17, 2020 at 12:06:10PM +0200, Peter Zijlstra wrote:
> >>>>>> On Fri, Jul 17, 2020 at 12:20:56AM -0700, ira.weiny@intel.com wrote:
> >>>>> I've been really digging into this today and I'm very concerned that I'm
> >>>>> completely missing something WRT idtentry_enter() and idtentry_exit().
> >>>>> 
> >>>>> I've instrumented idt_{save,restore}_pkrs(), and __dev_access_{en,dis}able()
> >>>>> with trace_printk()'s.
> >>>>> 
> >>>>> With this debug code, I have found an instance where it seems like
> >>>>> idtentry_enter() is called without a corresponding idtentry_exit().  This has
> >>>>> left the thread ref counter at 0 which results in very bad things happening
> >>>>> when __dev_access_disable() is called and the ref count goes negative.
> >>>>> 
> >>>>> Effectively this seems to be happening:
> >>>>> 
> >>>>> ...
> >>>>>    // ref == 0
> >>>>>    dev_access_enable()  // ref += 1 ==> disable protection
> >>>>>        // exception  (which one I don't know)
> >>>>>            idtentry_enter()
> >>>>>                // ref = 0
> >>>>>                _handler() // or whatever code...
> >>>>>            // *_exit() not called [at least there is no trace_printk() output]...
> >>>>>            // Regardless of trace output, the ref is left at 0
> >>>>>    dev_access_disable() // ref -= 1 ==> -1 ==> does not enable protection
> >>>>>    (Bad stuff is bound to happen now...)
> >>> 
> >>> Well, if any exception which calls idtentry_enter() would return without
> >>> going through idtentry_exit() then lots of bad stuff would happen even
> >>> without your patches.
> >>> 
> >>>> Also is there any chance that the process could be getting scheduled and that
> >>>> is causing an issue?
> >>> 
> >>> Only from #PF, but after the fault has been resolved and the tasks is
> >>> scheduled in again then the task returns through idtentry_exit() to the
> >>> place where it took the fault. That's not guaranteed to be on the same
> >>> CPU. If schedule is not aware of the fact that the exception turned off
> >>> stuff then you surely get into trouble. So you really want to store it
> >>> in the task itself then the context switch code can actually see the
> >>> state and act accordingly.
> >> 
> >> Actually thats nasty as well as you need a stack of PKRS values to
> >> handle nested exceptions. But it might be still the most reasonable
> >> thing to do. 7 PKRS values plus an index should be really sufficient,
> >> that's 32bytes total, not that bad.
> > 
> > I've thought about this a bit more and unless I'm wrong I think the
> > idtentry_state provides for that because each nested exception has it's own
> > idtentry_state doesn't it?
> 
> Only the ones that use idtentry_enter() instead of, say, nmi_enter().

Oh agreed...

But with this patch we are still better off than just preserving during context
switch.

I need to update the commit message here to make this clear though.

Thanks,
Ira
Thomas Gleixner July 24, 2020, 9:24 p.m. UTC | #27
Ira,

Thomas Gleixner <tglx@linutronix.de> writes:
> Ira Weiny <ira.weiny@intel.com> writes:
>> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
>> I think, after fixing my code (see below), using idtentry_state could still
>> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
>> should carry the state to the new cpu, correct?
>
> I'm way too tired to think about that now. Will have a look tomorrow
> with brain awake.

Not that I'm way more awake now, but at least I have the feeling that my
brain is not completely useless.

Let me summarize what I understood:

  1) A per CPU cache which shadows the current state of the MSR, i.e. the
     current valid key. You use that to avoid costly MSR writes if the
     key does not change.

  2) On idtentry you store the key on entry in idtentry_state, clear it
     in the MSR and shadow state if necessary and restore it on exit.

  3) On context switch out you save the per CPU cache value in the task
     and on context switch in you restore it from there.

Yes, that works (see below for #2) and sorry for my confusion yesterday
about storing this in task state.

#2 requires to handle the exceptions which do not go through
idtentry_enter/exit() seperately, but that's a manageable amount. It's
the ones which use IDTENTRY_RAW or a variant of it.

#BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
entries for those use nmi_enter()/exit(). So you just can create
wrappers around those. Somehting like this

static __always_inline idtentry_state_t idtentry_nmi_enter(void)
{
     	idtentry_state_t state = {};

        nmi_enter();
        instrumentation_begin();
        state.key = save_and_clear_key();
        instrumentation_end();
}

static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
{
        instrumentation_begin();
        restore_key(state.key);
        instrumentation_end();
        nmi_exit();
}

#UD and #PF are using the raw entry variant as well but still invoke
idtentry_enter()/exit(). #PF does not need any work. #UD handles
WARN/BUG without going through idtentry_enter() first, but I don't think
that's an issue unless a not 0 key would prevent writing to the console
device. You surely can figure that out.

Hope that helps.

Thanks,

        tglx
Thomas Gleixner July 24, 2020, 9:31 p.m. UTC | #28
Thomas Gleixner <tglx@linutronix.de> writes:
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>      	idtentry_state_t state = {};
>
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();

Clearly lacks a

        return state;

But I assume you already spotted it. Otherwise the compiler would have :)

Thanks,

        tglx
Andy Lutomirski July 25, 2020, 12:09 a.m. UTC | #29
On Fri, Jul 24, 2020 at 2:25 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ira,
>
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> >> I think, after fixing my code (see below), using idtentry_state could still
> >> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> >> should carry the state to the new cpu, correct?
> >
> > I'm way too tired to think about that now. Will have a look tomorrow
> > with brain awake.
>
> Not that I'm way more awake now, but at least I have the feeling that my
> brain is not completely useless.
>
> Let me summarize what I understood:
>
>   1) A per CPU cache which shadows the current state of the MSR, i.e. the
>      current valid key. You use that to avoid costly MSR writes if the
>      key does not change.
>
>   2) On idtentry you store the key on entry in idtentry_state, clear it
>      in the MSR and shadow state if necessary and restore it on exit.
>
>   3) On context switch out you save the per CPU cache value in the task
>      and on context switch in you restore it from there.
>
> Yes, that works (see below for #2) and sorry for my confusion yesterday
> about storing this in task state.
>
> #2 requires to handle the exceptions which do not go through
> idtentry_enter/exit() seperately, but that's a manageable amount. It's
> the ones which use IDTENTRY_RAW or a variant of it.
>
> #BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
> entries for those use nmi_enter()/exit(). So you just can create
> wrappers around those. Somehting like this
>
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>         idtentry_state_t state = {};
>
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();
> }
>
> static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
> {
>         instrumentation_begin();
>         restore_key(state.key);
>         instrumentation_end();
>         nmi_exit();
> }
>
> #UD and #PF are using the raw entry variant as well but still invoke
> idtentry_enter()/exit(). #PF does not need any work. #UD handles
> WARN/BUG without going through idtentry_enter() first, but I don't think
> that's an issue unless a not 0 key would prevent writing to the console
> device. You surely can figure that out.

Putting on my mm maintainer hat for a moment, I really think that we
want oopses to print PKRS along with all the other registers when we
inevitably oops due to a page fault.  And we probably also want it in
the nasty nested case where we get infinite page faults and eventually
double fault.

I'm sure it's *possible* to wire this up if we stick it in
idtentry_state, but it's trivial if we stick it in pt_regs.  I'm okay
with doing the save/restore in C (in fact, I prefer that), but I think
that either the value should be stuck in pt_regs or we should find a
way to teach the unwinder to locate idtentry_state.

And, if we go with idtentry_state, we should make a signature change
to nmi_enter() to also provide idtentry_state or some equivalent
object.

--Andy
Ira Weiny July 27, 2020, 8:59 p.m. UTC | #30
On Fri, Jul 24, 2020 at 11:24:58PM +0200, Thomas Gleixner wrote:
> Ira,
> 
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Ira Weiny <ira.weiny@intel.com> writes:
> >> On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> >> I think, after fixing my code (see below), using idtentry_state could still
> >> work.  If the per-cpu cache and the MSR is updated in idtentry_exit() that
> >> should carry the state to the new cpu, correct?
> >
> > I'm way too tired to think about that now. Will have a look tomorrow
> > with brain awake.
> 
> Not that I'm way more awake now, but at least I have the feeling that my
> brain is not completely useless.
> 
> Let me summarize what I understood:
> 
>   1) A per CPU cache which shadows the current state of the MSR, i.e. the
>      current valid key. You use that to avoid costly MSR writes if the
>      key does not change.

Yes

> 
>   2) On idtentry you store the key on entry in idtentry_state, clear it
>      in the MSR and shadow state if necessary and restore it on exit.

Yes, but I've subsequently found a bug here but yea that was the intention.
:-D

I also maintain the ref count of the number of nested calls to kmap to ensure
that kmap_atomic() is nestable during an exception independent of the number
of nested calls of the interrupted thread.

>   3) On context switch out you save the per CPU cache value in the task
>      and on context switch in you restore it from there.

yes

> 
> Yes, that works (see below for #2) and sorry for my confusion yesterday
> about storing this in task state.

No problem.

> 
> #2 requires to handle the exceptions which do not go through
> idtentry_enter/exit() seperately, but that's a manageable amount. It's
> the ones which use IDTENTRY_RAW or a variant of it.
> 
> #BP, #MC, #NMI, #DB, #DF need extra local storage as all the kernel
> entries for those use nmi_enter()/exit(). So you just can create
> wrappers around those. Somehting like this
> 
> static __always_inline idtentry_state_t idtentry_nmi_enter(void)
> {
>      	idtentry_state_t state = {};
> 
>         nmi_enter();
>         instrumentation_begin();
>         state.key = save_and_clear_key();
>         instrumentation_end();
> }
> 
> static __always_inline void idtentry_nmi_exit(idtentry_state_t state)
> {
>         instrumentation_begin();
>         restore_key(state.key);
>         instrumentation_end();
>         nmi_exit();
> }
> 

Thanks!

> #UD and #PF are using the raw entry variant as well but still invoke
> idtentry_enter()/exit(). #PF does not need any work. #UD handles
> WARN/BUG without going through idtentry_enter() first, but I don't think
> that's an issue unless a not 0 key would prevent writing to the console
> device. You surely can figure that out.
> 
> Hope that helps.

Yes it does thank you.  I'm also trying to simplify the API per Peters
comments while refactoring this.

Ira

Patch
diff mbox series

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0521546022cb..012bf7ecca0d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,6 +41,7 @@ 
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
+#include <asm/pkeys_internal.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -558,6 +559,72 @@  SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the default PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the access of the
+ * interrupted task.
+ *
+ * Furthermore, Zone Device Access Protection maintains access in a re-entrant
+ * manner through a reference count which also needs to be maintained should
+ * exception handlers use those interfaces for memory access.  Here we start
+ * off the exception handler ref count to 0 and ensure it is 0 when the
+ * exception is done.  Then restore it for the interrupted task.
+ */
+
+static void noinstr idt_save_pkrs(idtentry_state_t state)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	/*
+	 * Save the ref count of the current running process and set it to 0
+	 * for any irq users to properly track re-entrance
+	 */
+	state.pkrs_ref = current->dev_page_access_ref;
+	current->dev_page_access_ref = 0;
+#endif
+
+	state.pkrs = this_cpu_read(pkrs_cache);
+	if (state.pkrs != INIT_PKRS_VALUE)
+		write_pkrs(INIT_PKRS_VALUE);
+}
+
+static void noinstr idt_restore_pkrs(idtentry_state_t state)
+{
+	u32 pkrs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	pkrs = this_cpu_read(pkrs_cache);
+	if (state.pkrs != pkrs)
+		write_pkrs(state.pkrs);
+
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	WARN_ON_ONCE(current->dev_page_access_ref != 0);
+	/* Restore the interrupted process reference */
+	current->dev_page_access_ref = state.pkrs_ref;
+#endif
+}
+#else
+/* Define as macros to prevent conflict of inline and noinstr */
+#define idt_save_pkrs(state)
+#define idt_restore_pkrs(state)
+#endif
+
+
 /**
  * idtentry_enter - Handle state tracking on ordinary idtentries
  * @regs:	Pointer to pt_regs of interrupted context
@@ -604,6 +671,8 @@  idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
 		return ret;
 	}
 
+	idt_save_pkrs(ret);
+
 	/*
 	 * If this entry hit the idle task invoke rcu_irq_enter() whether
 	 * RCU is watching or not.
@@ -694,7 +763,12 @@  void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		prepare_exit_to_usermode(regs);
-	} else if (regs->flags & X86_EFLAGS_IF) {
+		return;
+	}
+
+	idt_restore_pkrs(state);
+
+	if (regs->flags & X86_EFLAGS_IF) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -819,6 +893,8 @@  __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+		/* Normally called by idtentry_exit, we must restore pkrs here */
+		idt_restore_pkrs(state);
 		instrumentation_begin();
 		idtentry_exit_cond_resched(regs, true);
 		instrumentation_end();
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 7227225cf45d..92d5e43cda7f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -14,6 +14,8 @@  void idtentry_enter_user(struct pt_regs *regs);
 void idtentry_exit_user(struct pt_regs *regs);
 
 typedef struct idtentry_state {
+	unsigned int pkrs_ref;
+	u32 pkrs;
 	bool exit_rcu;
 } idtentry_state_t;
 
diff --git a/arch/x86/include/asm/pkeys_internal.h b/arch/x86/include/asm/pkeys_internal.h
index e34f380c66d1..60e7b55dd141 100644
--- a/arch/x86/include/asm/pkeys_internal.h
+++ b/arch/x86/include/asm/pkeys_internal.h
@@ -27,7 +27,8 @@ 
 #define        PKS_NUM_KEYS            16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 pkrs_val);
+DECLARE_PER_CPU(u32, pkrs_cache);
+void noinstr write_pkrs(u32 pkrs_val);
 #else
 static inline void write_pkrs(u32 pkrs_val) { }
 #endif
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d69250a7c1bf..85f0cbd5faa5 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -194,7 +194,6 @@  int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
  * headers.
  */
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-DECLARE_PER_CPU(u32, pkrs_cache);
 static inline void pks_init_task(struct task_struct *tsk)
 {
 	/* New tasks get the most restrictive PKRS value */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e565fadd74d7..cf9915fed6c9 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -247,7 +247,7 @@  DEFINE_PER_CPU(u32, pkrs_cache);
  * disabled as it does not guarantee the atomicity of updating the pkrs_cache
  * and MSR on its own.
  */
-void write_pkrs(u32 pkrs_val)
+void noinstr write_pkrs(u32 pkrs_val)
 {
 	this_cpu_write(pkrs_cache, pkrs_val);
 	wrmsrl(MSR_IA32_PKRS, pkrs_val);