diff mbox series

[v22,16/24] x86/vdso: Add support for exception fixup in vDSO functions

Message ID 20190903142655.21943-17-jarkko.sakkinen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Intel SGX foundations | expand

Commit Message

Jarkko Sakkinen Sept. 3, 2019, 2:26 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

The basic concept and implementation is very similar to the kernel's
exception fixup mechanism.  The key differences are that the kernel
handler is hardcoded and the fixup entry addresses are relative to
the overall table as opposed to individual entries.

Hardcoding the kernel handler avoids the need to figure out how to
get userspace code to point at a kernel function.  Given that the
expected usage is to propagate information to userspace, dumping all
fault information into registers is likely the desired behavior for
the vast majority of yet-to-be-created functions.  Use registers
DI, SI and DX to communicate fault information, which follows Linux's
ABI for register consumption and hopefully avoids conflict with
hardware features that might leverage the fixup capabilities, e.g.
register usage for SGX instructions was at least partially designed
with calling conventions in mind.

Making fixup addresses relative to the overall table allows the table
to be stripped from the final vDSO image (it's a kernel construct)
without complicating the offset logic, e.g. entry-relative addressing
would also need to account for the table's location relative to the
image.

Regarding stripping the table, modify vdso2c to extract the table from
the raw, a.k.a. unstripped, data and dump it as a standalone byte array
in the resulting .c file.  The original base of the table, its length
and a pointer to the byte array are captured in struct vdso_image.
Alternatively, the table could be dumped directly into the struct,
but because the number of entries can vary per image, that would
require either hardcoding a max sized table into the struct definition
or defining the table as a flexible length array.  The flexible length
array approach has zero benefits, e.g. the base/size are still needed,
and prevents reusing the extraction code, while hardcoding the max size
adds ongoing maintenance just to avoid exporting the explicit size.

The immediate use case is for Intel Software Guard Extensions (SGX).
SGX introduces a new CPL3-only "enclave" mode that runs as a sort of
black box shared object that is hosted by an untrusted "normal" CPl3
process.

Entering an enclave can only be done through SGX-specific instructions,
EENTER and ERESUME, and is a non-trivial process.  Because of the
complexity of transitioning to/from an enclave, the vast majority of
enclaves are expected to utilize a library to handle the actual
transitions.  This is roughly analogous to how e.g. libc implementations
are used by most applications.

Another crucial characteristic of SGX enclaves is that they can generate
exceptions as part of their normal (at least as "normal" as SGX can be)
operation that need to be handled *in* the enclave and/or are unique
to SGX.

And because they are essentially fancy shared objects, a process can
host any number of enclaves, each of which can execute multiple threads
simultaneously.

Putting everything together, userspace enclaves will utilize a library
that must be prepared to handle any and (almost) all exceptions any time
at least one thread may be executing in an enclave.  Leveraging signals
to handle the enclave exceptions is unpleasant, to put it mildly, e.g.
the SGX library must constantly (un)register its signal handler based
on whether or not at least one thread is executing in an enclave, and
filter and forward exceptions that aren't related to its enclaves.  This
becomes particularly nasty when using multiple levels of libraries that
register signal handlers, e.g. running an enclave via cgo inside of the
Go runtime.

Enabling exception fixup in vDSO allows the kernel to provide a vDSO
function that wraps the low-level transitions to/from the enclave, i.e.
the EENTER and ERESUME instructions.  The vDSO function can intercept
exceptions that would otherwise generate a signal and return the fault
information directly to its caller, thus avoiding the need to juggle
signal handlers.

Note that unlike the kernel's _ASM_EXTABLE_HANDLE implementation, the
'C' version of _ASM_VDSO_EXTABLE_HANDLE doesn't use a pre-compiled
assembly macro.  Duplicating four lines of code is simpler than adding
the necessary infrastructure to generate pre-compiled assembly and the
intended benefit of massaging GCC's inlining algorithm is unlikely to
realized in the vDSO any time soon, if ever.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/entry/vdso/Makefile          |  6 +--
 arch/x86/entry/vdso/extable.c         | 46 +++++++++++++++++++++
 arch/x86/entry/vdso/extable.h         | 29 ++++++++++++++
 arch/x86/entry/vdso/vdso-layout.lds.S |  9 ++++-
 arch/x86/entry/vdso/vdso2c.h          | 58 +++++++++++++++++++++++----
 arch/x86/include/asm/vdso.h           |  5 +++
 6 files changed, 141 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/entry/vdso/extable.c
 create mode 100644 arch/x86/entry/vdso/extable.h

Comments

Jarkko Sakkinen Oct. 2, 2019, 11:18 p.m. UTC | #1
On Tue, Sep 03, 2019 at 05:26:47PM +0300, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> The basic concept and implementation is very similar to the kernel's
> exception fixup mechanism.  The key differences are that the kernel
> handler is hardcoded and the fixup entry addresses are relative to
> the overall table as opposed to individual entries.

The commit message is missing description of what the commit does.
Please explain "what" before "why". Now "what" is completely lacking.
This paragraph starts as if there was an invisible paragraph before it.

You should start by explaining briefly about this:

1. A brief description of what vdso2c is.
2. A brief description of what changes you do to vdso2.
3. A brief description of what kernel change you do.
4. A brief description of the flow how the expection gets delivered
   to the user space.

All of this is completely missing.

> Hardcoding the kernel handler avoids the need to figure out how to
> get userspace code to point at a kernel function.  Given that the
> expected usage is to propagate information to userspace, dumping all
> fault information into registers is likely the desired behavior for
> the vast majority of yet-to-be-created functions.  Use registers
> DI, SI and DX to communicate fault information, which follows Linux's
> ABI for register consumption and hopefully avoids conflict with
> hardware features that might leverage the fixup capabilities, e.g.
> register usage for SGX instructions was at least partially designed
> with calling conventions in mind.

No description of what is stored in DI, SI and DX. Also there is two
space bars after *every* sentence. Your text editor is totally broken
somehow. Also DB/BP exception is not described.

> Making fixup addresses relative to the overall table allows the table
> to be stripped from the final vDSO image (it's a kernel construct)
> without complicating the offset logic, e.g. entry-relative addressing
> would also need to account for the table's location relative to the
> image.
> 
> Regarding stripping the table, modify vdso2c to extract the table from
> the raw, a.k.a. unstripped, data and dump it as a standalone byte array
> in the resulting .c file.  The original base of the table, its length
> and a pointer to the byte array are captured in struct vdso_image.
> Alternatively, the table could be dumped directly into the struct,
> but because the number of entries can vary per image, that would
> require either hardcoding a max sized table into the struct definition
> or defining the table as a flexible length array.  The flexible length
> array approach has zero benefits, e.g. the base/size are still needed,
> and prevents reusing the extraction code, while hardcoding the max size
> adds ongoing maintenance just to avoid exporting the explicit size.
> 
> The immediate use case is for Intel Software Guard Extensions (SGX).
> SGX introduces a new CPL3-only "enclave" mode that runs as a sort of
> black box shared object that is hosted by an untrusted "normal" CPl3
> process.
> 
> Entering an enclave can only be done through SGX-specific instructions,
> EENTER and ERESUME, and is a non-trivial process.  Because of the
> complexity of transitioning to/from an enclave, the vast majority of
> enclaves are expected to utilize a library to handle the actual
> transitions.  This is roughly analogous to how e.g. libc implementations
> are used by most applications.
> 
> Another crucial characteristic of SGX enclaves is that they can generate
> exceptions as part of their normal (at least as "normal" as SGX can be)
> operation that need to be handled *in* the enclave and/or are unique
> to SGX.
> 
> And because they are essentially fancy shared objects, a process can
> host any number of enclaves, each of which can execute multiple threads
> simultaneously.
> 
> Putting everything together, userspace enclaves will utilize a library
> that must be prepared to handle any and (almost) all exceptions any time
> at least one thread may be executing in an enclave.  Leveraging signals
> to handle the enclave exceptions is unpleasant, to put it mildly, e.g.
> the SGX library must constantly (un)register its signal handler based
> on whether or not at least one thread is executing in an enclave, and
> filter and forward exceptions that aren't related to its enclaves.  This
> becomes particularly nasty when using multiple levels of libraries that
> register signal handlers, e.g. running an enclave via cgo inside of the
> Go runtime.
> 
> Enabling exception fixup in vDSO allows the kernel to provide a vDSO
> function that wraps the low-level transitions to/from the enclave, i.e.
> the EENTER and ERESUME instructions.  The vDSO function can intercept
> exceptions that would otherwise generate a signal and return the fault
> information directly to its caller, thus avoiding the need to juggle
> signal handlers.
> 
> Note that unlike the kernel's _ASM_EXTABLE_HANDLE implementation, the
> 'C' version of _ASM_VDSO_EXTABLE_HANDLE doesn't use a pre-compiled
> assembly macro.  Duplicating four lines of code is simpler than adding
> the necessary infrastructure to generate pre-compiled assembly and the
> intended benefit of massaging GCC's inlining algorithm is unlikely to
> realized in the vDSO any time soon, if ever.

Rest of the story is a mess with bits of pieces of "what" and "why"
and mixed together. You could probably make the whole like 50% smaller
with a better organization.

I never understood anything of this commit message. Only by looking
at the code change and completely ignoring the commit message I could
understand what the heck is going on. The commit message in the current
for makes me understand *less* the code change.

It would be even better to delete it completely than have it in the
current form. I would suggest that you do that and concentrate writing
steps 1-4 that I described above.

/Jarkko
Jarkko Sakkinen Oct. 2, 2019, 11:45 p.m. UTC | #2
On Thu, Oct 03, 2019 at 02:18:04AM +0300, Jarkko Sakkinen wrote:
> It would be even better to delete it completely than have it in the
> current form. I would suggest that you do that and concentrate writing
> steps 1-4 that I described above.

To compensate my rather harsh (but correct) comments on the commit
message, the code change is something that I'm more than happy with.

It is only the commit message that sucks.

/Jarkko
Sean Christopherson Oct. 4, 2019, 12:03 a.m. UTC | #3
On Thu, Oct 03, 2019 at 02:18:04AM +0300, Jarkko Sakkinen wrote:
> Also there is two space bars after *every* sentence. Your text editor is
> totally broken somehow.

I completely misunderstood your earlier comment, I thought you were saying
there were random spaces at the end of lines.

It's not my editor, it's me.  I insert two spaces after full stops.  IMO,
people that use a single space are heathens. :-)

If it's a sticking point I'll make an effort to use a single space for SGX
comments and changelogs.
Sean Christopherson Oct. 4, 2019, 12:15 a.m. UTC | #4
I'll tackle this tomorrow.  I've been working on the feature control MSR
series and will get that sent out tomorrow as well.  I should also be able
to get you the multi-page EADD patch.
Jarkko Sakkinen Oct. 4, 2019, 6:49 p.m. UTC | #5
On Thu, Oct 03, 2019 at 05:03:48PM -0700, Sean Christopherson wrote:
> On Thu, Oct 03, 2019 at 02:18:04AM +0300, Jarkko Sakkinen wrote:
> > Also there is two space bars after *every* sentence. Your text editor is
> > totally broken somehow.
> 
> I completely misunderstood your earlier comment, I thought you were saying
> there were random spaces at the end of lines.
> 
> It's not my editor, it's me.  I insert two spaces after full stops.  IMO,
> people that use a single space are heathens. :-)
> 
> If it's a sticking point I'll make an effort to use a single space for SGX
> comments and changelogs.

Great.

/Jarkko
Jarkko Sakkinen Oct. 4, 2019, 6:52 p.m. UTC | #6
On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> I'll tackle this tomorrow.  I've been working on the feature control MSR
> series and will get that sent out tomorrow as well.  I should also be able
> to get you the multi-page EADD patch.

Great I'll compose the patch set during the weekend and take Monday off
so you have the full work day to get everything (probably send the patch
set on Sunday).

/Jarkko
Sean Christopherson Oct. 5, 2019, 3:54 p.m. UTC | #7
On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > series and will get that sent out tomorrow as well.  I should also be able
> > to get you the multi-page EADD patch.
> 
> Great I'll compose the patch set during the weekend and take Monday off
> so you have the full work day to get everything (probably send the patch
> set on Sunday).

Didn't get to the actual SGX stuff yesterday as the feature control series
took longer than expected to finish.  Working on the other items this
morning.
Sean Christopherson Oct. 5, 2019, 6:39 p.m. UTC | #8
On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > series and will get that sent out tomorrow as well.  I should also be able
> > to get you the multi-page EADD patch.
> 
> Great I'll compose the patch set during the weekend and take Monday off
> so you have the full work day to get everything (probably send the patch
> set on Sunday).

I wasn't able to finish everything this morning (not even close).  The
vDSO code and documentation was in rough shape.  I finished cleaning it
up, but still need to test and rewrite the changelog.

If you really want to send v23 this weekend I can work more tonight
and/or tomorrow morning.  My preference would be to just punt a few more
days.

My todo list for v23:

  - Test vDSO changes and craft proper patches
  - Rewrite vDSO changelog
  - Rewrite vDSO exception fixup changelog
  - Implement multi-page EADD

My todo list post-v23:
  - Write SGX programming model documentation (requested by Casey)
Jarkko Sakkinen Oct. 6, 2019, 11:38 p.m. UTC | #9
On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > series and will get that sent out tomorrow as well.  I should also be able
> > to get you the multi-page EADD patch.
> 
> Great I'll compose the patch set during the weekend and take Monday off
> so you have the full work day to get everything (probably send the patch
> set on Sunday).

I don't see why the multipage version could not be ioctl of its own and
ioctl's can then use the same internals. Having a single page version
does not cause any kind of bottleneck really.

Thus, sending now v23 based on these conclusions.

/Jarkko
Jarkko Sakkinen Oct. 6, 2019, 11:40 p.m. UTC | #10
On Mon, Oct 07, 2019 at 02:38:17AM +0300, Jarkko Sakkinen wrote:
> On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > > series and will get that sent out tomorrow as well.  I should also be able
> > > to get you the multi-page EADD patch.
> > 
> > Great I'll compose the patch set during the weekend and take Monday off
> > so you have the full work day to get everything (probably send the patch
> > set on Sunday).
> 
> I don't see why the multipage version could not be ioctl of its own and
> ioctl's can then use the same internals. Having a single page version
> does not cause any kind of bottleneck really.
> 
> Thus, sending now v23 based on these conclusions.

Sure, you can argue it is redudant but I see it as a nice convenience
for simple stuff that does not really hurt at all.

/Jarkko
Jarkko Sakkinen Oct. 7, 2019, 7:57 a.m. UTC | #11
On Sat, Oct 05, 2019 at 08:54:13AM -0700, Sean Christopherson wrote:
> On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > > series and will get that sent out tomorrow as well.  I should also be able
> > > to get you the multi-page EADD patch.
> > 
> > Great I'll compose the patch set during the weekend and take Monday off
> > so you have the full work day to get everything (probably send the patch
> > set on Sunday).
> 
> Didn't get to the actual SGX stuff yesterday as the feature control series
> took longer than expected to finish.  Working on the other items this
> morning.

I anyway decided to wait for your patches.

I said in earlier email that two ioctl's would be great but I think the
following would be the API that I would actually appreciate the most:

struct sgx_enclave_add_page_desc {
	__u64	addr;
	__u64	src;
	__u64	secinfo;
	__u16	mrmask;
	__u8	reserved[6];
};

struct sgx_enclave_add_page {
	__u64	nr_pages;
	__u64	pages;
};

This will keep the same amount of control and give the performance
benefit. And it is trivial to use in the single page case. Finally,
it follows the principle of minimal delta i.e. we move the least from
the existing API, which is somewhat proven, to fulfill the new
requirement.

Can you use this model for the API? For internals you can choose what
you see fits best.

/Jarkko
Jarkko Sakkinen Oct. 7, 2019, 8:01 a.m. UTC | #12
On Sat, Oct 05, 2019 at 11:39:39AM -0700, Sean Christopherson wrote:
> On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > > series and will get that sent out tomorrow as well.  I should also be able
> > > to get you the multi-page EADD patch.
> > 
> > Great I'll compose the patch set during the weekend and take Monday off
> > so you have the full work day to get everything (probably send the patch
> > set on Sunday).
> 
> I wasn't able to finish everything this morning (not even close).  The
> vDSO code and documentation was in rough shape.  I finished cleaning it
> up, but still need to test and rewrite the changelog.
> 
> If you really want to send v23 this weekend I can work more tonight
> and/or tomorrow morning.  My preference would be to just punt a few more
> days.
> 
> My todo list for v23:
> 
>   - Test vDSO changes and craft proper patches
>   - Rewrite vDSO changelog
>   - Rewrite vDSO exception fixup changelog
>   - Implement multi-page EADD

For maintainer commit messages are at least as important as
documentation in the Documentation folder, if not more. If
they've been written well, they are useful when back tracking
history when fixing bugs and so forth.

> 
> My todo list post-v23:
>   - Write SGX programming model documentation (requested by Casey)

I'll wait up until I get your v23 changes. Try to get them as
fast as possible but without sacrificing quality.

/Jarkko
Jarkko Sakkinen Oct. 7, 2019, 8:10 a.m. UTC | #13
On Mon, Oct 07, 2019 at 10:57:12AM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 05, 2019 at 08:54:13AM -0700, Sean Christopherson wrote:
> > On Fri, Oct 04, 2019 at 09:52:21PM +0300, Jarkko Sakkinen wrote:
> > > On Thu, Oct 03, 2019 at 05:15:00PM -0700, Sean Christopherson wrote:
> > > > I'll tackle this tomorrow.  I've been working on the feature control MSR
> > > > series and will get that sent out tomorrow as well.  I should also be able
> > > > to get you the multi-page EADD patch.
> > > 
> > > Great I'll compose the patch set during the weekend and take Monday off
> > > so you have the full work day to get everything (probably send the patch
> > > set on Sunday).
> > 
> > Didn't get to the actual SGX stuff yesterday as the feature control series
> > took longer than expected to finish.  Working on the other items this
> > morning.
> 
> I anyway decided to wait for your patches.
> 
> I said in earlier email that two ioctl's would be great but I think the
> following would be the API that I would actually appreciate the most:
> 
> struct sgx_enclave_add_page_desc {
> 	__u64	addr;
> 	__u64	src;
> 	__u64	secinfo;
> 	__u16	mrmask;
> 	__u8	reserved[6];
> };
> 
> struct sgx_enclave_add_page {
> 	__u64	nr_pages;
> 	__u64	pages;
> };

Actually, maybe like this:

struct sgx_enclave_add_page_desc {
	__u64	addr;
	__u64	offset;
	__u64	secinfo;
	__u16	mrmask;
	__u8	reserved[6];
};

struct sgx_enclave_add_page {
	__u64	src;
	__u64	nr_pages;
	__u64	pages;
};

I.e. probably makes sense to fix the same source for all pages.

Also wondering if we should have special case for adding zero pages?
I.e. when you set @src to NULL ioctl would assume that zero pages
would be added?

E.g. I could use this in the selftest to create variable size data
segment.

/Jarkko
Jarkko Sakkinen Oct. 7, 2019, 12:04 p.m. UTC | #14
On Mon, Oct 07, 2019 at 11:10:24AM +0300, Jarkko Sakkinen wrote:
> Actually, maybe like this:
> 
> struct sgx_enclave_add_page_desc {
> 	__u64	addr;
> 	__u64	offset;
> 	__u64	secinfo;
> 	__u16	mrmask;
> 	__u8	reserved[6];
> };
> 
> struct sgx_enclave_add_page {
> 	__u64	src;
> 	__u64	nr_pages;
> 	__u64	pages;
> };

Of course we should remove @addr:

struct sgx_enclave_add_page_desc {
	__u64	offset;
	__u16	mrmask;
	__u8	reserved[6];
};

struct sgx_enclave_add_page {
	__u64	src;
	__u64	secinfo;
	__u64	nr_pages;
	__u64	pages;
};

That is something we have forgot to do. We should have started to use
offset instead of address when we moved to fd based API. Anyway I think
this kind of API where you give array of descriptors from one source
would be optimal.

Also, @secinfo is better to be out of the descriptor so that let say
LSM checks could be done with a single callback.

/Jarkko
Sean Christopherson Oct. 8, 2019, 4:54 a.m. UTC | #15
On Mon, Oct 07, 2019 at 03:04:12PM +0300, Jarkko Sakkinen wrote:
> On Mon, Oct 07, 2019 at 11:10:24AM +0300, Jarkko Sakkinen wrote:
> > Actually, maybe like this:
> > 
> > struct sgx_enclave_add_page_desc {
> > 	__u64	addr;
> > 	__u64	offset;
> > 	__u64	secinfo;
> > 	__u16	mrmask;
> > 	__u8	reserved[6];
> > };
> > 
> > struct sgx_enclave_add_page {
> > 	__u64	src;
> > 	__u64	nr_pages;
> > 	__u64	pages;
> > };
> 
> Of course we should remove @addr:
> 
> struct sgx_enclave_add_page_desc {
> 	__u64	offset;
> 	__u16	mrmask;
> 	__u8	reserved[6];
> };
> 
> struct sgx_enclave_add_page {
> 	__u64	src;
> 	__u64	secinfo;
> 	__u64	nr_pages;
> 	__u64	pages;
> };
> 
> That is something we have forgot to do. We should have started to use
> offset instead of address when we moved to fd based API. Anyway I think
> this kind of API where you give array of descriptors from one source
> would be optimal.
> 
> Also, @secinfo is better to be out of the descriptor so that let say
> LSM checks could be done with a single callback.

Famous last words, but hopefully I can get this to you tomorrow, as well
as the vDSO changelog rewrite.
diff mbox series

Patch

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 8df549138193..301e99d145a7 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -26,7 +26,7 @@  VDSO32-$(CONFIG_IA32_EMULATION)	:= y
 vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
 
 # files to link into kernel
-obj-y				+= vma.o
+obj-y				+= vma.o extable.o
 OBJECT_FILES_NON_STANDARD_vma.o	:= n
 
 # vDSO images to build
@@ -121,8 +121,8 @@  $(obj)/%-x32.o: $(obj)/%.o FORCE
 
 targets += vdsox32.lds $(vobjx32s-y)
 
-$(obj)/%.so: OBJCOPYFLAGS := -S
-$(obj)/%.so: $(obj)/%.so.dbg FORCE
+$(obj)/%.so: OBJCOPYFLAGS := -S --remove-section __ex_table
+$(obj)/%.so: $(obj)/%.so.dbg
 	$(call if_changed,objcopy)
 
 $(obj)/vdsox32.so.dbg: $(obj)/vdsox32.lds $(vobjx32s) FORCE
diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
new file mode 100644
index 000000000000..afcf5b65beef
--- /dev/null
+++ b/arch/x86/entry/vdso/extable.c
@@ -0,0 +1,46 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <asm/current.h>
+#include <asm/traps.h>
+#include <asm/vdso.h>
+
+struct vdso_exception_table_entry {
+	int insn, fixup;
+};
+
+bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+			  unsigned long error_code, unsigned long fault_addr)
+{
+	const struct vdso_image *image = current->mm->context.vdso_image;
+	const struct vdso_exception_table_entry *extable;
+	unsigned int nr_entries, i;
+	unsigned long base;
+
+	/*
+	 * Do not attempt to fixup #DB or #BP.  It's impossible to identify
+	 * whether or not a #DB/#BP originated from within an SGX enclave and
+	 * SGX enclaves are currently the only use case for vDSO fixup.
+	 */
+	if (trapnr == X86_TRAP_DB || trapnr == X86_TRAP_BP)
+		return false;
+
+	if (!current->mm->context.vdso)
+		return false;
+
+	base =  (unsigned long)current->mm->context.vdso + image->extable_base;
+	nr_entries = image->extable_len / (sizeof(*extable));
+	extable = image->extable;
+
+	for (i = 0; i < nr_entries; i++) {
+		if (regs->ip == base + extable[i].insn) {
+			regs->ip = base + extable[i].fixup;
+			regs->di = trapnr;
+			regs->si = error_code;
+			regs->dx = fault_addr;
+			return true;
+		}
+	}
+
+	return false;
+}
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
new file mode 100644
index 000000000000..aafdac396948
--- /dev/null
+++ b/arch/x86/entry/vdso/extable.h
@@ -0,0 +1,29 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __VDSO_EXTABLE_H
+#define __VDSO_EXTABLE_H
+
+/*
+ * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
+ * vDSO uses a dedicated handler the addresses are relative to the overall
+ * exception table, not each individual entry.
+ */
+#ifdef __ASSEMBLY__
+#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
+	ASM_VDSO_EXTABLE_HANDLE from to
+
+.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
+	.pushsection __ex_table, "a"
+	.long (\from) - __ex_table
+	.long (\to) - __ex_table
+	.popsection
+.endm
+#else
+#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
+	".pushsection __ex_table, \"a\"\n"      \
+	".long (" #from ") - __ex_table\n"      \
+	".long (" #to ") - __ex_table\n"        \
+	".popsection\n"
+#endif
+
+#endif /* __VDSO_EXTABLE_H */
+
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index 93c6dc7812d0..8ef849064501 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -63,11 +63,18 @@  SECTIONS
 	 * stuff that isn't used at runtime in between.
 	 */
 
-	.text		: { *(.text*) }			:text	=0x90909090,
+	.text		: {
+		*(.text*)
+		*(.fixup)
+	}						:text	=0x90909090,
+
+
 
 	.altinstructions	: { *(.altinstructions) }	:text
 	.altinstr_replacement	: { *(.altinstr_replacement) }	:text
 
+	__ex_table		: { *(__ex_table) }		:text
+
 	/DISCARD/ : {
 		*(.discard)
 		*(.discard.*)
diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index a20b134de2a8..04d04e46c98c 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -5,6 +5,41 @@ 
  * are built for 32-bit userspace.
  */
 
+static void BITSFUNC(copy)(FILE *outfile, const unsigned char *data, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (i % 10 == 0)
+			fprintf(outfile, "\n\t");
+		fprintf(outfile, "0x%02X, ", (int)(data)[i]);
+	}
+}
+
+
+/*
+ * Extract a section from the input data into a standalone blob.  Used to
+ * capture kernel-only data that needs to persist indefinitely, e.g. the
+ * exception fixup tables, but only in the kernel, i.e. the section can
+ * be stripped from the final vDSO image.
+ */
+static void BITSFUNC(extract)(const unsigned char *data, size_t data_len,
+			      FILE *outfile, ELF(Shdr) *sec, const char *name)
+{
+	unsigned long offset;
+	size_t len;
+
+	offset = (unsigned long)GET_LE(&sec->sh_offset);
+	len = (size_t)GET_LE(&sec->sh_size);
+
+	if (offset + len > data_len)
+		fail("section to extract overruns input data");
+
+	fprintf(outfile, "static const unsigned char %s[%lu] = {", name, len);
+	BITSFUNC(copy)(outfile, data + offset, len);
+	fprintf(outfile, "\n};\n\n");
+}
+
 static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 			 void *stripped_addr, size_t stripped_len,
 			 FILE *outfile, const char *image_name)
@@ -14,9 +49,8 @@  static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 	unsigned long mapping_size;
 	ELF(Ehdr) *hdr = (ELF(Ehdr) *)raw_addr;
 	int i;
-	unsigned long j;
 	ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
-		*alt_sec = NULL;
+		*alt_sec = NULL, *extable_sec = NULL;
 	ELF(Dyn) *dyn = 0, *dyn_end = 0;
 	const char *secstrings;
 	INT_BITS syms[NSYMS] = {};
@@ -78,6 +112,8 @@  static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 		if (!strcmp(secstrings + GET_LE(&sh->sh_name),
 			    ".altinstructions"))
 			alt_sec = sh;
+		if (!strcmp(secstrings + GET_LE(&sh->sh_name), "__ex_table"))
+			extable_sec = sh;
 	}
 
 	if (!symtab_hdr)
@@ -150,13 +186,11 @@  static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 	fprintf(outfile,
 		"static unsigned char raw_data[%lu] __ro_after_init __aligned(PAGE_SIZE) = {",
 		mapping_size);
-	for (j = 0; j < stripped_len; j++) {
-		if (j % 10 == 0)
-			fprintf(outfile, "\n\t");
-		fprintf(outfile, "0x%02X, ",
-			(int)((unsigned char *)stripped_addr)[j]);
-	}
+	BITSFUNC(copy)(outfile, stripped_addr, stripped_len);
 	fprintf(outfile, "\n};\n\n");
+	if (extable_sec)
+		BITSFUNC(extract)(raw_addr, raw_len, outfile,
+				  extable_sec, "extable");
 
 	fprintf(outfile, "const struct vdso_image %s = {\n", image_name);
 	fprintf(outfile, "\t.data = raw_data,\n");
@@ -167,6 +201,14 @@  static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 		fprintf(outfile, "\t.alt_len = %lu,\n",
 			(unsigned long)GET_LE(&alt_sec->sh_size));
 	}
+	if (extable_sec) {
+		fprintf(outfile, "\t.extable_base = %lu,\n",
+			(unsigned long)GET_LE(&extable_sec->sh_offset));
+		fprintf(outfile, "\t.extable_len = %lu,\n",
+			(unsigned long)GET_LE(&extable_sec->sh_size));
+		fprintf(outfile, "\t.extable = extable,\n");
+	}
+
 	for (i = 0; i < NSYMS; i++) {
 		if (required_syms[i].export && syms[i])
 			fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n",
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 230474e2ddb5..745300a05f25 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -15,6 +15,8 @@  struct vdso_image {
 	unsigned long size;   /* Always a multiple of PAGE_SIZE */
 
 	unsigned long alt, alt_len;
+	unsigned long extable_base, extable_len;
+	const void *extable;
 
 	long sym_vvar_start;  /* Negative offset to the vvar area */
 
@@ -44,6 +46,9 @@  extern void __init init_vdso_image(const struct vdso_image *image);
 
 extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
 
+extern bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+				 unsigned long error_code,
+				 unsigned long fault_addr);
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_X86_VDSO_H */