diff mbox series

[RFC,2/7] x86/sci: add core implementation for system call isolation

Message ID 1556228754-12996-3-git-send-email-rppt@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series x86: introduce system calls addess space isolation | expand

Commit Message

Mike Rapoport April 25, 2019, 9:45 p.m. UTC
When enabled, the system call isolation (SCI) would allow execution of the
system calls with reduced page tables. These page tables are almost
identical to the user page tables in PTI. The only addition is the code
page containing system call entry function that will continue exectution
after the context switch.

Unlike PTI page tables, there is no sharing at higher levels and all the
hierarchy for SCI page tables is cloned.

The SCI page tables are created when a system call that requires isolation
is executed for the first time.

Whenever a system call should be executed in the isolated environment, the
context is switched to the SCI page tables. Any further access to the
kernel memory will generate a page fault. The page fault handler can verify
that the access is safe and grant it or kill the task otherwise.

The initial SCI implementation allows access to any kernel data, but it
limits access to the code in the following way:
* calls and jumps to known code symbols without offset are allowed
* calls and jumps into a known symbol with offset are allowed only if that
symbol was already accessed and the offset is in the next page
* all other code access are blocked

After the isolated system call finishes, the mappings created during its
execution are cleared.

The entire SCI page table is lazily freed at task exit() time.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/include/asm/sci.h |  55 ++++
 arch/x86/mm/Makefile       |   1 +
 arch/x86/mm/init.c         |   2 +
 arch/x86/mm/sci.c          | 608 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h      |   5 +
 include/linux/sci.h        |  12 +
 6 files changed, 683 insertions(+)
 create mode 100644 arch/x86/include/asm/sci.h
 create mode 100644 arch/x86/mm/sci.c
 create mode 100644 include/linux/sci.h

Comments

Peter Zijlstra April 26, 2019, 7:49 a.m. UTC | #1
On Fri, Apr 26, 2019 at 12:45:49AM +0300, Mike Rapoport wrote:
> The initial SCI implementation allows access to any kernel data, but it
> limits access to the code in the following way:
> * calls and jumps to known code symbols without offset are allowed
> * calls and jumps into a known symbol with offset are allowed only if that
> symbol was already accessed and the offset is in the next page
> * all other code access are blocked

So if you have a large function and an in-function jump skips a page
you're toast.

Why not employ the instruction decoder we have and unconditionally allow
all direct JMP/CALL but verify indirect JMP/CALL and RET ?

Anyway, I'm fearing the overhead of this one, this cannot be fast.
Ingo Molnar April 26, 2019, 8:31 a.m. UTC | #2
* Mike Rapoport <rppt@linux.ibm.com> wrote:

> When enabled, the system call isolation (SCI) would allow execution of 
> the system calls with reduced page tables. These page tables are almost 
> identical to the user page tables in PTI. The only addition is the code 
> page containing system call entry function that will continue 
> exectution after the context switch.
> 
> Unlike PTI page tables, there is no sharing at higher levels and all 
> the hierarchy for SCI page tables is cloned.
> 
> The SCI page tables are created when a system call that requires 
> isolation is executed for the first time.
> 
> Whenever a system call should be executed in the isolated environment, 
> the context is switched to the SCI page tables. Any further access to 
> the kernel memory will generate a page fault. The page fault handler 
> can verify that the access is safe and grant it or kill the task 
> otherwise.
> 
> The initial SCI implementation allows access to any kernel data, but it
> limits access to the code in the following way:
> * calls and jumps to known code symbols without offset are allowed
> * calls and jumps into a known symbol with offset are allowed only if that
> symbol was already accessed and the offset is in the next page
> * all other code access are blocked
> 
> After the isolated system call finishes, the mappings created during its
> execution are cleared.
> 
> The entire SCI page table is lazily freed at task exit() time.

So this basically uses a similar mechanism to the horrendous PTI CR3 
switching overhead whenever a syscall seeks "protection", which overhead 
is only somewhat mitigated by PCID.

This might work on PTI-encumbered CPUs.

While AMD CPUs don't need PTI, nor do they have PCID.

So this feature is hurting the CPU maker who didn't mess up, and is 
hurting future CPUs that don't need PTI ..

I really don't like it where this is going. In a couple of years I really 
want to be able to think of PTI as a bad dream that is mostly over 
fortunately.

I have the feeling that compiler level protection that avoids corrupting 
the stack in the first place is going to be lower overhead, and would 
work in a much broader range of environments. Do we have analysis of what 
the compiler would have to do to prevent most ROP attacks, and what the 
runtime cost of that is?

I mean, C# and Java programs aren't able to corrupt the stack as long as 
the language runtime is corect. Has to be possible, right?

Thanks,

	Ingo
Ingo Molnar April 26, 2019, 9:58 a.m. UTC | #3
* Ingo Molnar <mingo@kernel.org> wrote:

> I really don't like it where this is going. In a couple of years I 
> really want to be able to think of PTI as a bad dream that is mostly 
> over fortunately.
> 
> I have the feeling that compiler level protection that avoids 
> corrupting the stack in the first place is going to be lower overhead, 
> and would work in a much broader range of environments. Do we have 
> analysis of what the compiler would have to do to prevent most ROP 
> attacks, and what the runtime cost of that is?
> 
> I mean, C# and Java programs aren't able to corrupt the stack as long 
> as the language runtime is corect. Has to be possible, right?

So if such security feature is offered then I'm afraid distros would be 
strongly inclined to enable it - saying 'yes' to a kernel feature that 
can keep your product off CVE advisories is a strong force.

To phrase the argument in a bit more controversial form:

   If the price of Linux using an insecure C runtime is to slow down 
   system calls with immense PTI-alike runtime costs, then wouldn't it be 
   the right technical decision to write the kernel in a language runtime 
   that doesn't allow stack overflows and such?

I.e. if having Linux in C ends up being slower than having it in Java, 
then what's the performance argument in favor of using C to begin with? 
;-)

And no, I'm not arguing for Java or C#, but I am arguing for a saner 
version of C.

Thanks,

	Ingo
James Bottomley April 26, 2019, 2:44 p.m. UTC | #4
On Fri, 2019-04-26 at 10:31 +0200, Ingo Molnar wrote:
> * Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> > When enabled, the system call isolation (SCI) would allow execution
> > of the system calls with reduced page tables. These page tables are
> > almost identical to the user page tables in PTI. The only addition
> > is the code page containing system call entry function that will
> > continue exectution after the context switch.
> > 
> > Unlike PTI page tables, there is no sharing at higher levels and
> > all the hierarchy for SCI page tables is cloned.
> > 
> > The SCI page tables are created when a system call that requires 
> > isolation is executed for the first time.
> > 
> > Whenever a system call should be executed in the isolated
> > environment, the context is switched to the SCI page tables. Any
> > further access to the kernel memory will generate a page fault. The
> > page fault handler can verify that the access is safe and grant it
> > or kill the task otherwise.
> > 
> > The initial SCI implementation allows access to any kernel data,
> > but it limits access to the code in the following way:
> > * calls and jumps to known code symbols without offset are allowed
> > * calls and jumps into a known symbol with offset are allowed only
> > if that symbol was already accessed and the offset is in the next
> > page 
> > * all other code access are blocked
> > 
> > After the isolated system call finishes, the mappings created
> > during its execution are cleared.
> > 
> > The entire SCI page table is lazily freed at task exit() time.
> 
> So this basically uses a similar mechanism to the horrendous PTI CR3 
> switching overhead whenever a syscall seeks "protection", which
> overhead is only somewhat mitigated by PCID.
> 
> This might work on PTI-encumbered CPUs.
> 
> While AMD CPUs don't need PTI, nor do they have PCID.
> 
> So this feature is hurting the CPU maker who didn't mess up, and is 
> hurting future CPUs that don't need PTI ..
> 
> I really don't like it where this is going. In a couple of years I
> really  want to be able to think of PTI as a bad dream that is mostly
> over  fortunately.

Perhaps ROP gadgets were a bad first example.  The research object of
the current patch set is really to investigate eliminating sandboxing
for containers.  As you know, current sandboxes like gVisor and Nabla
try to reduce the exposure to horizontal exploits (ability of an
untrusted tenant to exploit the shared kernel to attack another tenant)
by running significant chunks of kernel emulation code in userspace to
reduce exposure of the tenant to code in the shared kernel.  The price
paid for this is pretty horrendous in performance terms, but the
benefit is multi-tenant safety.

The question we were looking into is if we used per-tenant in-kernel
address space isolation to improve the security of kernel system calls
such that either the exploit becomes detectable or its consequences
bounce back only on the tenant trying the exploit, we could eliminate
the emulation for that system call and instead pass it through to the
kernel, thus thinning out the sandbox layer without losing the security
benefits.

We are looking at other aspects as well, like can we simply run chunks
of the kernel in the user's address space as the sanbox emulation
currently does, or can we hide a tenant's data objects such that
they're not easily accessible from an exploited kernel.

James
Dave Hansen April 26, 2019, 2:46 p.m. UTC | #5
On 4/25/19 2:45 PM, Mike Rapoport wrote:
> After the isolated system call finishes, the mappings created during its
> execution are cleared.

Yikes.  I guess that stops someone from calling write() a bunch of times
on every filesystem using every block device driver and all the DM code
to get a lot of code/data faulted in.  But, it also means not even
long-running processes will ever have a chance of behaving anything
close to normally.

Is this something you think can be rectified or is there something
fundamental that would keep SCI page tables from being cached across
different invocations of the same syscall?
James Bottomley April 26, 2019, 2:57 p.m. UTC | #6
On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > After the isolated system call finishes, the mappings created
> > during its execution are cleared.
> 
> Yikes.  I guess that stops someone from calling write() a bunch of
> times on every filesystem using every block device driver and all the
> DM code to get a lot of code/data faulted in.  But, it also means not
> even long-running processes will ever have a chance of behaving
> anything close to normally.
> 
> Is this something you think can be rectified or is there something
> fundamental that would keep SCI page tables from being cached across
> different invocations of the same syscall?

There is some work being done to look at pre-populating the isolated
address space with the expected execution footprint of the system call,
yes.  It lessens the ROP gadget protection slightly because you might
find a gadget in the pre-populated code, but it solves a lot of the
overhead problem.

James
Andy Lutomirski April 26, 2019, 3:07 p.m. UTC | #7
> On Apr 26, 2019, at 7:57 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
> 
>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>> After the isolated system call finishes, the mappings created
>>> during its execution are cleared.
>> 
>> Yikes.  I guess that stops someone from calling write() a bunch of
>> times on every filesystem using every block device driver and all the
>> DM code to get a lot of code/data faulted in.  But, it also means not
>> even long-running processes will ever have a chance of behaving
>> anything close to normally.
>> 
>> Is this something you think can be rectified or is there something
>> fundamental that would keep SCI page tables from being cached across
>> different invocations of the same syscall?
> 
> There is some work being done to look at pre-populating the isolated
> address space with the expected execution footprint of the system call,
> yes.  It lessens the ROP gadget protection slightly because you might
> find a gadget in the pre-populated code, but it solves a lot of the
> overhead problem.
> 

I’m not even remotely a ROP expert, but: what stops a ROP payload from using all the “fault-in” gadgets that exist — any function that can return on an error without doing to much will fault in the whole page containing the function.

To improve this, we would want some thing that would try to check whether the caller is actually supposed to call the callee, which is more or less the hard part of CFI.  So can’t we just do CFI and call it a day?

On top of that, a robust, maintainable implementation of this thing seems very complicated — for example, what happens if vfree() gets called?
James Bottomley April 26, 2019, 3:19 p.m. UTC | #8
On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
> > On Apr 26, 2019, at 7:57 AM, James Bottomley <James.Bottomley@hanse
> > npartnership.com> wrote:
> > 
> > > On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> > > > On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > > > After the isolated system call finishes, the mappings created
> > > > during its execution are cleared.
> > > 
> > > Yikes.  I guess that stops someone from calling write() a bunch
> > > of times on every filesystem using every block device driver and
> > > all the DM code to get a lot of code/data faulted in.  But, it
> > > also means not even long-running processes will ever have a
> > > chance of behaving anything close to normally.
> > > 
> > > Is this something you think can be rectified or is there
> > > something fundamental that would keep SCI page tables from being
> > > cached across different invocations of the same syscall?
> > 
> > There is some work being done to look at pre-populating the
> > isolated address space with the expected execution footprint of the
> > system call, yes.  It lessens the ROP gadget protection slightly
> > because you might find a gadget in the pre-populated code, but it
> > solves a lot of the overhead problem.
> > 
> 
> I’m not even remotely a ROP expert, but: what stops a ROP payload
> from using all the “fault-in” gadgets that exist — any function that
> can return on an error without doing to much will fault in the whole
> page containing the function.

The address space pre-population is still per syscall, so you don't get
access to the code footprint of a different syscall.  So the isolated
address space is created anew for every system call, it's just pre-
populated with that system call's expected footprint.

> To improve this, we would want some thing that would try to check
> whether the caller is actually supposed to call the callee, which is
> more or less the hard part of CFI.  So can’t we just do CFI and call
> it a day?

By CFI you mean control flow integrity?  In theory I believe so, yes,
but in practice doesn't it require a lot of semantic object information
which is easy to get from higher level languages like java but a bit
more difficult for plain C.

> On top of that, a robust, maintainable implementation of this thing
> seems very complicated — for example, what happens if vfree() gets
> called?

Address space Local vs global object tracking is another thing on our
list.  What we'd probably do is verify the global object was allowed to
be freed and then hand it off safely to the main kernel address space.

James
Andy Lutomirski April 26, 2019, 5:40 p.m. UTC | #9
> On Apr 26, 2019, at 8:19 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
> 
> On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
>>> On Apr 26, 2019, at 7:57 AM, James Bottomley <James.Bottomley@hanse
>>> npartnership.com> wrote:
>>> 
>>>>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>>>> After the isolated system call finishes, the mappings created
>>>>> during its execution are cleared.
>>>> 
>>>> Yikes.  I guess that stops someone from calling write() a bunch
>>>> of times on every filesystem using every block device driver and
>>>> all the DM code to get a lot of code/data faulted in.  But, it
>>>> also means not even long-running processes will ever have a
>>>> chance of behaving anything close to normally.
>>>> 
>>>> Is this something you think can be rectified or is there
>>>> something fundamental that would keep SCI page tables from being
>>>> cached across different invocations of the same syscall?
>>> 
>>> There is some work being done to look at pre-populating the
>>> isolated address space with the expected execution footprint of the
>>> system call, yes.  It lessens the ROP gadget protection slightly
>>> because you might find a gadget in the pre-populated code, but it
>>> solves a lot of the overhead problem.
>> 
>> I’m not even remotely a ROP expert, but: what stops a ROP payload
>> from using all the “fault-in” gadgets that exist — any function that
>> can return on an error without doing to much will fault in the whole
>> page containing the function.
> 
> The address space pre-population is still per syscall, so you don't get
> access to the code footprint of a different syscall.  So the isolated
> address space is created anew for every system call, it's just pre-
> populated with that system call's expected footprint.

That’s not what I mean. Suppose I want to use a ROP gadget in vmalloc(), but vmalloc isn’t in the page tables. Then first push vmalloc itself into the stack. As long as RDI contains a sufficiently ridiculous value, it should just return without doing anything. And it can return right back into the ROP gadget, which is now available.

> 
>> To improve this, we would want some thing that would try to check
>> whether the caller is actually supposed to call the callee, which is
>> more or less the hard part of CFI.  So can’t we just do CFI and call
>> it a day?
> 
> By CFI you mean control flow integrity?  In theory I believe so, yes,
> but in practice doesn't it require a lot of semantic object information
> which is easy to get from higher level languages like java but a bit
> more difficult for plain C.

Yes. As I understand it, grsecurity instruments gcc to create some kind of hash of all function signatures. Then any indirect call can effectively verify that it’s calling a function of the right type. And every return verified a cookie.

On CET CPUs, RET gets checked directly, and I don’t see the benefit of SCI.

> 
>> On top of that, a robust, maintainable implementation of this thing
>> seems very complicated — for example, what happens if vfree() gets
>> called?
> 
> Address space Local vs global object tracking is another thing on our
> list.  What we'd probably do is verify the global object was allowed to
> be freed and then hand it off safely to the main kernel address space.
> 
> 

This seems exceedingly complicated.
James Bottomley April 26, 2019, 6:49 p.m. UTC | #10
On Fri, 2019-04-26 at 10:40 -0700, Andy Lutomirski wrote:
> > On Apr 26, 2019, at 8:19 AM, James Bottomley <James.Bottomley@hanse
> > npartnership.com> wrote:
> > 
> > On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
> > > > On Apr 26, 2019, at 7:57 AM, James Bottomley
> > > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > 
> > > > > > On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> > > > > > On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > > > > > After the isolated system call finishes, the mappings
> > > > > > created during its execution are cleared.
> > > > > 
> > > > > Yikes.  I guess that stops someone from calling write() a
> > > > > bunch of times on every filesystem using every block device
> > > > > driver and all the DM code to get a lot of code/data faulted
> > > > > in.  But, it also means not even long-running processes will
> > > > > ever have a chance of behaving anything close to normally.
> > > > > 
> > > > > Is this something you think can be rectified or is there
> > > > > something fundamental that would keep SCI page tables from
> > > > > being cached across different invocations of the same
> > > > > syscall?
> > > > 
> > > > There is some work being done to look at pre-populating the
> > > > isolated address space with the expected execution footprint of
> > > > the system call, yes.  It lessens the ROP gadget protection
> > > > slightly because you might find a gadget in the pre-populated
> > > > code, but it solves a lot of the overhead problem.
> > > 
> > > I’m not even remotely a ROP expert, but: what stops a ROP payload
> > > from using all the “fault-in” gadgets that exist — any function
> > > that can return on an error without doing to much will fault in
> > > the whole page containing the function.
> > 
> > The address space pre-population is still per syscall, so you don't
> > get access to the code footprint of a different syscall.  So the
> > isolated address space is created anew for every system call, it's
> > just pre-populated with that system call's expected footprint.
> 
> That’s not what I mean. Suppose I want to use a ROP gadget in
> vmalloc(), but vmalloc isn’t in the page tables. Then first push
> vmalloc itself into the stack. As long as RDI contains a sufficiently
> ridiculous value, it should just return without doing anything. And
> it can return right back into the ROP gadget, which is now available.

Yes, it's not perfect, but stack space for a smashing attack is at a
premium and now you need two stack frames for every gadget you chain
instead of one so we've halved your ability to chain gadgets.

> > > To improve this, we would want some thing that would try to check
> > > whether the caller is actually supposed to call the callee, which
> > > is more or less the hard part of CFI.  So can’t we just do CFI
> > > and call it a day?
> > 
> > By CFI you mean control flow integrity?  In theory I believe so,
> > yes, but in practice doesn't it require a lot of semantic object
> > information which is easy to get from higher level languages like
> > java but a bit more difficult for plain C.
> 
> Yes. As I understand it, grsecurity instruments gcc to create some
> kind of hash of all function signatures. Then any indirect call can
> effectively verify that it’s calling a function of the right type.
> And every return verified a cookie.
> 
> On CET CPUs, RET gets checked directly, and I don’t see the benefit
> of SCI.

Presumably you know something I don't but I thought CET CPUs had been
planned for release for ages, but not actually released yet?

> > > On top of that, a robust, maintainable implementation of this
> > > thing seems very complicated — for example, what happens if
> > > vfree() gets called?
> > 
> > Address space Local vs global object tracking is another thing on
> > our list.  What we'd probably do is verify the global object was
> > allowed to be freed and then hand it off safely to the main kernel
> > address space.
> 
> This seems exceedingly complicated.

It's a research project: we're exploring what's possible so we can
choose the techniques that give the best security improvement for the
additional overhead.

James
Andy Lutomirski April 26, 2019, 7:22 p.m. UTC | #11
> On Apr 26, 2019, at 11:49 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
> 
> On Fri, 2019-04-26 at 10:40 -0700, Andy Lutomirski wrote:
>>> On Apr 26, 2019, at 8:19 AM, James Bottomley <James.Bottomley@hanse
>>> npartnership.com> wrote:
>>> 
>>> On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
>>>>> On Apr 26, 2019, at 7:57 AM, James Bottomley
>>>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>>> 
>>>>>>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>>>>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>>>>>> After the isolated system call finishes, the mappings
>>>>>>> created during its execution are cleared.
>>>>>> 
>>>>>> Yikes.  I guess that stops someone from calling write() a
>>>>>> bunch of times on every filesystem using every block device
>>>>>> driver and all the DM code to get a lot of code/data faulted
>>>>>> in.  But, it also means not even long-running processes will
>>>>>> ever have a chance of behaving anything close to normally.
>>>>>> 
>>>>>> Is this something you think can be rectified or is there
>>>>>> something fundamental that would keep SCI page tables from
>>>>>> being cached across different invocations of the same
>>>>>> syscall?
>>>>> 
>>>>> There is some work being done to look at pre-populating the
>>>>> isolated address space with the expected execution footprint of
>>>>> the system call, yes.  It lessens the ROP gadget protection
>>>>> slightly because you might find a gadget in the pre-populated
>>>>> code, but it solves a lot of the overhead problem.
>>>> 
>>>> I’m not even remotely a ROP expert, but: what stops a ROP payload
>>>> from using all the “fault-in” gadgets that exist — any function
>>>> that can return on an error without doing to much will fault in
>>>> the whole page containing the function.
>>> 
>>> The address space pre-population is still per syscall, so you don't
>>> get access to the code footprint of a different syscall.  So the
>>> isolated address space is created anew for every system call, it's
>>> just pre-populated with that system call's expected footprint.
>> 
>> That’s not what I mean. Suppose I want to use a ROP gadget in
>> vmalloc(), but vmalloc isn’t in the page tables. Then first push
>> vmalloc itself into the stack. As long as RDI contains a sufficiently
>> ridiculous value, it should just return without doing anything. And
>> it can return right back into the ROP gadget, which is now available.
> 
> Yes, it's not perfect, but stack space for a smashing attack is at a
> premium and now you need two stack frames for every gadget you chain
> instead of one so we've halved your ability to chain gadgets.
> 
>>>> To improve this, we would want some thing that would try to check
>>>> whether the caller is actually supposed to call the callee, which
>>>> is more or less the hard part of CFI.  So can’t we just do CFI
>>>> and call it a day?
>>> 
>>> By CFI you mean control flow integrity?  In theory I believe so,
>>> yes, but in practice doesn't it require a lot of semantic object
>>> information which is easy to get from higher level languages like
>>> java but a bit more difficult for plain C.
>> 
>> Yes. As I understand it, grsecurity instruments gcc to create some
>> kind of hash of all function signatures. Then any indirect call can
>> effectively verify that it’s calling a function of the right type.
>> And every return verified a cookie.
>> 
>> On CET CPUs, RET gets checked directly, and I don’t see the benefit
>> of SCI.
> 
> Presumably you know something I don't but I thought CET CPUs had been
> planned for release for ages, but not actually released yet?

I don’t know any secrets about this, but I don’t think it’s released. Last I checked, it didn’t even have a final public spec.

> 
>>>> On top of that, a robust, maintainable implementation of this
>>>> thing seems very complicated — for example, what happens if
>>>> vfree() gets called?
>>> 
>>> Address space Local vs global object tracking is another thing on
>>> our list.  What we'd probably do is verify the global object was
>>> allowed to be freed and then hand it off safely to the main kernel
>>> address space.
>> 
>> This seems exceedingly complicated.
> 
> It's a research project: we're exploring what's possible so we can
> choose the techniques that give the best security improvement for the
> additional overhead.
> 

:)
Andy Lutomirski April 26, 2019, 9:26 p.m. UTC | #12
> On Apr 26, 2019, at 2:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>> I really don't like it where this is going. In a couple of years I
>> really want to be able to think of PTI as a bad dream that is mostly
>> over fortunately.
>>
>> I have the feeling that compiler level protection that avoids
>> corrupting the stack in the first place is going to be lower overhead,
>> and would work in a much broader range of environments. Do we have
>> analysis of what the compiler would have to do to prevent most ROP
>> attacks, and what the runtime cost of that is?
>>
>> I mean, C# and Java programs aren't able to corrupt the stack as long
>> as the language runtime is corect. Has to be possible, right?
>
> So if such security feature is offered then I'm afraid distros would be
> strongly inclined to enable it - saying 'yes' to a kernel feature that
> can keep your product off CVE advisories is a strong force.
>
> To phrase the argument in a bit more controversial form:
>
>   If the price of Linux using an insecure C runtime is to slow down
>   system calls with immense PTI-alike runtime costs, then wouldn't it be
>   the right technical decision to write the kernel in a language runtime
>   that doesn't allow stack overflows and such?
>
> I.e. if having Linux in C ends up being slower than having it in Java,
> then what's the performance argument in favor of using C to begin with?
> ;-)
>
> And no, I'm not arguing for Java or C#, but I am arguing for a saner
> version of C.
>
>

IMO three are three credible choices:

1. C with fairly strong CFI protection. Grsecurity has his (supposedly
— there’s a distinct lack of source code available), and clang is
gradually working on it.

2. A safe language for parts of the kernel, e.g. drivers and maybe
eventually filesystems.  Rust is probably the only credible candidate.
Actually creating a decent Rust wrapper around the core kernel
facilities would be quite a bit of work.  Things like sysfs would be
interesting in Rust, since AFAIK few or even no drivers actually get
the locking fully correct.  This means that naive users of the API
cannot port directly to safe Rust, because all the races won't compile
:)

3. A sandbox for parts of the kernel, e.g. drivers.  The obvious
candidates are eBPF and WASM.

#2 will give very good performance.  #3 gives potentially stronger
protection against a sandboxed component corrupting the kernel
overall, but it gives much weaker protection against a sandboxed
component corrupting itself.

In an ideal world, we could do #2 *and* #3.  Drivers could, for
example, be written in a language like Rust, compiled to WASM, and run
in the kernel.
Ingo Molnar April 27, 2019, 8:47 a.m. UTC | #13
* Andy Lutomirski <luto@kernel.org> wrote:

> > On Apr 26, 2019, at 2:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> >> I really don't like it where this is going. In a couple of years I
> >> really want to be able to think of PTI as a bad dream that is mostly
> >> over fortunately.
> >>
> >> I have the feeling that compiler level protection that avoids
> >> corrupting the stack in the first place is going to be lower overhead,
> >> and would work in a much broader range of environments. Do we have
> >> analysis of what the compiler would have to do to prevent most ROP
> >> attacks, and what the runtime cost of that is?
> >>
> >> I mean, C# and Java programs aren't able to corrupt the stack as long
> >> as the language runtime is corect. Has to be possible, right?
> >
> > So if such security feature is offered then I'm afraid distros would be
> > strongly inclined to enable it - saying 'yes' to a kernel feature that
> > can keep your product off CVE advisories is a strong force.
> >
> > To phrase the argument in a bit more controversial form:
> >
> >   If the price of Linux using an insecure C runtime is to slow down
> >   system calls with immense PTI-alike runtime costs, then wouldn't it be
> >   the right technical decision to write the kernel in a language runtime
> >   that doesn't allow stack overflows and such?
> >
> > I.e. if having Linux in C ends up being slower than having it in Java,
> > then what's the performance argument in favor of using C to begin with?
> > ;-)
> >
> > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > version of C.
> >
> >
> 
> IMO three are three credible choices:
> 
> 1. C with fairly strong CFI protection. Grsecurity has this (supposedly 
> — there’s a distinct lack of source code available), and clang is 
> gradually working on it.
> 
> 2. A safe language for parts of the kernel, e.g. drivers and maybe 
> eventually filesystems.  Rust is probably the only credible candidate. 
> Actually creating a decent Rust wrapper around the core kernel 
> facilities would be quite a bit of work.  Things like sysfs would be 
> interesting in Rust, since AFAIK few or even no drivers actually get 
> the locking fully correct.  This means that naive users of the API 
> cannot port directly to safe Rust, because all the races won't compile
> :)
> 
> 3. A sandbox for parts of the kernel, e.g. drivers.  The obvious 
> candidates are eBPF and WASM.
> 
> #2 will give very good performance.  #3 gives potentially stronger
> protection against a sandboxed component corrupting the kernel overall, 
> but it gives much weaker protection against a sandboxed component 
> corrupting itself.
> 
> In an ideal world, we could do #2 *and* #3.  Drivers could, for 
> example, be written in a language like Rust, compiled to WASM, and run 
> in the kernel.

So why not go for #1, which would still outperform #2/#3, right? Do we 
know what it would take, roughly, and how the runtime overhead looks 
like?

Thanks,

	Ingo
Ingo Molnar April 27, 2019, 10:46 a.m. UTC | #14
* Ingo Molnar <mingo@kernel.org> wrote:

> * Andy Lutomirski <luto@kernel.org> wrote:
> 
> > > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > > version of C.
> > 
> > IMO three are three credible choices:
> > 
> > 1. C with fairly strong CFI protection. Grsecurity has this (supposedly 
> > — there’s a distinct lack of source code available), and clang is 
> > gradually working on it.
> > 
> > 2. A safe language for parts of the kernel, e.g. drivers and maybe 
> > eventually filesystems.  Rust is probably the only credible candidate. 
> > Actually creating a decent Rust wrapper around the core kernel 
> > facilities would be quite a bit of work.  Things like sysfs would be 
> > interesting in Rust, since AFAIK few or even no drivers actually get 
> > the locking fully correct.  This means that naive users of the API 
> > cannot port directly to safe Rust, because all the races won't compile
> > :)
> > 
> > 3. A sandbox for parts of the kernel, e.g. drivers.  The obvious 
> > candidates are eBPF and WASM.
> > 
> > #2 will give very good performance.  #3 gives potentially stronger
> > protection against a sandboxed component corrupting the kernel overall, 
> > but it gives much weaker protection against a sandboxed component 
> > corrupting itself.
> > 
> > In an ideal world, we could do #2 *and* #3.  Drivers could, for 
> > example, be written in a language like Rust, compiled to WASM, and run 
> > in the kernel.
> 
> So why not go for #1, which would still outperform #2/#3, right? Do we 
> know what it would take, roughly, and how the runtime overhead looks 
> like?

BTW., CFI protection is in essence a compiler (or hardware) technique to 
detect stack frame or function pointer corruption after the fact.

So I'm wondering whether there's a 4th choice as well, which avoids 
control flow corruption *before* it happens:

 - A C language runtime that is a subset of current C syntax and 
   semantics used in the kernel, and which doesn't allow access outside 
   of existing objects and thus creates a strictly enforced separation 
   between memory used for data, and memory used for code and control 
   flow.

 - This would involve, at minimum:

    - tracking every type and object and its inherent length and valid 
      access patterns, and never losing track of its type.

    - being a lot more organized about initialization, i.e. no 
      uninitialized variables/fields.

    - being a lot more strict about type conversions and pointers in 
      general.

    - ... and a metric ton of other details.

 - If such a runtime could co-exist without big complications with 
   regular C kernel code then we could convert particular pieces of C 
   code into this safe-C runtime step by step, and would also allow the 
   compilation of a piece of code as regular C, or into the safe runtime.

 - If a particular function can be formally proven to be safe, it can be 
   compiled as C - otherwise it would be compiled as safe-C.

 - ... or something like this.

The advantage would be: data corruption could never be triggered by code 
itself, if the compiler and runtime is correct. Return addresses and 
stacks wouldn't have to be 'hardened' or 'checked', because they'd never 
be corrupted in the first place. WX memory wouldn't be an issue as kernel 
code could never jump into generated shell code or ROP gadgets.

The disadvantage: the overhead of managing this, and any loss of 
flexibility on the kernel programming side.

Does this make sense, and if yes, does such a project exist already?
(And no, I don't mean Java or C#.)

Or would we in essence end up with a Java runtime, with C syntax?

Thanks,

	Ingo
Mike Rapoport April 28, 2019, 5:45 a.m. UTC | #15
On Fri, Apr 26, 2019 at 09:49:56AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 26, 2019 at 12:45:49AM +0300, Mike Rapoport wrote:
> > The initial SCI implementation allows access to any kernel data, but it
> > limits access to the code in the following way:
> > * calls and jumps to known code symbols without offset are allowed
> > * calls and jumps into a known symbol with offset are allowed only if that
> > symbol was already accessed and the offset is in the next page
> > * all other code access are blocked
> 
> So if you have a large function and an in-function jump skips a page
> you're toast.

Right :(
 
> Why not employ the instruction decoder we have and unconditionally allow
> all direct JMP/CALL but verify indirect JMP/CALL and RET ?

Apparently I didn't dig deep enough to find the instruction decoder :)
Surely I can use it.

> Anyway, I'm fearing the overhead of this one, this cannot be fast.

Well, I think that the verification itself is not what will slow things
down the most. IMHO, the major overhead is coming from cr3 switch.
James Morris April 29, 2019, 6:26 p.m. UTC | #16
On Sat, 27 Apr 2019, Ingo Molnar wrote:

>  - A C language runtime that is a subset of current C syntax and 
>    semantics used in the kernel, and which doesn't allow access outside 
>    of existing objects and thus creates a strictly enforced separation 
>    between memory used for data, and memory used for code and control 
>    flow.

Might be better to start with Rust.
Andy Lutomirski April 29, 2019, 6:43 p.m. UTC | #17
On Mon, Apr 29, 2019 at 11:27 AM James Morris <jmorris@namei.org> wrote:
>
> On Sat, 27 Apr 2019, Ingo Molnar wrote:
>
> >  - A C language runtime that is a subset of current C syntax and
> >    semantics used in the kernel, and which doesn't allow access outside
> >    of existing objects and thus creates a strictly enforced separation
> >    between memory used for data, and memory used for code and control
> >    flow.
>
> Might be better to start with Rust.
>

I think that Rust would be the clear winner as measured by how fun it sounds :)
Andy Lutomirski April 29, 2019, 6:46 p.m. UTC | #18
On Sat, Apr 27, 2019 at 3:46 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > * Andy Lutomirski <luto@kernel.org> wrote:
> >
> > > > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > > > version of C.
> > >
> > > IMO three are three credible choices:
> > >
> > > 1. C with fairly strong CFI protection. Grsecurity has this (supposedly
> > > — there’s a distinct lack of source code available), and clang is
> > > gradually working on it.
> > >
> > > 2. A safe language for parts of the kernel, e.g. drivers and maybe
> > > eventually filesystems.  Rust is probably the only credible candidate.
> > > Actually creating a decent Rust wrapper around the core kernel
> > > facilities would be quite a bit of work.  Things like sysfs would be
> > > interesting in Rust, since AFAIK few or even no drivers actually get
> > > the locking fully correct.  This means that naive users of the API
> > > cannot port directly to safe Rust, because all the races won't compile
> > > :)
> > >
> > > 3. A sandbox for parts of the kernel, e.g. drivers.  The obvious
> > > candidates are eBPF and WASM.
> > >
> > > #2 will give very good performance.  #3 gives potentially stronger
> > > protection against a sandboxed component corrupting the kernel overall,
> > > but it gives much weaker protection against a sandboxed component
> > > corrupting itself.
> > >
> > > In an ideal world, we could do #2 *and* #3.  Drivers could, for
> > > example, be written in a language like Rust, compiled to WASM, and run
> > > in the kernel.
> >
> > So why not go for #1, which would still outperform #2/#3, right? Do we
> > know what it would take, roughly, and how the runtime overhead looks
> > like?
>
> BTW., CFI protection is in essence a compiler (or hardware) technique to
> detect stack frame or function pointer corruption after the fact.
>
> So I'm wondering whether there's a 4th choice as well, which avoids
> control flow corruption *before* it happens:
>
>  - A C language runtime that is a subset of current C syntax and
>    semantics used in the kernel, and which doesn't allow access outside
>    of existing objects and thus creates a strictly enforced separation
>    between memory used for data, and memory used for code and control
>    flow.
>
>  - This would involve, at minimum:
>
>     - tracking every type and object and its inherent length and valid
>       access patterns, and never losing track of its type.
>
>     - being a lot more organized about initialization, i.e. no
>       uninitialized variables/fields.
>
>     - being a lot more strict about type conversions and pointers in
>       general.

You're not the only one to suggest this.  There are at least a few
things that make this extremely difficult if not impossible.  For
example, consider this code:

void maybe_buggy(void)
{
  int a, b;
  int *p = &a;
  int *q = (int *)some_function((unsigned long)p);
  *q = 1;
}

If some_function(&a) returns &a, then all is well.  But if
some_function(&a) returns &b or even a valid address of some unrelated
kernel object, then the code might be entirely valid and correct C,
but I don't see how the runtime checks are supposed to tell whether
the resulting address is valid or is a bug.  This type of code is, I
think, quite common in the kernel -- it happens in every data
structure where we have unions of pointers and integers or where we
steal some known-zero bits of a pointer to store something else.

--Andy
Ingo Molnar April 30, 2019, 5:03 a.m. UTC | #19
* Andy Lutomirski <luto@kernel.org> wrote:

> On Sat, Apr 27, 2019 at 3:46 AM Ingo Molnar <mingo@kernel.org> wrote:

> > So I'm wondering whether there's a 4th choice as well, which avoids
> > control flow corruption *before* it happens:
> >
> >  - A C language runtime that is a subset of current C syntax and
> >    semantics used in the kernel, and which doesn't allow access outside
> >    of existing objects and thus creates a strictly enforced separation
> >    between memory used for data, and memory used for code and control
> >    flow.
> >
> >  - This would involve, at minimum:
> >
> >     - tracking every type and object and its inherent length and valid
> >       access patterns, and never losing track of its type.
> >
> >     - being a lot more organized about initialization, i.e. no
> >       uninitialized variables/fields.
> >
> >     - being a lot more strict about type conversions and pointers in
> >       general.
> 
> You're not the only one to suggest this.  There are at least a few
> things that make this extremely difficult if not impossible.  For
> example, consider this code:
> 
> void maybe_buggy(void)
> {
>   int a, b;
>   int *p = &a;
>   int *q = (int *)some_function((unsigned long)p);
>   *q = 1;
> }
> 
> If some_function(&a) returns &a, then all is well.  But if
> some_function(&a) returns &b or even a valid address of some unrelated
> kernel object, then the code might be entirely valid and correct C,
> but I don't see how the runtime checks are supposed to tell whether
> the resulting address is valid or is a bug.  This type of code is, I
> think, quite common in the kernel -- it happens in every data
> structure where we have unions of pointers and integers or where we
> steal some known-zero bits of a pointer to store something else.

So the thing is, for the infinitely large state space of "valid C code" 
we already disallow an infinitely many versions in the Linux kernel.

We have complicated rules that disallow certain C syntactical and 
semantical constructs, both on the tooling (build failure/warning) and on 
the review (style/taste) level.

So the question IMHO isn't whether it's "valid C", because we already 
have the Linux kernel's own C syntax variant and are enforcing it with 
varying degrees of success.

The question is whether the example you gave can be written in a strongly 
typed fashion, whether it makes sense to do so, and what the costs are.

I think it's evident that it can be written with strongly typed 
constructs, by separating pointers from embedded error codes - with 
negative side effects to code generation: for example it increases 
structure sizes and error return paths.

I think there's four main costs of converting such a pattern to strongly 
typed constructs:

 - memory/cache footprint:  there's a nonzero cost there.
 - performance:             this will hurt too.
 - code readability:        this will probably improve.
 - code robustness:         this will improve too.

So I think the proper question to ask is not whether there's common C 
syntax within the kernel that would have to be rewritten, but whether the 
total sum of memory and runtime overhead of strongly typed C programming 
(if it's possible/desirable) is larger than the total sum of a typical 
Linux distro enabling the various current and proposed kernel hardening 
features that have a runtime overhead:

 - the SMAP/SMEP overhead of STAC/CLAC for every single user copy

 - other usercopy hardening features

 - stackprotector

 - KASLR

 - compiler plugins against information leaks

 - proposed KASLR extension to implement module randomization and -PIE overhead

 - proposed function call integrity checks

 - proposed per system call kernel stack offset randomization

 - ( and I'm sure I forgot about a few more, and it's all still only 
     reactive security, not proactive security. )

That's death by a thousand cuts and CR3 switching during system calls is 
also throwing a hand grenade into the fight ;-)

So if people are also proposing to do CR3 switches in every system call, 
I'm pretty sure the answer is "yes, even a managed C runtime is probably 
faster than *THAT* sum of a performanc mess" - at least with the current 
CR3 switching x86-uarch cost structure...

Thanks,

	Ingo
Peter Zijlstra April 30, 2019, 9:38 a.m. UTC | #20
On Tue, Apr 30, 2019 at 07:03:37AM +0200, Ingo Molnar wrote:
> So the question IMHO isn't whether it's "valid C", because we already 
> have the Linux kernel's own C syntax variant and are enforcing it with 
> varying degrees of success.

I'm not getting into the whole 'safe' fight here; but you're under
selling things. We don't have a C syntax, we have a full blown C
lanugeage variant.

The 'Kernel C' that we write is very much not 'ANSI/ISO C' anymore in a
fair number of places. And if I can get my way, we'll only diverge
further from the standard.

And this is quite separate from us using every GCC extention under the
sun; which of course also doesn't help. It mostly has to do with us
treating C as a portable assembler and the C people not wanting to
commit to sensible things because they think C is a high-level language.
Ingo Molnar April 30, 2019, 11:05 a.m. UTC | #21
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Apr 30, 2019 at 07:03:37AM +0200, Ingo Molnar wrote:
> > So the question IMHO isn't whether it's "valid C", because we already 
> > have the Linux kernel's own C syntax variant and are enforcing it with 
> > varying degrees of success.
> 
> I'm not getting into the whole 'safe' fight here; but you're under
> selling things. We don't have a C syntax, we have a full blown C
> lanugeage variant.
> 
> The 'Kernel C' that we write is very much not 'ANSI/ISO C' anymore in a
> fair number of places. And if I can get my way, we'll only diverge
> further from the standard.

Yeah, but I think it would be fair to say that random style variations 
aside, in the kernel we still allow about 95%+ of 'sensible C'.

> And this is quite separate from us using every GCC extention under the 
> sun; which of course also doesn't help. It mostly has to do with us 
> treating C as a portable assembler and the C people not wanting to 
> commit to sensible things because they think C is a high-level 
> language.

Indeed, and also because there's arguably somewhat of a "if the spec 
allows it then performance first, common-sense semantics second" mindset. 
Which is an understandable social dynamic, as compiler developers tend to 
distinguish themselves via the optimizations they've authored.

Anyway, the main point I tried to make is that I think we'd still be able 
to allow 95%+ of "sensible C" even if executed in a "safe runtime", and 
we'd still be able to build and run without such strong runtime type 
enforcement, i.e. get kernel code close to what we have today, minus a 
handful of optimizations and data structures. (But the performance costs 
even in that case are nonzero - I'm not sugarcoating it.)

( Plus even that isn't a fully secure solution with deterministic 
  outcomes, due to parallelism and data races. )

Thanks,

	Ingo
Robert O'Callahan May 2, 2019, 11:35 a.m. UTC | #22
On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar <mingo@kernel.org> wrote:
>  - A C language runtime that is a subset of current C syntax and
>    semantics used in the kernel, and which doesn't allow access outside
>    of existing objects and thus creates a strictly enforced separation
>    between memory used for data, and memory used for code and control
>    flow.
>
>  - This would involve, at minimum:
>
>     - tracking every type and object and its inherent length and valid
>       access patterns, and never losing track of its type.
>
>     - being a lot more organized about initialization, i.e. no
>       uninitialized variables/fields.
>
>     - being a lot more strict about type conversions and pointers in
>       general.
>
>     - ... and a metric ton of other details.

Several research groups have tried to do this, and it is very
difficult to do. In particular this was almost exactly the goal of
C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3],
which is less ambitious. Check the references of the latter for lots
of relevant work. If anyone really pursues this they should talk
directly to researchers who've worked on this, e.g. George Necula; you
need to know what *didn't* work well, which is hard to glean from
papers. (Academic publishing is broken that way.)

One problem with adopting "safe C" or Rust in the kernel is that most
of your security mitigations (e.g. KASLR, CFI, other randomizations)
probably need to remain in place as long as there is a significant
amount of C in the kernel, which means the benefits from eliminating
them will be realized very far in the future, if ever, which makes the
whole exercise harder to justify.

Having said that, I think there's a good case to be made for writing
kernel code in Rust, e.g. sketchy drivers. The classes of bugs
prevented in Rust are significantly broader than your usual safe-C
dialect (e.g. data races).

[1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf
[2] https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf
[3] https://github.com/Microsoft/checkedc

Rob
Ingo Molnar May 2, 2019, 3:20 p.m. UTC | #23
* Robert O'Callahan <robert@ocallahan.org> wrote:

> On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar <mingo@kernel.org> wrote:
> >  - A C language runtime that is a subset of current C syntax and
> >    semantics used in the kernel, and which doesn't allow access outside
> >    of existing objects and thus creates a strictly enforced separation
> >    between memory used for data, and memory used for code and control
> >    flow.
> >
> >  - This would involve, at minimum:
> >
> >     - tracking every type and object and its inherent length and valid
> >       access patterns, and never losing track of its type.
> >
> >     - being a lot more organized about initialization, i.e. no
> >       uninitialized variables/fields.
> >
> >     - being a lot more strict about type conversions and pointers in
> >       general.
> >
> >     - ... and a metric ton of other details.
> 
> Several research groups have tried to do this, and it is very
> difficult to do. In particular this was almost exactly the goal of
> C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3],
> which is less ambitious. Check the references of the latter for lots
> of relevant work. If anyone really pursues this they should talk
> directly to researchers who've worked on this, e.g. George Necula; you
> need to know what *didn't* work well, which is hard to glean from
> papers. (Academic publishing is broken that way.)
> 
> One problem with adopting "safe C" or Rust in the kernel is that most
> of your security mitigations (e.g. KASLR, CFI, other randomizations)
> probably need to remain in place as long as there is a significant
> amount of C in the kernel, which means the benefits from eliminating
> them will be realized very far in the future, if ever, which makes the
> whole exercise harder to justify.
> 
> Having said that, I think there's a good case to be made for writing
> kernel code in Rust, e.g. sketchy drivers. The classes of bugs
> prevented in Rust are significantly broader than your usual safe-C
> dialect (e.g. data races).
> 
> [1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf
> [2] https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf
> [3] https://github.com/Microsoft/checkedc

So what might work better is if we defined a Rust dialect that used C 
syntax. I.e. the end result would be something like the 'c2rust' or 
'citrus' projects, where code like this would be directly translatable to 
Rust:

void gz_compress(FILE * in, gzFile out)
{
	char buf[BUFLEN];
	int len;
	int err;

	for (;;) {
		len = fread(buf, 1, sizeof(buf), in);
		if (ferror(in)) {
			perror("fread");
			exit(1);
		}
		if (len == 0)
			break;
		if (gzwrite(out, buf, (unsigned)len) != len)
			error(gzerror(out, &err));
	}
	fclose(in);

	if (gzclose(out) != Z_OK)
		error("failed gzclose");
}


#[no_mangle]
pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) {
    let mut buf: [i8; 16384];
    let mut len;
    let mut err;
    loop  {
        len = fread(buf, 1, std::mem::size_of_val(&buf), in_);
        if ferror(in_) != 0 { perror("fread"); exit(1); }
        if len == 0 { break ; }
        if gzwrite(out, buf, len as c_uint) != len {
            error(gzerror(out, &mut err));
        };
    }
    fclose(in_);
    if gzclose(out) != Z_OK { error("failed gzclose"); };
}

Example taken from:

   https://gitlab.com/citrus-rs/citrus

Does this make sense?

Thanks,

	Ingo
Robert O'Callahan May 2, 2019, 9:07 p.m. UTC | #24
On Fri, May 3, 2019 at 3:20 AM Ingo Molnar <mingo@kernel.org> wrote:
> So what might work better is if we defined a Rust dialect that used C
> syntax. I.e. the end result would be something like the 'c2rust' or
> 'citrus' projects, where code like this would be directly translatable to
> Rust:
>
> void gz_compress(FILE * in, gzFile out)
> {
>         char buf[BUFLEN];
>         int len;
>         int err;
>
>         for (;;) {
>                 len = fread(buf, 1, sizeof(buf), in);
>                 if (ferror(in)) {
>                         perror("fread");
>                         exit(1);
>                 }
>                 if (len == 0)
>                         break;
>                 if (gzwrite(out, buf, (unsigned)len) != len)
>                         error(gzerror(out, &err));
>         }
>         fclose(in);
>
>         if (gzclose(out) != Z_OK)
>                 error("failed gzclose");
> }
>
>
> #[no_mangle]
> pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) {
>     let mut buf: [i8; 16384];
>     let mut len;
>     let mut err;
>     loop  {
>         len = fread(buf, 1, std::mem::size_of_val(&buf), in_);
>         if ferror(in_) != 0 { perror("fread"); exit(1); }
>         if len == 0 { break ; }
>         if gzwrite(out, buf, len as c_uint) != len {
>             error(gzerror(out, &mut err));
>         };
>     }
>     fclose(in_);
>     if gzclose(out) != Z_OK { error("failed gzclose"); };
> }
>
> Example taken from:
>
>    https://gitlab.com/citrus-rs/citrus
>
> Does this make sense?

Are you saying you want a tool like c2rust/citrus that translates some
new "looks like C, but really Rust" language into actual Rust at build
time? I guess that might work, but I suspect your "looks like C"
language isn't going to end up being much like C (e.g. it's going to
need Rust-style enums-with-fields, Rust polymorphism, Rust traits, and
Rust lifetimes), so it may not be beneficial, because you've just
created a new language no-one knows, and that has some real downsides.

If you're inspired by the dream of transitioning to safer languages,
then I think the first practical step would be to identify some part
of the kernel where the payoff of converting code would be highest.
This is probably something small, relatively isolated, that's not well
tested, generally suspicious, but still in use. Then do an experiment,
converting it to Rust (or something else) using off-the-shelf tools
and manual labor, and see where the pain points are and what benefits
accrue, if any. (Work like https://github.com/tsgates/rust.ko might be
a helpful starting point.) Then you'd have some data to start thinking
about how to reduce the costs, increase the benefits, and sell it to
the kernel community. If you reached out to the Rust community you
might find some volunteers to help with this.

Rob
diff mbox series

Patch

diff --git a/arch/x86/include/asm/sci.h b/arch/x86/include/asm/sci.h
new file mode 100644
index 0000000..0b56200
--- /dev/null
+++ b/arch/x86/include/asm/sci.h
@@ -0,0 +1,55 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _ASM_X86_SCI_H
+#define _ASM_X86_SCI_H
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+
+struct sci_task_data {
+	pgd_t		*pgd;
+	unsigned long	cr3_offset;
+	unsigned long	backtrace_size;
+	unsigned long	*backtrace;
+	unsigned long	ptes_count;
+	pte_t		**ptes;
+};
+
+struct sci_percpu_data {
+	unsigned long		sci_syscall;
+	unsigned long		sci_cr3_offset;
+};
+
+DECLARE_PER_CPU_PAGE_ALIGNED(struct sci_percpu_data, cpu_sci);
+
+void sci_check_boottime_disable(void);
+
+int sci_init(struct task_struct *tsk);
+void sci_exit(struct task_struct *tsk);
+
+bool sci_verify_and_map(struct pt_regs *regs, unsigned long addr,
+			unsigned long hw_error_code);
+void sci_clear_data(void);
+
+static inline void sci_switch_to(struct task_struct *next)
+{
+	this_cpu_write(cpu_sci.sci_syscall, next->in_isolated_syscall);
+	if (next->sci)
+		this_cpu_write(cpu_sci.sci_cr3_offset, next->sci->cr3_offset);
+}
+
+#else /* CONFIG_SYSCALL_ISOLATION */
+
+static inline void sci_check_boottime_disable(void) {}
+
+static inline bool sci_verify_and_map(struct pt_regs *regs,unsigned long addr,
+				      unsigned long hw_error_code)
+{
+	return true;
+}
+
+static inline void sci_clear_data(void) {}
+
+static inline void sci_switch_to(struct task_struct *next) {}
+
+#endif /* CONFIG_SYSCALL_ISOLATION */
+
+#endif /* _ASM_X86_SCI_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd..9a728b7 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,6 +49,7 @@  obj-$(CONFIG_X86_INTEL_MPX)			+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
+obj-$(CONFIG_SYSCALL_ISOLATION)			+= sci.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index f905a23..b6e2db4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -22,6 +22,7 @@ 
 #include <asm/hypervisor.h>
 #include <asm/cpufeature.h>
 #include <asm/pti.h>
+#include <asm/sci.h>
 
 /*
  * We need to define the tracepoints somewhere, and tlb.c
@@ -648,6 +649,7 @@  void __init init_mem_mapping(void)
 	unsigned long end;
 
 	pti_check_boottime_disable();
+	sci_check_boottime_disable();
 	probe_page_size_mask();
 	setup_pcid();
 
diff --git a/arch/x86/mm/sci.c b/arch/x86/mm/sci.c
new file mode 100644
index 0000000..e7ddec1
--- /dev/null
+++ b/arch/x86/mm/sci.c
@@ -0,0 +1,608 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2019 IBM Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ *
+ * This code is based on pti.c, see it for the original copyrights
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/kallsyms.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/sizes.h>
+#include <linux/sci.h>
+#include <linux/random.h>
+
+#include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/cmdline.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+#include <asm/sections.h>
+#include <asm/traps.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "SCI: " fmt
+
+#define SCI_MAX_PTES 256
+#define SCI_MAX_BACKTRACE 64
+
+__visible DEFINE_PER_CPU_PAGE_ALIGNED(struct sci_percpu_data, cpu_sci);
+
+/*
+ * Walk the shadow copy of the page tables to PMD level (optionally)
+ * trying to allocate page table pages on the way down.
+ *
+ * Allocation failures are not handled here because the entire page
+ * table will be freed in sci_free_pagetable.
+ *
+ * Returns a pointer to a PMD on success, or NULL on failure.
+ */
+static pmd_t *sci_pagetable_walk_pmd(struct mm_struct *mm,
+				     pgd_t *pgd, unsigned long address)
+{
+	p4d_t *p4d;
+	pud_t *pud;
+
+	p4d = p4d_alloc(mm, pgd, address);
+	if (!p4d)
+		return NULL;
+
+	pud = pud_alloc(mm, p4d, address);
+	if (!pud)
+		return NULL;
+
+	return pmd_alloc(mm, pud, address);
+}
+
+/*
+ * Walk the shadow copy of the page tables to PTE level (optionally)
+ * trying to allocate page table pages on the way down.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *sci_pagetable_walk_pte(struct mm_struct *mm,
+				     pgd_t *pgd, unsigned long address)
+{
+	pmd_t *pmd = sci_pagetable_walk_pmd(mm, pgd, address);
+
+	if (!pmd)
+		return NULL;
+
+	if (__pte_alloc(mm, pmd))
+		return NULL;
+
+	return pte_offset_kernel(pmd, address);
+}
+
+/*
+ * Clone a single page mapping
+ *
+ * The new mapping in the @target_pgdp is always created for base
+ * page. If the orinal page table has the page at @addr mapped at PMD
+ * level, we anyway create at PTE in the target page table and map
+ * only PAGE_SIZE.
+ */
+static pte_t *sci_clone_page(struct mm_struct *mm,
+			     pgd_t *pgdp, pgd_t *target_pgdp,
+			     unsigned long addr)
+{
+	pte_t *pte, *target_pte, ptev;
+	pgd_t *pgd, *target_pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_pgd(pgdp, addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	target_pgd = pgd_offset_pgd(target_pgdp, addr);
+
+	if (pmd_large(*pmd)) {
+		pgprot_t flags;
+		unsigned long pfn;
+
+		/*
+		 * We map only PAGE_SIZE rather than the entire huge page.
+		 * The PTE will have the same pgprot bits as the origial PMD
+		 */
+		flags = pte_pgprot(pte_clrhuge(*(pte_t *)pmd));
+		pfn = pmd_pfn(*pmd) + pte_index(addr);
+		ptev = pfn_pte(pfn, flags);
+	} else {
+		pte = pte_offset_kernel(pmd, addr);
+		if (pte_none(*pte) || !(pte_flags(*pte) & _PAGE_PRESENT))
+			return NULL;
+
+		ptev = *pte;
+	}
+
+	target_pte = sci_pagetable_walk_pte(mm, target_pgd, addr);
+	if (!target_pte)
+		return NULL;
+
+	*target_pte = ptev;
+
+	return target_pte;
+}
+
+/*
+ * Clone a range keeping the same leaf mappings
+ *
+ * If the range has holes they are simply skipped
+ */
+static int sci_clone_range(struct mm_struct *mm,
+			   pgd_t *pgdp, pgd_t *target_pgdp,
+			   unsigned long start, unsigned long end)
+{
+	unsigned long addr;
+
+	/*
+	 * Clone the populated PMDs which cover start to end. These PMD areas
+	 * can have holes.
+	 */
+	for (addr = start; addr < end;) {
+		pte_t *pte, *target_pte;
+		pgd_t *pgd, *target_pgd;
+		pmd_t *pmd, *target_pmd;
+		p4d_t *p4d;
+		pud_t *pud;
+
+		/* Overflow check */
+		if (addr < start)
+			break;
+
+		pgd = pgd_offset_pgd(pgdp, addr);
+		if (pgd_none(*pgd))
+			return 0;
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d))
+			return 0;
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			addr += PUD_SIZE;
+			continue;
+		}
+
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd)) {
+			addr += PMD_SIZE;
+			continue;
+		}
+
+		target_pgd = pgd_offset_pgd(target_pgdp, addr);
+
+		if (pmd_large(*pmd)) {
+			target_pmd = sci_pagetable_walk_pmd(mm, target_pgd,
+							    addr);
+			if (!target_pmd)
+				return -ENOMEM;
+
+			*target_pmd = *pmd;
+
+			addr += PMD_SIZE;
+			continue;
+		} else {
+			pte = pte_offset_kernel(pmd, addr);
+			if (pte_none(*pte)) {
+				addr += PAGE_SIZE;
+				continue;
+			}
+
+			target_pte = sci_pagetable_walk_pte(mm, target_pgd,
+							    addr);
+			if (!target_pte)
+				return -ENOMEM;
+
+			*target_pte = *pte;
+
+			addr += PAGE_SIZE;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * we have to map the syscall entry because we'll fault there after
+ * CR3 switch and before the verifier is able to detect this as proper
+ * access
+ */
+extern void do_syscall_64(unsigned long nr, struct pt_regs *regs);
+unsigned long syscall_entry_addr = (unsigned long)do_syscall_64;
+
+static void sci_reset_backtrace(struct sci_task_data *sci)
+{
+	memset(sci->backtrace, 0, sci->backtrace_size);
+	sci->backtrace[0] = syscall_entry_addr;
+	sci->backtrace_size = 1;
+}
+
+static inline void sci_sync_user_pagetable(struct task_struct *tsk)
+{
+	pgd_t *u_pgd = kernel_to_user_pgdp(tsk->mm->pgd);
+	pgd_t *sci_pgd = tsk->sci->pgd;
+
+	down_write(&tsk->mm->mmap_sem);
+	memcpy(sci_pgd, u_pgd, PGD_KERNEL_START * sizeof(pgd_t));
+	up_write(&tsk->mm->mmap_sem);
+}
+
+static int sci_free_pte_range(struct mm_struct *mm, pmd_t *pmd)
+{
+	pte_t *ptep = pte_offset_kernel(pmd, 0);
+
+	pmd_clear(pmd);
+	pte_free(mm, virt_to_page(ptep));
+	mm_dec_nr_ptes(mm);
+
+	return 0;
+}
+
+static int sci_free_pmd_range(struct mm_struct *mm, pud_t *pud)
+{
+	pmd_t *pmd, *pmdp;
+	int i;
+
+	pmdp = pmd_offset(pud, 0);
+
+	for (i = 0, pmd = pmdp; i < PTRS_PER_PMD; i++, pmd++)
+		if (!pmd_none(*pmd) && !pmd_large(*pmd))
+			sci_free_pte_range(mm, pmd);
+
+	pud_clear(pud);
+	pmd_free(mm, pmdp);
+	mm_dec_nr_pmds(mm);
+
+	return 0;
+}
+
+static int sci_free_pud_range(struct mm_struct *mm, p4d_t *p4d)
+{
+	pud_t *pud, *pudp;
+	int i;
+
+	pudp = pud_offset(p4d, 0);
+
+	for (i = 0, pud = pudp; i < PTRS_PER_PUD; i++, pud++)
+		if (!pud_none(*pud))
+			sci_free_pmd_range(mm, pud);
+
+	p4d_clear(p4d);
+	pud_free(mm, pudp);
+	mm_dec_nr_puds(mm);
+
+	return 0;
+}
+
+static int sci_free_p4d_range(struct mm_struct *mm, pgd_t *pgd)
+{
+	p4d_t *p4d, *p4dp;
+	int i;
+
+	p4dp = p4d_offset(pgd, 0);
+
+	for (i = 0, p4d = p4dp; i < PTRS_PER_P4D; i++, p4d++)
+		if (!p4d_none(*p4d))
+			sci_free_pud_range(mm, p4d);
+
+	pgd_clear(pgd);
+	p4d_free(mm, p4dp);
+
+	return 0;
+}
+
+static int sci_free_pagetable(struct task_struct *tsk, pgd_t *sci_pgd)
+{
+	struct mm_struct *mm = tsk->mm;
+	pgd_t *pgd, *pgdp = sci_pgd;
+
+#ifdef SCI_SHARED_PAGE_TABLES
+	int i;
+
+	for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++) {
+		if (i >= pgd_index(VMALLOC_START) &&
+		    i < pgd_index(__START_KERNEL_map))
+			continue;
+		pgd = pgdp + i;
+		sci_free_p4d_range(mm, pgd);
+	}
+#else
+	for (pgd = pgdp + KERNEL_PGD_BOUNDARY; pgd < pgdp + PTRS_PER_PGD; pgd++)
+		if (!pgd_none(*pgd))
+			sci_free_p4d_range(mm, pgd);
+#endif
+
+
+	return 0;
+}
+
+static int sci_pagetable_init(struct task_struct *tsk, pgd_t *sci_pgd)
+{
+	struct mm_struct *mm = tsk->mm;
+	pgd_t *k_pgd = mm->pgd;
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	unsigned long stack = (unsigned long)tsk->stack;
+	unsigned long addr;
+	unsigned int cpu;
+	pte_t *pte;
+	int ret;
+
+	/* copy the kernel part of user visible page table */
+	ret = sci_clone_range(mm, u_pgd, sci_pgd, CPU_ENTRY_AREA_BASE,
+			      CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE);
+	if (ret)
+		goto err_free_pagetable;
+
+	ret = sci_clone_range(mm, u_pgd, sci_pgd,
+			      (unsigned long) __entry_text_start,
+			      (unsigned long) __irqentry_text_end);
+	if (ret)
+		goto err_free_pagetable;
+
+	ret = sci_clone_range(mm, mm->pgd, sci_pgd,
+			      stack, stack + THREAD_SIZE);
+	if (ret)
+		goto err_free_pagetable;
+
+	ret = -ENOMEM;
+	for_each_possible_cpu(cpu) {
+		addr = (unsigned long)&per_cpu(cpu_sci, cpu);
+		pte = sci_clone_page(mm, k_pgd, sci_pgd, addr);
+		if (!pte)
+			goto err_free_pagetable;
+	}
+
+	/* plus do_syscall_64 */
+	pte = sci_clone_page(mm, k_pgd, sci_pgd, syscall_entry_addr);
+	if (!pte)
+		goto err_free_pagetable;
+
+	return 0;
+
+err_free_pagetable:
+	sci_free_pagetable(tsk, sci_pgd);
+	return ret;
+}
+
+static int sci_alloc(struct task_struct *tsk)
+{
+	struct sci_task_data *sci;
+	int err = -ENOMEM;
+
+	if (!static_cpu_has(X86_FEATURE_SCI))
+		return 0;
+
+	if (tsk->sci)
+		return 0;
+
+	sci = kzalloc(sizeof(*sci), GFP_KERNEL);
+	if (!sci)
+		return err;
+
+	sci->ptes = kcalloc(SCI_MAX_PTES, sizeof(*sci->ptes), GFP_KERNEL);
+	if (!sci->ptes)
+		goto free_sci;
+
+	sci->backtrace = kcalloc(SCI_MAX_BACKTRACE, sizeof(*sci->backtrace),
+				  GFP_KERNEL);
+	if (!sci->backtrace)
+		goto free_ptes;
+
+	sci->pgd = (pgd_t *)get_zeroed_page(GFP_KERNEL);
+	if (!sci->pgd)
+		goto free_backtrace;
+
+	err = sci_pagetable_init(tsk, sci->pgd);
+	if (err)
+		goto free_pgd;
+
+	sci_reset_backtrace(sci);
+
+	tsk->sci = sci;
+
+	return 0;
+
+free_pgd:
+	free_page((unsigned long)sci->pgd);
+free_backtrace:
+	kfree(sci->backtrace);
+free_ptes:
+	kfree(sci->ptes);
+free_sci:
+	kfree(sci);
+	return err;
+}
+
+int sci_init(struct task_struct *tsk)
+{
+	if (!tsk->sci) {
+		int err = sci_alloc(tsk);
+
+		if (err)
+			return err;
+	}
+
+	sci_sync_user_pagetable(tsk);
+
+	return 0;
+}
+
+void sci_exit(struct task_struct *tsk)
+{
+	struct sci_task_data *sci = tsk->sci;
+
+	if (!static_cpu_has(X86_FEATURE_SCI))
+		return;
+
+	if (!sci)
+		return;
+
+	sci_free_pagetable(tsk, tsk->sci->pgd);
+	free_page((unsigned long)sci->pgd);
+	kfree(sci->backtrace);
+	kfree(sci->ptes);
+	kfree(sci);
+}
+
+void sci_clear_data(void)
+{
+	struct sci_task_data *sci = current->sci;
+	int i;
+
+	if (WARN_ON(!sci))
+		return;
+
+	for (i = 0; i < sci->ptes_count; i++)
+		pte_clear(NULL, 0, sci->ptes[i]);
+
+	memset(sci->ptes, 0, sci->ptes_count);
+	sci->ptes_count = 0;
+
+	sci_reset_backtrace(sci);
+}
+
+static void sci_add_pte(struct sci_task_data *sci, pte_t *pte)
+{
+	int i;
+
+	for (i = sci->ptes_count - 1; i >= 0; i--)
+		if (pte == sci->ptes[i])
+			return;
+	sci->ptes[sci->ptes_count++] = pte;
+}
+
+static void sci_add_rip(struct sci_task_data *sci, unsigned long rip)
+{
+	int i;
+
+	for (i = sci->backtrace_size - 1; i >= 0; i--)
+		if (rip == sci->backtrace[i])
+			return;
+
+	sci->backtrace[sci->backtrace_size++] = rip;
+}
+
+static bool sci_verify_code_access(struct sci_task_data *sci,
+				   struct pt_regs *regs, unsigned long addr)
+{
+	char namebuf[KSYM_NAME_LEN];
+	unsigned long offset, size;
+	const char *symbol;
+	char *modname;
+
+
+	/* instruction fetch outside kernel or module text */
+	if (!(is_kernel_text(addr) || is_module_text_address(addr)))
+		return false;
+
+	/* no symbol matches the address */
+	symbol = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+	if (!symbol)
+		return false;
+
+	/* BPF or ftrace? */
+	if (symbol != namebuf)
+		return false;
+
+	/* access in the middle of a function */
+	if (offset) {
+		int i = 0;
+
+		for (i = sci->backtrace_size - 1; i >= 0; i--) {
+			unsigned long rip = sci->backtrace[i];
+
+			/* allow jumps to the next page of already mapped one */
+			if ((addr >> PAGE_SHIFT) == ((rip >> PAGE_SHIFT) + 1))
+				return true;
+		}
+
+		return false;
+	}
+
+	sci_add_rip(sci, regs->ip);
+
+	return true;
+}
+
+bool sci_verify_and_map(struct pt_regs *regs, unsigned long addr,
+			unsigned long hw_error_code)
+{
+	struct task_struct *tsk = current;
+	struct mm_struct *mm = tsk->mm;
+	struct sci_task_data *sci = tsk->sci;
+	pte_t *pte;
+
+	/* run out of room for metadata, can't grant access */
+	if (sci->ptes_count >= SCI_MAX_PTES ||
+	    sci->backtrace_size >= SCI_MAX_BACKTRACE)
+		return false;
+
+	/* only code access is checked */
+	if (hw_error_code & X86_PF_INSTR &&
+	    !sci_verify_code_access(sci, regs, addr))
+		return false;
+
+	pte = sci_clone_page(mm, mm->pgd, sci->pgd, addr);
+	if (!pte)
+		return false;
+
+	sci_add_pte(sci, pte);
+
+	return true;
+}
+
+void __init sci_check_boottime_disable(void)
+{
+	char arg[5];
+	int ret;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PCID)) {
+		pr_info("System call isolation requires PCID\n");
+		return;
+	}
+
+	/* Assume SCI is disabled unless explicitly overridden. */
+	ret = cmdline_find_option(boot_command_line, "sci", arg, sizeof(arg));
+	if (ret == 2 && !strncmp(arg, "on", 2)) {
+		setup_force_cpu_cap(X86_FEATURE_SCI);
+		pr_info("System call isolation is enabled\n");
+		return;
+	}
+
+	pr_info("System call isolation is disabled\n");
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9b43c9..cdcdb07 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1202,6 +1202,11 @@  struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_SYSCALL_ISOLATION
+	unsigned long			in_isolated_syscall;
+	struct sci_task_data		*sci;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/sci.h b/include/linux/sci.h
new file mode 100644
index 0000000..7a6beac
--- /dev/null
+++ b/include/linux/sci.h
@@ -0,0 +1,12 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_SCI_H
+#define _LINUX_SCI_H
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+#include <asm/sci.h>
+#else
+static inline int sci_init(struct task_struct *tsk) { return 0; }
+static inline void sci_exit(struct task_struct *tsk) {}
+#endif
+
+#endif /* _LINUX_SCI_H */