[V5,0/5] KVM: X86: Introducing ROE Protection Kernel Hardening
mbox series

Message ID 20181026151223.16810-1-ahmedsoliman0x666@gmail.com
Headers show
Series
  • KVM: X86: Introducing ROE Protection Kernel Hardening
Related show

Message

Ahmed Soliman Oct. 26, 2018, 3:12 p.m. UTC
This is the 5th version which is 4th version with minor fixes. ROE is a 
hypercall that enables host operating system to restrict guest's access to its
own memory. This will provide a hardening mechanism that can be used to stop
rootkits from manipulating kernel static data structures and code. Once a memory
region is protected the guest kernel can't even request undoing the protection.

Memory protected by ROE should be non-swapable because even if the ROE protected
page got swapped out, It won't be possible to write anything in its place.

ROE hypercall should be capable of either protecting a whole memory frame or
parts of it. With these two, it should be possible for guest kernel to protect
its memory and all the page table entries for that memory inside the page table.
I am still not sure whether this should be part of ROE job or the guest's job.


The reason why it would be better to implement this from inside kvm: instead of
(host) user space is the need to access SPTEs to modify the permissions, while
mprotect() from user space can work in theory. It will become a big performance
hit to vmexit and switch to user space mode on each fault, on the other hand,
having the permission handled by EPT should make some remarkable performance
gain.

Our model assumes that an attacker got full root access to a running guest and
his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT ..etc).

There is future work in progress to also put some sort of protection on the page
table register CR3 and other critical registers that can be intercepted by KVM.
This way it won't be possible for an attacker to manipulate any part of the
guests page table.

V4->V5 change log:
- Fixed summary (it was reverted summary)
- Fixed an inaccurate documentation in patch [4/5]

Summary:

 Documentation/virtual/kvm/hypercalls.txt |  40 +++++
 arch/x86/include/asm/kvm_host.h          |  11 +-
 arch/x86/kvm/Kconfig                     |   7 +
 arch/x86/kvm/mmu.c                       | 129 ++++++++++----
 arch/x86/kvm/vmx.c                       |   2 +-
 arch/x86/kvm/x86.c                       | 281 ++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h                 |  29 ++++
 include/uapi/linux/kvm_para.h            |   5 +
 virt/kvm/kvm_main.c                      | 119 +++++++++++--
 9 files changed, 572 insertions(+), 51 deletions(-)

Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>

Comments

Ingo Molnar Oct. 29, 2018, 6:46 a.m. UTC | #1
* Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com> wrote:

> This is the 5th version which is 4th version with minor fixes. ROE is a 
> hypercall that enables host operating system to restrict guest's access to its
> own memory. This will provide a hardening mechanism that can be used to stop
> rootkits from manipulating kernel static data structures and code. Once a memory
> region is protected the guest kernel can't even request undoing the protection.
> 
> Memory protected by ROE should be non-swapable because even if the ROE protected
> page got swapped out, It won't be possible to write anything in its place.
> 
> ROE hypercall should be capable of either protecting a whole memory frame or
> parts of it. With these two, it should be possible for guest kernel to protect
> its memory and all the page table entries for that memory inside the page table.
> I am still not sure whether this should be part of ROE job or the guest's job.
> 
> 
> The reason why it would be better to implement this from inside kvm: instead of
> (host) user space is the need to access SPTEs to modify the permissions, while
> mprotect() from user space can work in theory. It will become a big performance
> hit to vmexit and switch to user space mode on each fault, on the other hand,
> having the permission handled by EPT should make some remarkable performance
> gain.
> 
> Our model assumes that an attacker got full root access to a running guest and
> his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT ..etc).
> 
> There is future work in progress to also put some sort of protection on the page
> table register CR3 and other critical registers that can be intercepted by KVM.
> This way it won't be possible for an attacker to manipulate any part of the
> guests page table.

BTW., transparent detection and trapping of attacks would also be nice: 
if ROE is active and something running on the guest still attempts to 
change the pagetables, the guest should be frozen and a syslog warning on 
the hypervisor side should be printed?

Also, the feature should probably be 'default y' to help spread it on the 
distro side. It's opt-in functionality from the guest side anyway, so 
there's no real cost on the host side other than some minor resident 
memory.

Thanks,

	Ingo
Igor Stoppa Oct. 29, 2018, 6:01 p.m. UTC | #2
Hi,

On 26/10/2018 16:12, Ahmed Abd El Mawgood wrote:

> This is the 5th version which is 4th version with minor fixes. ROE is a
> hypercall that enables host operating system to restrict guest's access to its
> own memory. This will provide a hardening mechanism that can be used to stop
> rootkits from manipulating kernel static data structures and code. Once a memory
> region is protected the guest kernel can't even request undoing the protection.

This is very interesting, because it seems a very good match to the work 
I'm doing, for supporting the creation of more targets for protection:

https://www.openwall.com/lists/kernel-hardening/2018/10/23/3

In my case the protection would extend also to write-rate type of data.
There is an open problem of identifying legitimate write-rare 
operations, however it should be possible to provide at least a certain 
degree of confidence.

--

igor
Ahmed Soliman Oct. 30, 2018, 4:49 p.m. UTC | #3
On Mon, 29 Oct 2018 at 08:46, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com> wrote:
>
> > This is the 5th version which is 4th version with minor fixes. ROE is a
> > hypercall that enables host operating system to restrict guest's access to its
> > own memory. This will provide a hardening mechanism that can be used to stop
> > rootkits from manipulating kernel static data structures and code. Once a memory
> > region is protected the guest kernel can't even request undoing the protection.
> >
> > Memory protected by ROE should be non-swapable because even if the ROE protected
> > page got swapped out, It won't be possible to write anything in its place.
> >
> > ROE hypercall should be capable of either protecting a whole memory frame or
> > parts of it. With these two, it should be possible for guest kernel to protect
> > its memory and all the page table entries for that memory inside the page table.
> > I am still not sure whether this should be part of ROE job or the guest's job.
> >
> >
> > The reason why it would be better to implement this from inside kvm: instead of
> > (host) user space is the need to access SPTEs to modify the permissions, while
> > mprotect() from user space can work in theory. It will become a big performance
> > hit to vmexit and switch to user space mode on each fault, on the other hand,
> > having the permission handled by EPT should make some remarkable performance
> > gain.
> >
> > Our model assumes that an attacker got full root access to a running guest and
> > his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT ..etc).
> >
> > There is future work in progress to also put some sort of protection on the page
> > table register CR3 and other critical registers that can be intercepted by KVM.
> > This way it won't be possible for an attacker to manipulate any part of the
> > guests page table.
>
> BTW., transparent detection and trapping of attacks would also be nice:
> if ROE is active and something running on the guest still attempts to
> change the pagetables, the guest should be frozen and a syslog warning on
> the hypervisor side should be printed?

I was thinking about logging ROE violations to host's kernel logs too
and I will have it in the next
version of the patch set, but I am not sure if we should really freeze
the guest once a violation
happens, I wanted to completely isolate the mechanism from the policy.
In the current implementation,
the policy followed  is left to users pace host process using kvm. It
will be notified about a Memory IO
error in the guest, and the guest will be waiting for the host to take
actions (vmexit). Of course the
simplest action shall be "do nothing", so the writing operation will
fail but the guest continues to run, or it
can freeze the guest, or maybe do something advanced like cloning the
vm and disconnecting network
interface from the cloned vm, then enable the write operation and keep
track of what is going on somehow.
Because of the many varieties of possible reactions, we decided to
keep the them out of KVM and just let the
user space process decide how should the situation be handled.

> Also, the feature should probably be 'default y' to help spread it on the
> distro side. It's opt-in functionality from the guest side anyway, so
> there's no real cost on the host side other than some minor resident
> memory.

Noted, I will have it `default y` in the next patches set.

Thanks,
Ahmed
Christian Borntraeger Oct. 30, 2018, 5:31 p.m. UTC | #4
On 10/26/2018 05:12 PM, Ahmed Abd El Mawgood wrote:
> This is the 5th version which is 4th version with minor fixes. ROE is a 
> hypercall that enables host operating system to restrict guest's access to its
> own memory. This will provide a hardening mechanism that can be used to stop
> rootkits from manipulating kernel static data structures and code. Once a memory
> region is protected the guest kernel can't even request undoing the protection.

At the KVM forum we considered this as something that could be implemented across
multiple architectures. Yes, there will be architecture specific implementations 
and yes some  architectures might be not able to provide everything (e.g. sub-page
granularity).
But we should really check if we can come up with a guest interface or guest common 
code that can be useful across architectures.
> 
> Memory protected by ROE should be non-swapable because even if the ROE protected
> page got swapped out, It won't be possible to write anything in its place.
> 
> ROE hypercall should be capable of either protecting a whole memory frame or
> parts of it. With these two, it should be possible for guest kernel to protect
> its memory and all the page table entries for that memory inside the page table.
> I am still not sure whether this should be part of ROE job or the guest's job.
> 
> 
> The reason why it would be better to implement this from inside kvm: instead of
> (host) user space is the need to access SPTEs to modify the permissions, while
> mprotect() from user space can work in theory. It will become a big performance
> hit to vmexit and switch to user space mode on each fault, on the other hand,
> having the permission handled by EPT should make some remarkable performance
> gain.
> 
> Our model assumes that an attacker got full root access to a running guest and
> his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT ..etc).
> 
> There is future work in progress to also put some sort of protection on the page
> table register CR3 and other critical registers that can be intercepted by KVM.
> This way it won't be possible for an attacker to manipulate any part of the
> guests page table.
> 
> V4->V5 change log:
> - Fixed summary (it was reverted summary)
> - Fixed an inaccurate documentation in patch [4/5]
> 
> Summary:
> 
>  Documentation/virtual/kvm/hypercalls.txt |  40 +++++
>  arch/x86/include/asm/kvm_host.h          |  11 +-
>  arch/x86/kvm/Kconfig                     |   7 +
>  arch/x86/kvm/mmu.c                       | 129 ++++++++++----
>  arch/x86/kvm/vmx.c                       |   2 +-
>  arch/x86/kvm/x86.c                       | 281 ++++++++++++++++++++++++++++++-
>  include/linux/kvm_host.h                 |  29 ++++
>  include/uapi/linux/kvm_para.h            |   5 +
>  virt/kvm/kvm_main.c                      | 119 +++++++++++--
>  9 files changed, 572 insertions(+), 51 deletions(-)
> 
> Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
>
Ahmed Soliman Oct. 31, 2018, 11:21 p.m. UTC | #5
Hello Igor,
> This is very interesting, because it seems a very good match to the work
> I'm doing, for supporting the creation of more targets for protection:
>
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/3
>
> In my case the protection would extend also to write-rate type of data.
> There is an open problem of identifying legitimate write-rare
> operations, however it should be possible to provide at least a certain
> degree of confidence.

I have checked your patch set. In our work we were originally planning to do
something similar to write_rare just so we can differentiate between memory
chunks that may be modified and those that will be set once and never modify.
I see you are planning to do a white paper too, actually we are doing
an academic
paper based on our work. If you would like to collaborate, so that ROE
and write_rare
would integrate well from the beginning, we will be glad to do so.

Thanks,
--
Ahmed
Junior Researcher , IoT and Cyber Security lab, SmartCI , Alexandria
University, & CIS @  VMI
Igor Stoppa Nov. 1, 2018, 3:56 p.m. UTC | #6
Hello Ahmed,

On 01/11/2018 01:21, Ahmed Soliman wrote:
> Hello Igor,
>> This is very interesting, because it seems a very good match to the work
>> I'm doing, for supporting the creation of more targets for protection:
>>
>> https://www.openwall.com/lists/kernel-hardening/2018/10/23/3
>>
>> In my case the protection would extend also to write-rate type of data.
>> There is an open problem of identifying legitimate write-rare
>> operations, however it should be possible to provide at least a certain
>> degree of confidence.
> 
> I have checked your patch set. In our work we were originally planning to do
> something similar to write_rare just so we can differentiate between memory
> chunks that may be modified and those that will be set once and never modify.
> I see you are planning to do a white paper too, actually we are doing
> an academic
> paper based on our work. If you would like to collaborate, so that ROE
> and write_rare
> would integrate well from the beginning, we will be glad to do so.

The offer is very kind, thanks a lot.
I will contact you in private.

--
igor