Message ID | 20190606200926.4029-4-yu-cheng.yu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Control-flow Enforcement: Branch Tracking, PTRACE | expand |
On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap > that allows execution of legacy, non-IBT compatible library by an > IBT-enabled application. When set, each bit in the bitmap indicates > one page of legacy code. > > The bitmap is allocated and setup from the application. > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) > +{ > + u64 r; > + > + if (!current->thread.cet.ibt_enabled) > + return -EINVAL; > + > + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) > + return -EINVAL; > + > + current->thread.cet.ibt_bitmap_addr = bitmap; > + current->thread.cet.ibt_bitmap_size = size; > + > + /* > + * Turn on IBT legacy bitmap. > + */ > + modify_fpu_regs_begin(); > + rdmsrl(MSR_IA32_U_CET, r); > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > + wrmsrl(MSR_IA32_U_CET, r); > + modify_fpu_regs_end(); > + > + return 0; > +} So you just program a random user supplied address into the hardware. What happens if there's not actually anything at that address or the user munmap()s the data after doing this?
On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote: > On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: > > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap > > that allows execution of legacy, non-IBT compatible library by an > > IBT-enabled application. When set, each bit in the bitmap indicates > > one page of legacy code. > > > > The bitmap is allocated and setup from the application. > > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) > > +{ > > + u64 r; > > + > > + if (!current->thread.cet.ibt_enabled) > > + return -EINVAL; > > + > > + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) > > + return -EINVAL; > > + > > + current->thread.cet.ibt_bitmap_addr = bitmap; > > + current->thread.cet.ibt_bitmap_size = size; > > + > > + /* > > + * Turn on IBT legacy bitmap. > > + */ > > + modify_fpu_regs_begin(); > > + rdmsrl(MSR_IA32_U_CET, r); > > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > > + wrmsrl(MSR_IA32_U_CET, r); > > + modify_fpu_regs_end(); > > + > > + return 0; > > +} > > So you just program a random user supplied address into the hardware. > What happens if there's not actually anything at that address or the > user munmap()s the data after doing this? This function checks the bitmap's alignment and size, and anything else is the app's responsibility. What else do you think the kernel should check? Yu-cheng
> On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > >> On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote: >>> On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: >>> Indirect Branch Tracking (IBT) provides an optional legacy code bitmap >>> that allows execution of legacy, non-IBT compatible library by an >>> IBT-enabled application. When set, each bit in the bitmap indicates >>> one page of legacy code. >>> >>> The bitmap is allocated and setup from the application. >>> +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) >>> +{ >>> + u64 r; >>> + >>> + if (!current->thread.cet.ibt_enabled) >>> + return -EINVAL; >>> + >>> + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) >>> + return -EINVAL; >>> + >>> + current->thread.cet.ibt_bitmap_addr = bitmap; >>> + current->thread.cet.ibt_bitmap_size = size; >>> + >>> + /* >>> + * Turn on IBT legacy bitmap. >>> + */ >>> + modify_fpu_regs_begin(); >>> + rdmsrl(MSR_IA32_U_CET, r); >>> + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); >>> + wrmsrl(MSR_IA32_U_CET, r); >>> + modify_fpu_regs_end(); >>> + >>> + return 0; >>> +} >> >> So you just program a random user supplied address into the hardware. >> What happens if there's not actually anything at that address or the >> user munmap()s the data after doing this? > > This function checks the bitmap's alignment and size, and anything else is the > app's responsibility. What else do you think the kernel should check? > One might reasonably wonder why this state is privileged in the first place and, given that, why we’re allowing it to be written like this. Arguably we should have another prctl to lock these values (until exec) as a gardening measure.
On 6/7/19 9:35 AM, Andy Lutomirski wrote: > One might reasonably wonder why this state is privileged in the first > place and, given that, why we’re allowing it to be written like > this. I think it's generally a good architectural practice to make things like this privileged. They're infrequent so can survive the cost of a trip in/out of the kernel and are a great choke point to make sure the OS is involved. I wish we had the same for MPX or pkeys per-task "setup".
On Fri, 2019-06-07 at 09:35 -0700, Andy Lutomirski wrote: > > On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > > > > On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote: > > > > On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: > > > > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap > > > > that allows execution of legacy, non-IBT compatible library by an > > > > IBT-enabled application. When set, each bit in the bitmap indicates > > > > one page of legacy code. > > > > > > > > The bitmap is allocated and setup from the application. > > > > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) > > > > +{ > > > > + u64 r; > > > > + > > > > + if (!current->thread.cet.ibt_enabled) > > > > + return -EINVAL; > > > > + > > > > + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) > > > > + return -EINVAL; > > > > + > > > > + current->thread.cet.ibt_bitmap_addr = bitmap; > > > > + current->thread.cet.ibt_bitmap_size = size; > > > > + > > > > + /* > > > > + * Turn on IBT legacy bitmap. > > > > + */ > > > > + modify_fpu_regs_begin(); > > > > + rdmsrl(MSR_IA32_U_CET, r); > > > > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > > > > + wrmsrl(MSR_IA32_U_CET, r); > > > > + modify_fpu_regs_end(); > > > > + > > > > + return 0; > > > > +} > > > > > > So you just program a random user supplied address into the hardware. > > > What happens if there's not actually anything at that address or the > > > user munmap()s the data after doing this? > > > > This function checks the bitmap's alignment and size, and anything else is > > the > > app's responsibility. What else do you think the kernel should check? > > > > One might reasonably wonder why this state is privileged in the first place > and, given that, why we’re allowing it to be written like this. > > Arguably we should have another prctl to lock these values (until exec) as a > gardening measure. We can prevent the bitmap from being set more than once. I will test it. Yu-cheng
> On Jun 7, 2019, at 9:45 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > On Fri, 2019-06-07 at 09:35 -0700, Andy Lutomirski wrote: >>> On Jun 7, 2019, at 9:23 AM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: >>> >>>>> On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote: >>>>> On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: >>>>> Indirect Branch Tracking (IBT) provides an optional legacy code bitmap >>>>> that allows execution of legacy, non-IBT compatible library by an >>>>> IBT-enabled application. When set, each bit in the bitmap indicates >>>>> one page of legacy code. >>>>> >>>>> The bitmap is allocated and setup from the application. >>>>> +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) >>>>> +{ >>>>> + u64 r; >>>>> + >>>>> + if (!current->thread.cet.ibt_enabled) >>>>> + return -EINVAL; >>>>> + >>>>> + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) >>>>> + return -EINVAL; >>>>> + >>>>> + current->thread.cet.ibt_bitmap_addr = bitmap; >>>>> + current->thread.cet.ibt_bitmap_size = size; >>>>> + >>>>> + /* >>>>> + * Turn on IBT legacy bitmap. >>>>> + */ >>>>> + modify_fpu_regs_begin(); >>>>> + rdmsrl(MSR_IA32_U_CET, r); >>>>> + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); >>>>> + wrmsrl(MSR_IA32_U_CET, r); >>>>> + modify_fpu_regs_end(); >>>>> + >>>>> + return 0; >>>>> +} >>>> >>>> So you just program a random user supplied address into the hardware. >>>> What happens if there's not actually anything at that address or the >>>> user munmap()s the data after doing this? >>> >>> This function checks the bitmap's alignment and size, and anything else is >>> the >>> app's responsibility. What else do you think the kernel should check? >>> >> >> One might reasonably wonder why this state is privileged in the first place >> and, given that, why we’re allowing it to be written like this. >> >> Arguably we should have another prctl to lock these values (until exec) as a >> gardening measure. > > We can prevent the bitmap from being set more than once. I will test it. > I think it would be better to make locking an explicit opt-in.
On Fri, Jun 07, 2019 at 09:23:43AM -0700, Yu-cheng Yu wrote: > On Fri, 2019-06-07 at 10:08 +0200, Peter Zijlstra wrote: > > On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote: > > > Indirect Branch Tracking (IBT) provides an optional legacy code bitmap > > > that allows execution of legacy, non-IBT compatible library by an > > > IBT-enabled application. When set, each bit in the bitmap indicates > > > one page of legacy code. > > > > > > The bitmap is allocated and setup from the application. > > > +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) > > > +{ > > > + u64 r; > > > + > > > + if (!current->thread.cet.ibt_enabled) > > > + return -EINVAL; > > > + > > > + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) > > > + return -EINVAL; > > > + > > > + current->thread.cet.ibt_bitmap_addr = bitmap; > > > + current->thread.cet.ibt_bitmap_size = size; > > > + > > > + /* > > > + * Turn on IBT legacy bitmap. > > > + */ > > > + modify_fpu_regs_begin(); > > > + rdmsrl(MSR_IA32_U_CET, r); > > > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > > > + wrmsrl(MSR_IA32_U_CET, r); > > > + modify_fpu_regs_end(); > > > + > > > + return 0; > > > +} > > > > So you just program a random user supplied address into the hardware. > > What happens if there's not actually anything at that address or the > > user munmap()s the data after doing this? > > This function checks the bitmap's alignment and size, and anything else is the > app's responsibility. What else do you think the kernel should check? I've no idea what the kernel should do; since you failed to answer the question what happens when you point this to garbage. Does it then fault or what?
On 6/7/19 10:43 AM, Peter Zijlstra wrote: > I've no idea what the kernel should do; since you failed to answer the > question what happens when you point this to garbage. > > Does it then fault or what? Yeah, I think you'll fault with a rather mysterious CR2 value since you'll go look at the instruction that faulted and not see any references to the CR2 value. I think this new MSR probably needs to get included in oops output when CET is enabled. Why don't we require that a VMA be in place for the entire bitmap? Don't we need a "get" prctl function too in case something like a JIT is running and needs to find the location of this bitmap to set bits itself? Or, do we just go whole-hog and have the kernel manage the bitmap itself. Our interface here could be: prctl(PR_MARK_CODE_AS_LEGACY, start, size); and then have the kernel allocate and set the bitmap for those code locations.
> On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 6/7/19 10:43 AM, Peter Zijlstra wrote: >> I've no idea what the kernel should do; since you failed to answer the >> question what happens when you point this to garbage. >> >> Does it then fault or what? > > Yeah, I think you'll fault with a rather mysterious CR2 value since > you'll go look at the instruction that faulted and not see any > references to the CR2 value. > > I think this new MSR probably needs to get included in oops output when > CET is enabled. This shouldn’t be able to OOPS because it only happens at CPL 3, right? We should put it into core dumps, though. > > Why don't we require that a VMA be in place for the entire bitmap? > Don't we need a "get" prctl function too in case something like a JIT is > running and needs to find the location of this bitmap to set bits itself? > > Or, do we just go whole-hog and have the kernel manage the bitmap > itself. Our interface here could be: > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > and then have the kernel allocate and set the bitmap for those code > locations. Given that the format depends on the VA size, this might be a good idea. I bet we can reuse the special mapping infrastructure for this — the VMA could be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we can even make special rules to core dump it intelligently if needed. And we can make mremap() on it work correctly if anyone (CRIU?) cares. Hmm. Can we be creative and skip populating it with zeros? The CPU should only ever touch a page if we miss an ENDBR on it, so, in normal operation, we don’t need anything to be there. We could try to prevent anyone from *reading* it outside of ENDBR tracking if we want to avoid people accidentally wasting lots of memory by forcing it to be fully populated when the read it. The one downside is this forces it to be per-mm, but that seems like a generally reasonable model anyway. This also gives us an excellent opportunity to make it read-only as seen from userspace to prevent exploits from just poking it full of ones before redirecting execution.
On 6/7/19 11:29 AM, Andy Lutomirski wrote: ... >> I think this new MSR probably needs to get included in oops output when >> CET is enabled. > > This shouldn’t be able to OOPS because it only happens at CPL 3, > right? We should put it into core dumps, though. Good point. Yu-cheng, can you just confirm that the bitmap can't be referenced in ring-0, no matter what? We should also make sure that no funny business happens if we put an address in the bitmap that faults, or is non-canonical. Do we have any self-tests for that? Let's say userspace gets a fault on this. Do they have the introspection capability to figure out why they faulted, say in their signal handler? >> Why don't we require that a VMA be in place for the entire bitmap? >> Don't we need a "get" prctl function too in case something like a JIT is >> running and needs to find the location of this bitmap to set bits itself? >> >> Or, do we just go whole-hog and have the kernel manage the bitmap >> itself. Our interface here could be: >> >> prctl(PR_MARK_CODE_AS_LEGACY, start, size); >> >> and then have the kernel allocate and set the bitmap for those code >> locations. > > Given that the format depends on the VA size, this might be a good > idea. Yeah, making userspace know how large the address space is or could be is rather nasty, especially if we ever get any fancy CPU features that eat up address bits (a la ARM top-byte-ignore or SPARC ADI). > Hmm. Can we be creative and skip populating it with zeros? The CPU should only ever touch a page if we miss an ENDBR on it, so, in normal operation, we don’t need anything to be there. We could try to prevent anyone from *reading* it outside of ENDBR tracking if we want to avoid people accidentally wasting lots of memory by forcing it to be fully populated when the read it. Won't reads on a big, contiguous private mapping get the huge zero page anyway? > The one downside is this forces it to be per-mm, but that seems like > a generally reasonable model anyway. Yeah, practically, you could only make it shared if you shared the layout of all code in the address space. I'm sure the big database(s) do that cross-process, but I bet nobody else does. User ASLR practically guarantees that nobody can do this. > This also gives us an excellent opportunity to make it read-only as > seen from userspace to prevent exploits from just poking it full of > ones before redirecting execution. That would be fun.
On 6/6/19 1:09 PM, Yu-cheng Yu wrote: > + modify_fpu_regs_begin(); > + rdmsrl(MSR_IA32_U_CET, r); > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > + wrmsrl(MSR_IA32_U_CET, r); > + modify_fpu_regs_end(); Isn't there a bunch of other stuff in this MSR? It seems like the bitmap value would allow overwriting lots of bits in the MSR that have nothing to do with the bitmap... in a prctl() that's supposed to only be dealing with the bitmap.
On Fri, 2019-06-07 at 12:03 -0700, Dave Hansen wrote: > On 6/6/19 1:09 PM, Yu-cheng Yu wrote: > > + modify_fpu_regs_begin(); > > + rdmsrl(MSR_IA32_U_CET, r); > > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); > > + wrmsrl(MSR_IA32_U_CET, r); > > + modify_fpu_regs_end(); > > Isn't there a bunch of other stuff in this MSR? It seems like the > bitmap value would allow overwriting lots of bits in the MSR that have > nothing to do with the bitmap... in a prctl() that's supposed to only be > dealing with the bitmap. Yes, the bitmap address should have been masked, although it is checked for page alignment (which has the same effect). I will fix it. Yu-cheng
On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote: > > On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > On 6/7/19 10:43 AM, Peter Zijlstra wrote: > > > I've no idea what the kernel should do; since you failed to answer the > > > question what happens when you point this to garbage. > > > > > > Does it then fault or what? > > > > Yeah, I think you'll fault with a rather mysterious CR2 value since > > you'll go look at the instruction that faulted and not see any > > references to the CR2 value. > > > > I think this new MSR probably needs to get included in oops output when > > CET is enabled. > > This shouldn’t be able to OOPS because it only happens at CPL 3, right? We > should put it into core dumps, though. > > > > > Why don't we require that a VMA be in place for the entire bitmap? > > Don't we need a "get" prctl function too in case something like a JIT is > > running and needs to find the location of this bitmap to set bits itself? > > > > Or, do we just go whole-hog and have the kernel manage the bitmap > > itself. Our interface here could be: > > > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > > > and then have the kernel allocate and set the bitmap for those code > > locations. > > Given that the format depends on the VA size, this might be a good idea. I > bet we can reuse the special mapping infrastructure for this — the VMA could > be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we > can even make special rules to core dump it intelligently if needed. And we > can make mremap() on it work correctly if anyone (CRIU?) cares. > > Hmm. Can we be creative and skip populating it with zeros? The CPU should > only ever touch a page if we miss an ENDBR on it, so, in normal operation, we > don’t need anything to be there. We could try to prevent anyone from > *reading* it outside of ENDBR tracking if we want to avoid people accidentally > wasting lots of memory by forcing it to be fully populated when the read it. > > The one downside is this forces it to be per-mm, but that seems like a > generally reasonable model anyway. > > This also gives us an excellent opportunity to make it read-only as seen from > userspace to prevent exploits from just poking it full of ones before > redirecting execution. GLIBC sets bits only for legacy code, and then makes the bitmap read-only. That avoids most issues: To populate bitmap pages, mprotect() is required. Reading zero bitmap pages would not waste more physical memory, right? Yu-cheng
On Fri, 2019-06-07 at 11:58 -0700, Dave Hansen wrote: > On 6/7/19 11:29 AM, Andy Lutomirski wrote: > ... > > > I think this new MSR probably needs to get included in oops output when > > > CET is enabled. > > > > This shouldn’t be able to OOPS because it only happens at CPL 3, > > right? We should put it into core dumps, though. > > Good point. > > Yu-cheng, can you just confirm that the bitmap can't be referenced in > ring-0, no matter what? We should also make sure that no funny business > happens if we put an address in the bitmap that faults, or is > non-canonical. Do we have any self-tests for that? Yes, the bitmap is user memory, but the kernel can still get to it (e.g. copy_from_user()). We can do more check on the address. > > Let's say userspace gets a fault on this. Do they have the > introspection capability to figure out why they faulted, say in their > signal handler? The bitmap address is kept by the application; the kernel won't provide it again to user-space. In the signal handler, the app can find out from its own record. [...]
On 6/7/19 12:49 PM, Yu-cheng Yu wrote: >> >> This also gives us an excellent opportunity to make it read-only as seen from >> userspace to prevent exploits from just poking it full of ones before >> redirecting execution. > GLIBC sets bits only for legacy code, and then makes the bitmap read-only. That > avoids most issues: > > To populate bitmap pages, mprotect() is required. > Reading zero bitmap pages would not waste more physical memory, right? Huh, how does glibc know about all possible past and future legacy code in the application?
On Fri, 2019-06-07 at 13:00 -0700, Dave Hansen wrote: > On 6/7/19 12:49 PM, Yu-cheng Yu wrote: > > > > > > This also gives us an excellent opportunity to make it read-only as seen > > > from > > > userspace to prevent exploits from just poking it full of ones before > > > redirecting execution. > > > > GLIBC sets bits only for legacy code, and then makes the bitmap read- > > only. That > > avoids most issues: > > > > To populate bitmap pages, mprotect() is required. > > Reading zero bitmap pages would not waste more physical memory, right? > > Huh, how does glibc know about all possible past and future legacy code > in the application? When dlopen() gets a legacy binary and the policy allows that, it will manage the bitmap: If a bitmap has not been created, create one. Set bits for the legacy code being loaded. Yu-cheng
> On Jun 7, 2019, at 11:58 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/7/19 11:29 AM, Andy Lutomirski wrote: > ... >>> I think this new MSR probably needs to get included in oops output when >>> CET is enabled. >> >> This shouldn’t be able to OOPS because it only happens at CPL 3, >> right? We should put it into core dumps, though. > > Good point. > > Yu-cheng, can you just confirm that the bitmap can't be referenced in > ring-0, no matter what? We should also make sure that no funny business > happens if we put an address in the bitmap that faults, or is > non-canonical. Do we have any self-tests for that? > > Let's say userspace gets a fault on this. Do they have the > introspection capability to figure out why they faulted, say in their > signal handler? We need to stick the tracker state in the sigcontext somewhere. Did we end up defining a signal frame shadow stack token? > >>> Why don't we require that a VMA be in place for the entire bitmap? >>> Don't we need a "get" prctl function too in case something like a JIT is >>> running and needs to find the location of this bitmap to set bits itself? >>> >>> Or, do we just go whole-hog and have the kernel manage the bitmap >>> itself. Our interface here could be: >>> >>> prctl(PR_MARK_CODE_AS_LEGACY, start, size); >>> >>> and then have the kernel allocate and set the bitmap for those code >>> locations. >> >> Given that the format depends on the VA size, this might be a good >> idea. > > Yeah, making userspace know how large the address space is or could be > is rather nasty, especially if we ever get any fancy CPU features that > eat up address bits (a la ARM top-byte-ignore or SPARC ADI). That gets extra bad if we ever grow user code that uses it but is unaware. It could poke the wrong part of the bitmap. > >> Hmm. Can we be creative and skip populating it with zeros? The CPU > should only ever touch a page if we miss an ENDBR on it, so, in normal > operation, we don’t need anything to be there. We could try to prevent > anyone from *reading* it outside of ENDBR tracking if we want to avoid > people accidentally wasting lots of memory by forcing it to be fully > populated when the read it. > > Won't reads on a big, contiguous private mapping get the huge zero page > anyway? The zero pages may be free, but the page tables could be decently large. Does the core mm code use huge, immense, etc huge zero pages? Or can it synthesize them by reusing page table pages that map zeros? > >> The one downside is this forces it to be per-mm, but that seems like >> a generally reasonable model anyway. > > Yeah, practically, you could only make it shared if you shared the > layout of all code in the address space. I'm sure the big database(s) > do that cross-process, but I bet nobody else does. User ASLR > practically guarantees that nobody can do this. I meant per-mm instead of per-task.
> On Jun 7, 2019, at 12:49 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote: >>> On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>>> On 6/7/19 10:43 AM, Peter Zijlstra wrote: >>>> I've no idea what the kernel should do; since you failed to answer the >>>> question what happens when you point this to garbage. >>>> >>>> Does it then fault or what? >>> >>> Yeah, I think you'll fault with a rather mysterious CR2 value since >>> you'll go look at the instruction that faulted and not see any >>> references to the CR2 value. >>> >>> I think this new MSR probably needs to get included in oops output when >>> CET is enabled. >> >> This shouldn’t be able to OOPS because it only happens at CPL 3, right? We >> should put it into core dumps, though. >> >>> >>> Why don't we require that a VMA be in place for the entire bitmap? >>> Don't we need a "get" prctl function too in case something like a JIT is >>> running and needs to find the location of this bitmap to set bits itself? >>> >>> Or, do we just go whole-hog and have the kernel manage the bitmap >>> itself. Our interface here could be: >>> >>> prctl(PR_MARK_CODE_AS_LEGACY, start, size); >>> >>> and then have the kernel allocate and set the bitmap for those code >>> locations. >> >> Given that the format depends on the VA size, this might be a good idea. I >> bet we can reuse the special mapping infrastructure for this — the VMA could >> be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and we >> can even make special rules to core dump it intelligently if needed. And we >> can make mremap() on it work correctly if anyone (CRIU?) cares. >> >> Hmm. Can we be creative and skip populating it with zeros? The CPU should >> only ever touch a page if we miss an ENDBR on it, so, in normal operation, we >> don’t need anything to be there. We could try to prevent anyone from >> *reading* it outside of ENDBR tracking if we want to avoid people accidentally >> wasting lots of memory by forcing it to be fully populated when the read it. >> >> The one downside is this forces it to be per-mm, but that seems like a >> generally reasonable model anyway. >> >> This also gives us an excellent opportunity to make it read-only as seen from >> userspace to prevent exploits from just poking it full of ones before >> redirecting execution. > > GLIBC sets bits only for legacy code, and then makes the bitmap read-only. That > avoids most issues: How does glibc know the linear address space size? We don’t want LA64 to break old binaries because the address calculation changed.
On 6/7/19 1:40 PM, Andy Lutomirski wrote: >>> Hmm. Can we be creative and skip populating it with zeros? The >>> CPU >> should only ever touch a page if we miss an ENDBR on it, so, in >> normal operation, we don’t need anything to be there. We could try >> to prevent anyone from *reading* it outside of ENDBR tracking if we >> want to avoid people accidentally wasting lots of memory by forcing >> it to be fully populated when the read it. >> >> Won't reads on a big, contiguous private mapping get the huge zero >> page anyway? > > The zero pages may be free, but the page tables could be decently large. Does the core mm code use huge, immense, etc huge zero pages? Or can it synthesize them by reusing page table pages that map zeros? IIRC, we only ever fill single PMDs, even though we could gang a pmd page up and do it for 1GB areas too. I guess the page table consumption could really suck if we had code all over the 57-bit address space and that code moved around and the process ran for a long long time. Pathologically, we need a ulong/pmd_t for each 2MB of address space which is 8*2^56-30=512GB per process. Yikes. Right now, we'd at least detect the memory consumption and OOM-kill the process(es) eventually. But, that's not really _this_ patch's problem. It's a general problem, and doesn't even require the zero page to be mapped all over. Longer-term, I'd much rather see us add some page table reclaim mechanism that new how to go after things like excessive page tables in MAP_NORESERVE areas.
On 6/7/19 1:06 PM, Yu-cheng Yu wrote: >> Huh, how does glibc know about all possible past and future legacy code >> in the application? > When dlopen() gets a legacy binary and the policy allows that, it will manage > the bitmap: > > If a bitmap has not been created, create one. > Set bits for the legacy code being loaded. I was thinking about code that doesn't go through GLIBC like JITs.
> On Jun 7, 2019, at 2:09 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/7/19 1:06 PM, Yu-cheng Yu wrote: >>> Huh, how does glibc know about all possible past and future legacy code >>> in the application? >> When dlopen() gets a legacy binary and the policy allows that, it will manage >> the bitmap: >> >> If a bitmap has not been created, create one. >> Set bits for the legacy code being loaded. > > I was thinking about code that doesn't go through GLIBC like JITs. CRIU is another consideration: it would be rather annoying if CET programs can’t migrate between LA57 and normal machines.
Hi! > > I've no idea what the kernel should do; since you failed to answer the > > question what happens when you point this to garbage. > > > > Does it then fault or what? > > Yeah, I think you'll fault with a rather mysterious CR2 value since > you'll go look at the instruction that faulted and not see any > references to the CR2 value. > > I think this new MSR probably needs to get included in oops output when > CET is enabled. > > Why don't we require that a VMA be in place for the entire bitmap? > Don't we need a "get" prctl function too in case something like a JIT is > running and needs to find the location of this bitmap to set bits itself? > > Or, do we just go whole-hog and have the kernel manage the bitmap > itself. Our interface here could be: > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > and then have the kernel allocate and set the bitmap for those code > locations. For the record, that sounds like a better interface than userspace knowing about the bitmap formats... Pavel
On Fri, 2019-06-07 at 13:43 -0700, Andy Lutomirski wrote: > > On Jun 7, 2019, at 12:49 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > > > On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote: > > > > On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > > On 6/7/19 10:43 AM, Peter Zijlstra wrote: > > > > > I've no idea what the kernel should do; since you failed to answer the > > > > > question what happens when you point this to garbage. > > > > > > > > > > Does it then fault or what? > > > > > > > > Yeah, I think you'll fault with a rather mysterious CR2 value since > > > > you'll go look at the instruction that faulted and not see any > > > > references to the CR2 value. > > > > > > > > I think this new MSR probably needs to get included in oops output when > > > > CET is enabled. > > > > > > This shouldn’t be able to OOPS because it only happens at CPL 3, > > > right? We > > > should put it into core dumps, though. > > > > > > > > > > > Why don't we require that a VMA be in place for the entire bitmap? > > > > Don't we need a "get" prctl function too in case something like a JIT is > > > > running and needs to find the location of this bitmap to set bits > > > > itself? > > > > > > > > Or, do we just go whole-hog and have the kernel manage the bitmap > > > > itself. Our interface here could be: > > > > > > > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > > > > > > > and then have the kernel allocate and set the bitmap for those code > > > > locations. > > > > > > Given that the format depends on the VA size, this might be a good > > > idea. I > > > bet we can reuse the special mapping infrastructure for this — the VMA > > > could > > > be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and > > > we > > > can even make special rules to core dump it intelligently if needed. And > > > we > > > can make mremap() on it work correctly if anyone (CRIU?) cares. > > > > > > Hmm. Can we be creative and skip populating it with zeros? The CPU > > > should > > > only ever touch a page if we miss an ENDBR on it, so, in normal operation, > > > we > > > don’t need anything to be there. We could try to prevent anyone from > > > *reading* it outside of ENDBR tracking if we want to avoid people > > > accidentally > > > wasting lots of memory by forcing it to be fully populated when the read > > > it. > > > > > > The one downside is this forces it to be per-mm, but that seems like a > > > generally reasonable model anyway. > > > > > > This also gives us an excellent opportunity to make it read-only as seen > > > from > > > userspace to prevent exploits from just poking it full of ones before > > > redirecting execution. > > > > GLIBC sets bits only for legacy code, and then makes the bitmap read- > > only. That > > avoids most issues: > > How does glibc know the linear address space size? We don’t want LA64 to > break old binaries because the address calculation changed. When an application starts, its highest stack address is determined. It uses that as the maximum the bitmap needs to cover. Yu-cheng
On Sat, 2019-06-08 at 22:52 +0200, Pavel Machek wrote: > Hi! > > > > I've no idea what the kernel should do; since you failed to answer the > > > question what happens when you point this to garbage. > > > > > > Does it then fault or what? > > > > Yeah, I think you'll fault with a rather mysterious CR2 value since > > you'll go look at the instruction that faulted and not see any > > references to the CR2 value. > > > > I think this new MSR probably needs to get included in oops output when > > CET is enabled. > > > > Why don't we require that a VMA be in place for the entire bitmap? > > Don't we need a "get" prctl function too in case something like a JIT is > > running and needs to find the location of this bitmap to set bits itself? > > > > Or, do we just go whole-hog and have the kernel manage the bitmap > > itself. Our interface here could be: > > > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > > > and then have the kernel allocate and set the bitmap for those code > > locations. > > For the record, that sounds like a better interface than userspace knowing > about the bitmap formats... > Pavel Initially we implemented the bitmap that way. To manage the bitmap, every time the application issues a syscall for a .so it loads, and the kernel does copy_from_user() & copy_to_user() (or similar things). If a system has a few legacy .so files and every application does the same, it can take a long time to boot up. Yu-cheng
On Fri, 2019-06-07 at 15:27 -0700, Andy Lutomirski wrote: > > On Jun 7, 2019, at 2:09 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 6/7/19 1:06 PM, Yu-cheng Yu wrote: > > > > Huh, how does glibc know about all possible past and future legacy code > > > > in the application? > > > > > > When dlopen() gets a legacy binary and the policy allows that, it will > > > manage > > > the bitmap: > > > > > > If a bitmap has not been created, create one. > > > Set bits for the legacy code being loaded. > > > > I was thinking about code that doesn't go through GLIBC like JITs. > > CRIU is another consideration: it would be rather annoying if CET programs > can’t migrate between LA57 and normal machines. When a machine migrates, does its applications' addresses change? If no, then the bitmap should still work, right? Yu-cheng
On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote: > On 6/7/19 1:06 PM, Yu-cheng Yu wrote: > > > Huh, how does glibc know about all possible past and future legacy code > > > in the application? > > > > When dlopen() gets a legacy binary and the policy allows that, it will > > manage > > the bitmap: > > > > If a bitmap has not been created, create one. > > Set bits for the legacy code being loaded. > > I was thinking about code that doesn't go through GLIBC like JITs. If JIT manages the bitmap, it knows where it is. It can always read the bitmap again, right? Yu-cheng
* Yu-cheng Yu: > On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote: >> On 6/7/19 1:06 PM, Yu-cheng Yu wrote: >> > > Huh, how does glibc know about all possible past and future legacy code >> > > in the application? >> > >> > When dlopen() gets a legacy binary and the policy allows that, it will >> > manage >> > the bitmap: >> > >> > If a bitmap has not been created, create one. >> > Set bits for the legacy code being loaded. >> >> I was thinking about code that doesn't go through GLIBC like JITs. > > If JIT manages the bitmap, it knows where it is. > It can always read the bitmap again, right? The problem are JIT libraries without assembler code which can be marked non-CET, such as liborc. Our builds (e.g., orc-0.4.29-2.fc30.x86_64) currently carries the IBT and SHSTK flag, although the entry points into the generated code do not start with ENDBR, so that a jump to them will fault with the CET enabled. Thanks, Florian
On 6/10/19 9:05 AM, Yu-cheng Yu wrote: > On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote: >> On 6/7/19 1:06 PM, Yu-cheng Yu wrote: >>>> Huh, how does glibc know about all possible past and future legacy code >>>> in the application? >>> When dlopen() gets a legacy binary and the policy allows that, it will >>> manage >>> the bitmap: >>> >>> If a bitmap has not been created, create one. >>> Set bits for the legacy code being loaded. >> I was thinking about code that doesn't go through GLIBC like JITs. > If JIT manages the bitmap, it knows where it is. > It can always read the bitmap again, right? Let's just be clear: The design proposed here is that all code mappers (anybody wanting to get legacy non-CET code into the address space): 1. Know about CET 2. Know where the bitmap is, and identify the part that needs to be changed 3. Be able to mprotect() the bitmap to be writable (undoing glibc's PROT_READ) 4. Set the bits in the bitmap for the legacy code 5. mprotect() the bitmap back to PROT_READ Do the non-glibc code mappers have glibc interfaces for this? Otherwise, how could a bunch of JITs in a big multi-threaded application possibly coordinate the mprotect()s? Won't they race with each other?
On 6/10/19 8:22 AM, Yu-cheng Yu wrote: >> How does glibc know the linear address space size? We don’t want LA64 to >> break old binaries because the address calculation changed. > When an application starts, its highest stack address is determined. > It uses that as the maximum the bitmap needs to cover. Huh, I didn't think we ran code from the stack. ;) Especially given the way that we implemented the new 5-level-paging address space, I don't think that expecting code to be below the stack is a good universal expectation.
On Mon, 2019-06-10 at 11:02 -0700, Dave Hansen wrote: > On 6/10/19 8:22 AM, Yu-cheng Yu wrote: > > > How does glibc know the linear address space size? We don’t want LA64 to > > > break old binaries because the address calculation changed. > > > > When an application starts, its highest stack address is determined. > > It uses that as the maximum the bitmap needs to cover. > > Huh, I didn't think we ran code from the stack. ;) > > Especially given the way that we implemented the new 5-level-paging > address space, I don't think that expecting code to be below the stack > is a good universal expectation. Yes, you make a good point. However, allowing the application manage the bitmap is the most efficient and flexible. If the loader finds a legacy lib is beyond the bitmap can cover, it can deal with the problem by moving the lib to a lower address; or re-allocate the bitmap. If the loader cannot allocate a big bitmap to cover all 5-level address space (the bitmap will be large), it can put all legacy lib's at lower address. We cannot do these easily in the kernel. Yu-cheng
On 6/10/19 12:38 PM, Yu-cheng Yu wrote: >>> When an application starts, its highest stack address is determined. >>> It uses that as the maximum the bitmap needs to cover. >> Huh, I didn't think we ran code from the stack. ;) >> >> Especially given the way that we implemented the new 5-level-paging >> address space, I don't think that expecting code to be below the stack >> is a good universal expectation. > Yes, you make a good point. However, allowing the application manage the bitmap > is the most efficient and flexible. If the loader finds a legacy lib is beyond > the bitmap can cover, it can deal with the problem by moving the lib to a lower > address; or re-allocate the bitmap. How could the loader reallocate the bitmap and coordinate with other users of the bitmap? > If the loader cannot allocate a big bitmap to cover all 5-level > address space (the bitmap will be large), it can put all legacy lib's > at lower address. We cannot do these easily in the kernel. This is actually an argument to do it in the kernel. The kernel can always allocate the virtual space however it wants, no matter how large. If we hide the bitmap behind a kernel API then we can put it at high 5-level user addresses because we also don't have to worry about the high bits confusing userspace.
On Mon, Jun 10, 2019 at 12:52 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/10/19 12:38 PM, Yu-cheng Yu wrote: > >>> When an application starts, its highest stack address is determined. > >>> It uses that as the maximum the bitmap needs to cover. > >> Huh, I didn't think we ran code from the stack. ;) > >> > >> Especially given the way that we implemented the new 5-level-paging > >> address space, I don't think that expecting code to be below the stack > >> is a good universal expectation. > > Yes, you make a good point. However, allowing the application manage the bitmap > > is the most efficient and flexible. If the loader finds a legacy lib is beyond > > the bitmap can cover, it can deal with the problem by moving the lib to a lower > > address; or re-allocate the bitmap. > > How could the loader reallocate the bitmap and coordinate with other > users of the bitmap? > > > If the loader cannot allocate a big bitmap to cover all 5-level > > address space (the bitmap will be large), it can put all legacy lib's > > at lower address. We cannot do these easily in the kernel. > > This is actually an argument to do it in the kernel. The kernel can > always allocate the virtual space however it wants, no matter how large. > If we hide the bitmap behind a kernel API then we can put it at high > 5-level user addresses because we also don't have to worry about the > high bits confusing userspace. > That's a fairly compelling argument. The bitmap is one bit per page, right? So it's smaller than the address space by a factor of 8*2^12 == 2^15. This means that, if we ever get full 64-bit linear addresses reserved entirely for userspace (which could happen if my perennial request to Intel to split user and kernel addresses completely happens), then we'll need 2^48 bytes for the bitmap, which simply does not fit in the address space of a legacy application.
On Mon, 2019-06-10 at 12:52 -0700, Dave Hansen wrote: > On 6/10/19 12:38 PM, Yu-cheng Yu wrote: > > > > When an application starts, its highest stack address is determined. > > > > It uses that as the maximum the bitmap needs to cover. > > > > > > Huh, I didn't think we ran code from the stack. ;) > > > > > > Especially given the way that we implemented the new 5-level-paging > > > address space, I don't think that expecting code to be below the stack > > > is a good universal expectation. > > > > Yes, you make a good point. However, allowing the application manage the > > bitmap > > is the most efficient and flexible. If the loader finds a legacy lib is > > beyond > > the bitmap can cover, it can deal with the problem by moving the lib to a > > lower > > address; or re-allocate the bitmap. > > How could the loader reallocate the bitmap and coordinate with other > users of the bitmap? Assuming the loader actually chooses to re-allocate, it can copy the old bitmap over to the new before doing the switch. But, I agree, the other choice is easier; the loader can simply put the lib at lower address. AFAIK, the loader does not request high address in mmap(). > > > If the loader cannot allocate a big bitmap to cover all 5-level > > address space (the bitmap will be large), it can put all legacy lib's > > at lower address. We cannot do these easily in the kernel. > > This is actually an argument to do it in the kernel. The kernel can > always allocate the virtual space however it wants, no matter how large. > If we hide the bitmap behind a kernel API then we can put it at high > 5-level user addresses because we also don't have to worry about the > high bits confusing userspace. We actually tried this. The kernel needs to reserve the bitmap space in the beginning for every CET-enabled app, regardless of actual needs. On each memory request, the kernel then must consider a percentage of allocated space in its calculation, and on systems with less memory this quickly becomes a problem.
On 6/10/19 1:27 PM, Yu-cheng Yu wrote: >>> If the loader cannot allocate a big bitmap to cover all 5-level >>> address space (the bitmap will be large), it can put all legacy lib's >>> at lower address. We cannot do these easily in the kernel. >> This is actually an argument to do it in the kernel. The kernel can >> always allocate the virtual space however it wants, no matter how large. >> If we hide the bitmap behind a kernel API then we can put it at high >> 5-level user addresses because we also don't have to worry about the >> high bits confusing userspace. > We actually tried this. The kernel needs to reserve the bitmap space in the > beginning for every CET-enabled app, regardless of actual needs. I don't think this is a problem. In fact, I think reserving the space is actually the only sane behavior. If you don't reserve it, you fundamentally limit where future legacy instructions can go. One idea is that we always size the bitmap for the 48-bit addressing systems. Legacy code probably doesn't _need_ to go in the new address space, and if we do this we don't have to worry about the gigantic 57-bit address space bitmap. > On each memory request, the kernel then must consider a percentage of > allocated space in its calculation, and on systems with less memory > this quickly becomes a problem. I'm not sure what you're referring to here? Are you referring to our overcommit limits?
On Mon, 2019-06-10 at 13:43 -0700, Dave Hansen wrote: > On 6/10/19 1:27 PM, Yu-cheng Yu wrote: > > > > If the loader cannot allocate a big bitmap to cover all 5-level > > > > address space (the bitmap will be large), it can put all legacy lib's > > > > at lower address. We cannot do these easily in the kernel. > > > > > > This is actually an argument to do it in the kernel. The kernel can > > > always allocate the virtual space however it wants, no matter how large. > > > If we hide the bitmap behind a kernel API then we can put it at high > > > 5-level user addresses because we also don't have to worry about the > > > high bits confusing userspace. > > > > We actually tried this. The kernel needs to reserve the bitmap space in the > > beginning for every CET-enabled app, regardless of actual needs. > > I don't think this is a problem. In fact, I think reserving the space > is actually the only sane behavior. If you don't reserve it, you > fundamentally limit where future legacy instructions can go. > > One idea is that we always size the bitmap for the 48-bit addressing > systems. Legacy code probably doesn't _need_ to go in the new address > space, and if we do this we don't have to worry about the gigantic > 57-bit address space bitmap. > > > On each memory request, the kernel then must consider a percentage of > > allocated space in its calculation, and on systems with less memory > > this quickly becomes a problem. > > I'm not sure what you're referring to here? Are you referring to our > overcommit limits? Yes.
On 6/10/19 1:58 PM, Yu-cheng Yu wrote: >>> On each memory request, the kernel then must consider a percentage of >>> allocated space in its calculation, and on systems with less memory >>> this quickly becomes a problem. >> I'm not sure what you're referring to here? Are you referring to our >> overcommit limits? > Yes. My assumption has always been that these large, potentially sparse hardware tables *must* be mmap()'d with MAP_NORESERVE specified. That should keep them from being problematic with respect to overcommit.
On Mon, 2019-06-10 at 15:02 -0700, Dave Hansen wrote: > On 6/10/19 1:58 PM, Yu-cheng Yu wrote: > > > > On each memory request, the kernel then must consider a percentage of > > > > allocated space in its calculation, and on systems with less memory > > > > this quickly becomes a problem. > > > > > > I'm not sure what you're referring to here? Are you referring to our > > > overcommit limits? > > > > Yes. > > My assumption has always been that these large, potentially sparse > hardware tables *must* be mmap()'d with MAP_NORESERVE specified. That > should keep them from being problematic with respect to overcommit. Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and VM_DONTDUMP. The bitmap will cover only 48-bit address space. We then create PR_MARK_CODE_AS_LEGACY. The kernel will set the bitmap, but it is going to be slow. Perhaps we still let the app fill the bitmap? Yu-cheng
On 6/10/19 3:40 PM, Yu-cheng Yu wrote: > Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and > VM_DONTDUMP. The bitmap will cover only 48-bit address space. Could you make sure to discuss the downsides of only doing a 48-bit address space? What are the reasons behind and implications of VM_DONTDUMP? > We then create PR_MARK_CODE_AS_LEGACY. The kernel will set the bitmap, but it > is going to be slow. Slow compared to what? We're effectively adding one (quick) system call to a path that, today, has at *least* half a dozen syscalls and probably a bunch of page faults. Heck, we can probably avoid the actual page fault to populate the bitmap if we're careful. That alone would put a syscall on equal footing with any other approach. If the bit setting crossed a page boundary it would probably win. > Perhaps we still let the app fill the bitmap? I think I'd want to see some performance data on it first.
On Mon, Jun 10, 2019 at 3:59 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > We then create PR_MARK_CODE_AS_LEGACY. The kernel will set the bitmap, but it > > is going to be slow. > > Slow compared to what? We're effectively adding one (quick) system call > to a path that, today, has at *least* half a dozen syscalls and probably > a bunch of page faults. Heck, we can probably avoid the actual page > fault to populate the bitmap if we're careful. That alone would put a > syscall on equal footing with any other approach. If the bit setting > crossed a page boundary it would probably win. > > > Perhaps we still let the app fill the bitmap? > > I think I'd want to see some performance data on it first. Updating legacy bitmap in user space from kernel requires long q; get_user(q, ...); q |= mask; put_user(q, ...); instead of *p |= mask; get_user + put_user was quite slow when we tried before.
On 6/10/19 4:20 PM, H.J. Lu wrote: >>> Perhaps we still let the app fill the bitmap? >> I think I'd want to see some performance data on it first. > Updating legacy bitmap in user space from kernel requires > > long q; > > get_user(q, ...); > q |= mask; > put_user(q, ...); > > instead of > > *p |= mask; > > get_user + put_user was quite slow when we tried before. Numbers, please. There are *lots* of ways to speed something like that up if you have actual issues with it. For instance, you can skip the get_user() for whole bytes. You can write bits with 0's for unallocated address space. You can do user_access_begin/end() to avoid bunches of STAC/CLACs... The list goes on and on. :)
> On Jun 10, 2019, at 3:59 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 6/10/19 3:40 PM, Yu-cheng Yu wrote: >> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and >> VM_DONTDUMP. The bitmap will cover only 48-bit address space. > > Could you make sure to discuss the downsides of only doing a 48-bit > address space? > > What are the reasons behind and implications of VM_DONTDUMP? > >> We then create PR_MARK_CODE_AS_LEGACY. The kernel will set the bitmap, but it >> is going to be slow. > > Slow compared to what? We're effectively adding one (quick) system call > to a path that, today, has at *least* half a dozen syscalls and probably > a bunch of page faults. Heck, we can probably avoid the actual page > fault to populate the bitmap if we're careful. That alone would put a > syscall on equal footing with any other approach. If the bit setting > crossed a page boundary it would probably win. > >> Perhaps we still let the app fill the bitmap? > > I think I'd want to see some performance data on it first. Trying to summarize: If we manage the whole thing in user space, we are basically committing to only covering 48 bits — otherwise the whole model falls apart in quite a few ways. We gain some simplicity in the kernel. If we do it in the kernel, we still have to decide how much address space to cover. We get to play games like allocating the bitmap above 2^48, but then we might have CRIU issues if we migrate to a system with fewer BA bits. I doubt that the performance matters much one way or another. I just don’t expect any of this to be a bottleneck. Another benefit of kernel management: we could plausibly auto-clear the bits corresponding to munmapped regions. Is this worth it? And a maybe-silly benefit: if we manage it in the kernel, we could optimize the inevitable case where the bitmap contains pages that are all ones :). If it’s in userspace, KSM could do the, but that will be inefficient at best.
On 6/10/19 4:54 PM, Andy Lutomirski wrote: > Another benefit of kernel management: we could plausibly auto-clear > the bits corresponding to munmapped regions. Is this worth it? I did it for MPX. I think I even went to the trouble of zapping the whole pages that got unused. But, MPX tables took 80% of the address space, worst-case. This takes 0.003% :) The only case it would really matter would be a task was long-running, used legacy executables/JITs, and was mapping/unmapping text all over the address space. That seems rather unlikely.
> On Jun 10, 2019, at 5:08 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 6/10/19 4:54 PM, Andy Lutomirski wrote: >> Another benefit of kernel management: we could plausibly auto-clear >> the bits corresponding to munmapped regions. Is this worth it? > > I did it for MPX. I think I even went to the trouble of zapping the > whole pages that got unused. > > But, MPX tables took 80% of the address space, worst-case. This takes > 0.003% :) The only case it would really matter would be a task was > long-running, used legacy executables/JITs, and was mapping/unmapping > text all over the address space. That seems rather unlikely. Every wasted page still costs 4K plus page table overhead. The worst case is a JIT that doesn’t clean up and leaks legacy bitmap memory all over. We can blame the JIT, but the actual attribution could be complicated. It also matters when you unmap one thing, map something else, and are sad when the legacy bits are still set. Admittedly, it’s a bit hard to imagine the exploit that takes advantage of this.
* Dave Hansen: > My assumption has always been that these large, potentially sparse > hardware tables *must* be mmap()'d with MAP_NORESERVE specified. That > should keep them from being problematic with respect to overcommit. MAP_NORESERVE pages still count towards the commit limit. The flag only disables checks at allocation time, for this particular allocation. (At least this was the behavior the last time I looked into this, I believe.) Not sure if this makes a difference here. Thanks, Florian
On Mon 2019-06-10 08:47:45, Yu-cheng Yu wrote: > On Sat, 2019-06-08 at 22:52 +0200, Pavel Machek wrote: > > Hi! > > > > > > I've no idea what the kernel should do; since you failed to answer the > > > > question what happens when you point this to garbage. > > > > > > > > Does it then fault or what? > > > > > > Yeah, I think you'll fault with a rather mysterious CR2 value since > > > you'll go look at the instruction that faulted and not see any > > > references to the CR2 value. > > > > > > I think this new MSR probably needs to get included in oops output when > > > CET is enabled. > > > > > > Why don't we require that a VMA be in place for the entire bitmap? > > > Don't we need a "get" prctl function too in case something like a JIT is > > > running and needs to find the location of this bitmap to set bits itself? > > > > > > Or, do we just go whole-hog and have the kernel manage the bitmap > > > itself. Our interface here could be: > > > > > > prctl(PR_MARK_CODE_AS_LEGACY, start, size); > > > > > > and then have the kernel allocate and set the bitmap for those code > > > locations. > > > > For the record, that sounds like a better interface than userspace knowing > > about the bitmap formats... > > Pavel > > Initially we implemented the bitmap that way. To manage the bitmap, every time > the application issues a syscall for a .so it loads, and the kernel does > copy_from_user() & copy_to_user() (or similar things). If a system has a few > legacy .so files and every application does the same, it can take a long time to > boot up. Loading .so is already many syscalls, I'd not expect measurable performance there. Are you sure? Pavel
On Mon, 2019-06-10 at 15:59 -0700, Dave Hansen wrote: > On 6/10/19 3:40 PM, Yu-cheng Yu wrote: > > Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and > > VM_DONTDUMP. The bitmap will cover only 48-bit address space. > > Could you make sure to discuss the downsides of only doing a 48-bit > address space? The downside is that we cannot load legacy lib's above 48-bit address space, but currently ld-linux does not do that. Should ld-linux do that in the future, dlopen() fails. Considering CRIU migration, we probably need to do this anyway? > What are the reasons behind and implications of VM_DONTDUMP? The bitmap is very big. In GDB, it should be easy to tell why a control-protection fault occurred without the bitmap. Yu-cheng
On 6/14/19 8:25 AM, Yu-cheng Yu wrote: > On Mon, 2019-06-10 at 15:59 -0700, Dave Hansen wrote: >> On 6/10/19 3:40 PM, Yu-cheng Yu wrote: >>> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and >>> VM_DONTDUMP. The bitmap will cover only 48-bit address space. >> >> Could you make sure to discuss the downsides of only doing a 48-bit >> address space? > > The downside is that we cannot load legacy lib's above 48-bit address space, but > currently ld-linux does not do that. Should ld-linux do that in the future, > dlopen() fails. Considering CRIU migration, we probably need to do this anyway? Again, I was thinking about JITs. Please remember that not all code in the system is from files on the disk. Please. We need to be really, really sure that we don't addle this implementation by being narrow minded about this. Please don't forget about JITs. >> What are the reasons behind and implications of VM_DONTDUMP? > > The bitmap is very big. Really? It's actually, what, 8*4096=32k, so 1/32,768th of the size of the libraries legacy libraries you load? Do our crash dumps really not know how to represent or deal with sparse mappings? > In GDB, it should be easy to tell why a control-protection fault occurred > without the bitmap. How about why one didn't happen?
On Fri, 2019-06-14 at 09:13 -0700, Dave Hansen wrote: > On 6/14/19 8:25 AM, Yu-cheng Yu wrote: > > On Mon, 2019-06-10 at 15:59 -0700, Dave Hansen wrote: > > > On 6/10/19 3:40 PM, Yu-cheng Yu wrote: > > > > Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and > > > > VM_DONTDUMP. The bitmap will cover only 48-bit address space. > > > > > > Could you make sure to discuss the downsides of only doing a 48-bit > > > address space? > > > > The downside is that we cannot load legacy lib's above 48-bit address space, > > but > > currently ld-linux does not do that. Should ld-linux do that in the future, > > dlopen() fails. Considering CRIU migration, we probably need to do this > > anyway? > > Again, I was thinking about JITs. Please remember that not all code in > the system is from files on the disk. Please. We need to be really, > really sure that we don't addle this implementation by being narrow > minded about this. > > Please don't forget about JITs. > > > > What are the reasons behind and implications of VM_DONTDUMP? > > > > The bitmap is very big. > > Really? It's actually, what, 8*4096=32k, so 1/32,768th of the size of > the libraries legacy libraries you load? Do our crash dumps really not > know how to represent or deal with sparse mappings? Ok, even the core dump is not physically big, its size still looks odd, right? Could this also affect how much time for GDB to load it. We will only mmap the bitmap when the first time the bitmap prctl is called. I have a related question: Do we allow the application to read the bitmap, or any fault from the application on bitmap pages? We populate a page only when bits are set from a prctl. Any other fault means either the application tries to find out an address range's status or it executes legacy code that has not been marked in the bitmap. > > > In GDB, it should be easy to tell why a control-protection fault occurred > > without the bitmap. > > How about why one didn't happen? We'll dump the bitmap if it is allocated. Yu-cheng
On 6/14/19 10:13 AM, Yu-cheng Yu wrote: > On Fri, 2019-06-14 at 09:13 -0700, Dave Hansen wrote: >> On 6/14/19 8:25 AM, Yu-cheng Yu wrote: >>> The bitmap is very big. >> >> Really? It's actually, what, 8*4096=32k, so 1/32,768th of the size of >> the libraries legacy libraries you load? Do our crash dumps really not >> know how to represent or deal with sparse mappings? > > Ok, even the core dump is not physically big, its size still looks odd, right? Hell if I know. Could you please go try this in practice so that we're designing this thing fixing real actual problems instead of phantoms that we're anticipating? > Could this also affect how much time for GDB to load it. I don't know. Can you go find out for sure, please? > I have a related question: > > Do we allow the application to read the bitmap, or any fault from the > application on bitmap pages? We have to allow apps to read it. Otherwise they can't execute instructions. We don't have to allow them to (popuating) fault on it. But, if we don't, we need some kind of kernel interface to avoid the faults.
On Fri, 2019-06-14 at 13:57 -0700, Dave Hansen wrote: > On 6/14/19 10:13 AM, Yu-cheng Yu wrote: > > On Fri, 2019-06-14 at 09:13 -0700, Dave Hansen wrote: > > > On 6/14/19 8:25 AM, Yu-cheng Yu wrote: > > > > The bitmap is very big. > > > > > > Really? It's actually, what, 8*4096=32k, so 1/32,768th of the size of > > > the libraries legacy libraries you load? Do our crash dumps really not > > > know how to represent or deal with sparse mappings? > > > > Ok, even the core dump is not physically big, its size still looks odd, > > right? > > Hell if I know. > > Could you please go try this in practice so that we're designing this > thing fixing real actual problems instead of phantoms that we're > anticipating? > > > Could this also affect how much time for GDB to load it. > > I don't know. Can you go find out for sure, please? OK! > > > I have a related question: > > > > Do we allow the application to read the bitmap, or any fault from the > > application on bitmap pages? > > We have to allow apps to read it. Otherwise they can't execute > instructions. What I meant was, if an app executes some legacy code that results in bitmap lookup, but the bitmap page is not yet populated, and if we then populate that page with all-zero, a #CP should follow. So do we even populate that zero page at all? I think we should; a #CP is more obvious to the user at least. > > We don't have to allow them to (popuating) fault on it. But, if we > don't, we need some kind of kernel interface to avoid the faults. The plan is: * Move STACK_TOP (and vdso) down to give space to the bitmap. * Reserve the bitmap space from (mm->start_stack + PAGE_SIZE) to cover a code size of TASK_SIZE_LOW, which is (TASK_SIZE_LOW / PAGE_SIZE / 8). * Mmap the space only when the app issues the first mark-legacy prctl. This avoids the core-dump issue for most apps and the accounting problem that MAP_NORESERVE probably won't solve completely. * The bitmap is read-only. The kernel sets the bitmap with get_user_pages_fast(FOLL_WRITE) and user_access_begin()/user_addess_end(). I will send out a RFC patch. Yu-cheng
On 6/14/19 2:34 PM, Yu-cheng Yu wrote: > On Fri, 2019-06-14 at 13:57 -0700, Dave Hansen wrote: >>> I have a related question: >>> >>> Do we allow the application to read the bitmap, or any fault from the >>> application on bitmap pages? >> >> We have to allow apps to read it. Otherwise they can't execute >> instructions. > > What I meant was, if an app executes some legacy code that results in bitmap > lookup, but the bitmap page is not yet populated, and if we then populate that > page with all-zero, a #CP should follow. So do we even populate that zero page > at all? > > I think we should; a #CP is more obvious to the user at least. Please make an effort to un-Intel-ificate your messages as much as possible. I'd really prefer that folks say "missing end branch fault" rather than #CP. I had to Google "#CP". I *think* you are saying that: The *only* lookups to this bitmap are on "missing end branch" conditions. Normal, proper-functioning code execution that has ENDBR instructions in it will never even look at the bitmap. The only case when we reference the bitmap locations is when the processor is about do do a "missing end branch fault" so that it can be suppressed. Any population with the zero page would be done when code had already encountered a "missing end branch" condition, and populating with a zero-filled page will guarantee that a "missing end branch fault" will result. You're arguing that we should just figure this out at fault time and not ever reach the "missing end branch fault" at all. Is that right? If so, that's an architecture subtlety that I missed until now and which went entirely unmentioned in the changelog and discussion up to this point. Let's make sure that nobody else has to walk that path by improving our changelog, please. In any case, I don't think this is worth special-casing our zero-fill code, FWIW. It's not performance critical and not worth the complexity. If apps want to handle the signals and abuse this to fill space up with boring page table contents, they're welcome to. There are much easier ways to consume a lot of memory. >> We don't have to allow them to (popuating) fault on it. But, if we >> don't, we need some kind of kernel interface to avoid the faults. > > The plan is: > > * Move STACK_TOP (and vdso) down to give space to the bitmap. Even for apps with 57-bit address spaces? > * Reserve the bitmap space from (mm->start_stack + PAGE_SIZE) to cover a code > size of TASK_SIZE_LOW, which is (TASK_SIZE_LOW / PAGE_SIZE / 8). The bitmap size is determined by CR4.LA57, not the app. If you place the bitmap here, won't references to it for high addresses go into the high address space? Specifically, on a CR4.LA57=0 system, we have 48 bits of address space, so 128TB for apps. You are proposing sticking the bitmap above the stack which is near the top of that 128TB address space. But on a 5-level paging system with CR4.LA57=1, there could be valid data at 129GB. Is there something keeping that data from being mistaken for being part of the bitmap? Also, if you're limiting it to TASK_SIZE_LOW, please don't forget that this is yet another thing that probably won't work with the vsyscall page. Please make sure you consider it and mention it in your next post. > * Mmap the space only when the app issues the first mark-legacy prctl. This > avoids the core-dump issue for most apps and the accounting problem that > MAP_NORESERVE probably won't solve completely. What is this accounting problem?
> On Jun 14, 2019, at 3:06 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 6/14/19 2:34 PM, Yu-cheng Yu wrote: >> On Fri, 2019-06-14 at 13:57 -0700, Dave Hansen wrote: >>>> I have a related question: >>>> >>>> Do we allow the application to read the bitmap, or any fault from the >>>> application on bitmap pages? >>> >>> We have to allow apps to read it. Otherwise they can't execute >>> instructions. >> >> What I meant was, if an app executes some legacy code that results in bitmap >> lookup, but the bitmap page is not yet populated, and if we then populate that >> page with all-zero, a #CP should follow. So do we even populate that zero page >> at all? >> >> I think we should; a #CP is more obvious to the user at least. > > Please make an effort to un-Intel-ificate your messages as much as > possible. I'd really prefer that folks say "missing end branch fault" > rather than #CP. I had to Google "#CP". > > I *think* you are saying that: The *only* lookups to this bitmap are on > "missing end branch" conditions. Normal, proper-functioning code > execution that has ENDBR instructions in it will never even look at the > bitmap. The only case when we reference the bitmap locations is when > the processor is about do do a "missing end branch fault" so that it can > be suppressed. Any population with the zero page would be done when > code had already encountered a "missing end branch" condition, and > populating with a zero-filled page will guarantee that a "missing end > branch fault" will result. You're arguing that we should just figure > this out at fault time and not ever reach the "missing end branch fault" > at all. > > Is that right? > > If so, that's an architecture subtlety that I missed until now and which > went entirely unmentioned in the changelog and discussion up to this > point. Let's make sure that nobody else has to walk that path by > improving our changelog, please. > > In any case, I don't think this is worth special-casing our zero-fill > code, FWIW. It's not performance critical and not worth the complexity. > If apps want to handle the signals and abuse this to fill space up with > boring page table contents, they're welcome to. There are much easier > ways to consume a lot of memory. Isn’t it a special case either way? Either we look at CR2 and populate a page, or we look at CR2 and the “tracker” state and send a different signal. Admittedly the former is very common in the kernel. > >>> We don't have to allow them to (popuating) fault on it. But, if we >>> don't, we need some kind of kernel interface to avoid the faults. >> >> The plan is: >> >> * Move STACK_TOP (and vdso) down to give space to the bitmap. > > Even for apps with 57-bit address spaces? > >> * Reserve the bitmap space from (mm->start_stack + PAGE_SIZE) to cover a code >> size of TASK_SIZE_LOW, which is (TASK_SIZE_LOW / PAGE_SIZE / 8). > > The bitmap size is determined by CR4.LA57, not the app. If you place > the bitmap here, won't references to it for high addresses go into the > high address space? > > Specifically, on a CR4.LA57=0 system, we have 48 bits of address space, > so 128TB for apps. You are proposing sticking the bitmap above the > stack which is near the top of that 128TB address space. But on a > 5-level paging system with CR4.LA57=1, there could be valid data at > 129GB. Is there something keeping that data from being mistaken for > being part of the bitmap? > I think we need to make the vma be full sized — it should cover the entire range that the CPU might access. If that means it spans the 48-bit boundary, so be it. > Also, if you're limiting it to TASK_SIZE_LOW, please don't forget that > this is yet another thing that probably won't work with the vsyscall > page. Please make sure you consider it and mention it in your next post. Why not? The vsyscall page is at a negative address. > >> * Mmap the space only when the app issues the first mark-legacy prctl. This >> avoids the core-dump issue for most apps and the accounting problem that >> MAP_NORESERVE probably won't solve What happens if there’s another VMA there by the time you map it?
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h index 89330e4159a9..9e613a6598c9 100644 --- a/arch/x86/include/asm/cet.h +++ b/arch/x86/include/asm/cet.h @@ -31,6 +31,7 @@ void cet_disable_free_shstk(struct task_struct *p); int cet_restore_signal(bool ia32, struct sc_ext *sc); int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc); int cet_setup_ibt(void); +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size); void cet_disable_ibt(void); #else static inline int prctl_cet(int option, unsigned long arg2) { return -EINVAL; } diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c index 14ad25b8ff21..e0ef996d3148 100644 --- a/arch/x86/kernel/cet.c +++ b/arch/x86/kernel/cet.c @@ -22,6 +22,7 @@ #include <asm/fpu/types.h> #include <asm/cet.h> #include <asm/special_insns.h> +#include <asm/elf.h> #include <uapi/asm/sigcontext.h> static int set_shstk_ptr(unsigned long addr) @@ -361,3 +362,28 @@ void cet_disable_ibt(void) current->thread.cet.ibt_enabled = 0; } + +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size) +{ + u64 r; + + if (!current->thread.cet.ibt_enabled) + return -EINVAL; + + if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX)) + return -EINVAL; + + current->thread.cet.ibt_bitmap_addr = bitmap; + current->thread.cet.ibt_bitmap_size = size; + + /* + * Turn on IBT legacy bitmap. + */ + modify_fpu_regs_begin(); + rdmsrl(MSR_IA32_U_CET, r); + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap); + wrmsrl(MSR_IA32_U_CET, r); + modify_fpu_regs_end(); + + return 0; +}
Indirect Branch Tracking (IBT) provides an optional legacy code bitmap that allows execution of legacy, non-IBT compatible library by an IBT-enabled application. When set, each bit in the bitmap indicates one page of legacy code. The bitmap is allocated and setup from the application. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> --- arch/x86/include/asm/cet.h | 1 + arch/x86/kernel/cet.c | 26 ++++++++++++++++++++++++++ 2 files changed, 27 insertions(+)