diff mbox series

[10/17] prmem: documentation

Message ID 20181023213504.28905-11-igor.stoppa@huawei.com (mailing list archive)
State New, archived
Headers show
Series prmem: protected memory | expand

Commit Message

Igor Stoppa Oct. 23, 2018, 9:34 p.m. UTC
Documentation for protected memory.

Topics covered:
* static memory allocation
* dynamic memory allocation
* write-rare

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Randy Dunlap <rdunlap@infradead.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
 MAINTAINERS                      |   1 +
 3 files changed, 174 insertions(+)
 create mode 100644 Documentation/core-api/prmem.rst

Comments

Randy Dunlap Oct. 24, 2018, 3:48 a.m. UTC | #1
Hi,

On 10/23/18 2:34 PM, Igor Stoppa wrote:
> Documentation for protected memory.
> 
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Randy Dunlap <rdunlap@infradead.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
>  MAINTAINERS                      |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/core-api/prmem.rst


> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.

                                               transience.

> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.
> +
> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause

no comma.

> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after
> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at
> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside
> +   reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.
> +        - The functionality is provided as tradeoff between hardening and speed.
> +        - Its usefulness depends on the specific use case at hand

end above sentence with a period, please, like all of the others above it.

> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.

Please try to keep the line above and a few below to < 80 characters in length.
(because some of us read rst files as text files, with a text editor, and line
wrap is ugly)

> +
> +- **Protecting the pool**
> +   This is achieved with **pmalloc_protect_pool()**
> +    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> +    - Any residual space still available from the current vmap_area is lost, as the area is protected.
> +    - **protecting a pool after every allocation will likely be very wasteful**
> +    - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> +   This is achieved with **pmalloc_make_pool_ro()**
> +    - it turns the present content of a write-rare pool into read-only
> +    - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> +  system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> +  limited, on 32-bit systems. However it shouldn't be an issue, since not
> +  much data is expected to be dynamically allocated and turned into
> +  write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> +  requires performing cross-processor synchronizations of page tables.
> +  This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the

s/rare write/write rare/ ?

> +  action, respect to the way they use the data being altered; for example,

  This ..   "respect to the way" is awkward, but I don't know what to
change it to.

> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation

It seemed to me that "write-rare" (or write rare) was the going name, but now
it's being called "rare write" (or rare-write).  Just be consistent, please.


> +  that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> +  tlb entries. It still does a better job of it, compared to invoking

     TLB

> +  vmalloc for each allocation, but it is undeniably less optimized wrt to

s/wrt/with respect to/

> +  TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h


Thanks for the documentation.
Igor Stoppa Oct. 24, 2018, 2:30 p.m. UTC | #2
Hi,

On 24/10/18 06:48, Randy Dunlap wrote:
> Hi,
> 
> On 10/23/18 2:34 PM, Igor Stoppa wrote:

[...]

>> +- The present document doesn't address such transient.
> 
>                                                 transience.

ok

[...]

>> +   are attempted after the write protection is in place, will cause
> 
> no comma.

ok

[...]

>> +        - Its usefulness depends on the specific use case at hand
> 
> end above sentence with a period, please, like all of the others above it.

ok


>> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
> 
> Please try to keep the line above and a few below to < 80 characters in length.
> (because some of us read rst files as text files, with a text editor, and line
> wrap is ugly)

ok, I still have to master .rst :-(

[...]

>> +- The users of rare write must take care of ensuring the atomicity of the
> 
> s/rare write/write rare/ ?

thanks

>> +  action, respect to the way they use the data being altered; for example,
> 
>    This ..   "respect to the way" is awkward, but I don't know what to
> change it to.
> 
>> +  take a lock before making a copy of the value to modify (if it's
>> +  relevant), then alter it, issue the call to rare write and finally
>> +  release the lock. Some special scenario might be exempt from the need
>> +  for locking, but in general rare-write must be treated as an operation
> 
> It seemed to me that "write-rare" (or write rare) was the going name, but now
> it's being called "rare write" (or rare-write).  Just be consistent, please.


write-rare it is, because it can be shortened as wr_xxx

rare_write becomes rw_xxx

which wrongly hints at read/write, which it definitely is not

>> +  tlb entries. It still does a better job of it, compared to invoking
> 
>       TLB

ok

>> +  vmalloc for each allocation, but it is undeniably less optimized wrt to
> 
> s/wrt/with respect to/

yes

> Thanks for the documentation.

thanks for the review :-)

--
igor
Mike Rapoport Oct. 24, 2018, 11:04 p.m. UTC | #3
Hi Igor,

On Wed, Oct 24, 2018 at 12:34:57AM +0300, Igor Stoppa wrote:
> Documentation for protected memory.
> 
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Randy Dunlap <rdunlap@infradead.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++

Thanks for having docs a part of the patchset!

>  MAINTAINERS                      |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/core-api/prmem.rst
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm
> +   prmem
> 
>  Interfaces for kernel debugging
>  ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.
> +

I feel that foreword should include a sentence or two saying why we need
the memory protection and when it can/should be used.

> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause
> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after

                                                         no comma ^

> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at

                    __wr_aFter_init

> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside

... become read-only, the only valid operation

> +   reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*

For me this part is completely blind. Maybe arranging this as a table would
make the states more clearly visible.

> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.
> +        - The functionality is provided as tradeoff between hardening and speed.
> +        - Its usefulness depends on the specific use case at hand
> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
> +
> +- **Protecting the pool**
> +   This is achieved with **pmalloc_protect_pool()**
> +    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> +    - Any residual space still available from the current vmap_area is lost, as the area is protected.
> +    - **protecting a pool after every allocation will likely be very wasteful**
> +    - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> +   This is achieved with **pmalloc_make_pool_ro()**
> +    - it turns the present content of a write-rare pool into read-only
> +    - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> +  system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> +  limited, on 32-bit systems. However it shouldn't be an issue, since not

   no comma ^

> +  much data is expected to be dynamically allocated and turned into
> +  write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> +  requires performing cross-processor synchronizations of page tables.
> +  This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the
> +  action, respect to the way they use the data being altered; for example,
> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation
> +  that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> +  tlb entries. It still does a better job of it, compared to invoking
> +  vmalloc for each allocation, but it is undeniably less optimized wrt to
> +  TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea979a5a9ec9..246b1a1cc8bb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9463,6 +9463,7 @@ F:	include/linux/prmemextra.h
>  F:	mm/prmem.c
>  F:	mm/test_write_rare.c
>  F:	mm/test_pmalloc.c
> +F:	Documentation/core-api/prmem.rst

I think the MAINTAINERS update can go in one chunk as the last patch in the
series.
 
>  MEMORY MANAGEMENT
>  L:	linux-mm@kvack.org
> -- 
> 2.17.1
>
Peter Zijlstra Oct. 26, 2018, 9:26 a.m. UTC | #4
Jon,

So the below document is a prime example for why I think RST sucks. As a
text document readability is greatly diminished by all the markup
nonsense.

This stuff should not become write-only content like html and other
gunk. The actual text file is still the primary means of reading this.

> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm
> +   prmem
>  
>  Interfaces for kernel debugging
>  ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.

What does this even mean?

> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.

Again; what does this really want to say?

> +- The protection has page-level granularity.

I don't think Linux supports non-paging MMUs.

> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**

WTH is with all those ** ?

> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause
> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after
> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at
> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside
> +   reading, is to released it, by destroying the pool it belongs to.

Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
and whitespace are more good.

Also, I really don't like how you differentiate between static and
dynamic wr.

> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*

That's just unreadable gibberish from here. Also what?

We already have RO, why do you need more RO?

> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.

Surely that is a problem?

> +        - The functionality is provided as tradeoff between hardening and speed.

Which you fail to explain.

> +        - Its usefulness depends on the specific use case at hand

How about you write sensible text inside the option descriptions
instead?

This is not a presentation; less bullets, more content.

> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.

I still don't really understand the whole write-rare thing; how does it
really help? If we can write in kernel memory, we can write to
page-tables too.

And I don't think this document even begins to explain _why_ you're
doing any of this. How does it help?

> +- The users of rare write must take care of ensuring the atomicity of the
> +  action, respect to the way they use the data being altered; for example,
> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation
> +  that can incur into races.

What?!
Matthew Wilcox Oct. 26, 2018, 10:20 a.m. UTC | #5
On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote:
> Jon,
> 
> So the below document is a prime example for why I think RST sucks. As a
> text document readability is greatly diminished by all the markup
> nonsense.
> 
> This stuff should not become write-only content like html and other
> gunk. The actual text file is still the primary means of reading this.

I think Igor neglected to read doc-guide/sphinx.rst:

Specific guidelines for the kernel documentation
------------------------------------------------

Here are some specific guidelines for the kernel documentation:

* Please don't go overboard with reStructuredText markup. Keep it
  simple. For the most part the documentation should be plain text with
  just enough consistency in formatting that it can be converted to
  other formats.

I agree that it's overly marked up.
Kees Cook Oct. 26, 2018, 10:46 a.m. UTC | #6
On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.

I can speak to the general goal. The specific implementation here may
not fit all the needs, but it's a good starting point for discussion.

One aspect of hardening the kernel against attack is reducing the
internal attack surface. Not all flaws are created equal, so there is
variation in what limitations an attacker may have when exploiting
flaws (not many flaws end up being a fully controlled "write anything,
anywhere, at any time"). By making the more sensitive data structures
of the kernel read-only, we reduce the risk of an attacker finding a
path to manipulating the kernel's behavior in a significant way.

Examples of typical sensitive targets are function pointers, security
policy, and page tables. Having these "read only at rest" makes them
much harder to control by an attacker using memory integrity flaws. We
already have the .rodata section for "const" things. Those are
trivially read-only. For example there is a long history of making
sure that function pointer tables are marked "const". However, some
things need to stay writable. Some of those in .data aren't changed
after __init, so we added the __ro_after_init which made them
read-only for the life of the system after __init. However, we're
still left with a lot of sensitive structures either in .data or
dynamically allocated, that need a way to be made read-only for most
of the time, but get written to during very specific times.

The "write rarely" name itself may not sufficiently describe what is
wanted either (I'll take the blame for the inaccurate name), so I'm
open to new ideas there. The implementation requirements for the
"sensitive data read-only at rest" feature are rather tricky:

- allow writes only from specific places in the kernel
- keep those locations inline to avoid making them trivial ROP targets
- keep the writeability window open only to a single uninterruptable CPU
- fast enough to deal with page table updates

The proposal I made a while back only covered .data things (and used
x86-specific features). Igor's proposal builds on this by including a
way to do this with dynamic allocation too, which greatly expands the
scope of structures that can be protected. Given that the x86-only
method of write-window creation was firmly rejected, this is a new
proposal for how to do it (vmap window). Using switch_mm() has also
been suggested, etc.

We need to find a good way to do the write-windowing that works well
for static and dynamic structures _and_ for the page tables... this
continues to be tricky.

Making it resilient against ROP-style targets makes it difficult to
deal with certain data structures (like list manipulation). In my
earlier RFC, I tried to provide enough examples of where this could
get used to let people see some of the complexity[1]. Igor's series
expands this to even more examples using dynamic allocation.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely
Markus Heiser Oct. 26, 2018, 11:09 a.m. UTC | #7
Am 26.10.18 um 11:26 schrieb Peter Zijlstra:
 >> Jon,
 >>
 >> So the below document is a prime example for why I think RST sucks. As a
 >> text document readability is greatly diminished by all the markup
 >> nonsense.
 >>
 >> This stuff should not become write-only content like html and other
 >> gunk. The actual text file is still the primary means of reading this.
 >
 > I think Igor neglected to read doc-guide/sphinx.rst:
 >
 > Specific guidelines for the kernel documentation
 > ------------------------------------------------
 >
 > Here are some specific guidelines for the kernel documentation:
 >
 > * Please don't go overboard with reStructuredText markup. Keep it
 >    simple. For the most part the documentation should be plain text with
 >    just enough consistency in formatting that it can be converted to
 >    other formats.
 >
 > I agree that it's overly marked up.

To add my two cent ..

 > WTH is with all those ** ?

I guess Igor was looking for a definition list ...

 >> +Available protections for kernel data
 >> +-------------------------------------
 >> +
 >> +- **constant**
 >> +   Labelled as **const**, the data is never supposed to be altered.
 >> +   It is statically allocated - if it has any memory footprint at all.
 >> +   The compiler can even optimize it away, where possible, by replacing
 >> +   references to a **const** with its actual value.


+Available protections for kernel data
+-------------------------------------
+
+constant
+   Labelled as const, the data is never supposed to be altered.
+   It is statically allocated - if it has any memory footprint at all.
+   The compiler can even optimize it away, where possible, by replacing
+   references to a const with its actual value.

see "Lists and Quote-like blocks" [1]

[1] 
http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks

-- Markus --
Jonathan Corbet Oct. 26, 2018, 3:05 p.m. UTC | #8
On Fri, 26 Oct 2018 11:26:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Jon,
> 
> So the below document is a prime example for why I think RST sucks. As a
> text document readability is greatly diminished by all the markup
> nonsense.

Please don't confused RST with the uses of it.  The fact that one can do
amusing things with the C preprocessor doesn't, on its own, make C
suck... 
> 
> This stuff should not become write-only content like html and other
> gunk. The actual text file is still the primary means of reading this.

I agree that the text file comes first, and I agree that the markup in
this particular document is excessive.

Igor, just like #ifdef's make code hard to read, going overboard with RST
markup makes text hard to read.  It's a common trap to fall into, because
it lets you make slick-looking web pages, but that is not our primary
goal here.  Many thanks for writing documentation for this feature, that
already puts you way ahead of a discouraging number of contributors.  But
could I ask you, please, to make a pass over it and reduce the markup to
a minimum?  Using lists as suggested by Markus would help here.

Thanks,

jon
Peter Zijlstra Oct. 28, 2018, 6:31 p.m. UTC | #9
On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > I still don't really understand the whole write-rare thing; how does it
> > really help? If we can write in kernel memory, we can write to
> > page-tables too.

> One aspect of hardening the kernel against attack is reducing the
> internal attack surface. Not all flaws are created equal, so there is
> variation in what limitations an attacker may have when exploiting
> flaws (not many flaws end up being a fully controlled "write anything,
> anywhere, at any time"). By making the more sensitive data structures
> of the kernel read-only, we reduce the risk of an attacker finding a
> path to manipulating the kernel's behavior in a significant way.
> 
> Examples of typical sensitive targets are function pointers, security
> policy, and page tables. Having these "read only at rest" makes them
> much harder to control by an attacker using memory integrity flaws.

Because 'write-anywhere' exploits are easier than (and the typical first
step to) arbitrary code execution thingies?

> The "write rarely" name itself may not sufficiently describe what is
> wanted either (I'll take the blame for the inaccurate name), so I'm
> open to new ideas there. The implementation requirements for the
> "sensitive data read-only at rest" feature are rather tricky:
> 
> - allow writes only from specific places in the kernel
> - keep those locations inline to avoid making them trivial ROP targets
> - keep the writeability window open only to a single uninterruptable CPU

The current patch set does not achieve that because it uses a global
address space for the alias mapping (vmap) which is equally accessible
from all CPUs.

> - fast enough to deal with page table updates

The proposed implementation needs page-tables for the alias; I don't see
how you could ever do R/O page-tables when you need page-tables to
modify your page-tables.

And this is entirely irrespective of performance.

> The proposal I made a while back only covered .data things (and used
> x86-specific features).

Oh, right, that CR0.WP stuff.

> Igor's proposal builds on this by including a
> way to do this with dynamic allocation too, which greatly expands the
> scope of structures that can be protected. Given that the x86-only
> method of write-window creation was firmly rejected, this is a new
> proposal for how to do it (vmap window). Using switch_mm() has also
> been suggested, etc.

Right... /me goes find the patches we did for text_poke. Hmm, those
never seem to have made it:

  https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com

like that. That approach will in fact work and not be a completely
broken mess like this thing.

> We need to find a good way to do the write-windowing that works well
> for static and dynamic structures _and_ for the page tables... this
> continues to be tricky.
> 
> Making it resilient against ROP-style targets makes it difficult to
> deal with certain data structures (like list manipulation). In my
> earlier RFC, I tried to provide enough examples of where this could
> get used to let people see some of the complexity[1]. Igor's series
> expands this to even more examples using dynamic allocation.

Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
fast enough for much of anything.

And I don't suppose we can take the WP fault and then fix up from there,
because if we're doing R/O page-tables, that'll incrase the fault depth
and we'll double fault all the time, and tripple fault where we
currently double fault. And we all know how _awesome_ tripple faults
are.

But duplicating (and wrapping in gunk) whole APIs is just not going to
work.
Igor Stoppa Oct. 29, 2018, 7:05 p.m. UTC | #10
Hello,

On 25/10/2018 00:04, Mike Rapoport wrote:

> I feel that foreword should include a sentence or two saying why we need
> the memory protection and when it can/should be used.

I was somewhat lost about the actual content of this sort of document.

In past reviews of older version of the docs, I go the impression that 
this should focus more on the use of the API and how to use it.

And it seem that different people expect to find here different type of 
information.

What is the best place where to put a more extensive description of the 
feature?

--
igor
Igor Stoppa Oct. 29, 2018, 7:28 p.m. UTC | #11
On 26/10/2018 11:20, Matthew Wilcox wrote:
> On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote:
>> Jon,
>>
>> So the below document is a prime example for why I think RST sucks. As a
>> text document readability is greatly diminished by all the markup
>> nonsense.
>>
>> This stuff should not become write-only content like html and other
>> gunk. The actual text file is still the primary means of reading this.
> 
> I think Igor neglected to read doc-guide/sphinx.rst:

Guilty as charged :-/

> Specific guidelines for the kernel documentation
> ------------------------------------------------
> 
> Here are some specific guidelines for the kernel documentation:
> 
> * Please don't go overboard with reStructuredText markup. Keep it
>    simple. For the most part the documentation should be plain text with
>    just enough consistency in formatting that it can be converted to
>    other formats.

> I agree that it's overly marked up.

I'll fix it

--
igor
Igor Stoppa Oct. 29, 2018, 7:35 p.m. UTC | #12
On 26/10/2018 12:09, Markus Heiser wrote:

> I guess Igor was looking for a definition list ...

It was meant to be bold.
Even after reading the guideline "keep it simple", what exactly is 
simple can be subjective.

If certain rst constructs are not acceptable and can be detected 
unequivocally, maybe it would help to teach them to checkpatch.pl

[...]

> see "Lists and Quote-like blocks" [1]
> 
> [1] 
> http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks 

thank you

--
igor
Igor Stoppa Oct. 29, 2018, 7:38 p.m. UTC | #13
On 26/10/2018 16:05, Jonathan Corbet wrote:
> But
> could I ask you, please, to make a pass over it and reduce the markup to
> a minimum?  Using lists as suggested by Markus would help here.

sure, it's even easier for me to maintain the doc :-)
as i just wrote in a related reply, it might be worth having checkpatch 
to detect formatting constructs that are not welcome

--
igor
Igor Stoppa Oct. 29, 2018, 8:35 p.m. UTC | #14
On 26/10/2018 10:26, Peter Zijlstra wrote:

>> +- Kernel code is protected at system level and, unlike data, it doesn't
>> +  require special attention.
> 
> What does this even mean?

I was trying to convey the notion that the pages containing kernel code 
do not require any special handling by the author of a generic kernel 
component, for example a kernel driver.

Pages containing either statically or dynamically allocated data, 
instead, are not automatically protected.

But yes, the sentence is far from being clear.

> 
>> +Protection mechanism
>> +--------------------
>> +
>> +- When available, the MMU can write protect memory pages that would be
>> +  otherwise writable.
> 
> Again; what does this really want to say?

That it is possible to use the MMU also for write-protecting pages 
containing data which was not declared as constant.
> 
>> +- The protection has page-level granularity.
> 
> I don't think Linux supports non-paging MMUs.

This probably came from a model I had in mind where a separate execution 
environment, such as an hypervisor, could trap and filter writes to a 
certain parts of a page, rejecting some and performing others, 
effectively emulating sub-page granularity.

> 
>> +- An attempt to overwrite a protected page will trigger an exception.
>> +- **Write protected data must go exclusively to write protected pages**
>> +- **Writable data must go exclusively to writable pages**
> 
> WTH is with all those ** ?

[...]

> Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
> and whitespace are more good.

Yes

> Also, I really don't like how you differentiate between static and
> dynamic wr.

ok, but why? what would you suggest, instead?

[...]

> We already have RO, why do you need more RO?

I can explain, but I'm at loss of what is the best place: I was under 
the impression that this sort of document should focus mostly on the API 
and its use. I was even considering removing most of the explanation and 
instead put it in a separate document.

> 
>> +
>> +   **Remarks:**
>> +    - The "AUTO" modes perform automatic protection of the content, whenever
>> +       the current vmap_area is used up and a new one is allocated.
>> +        - At that point, the vmap_area being phased out is protected.
>> +        - The size of the vmap_area depends on various parameters.
>> +        - It might not be possible to know for sure *when* certain data will
>> +          be protected.
> 
> Surely that is a problem?
> 
>> +        - The functionality is provided as tradeoff between hardening and speed.
> 
> Which you fail to explain.
> 
>> +        - Its usefulness depends on the specific use case at hand
> 
> How about you write sensible text inside the option descriptions
> instead?
> 
> This is not a presentation; less bullets, more content.

I tried to say something, without saying too much, but it was clearly a 
bad choice.

Where should i put a thoroughly detailed explanation?
Here or in a separate document?

> 
>> +- Not only the pmalloc memory must be protected, but also any reference to
>> +  it that might become the target for an attack. The attack would replace
>> +  a reference to the protected memory with a reference to some other,
>> +  unprotected, memory.
> 
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.

It has already been answered by Kees, but I'll provide also my answer:

the exploits used to write in kernel memory are specific to certain 
products and SW builds, so it's not possible to generalize too much, 
however there might be some limitation in the reach of a specific 
vulnerability. For example, if a memory location is referred to as 
offset from a base address, the maximum size of the offset limits the 
scope of the attack. Which might make it impossible to use that specific 
vulnerability for writing directly to the page table. But something 
else, say a statically allocated variable, might be within reach.

said this, there is also the almost orthogonal use case of providing 
increased robustness by trapping accidental modifications, caused by bugs.

> And I don't think this document even begins to explain _why_ you're
> doing any of this. How does it help?

ok, point taken

>> +- The users of rare write must take care of ensuring the atomicity of the
>> +  action, respect to the way they use the data being altered; for example,
>> +  take a lock before making a copy of the value to modify (if it's
>> +  relevant), then alter it, issue the call to rare write and finally
>> +  release the lock. Some special scenario might be exempt from the need
>> +  for locking, but in general rare-write must be treated as an operation
>> +  that can incur into races.
> 
> What?!

Probably something along the lines of:

"users of write-rare are responsible of using mechanisms that allow 
reading/writing data in a consistent way"

and if it seems obvious, I can just drop it.

One of the problems I have faced is to decide what level of knowledge or 
understanding I should expect from the reader of such document: what I 
can take for granted and what I should explain.

--
igor
Igor Stoppa Oct. 29, 2018, 9:04 p.m. UTC | #15
On 28/10/2018 18:31, Peter Zijlstra wrote:
> On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
>> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> I still don't really understand the whole write-rare thing; how does it
>>> really help? If we can write in kernel memory, we can write to
>>> page-tables too.
> 
>> One aspect of hardening the kernel against attack is reducing the
>> internal attack surface. Not all flaws are created equal, so there is
>> variation in what limitations an attacker may have when exploiting
>> flaws (not many flaws end up being a fully controlled "write anything,
>> anywhere, at any time"). By making the more sensitive data structures
>> of the kernel read-only, we reduce the risk of an attacker finding a
>> path to manipulating the kernel's behavior in a significant way.
>>
>> Examples of typical sensitive targets are function pointers, security
>> policy, and page tables. Having these "read only at rest" makes them
>> much harder to control by an attacker using memory integrity flaws.
> 
> Because 'write-anywhere' exploits are easier than (and the typical first
> step to) arbitrary code execution thingies?
> 
>> The "write rarely" name itself may not sufficiently describe what is
>> wanted either (I'll take the blame for the inaccurate name), so I'm
>> open to new ideas there. The implementation requirements for the
>> "sensitive data read-only at rest" feature are rather tricky:
>>
>> - allow writes only from specific places in the kernel
>> - keep those locations inline to avoid making them trivial ROP targets
>> - keep the writeability window open only to a single uninterruptable CPU
> 
> The current patch set does not achieve that because it uses a global
> address space for the alias mapping (vmap) which is equally accessible
> from all CPUs.

I never claimed to achieve 100% resilience to attacks.
While it's true that the address space is accessible to all the CPUs, it 
is also true that the access has a limited duration in time, and the 
location where the access can be performed is not fixed.

Iow, assuming that the CPUs not involved in the write-rare operations 
are compromised and that they are trying to perform a concurrent access 
to the data in the writable page, they have a limited window of opportunity.

Said this, this I have posted is just a tentative implementation.
My primary intent was to at least give an idea of what I'd like to do: 
alter some data in a way that is not easily exploitable.

>> - fast enough to deal with page table updates
> 
> The proposed implementation needs page-tables for the alias; I don't see
> how you could ever do R/O page-tables when you need page-tables to
> modify your page-tables.

It's not all-or-nothing.
I hope we agree at least on the reasoning that having only a limited 
amount of address space directly attackable, instead of the whole set of 
pages containing exploitable data, is  reducing the attack surface.

Furthermore, if we think about possible limitations that the attack 
might have (maximum reach), the level of protection might be even 
higher. I have to use "might" because I cannot foresee the vulnerability.

Furthermore, taking a different angle: your average attacker is not 
necessarily very social and inclined to share the vulnerability found.
It is safe to assume that in most cases each attacker has to identify 
the attack strategy autonomously.
Reducing the amount of individual who can perform an attack, by 
increasing the expertise required is also a way of doing damage control.

> And this is entirely irrespective of performance.

I have not completely given up on performance, but, being write-rare, I 
see improved performance as just a way of widening the range of possible 
recipients for the hardening.

>> The proposal I made a while back only covered .data things (and used
>> x86-specific features).
> 
> Oh, right, that CR0.WP stuff.
> 
>> Igor's proposal builds on this by including a
>> way to do this with dynamic allocation too, which greatly expands the
>> scope of structures that can be protected. Given that the x86-only
>> method of write-window creation was firmly rejected, this is a new
>> proposal for how to do it (vmap window). Using switch_mm() has also
>> been suggested, etc.
> 
> Right... /me goes find the patches we did for text_poke. Hmm, those
> never seem to have made it:
> 
>    https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
> 
> like that. That approach will in fact work and not be a completely
> broken mess like this thing.

That approach is x86 specific.
I preferred to start with something that would work on a broader set of 
architectures, even if with less resilience.

It can be done so that individual architectures can implement their own 
specific way and obtain better results.

>> We need to find a good way to do the write-windowing that works well
>> for static and dynamic structures _and_ for the page tables... this
>> continues to be tricky.
>>
>> Making it resilient against ROP-style targets makes it difficult to
>> deal with certain data structures (like list manipulation). In my
>> earlier RFC, I tried to provide enough examples of where this could
>> get used to let people see some of the complexity[1]. Igor's series
>> expands this to even more examples using dynamic allocation.
> 
> Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> fast enough for much of anything.

Part of the reason for the duplication is that, even if the wr_API I 
came up with, can be collapsed with the regular API, the implementation 
needs to be different, to be faster.

Ex:
* force the list_head structure to be aligned so that it will always be 
fully contained by a single page
* create alternate mapping for all the pages involved (max 1 page for 
each list_head)
* do the lsit operation on the remapped list_heads
* destroy all the mappings

I can try to introduce some wrapper similar to kmap_atomic(), as 
suggested by Dave Hansen, which can improve the coding, but it will not 
change the actual set of operations performed.


> And I don't suppose we can take the WP fault and then fix up from there,
> because if we're doing R/O page-tables, that'll incrase the fault depth
> and we'll double fault all the time, and tripple fault where we
> currently double fault. And we all know how _awesome_ tripple faults
> are.
> 
> But duplicating (and wrapping in gunk) whole APIs is just not going to
> work.

Would something like kmap_atomic() be acceptable?
Do you have some better proposal, now that (I hope) it should be more 
clear what I'm trying to do and why?

--
igor
Peter Zijlstra Oct. 30, 2018, 3:26 p.m. UTC | #16
On Mon, Oct 29, 2018 at 11:04:01PM +0200, Igor Stoppa wrote:
> On 28/10/2018 18:31, Peter Zijlstra wrote:
> > On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> > > On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > I still don't really understand the whole write-rare thing; how does it
> > > > really help? If we can write in kernel memory, we can write to
> > > > page-tables too.
> > 
> > > One aspect of hardening the kernel against attack is reducing the
> > > internal attack surface. Not all flaws are created equal, so there is
> > > variation in what limitations an attacker may have when exploiting
> > > flaws (not many flaws end up being a fully controlled "write anything,
> > > anywhere, at any time"). By making the more sensitive data structures
> > > of the kernel read-only, we reduce the risk of an attacker finding a
> > > path to manipulating the kernel's behavior in a significant way.
> > > 
> > > Examples of typical sensitive targets are function pointers, security
> > > policy, and page tables. Having these "read only at rest" makes them
> > > much harder to control by an attacker using memory integrity flaws.
> > 
> > Because 'write-anywhere' exploits are easier than (and the typical first
> > step to) arbitrary code execution thingies?
> > 
> > > The "write rarely" name itself may not sufficiently describe what is
> > > wanted either (I'll take the blame for the inaccurate name), so I'm
> > > open to new ideas there. The implementation requirements for the
> > > "sensitive data read-only at rest" feature are rather tricky:
> > > 
> > > - allow writes only from specific places in the kernel
> > > - keep those locations inline to avoid making them trivial ROP targets
> > > - keep the writeability window open only to a single uninterruptable CPU
> > 
> > The current patch set does not achieve that because it uses a global
> > address space for the alias mapping (vmap) which is equally accessible
> > from all CPUs.
> 
> I never claimed to achieve 100% resilience to attacks.

You never even begun explaining what you were defending against. Let
alone how well it achieved its stated goals.

> While it's true that the address space is accessible to all the CPUs, it is
> also true that the access has a limited duration in time, and the location
> where the access can be performed is not fixed.
> 
> Iow, assuming that the CPUs not involved in the write-rare operations are
> compromised and that they are trying to perform a concurrent access to the
> data in the writable page, they have a limited window of opportunity.

That sounds like security through obscurity. Sure it makes it harder,
but it doesn't stop anything.

> Said this, this I have posted is just a tentative implementation.
> My primary intent was to at least give an idea of what I'd like to do: alter
> some data in a way that is not easily exploitable.

Since you need to modify page-tables in order to achieve this, the
page-tables are also there for the attacker to write to.

> > > - fast enough to deal with page table updates
> > 
> > The proposed implementation needs page-tables for the alias; I don't see
> > how you could ever do R/O page-tables when you need page-tables to
> > modify your page-tables.
> 
> It's not all-or-nothing.

I really am still strugging with what it is this thing is supposed to
do. As it stands, I see very little value. Yes it makes a few things a
little harder, but it doesn't really do away with things.

> I hope we agree at least on the reasoning that having only a limited amount
> of address space directly attackable, instead of the whole set of pages
> containing exploitable data, is  reducing the attack surface.

I do not in fact agree. Most pages are not interesting for an attack at
all. So simply reducing the set of pages you can write to isn't
sufficient.

IOW. removing all noninteresting pages from the writable set will
satisfy your criteria, but it avoids exactly 0 exploits.

What you need to argue is that we remove common exploit targets, and
you've utterly failed to do so.

Kees mentioned: function pointers, page-tables and a few other things.
You mentioned nothing.

> Furthermore, if we think about possible limitations that the attack might
> have (maximum reach), the level of protection might be even higher. I have
> to use "might" because I cannot foresee the vulnerability.

That's just a bunch of words together that don't really say anything
afaict.

> Furthermore, taking a different angle: your average attacker is not
> necessarily very social and inclined to share the vulnerability found.
> It is safe to assume that in most cases each attacker has to identify the
> attack strategy autonomously.
> Reducing the amount of individual who can perform an attack, by increasing
> the expertise required is also a way of doing damage control.

If you make RO all noninteresting pages, you in fact increase the
density of interesting targets and make things easier.

> > And this is entirely irrespective of performance.
> 
> I have not completely given up on performance, but, being write-rare, I see
> improved performance as just a way of widening the range of possible
> recipients for the hardening.

What?! Are you saying you don't care about performance?

> > Right... /me goes find the patches we did for text_poke. Hmm, those
> > never seem to have made it:
> > 
> >    https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
> > 
> > like that. That approach will in fact work and not be a completely
> > broken mess like this thing.
> 
> That approach is x86 specific.

It is not. Every architecture does switch_mm() with IRQs disabled, as
that is exactly what the scheduler does.

And keeping a second init_mm around also isn't very architecture
specific I feel.

> > > We need to find a good way to do the write-windowing that works well
> > > for static and dynamic structures _and_ for the page tables... this
> > > continues to be tricky.
> > > 
> > > Making it resilient against ROP-style targets makes it difficult to
> > > deal with certain data structures (like list manipulation). In my
> > > earlier RFC, I tried to provide enough examples of where this could
> > > get used to let people see some of the complexity[1]. Igor's series
> > > expands this to even more examples using dynamic allocation.
> > 
> > Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> > fast enough for much of anything.
> 
> Part of the reason for the duplication is that, even if the wr_API I came up
> with, can be collapsed with the regular API, the implementation needs to be
> different, to be faster.
> 
> Ex:
> * force the list_head structure to be aligned so that it will always be
> fully contained by a single page
> * create alternate mapping for all the pages involved (max 1 page for each
> list_head)
> * do the lsit operation on the remapped list_heads
> * destroy all the mappings

Those constraints are due to single page aliases.

> I can try to introduce some wrapper similar to kmap_atomic(), as suggested
> by Dave Hansen, which can improve the coding, but it will not change the
> actual set of operations performed.

See below, I don't think kmap_atomic() is either correct or workable.

One thing that might be interesting is teaching objtool about the
write-enable write-disable markers and making it guarantee we reach a
disable after every enable. IOW, ensure we never leak this state.

I think that was part of why we hated on that initial thing Kees did --
well that an disabling all WP is of course completely insane ;-).

> > And I don't suppose we can take the WP fault and then fix up from there,
> > because if we're doing R/O page-tables, that'll incrase the fault depth
> > and we'll double fault all the time, and tripple fault where we
> > currently double fault. And we all know how _awesome_ tripple faults
> > are.
> > 
> > But duplicating (and wrapping in gunk) whole APIs is just not going to
> > work.
> 
> Would something like kmap_atomic() be acceptable?

Don't think so. kmap_atomic() on x86_32 (64bit doesn't have it at all)
only does the TLB invalidate on the one CPU, which we've established is
incorrect.

Also, kmap_atomic is still page-table based, which means not all
page-tables can be read-only.

> Do you have some better proposal, now that (I hope) it should be more clear
> what I'm trying to do and why?

You've still not talked about any actual attack vectors and how they're
mitigated by these patches.

I suppose the 'normal' attack goes like:

 1) find buffer-overrun / bound check failure
 2) use that to write to 'interesting' location
 3) that write results arbitrary code execution
 4) win

Of course, if the store of 2 is to the current cred structure, and
simply sets the effective uid to 0, we can skip 3.

Which seems to suggest all cred structures should be made r/o like this.
But I'm not sure I remember these patches doing that.


Also, there is an inverse situation with all this. If you make
everything R/O, then you need this allow-write for everything you do,
which then is about to include a case with an overflow / bound check
fail, and we're back to square 1.

What are you doing to avoid that?
Kees Cook Oct. 30, 2018, 4:37 p.m. UTC | #17
On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I suppose the 'normal' attack goes like:
>
>  1) find buffer-overrun / bound check failure
>  2) use that to write to 'interesting' location
>  3) that write results arbitrary code execution
>  4) win
>
> Of course, if the store of 2 is to the current cred structure, and
> simply sets the effective uid to 0, we can skip 3.

In most cases, yes, gaining root is game over. However, I don't want
to discount other threat models: some systems have been designed not
to trust root, so a cred attack doesn't always get an attacker full
control (e.g. lockdown series, signed modules, encrypted VMs, etc).

> Which seems to suggest all cred structures should be made r/o like this.
> But I'm not sure I remember these patches doing that.

There are things that attempt to protect cred (and other things, like
page tables) via hypervisors (see Samsung KNOX) or via syscall
boundary checking (see Linux Kernel Runtime Guard). They're pretty
interesting, but I'm not sure if there is a clear way forward on it
working in upstream, but that's why I think these discussions are
useful.

> Also, there is an inverse situation with all this. If you make
> everything R/O, then you need this allow-write for everything you do,
> which then is about to include a case with an overflow / bound check
> fail, and we're back to square 1.

Sure -- this is the fine line in trying to build these defenses. The
point is to narrow the scope of attack. Stupid metaphor follows: right
now we have only a couple walls; if we add walls we can focus on make
sure the doors and windows are safe. If we make the relatively
easy-to-find-in-memory page tables read-only-at-rest then a whole
class of very powerful exploits that depend on page table attacks go
away.

As part of all of this is the observation that there are two types of
things clearly worth protecting: that which is updated rarely (no need
to leave it writable for so much of its lifetime), and that which is
especially sensitive (page tables, security policy, function pointers,
etc). Finding a general purpose way to deal with these (like we have
for other data-lifetime cases like const and __ro_after_init) would be
very nice. I don't think there is a slippery slope here.

-Kees
Andy Lutomirski Oct. 30, 2018, 5:06 p.m. UTC | #18
> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> 
>> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>> 
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>> 
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
> 
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
> 
>> Which seems to suggest all cred structures should be made r/o like this.
>> But I'm not sure I remember these patches doing that.
> 
> There are things that attempt to protect cred (and other things, like
> page tables) via hypervisors (see Samsung KNOX) or via syscall
> boundary checking (see Linux Kernel Runtime Guard). They're pretty
> interesting, but I'm not sure if there is a clear way forward on it
> working in upstream, but that's why I think these discussions are
> useful.
> 
>> Also, there is an inverse situation with all this. If you make
>> everything R/O, then you need this allow-write for everything you do,
>> which then is about to include a case with an overflow / bound check
>> fail, and we're back to square 1.
> 
> Sure -- this is the fine line in trying to build these defenses. The
> point is to narrow the scope of attack. Stupid metaphor follows: right
> now we have only a couple walls; if we add walls we can focus on make
> sure the doors and windows are safe. If we make the relatively
> easy-to-find-in-memory page tables read-only-at-rest then a whole
> class of very powerful exploits that depend on page table attacks go
> away.
> 
> As part of all of this is the observation that there are two types of
> things clearly worth protecting: that which is updated rarely (no need
> to leave it writable for so much of its lifetime), and that which is
> especially sensitive (page tables, security policy, function pointers,
> etc). Finding a general purpose way to deal with these (like we have
> for other data-lifetime cases like const and __ro_after_init) would be
> very nice. I don't think there is a slippery slope here.
> 
> 

Since I wasn’t cc’d on this series:

I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.

If anyone wants to use CR0.WP instead, I’ll remind them that they have to fix up the entry code and justify the added complexity. And make performance not suck in a VM (i.e. CR0 reads on entry are probably off the table).  None of these will be easy.

If anyone wants to use kmap_atomic-like tricks, I’ll point out that we already have enough problems with dangling TLB entries due to SMP issues. The last thing we need is more of them. If someone proposes a viable solution that doesn’t involve CR3 fiddling, I’ll be surprised.

Keep in mind that switch_mm() is actually decently fast on modern CPUs.  It’s probably considerably faster than writing CR0, although I haven’t benchmarked it. It’s certainly faster than writing CR4.  It’s also faster than INVPCID, surprisingly, which means that it will be quite hard to get better performance using any sort of trickery.

Nadav’s patch set would be an excellent starting point.

P.S. EFI is sort of grandfathered in as a hackish alternate page table hierarchy. We’re not adding another one.
Matthew Wilcox Oct. 30, 2018, 5:58 p.m. UTC | #19
On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> I support the addition of a rare-write mechanism to the upstream kernel.
> And I think that there is only one sane way to implement it: using an
> mm_struct. That mm_struct, just like any sane mm_struct, should only
> differ from init_mm in that it has extra mappings in the *user* region.

I'd like to understand this approach a little better.  In a syscall path,
we run with the user task's mm.  What you're proposing is that when we
want to modify rare data, we switch to rare_mm which contains a
writable mapping to all the kernel data which is rare-write.

So the API might look something like this:

	void *p = rare_alloc(...);	/* writable pointer */
	p->a = x;
	q = rare_protect(p);		/* read-only pointer */

To subsequently modify q,

	p = rare_modify(q);
	q->a = y;
	rare_protect(p);

Under the covers, rare_modify() would switch to the rare_mm and return
(void *)((unsigned long)q + ARCH_RARE_OFFSET).  All of the rare data
would then be modifiable, although you don't have any other pointers
to it.  rare_protect() would switch back to the previous mm and return
(p - ARCH_RARE_OFFSET).

Does this satisfy Igor's requirements?  We wouldn't be able to
copy_to/from_user() while rare_mm was active.  I think that's a feature
though!  It certainly satisfies my interests (kernel code be able to
mark things as dynamically-allocated-and-read-only-after-initialisation)
Dave Hansen Oct. 30, 2018, 6:03 p.m. UTC | #20
On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> Does this satisfy Igor's requirements?  We wouldn't be able to
> copy_to/from_user() while rare_mm was active.  I think that's a feature
> though!  It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)

It has to be more than copy_to/from_user(), though, I think.

rare_modify(q) either has to preempt_disable(), or we need to teach the
context-switching code when and how to switch in/out of the rare_mm.
preempt_disable() would also keep us from sleeping.
Tycho Andersen Oct. 30, 2018, 6:28 p.m. UTC | #21
On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > I support the addition of a rare-write mechanism to the upstream kernel.
> > And I think that there is only one sane way to implement it: using an
> > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
> 	void *p = rare_alloc(...);	/* writable pointer */
> 	p->a = x;
> 	q = rare_protect(p);		/* read-only pointer */
> 
> To subsequently modify q,
> 
> 	p = rare_modify(q);
> 	q->a = y;

Do you mean

    p->a = y;

here? I assume the intent is that q isn't writable ever, but that's
the one we have in the structure at rest.

Tycho

> 	rare_protect(p);
> 
> Under the covers, rare_modify() would switch to the rare_mm and return
> (void *)((unsigned long)q + ARCH_RARE_OFFSET).  All of the rare data
> would then be modifiable, although you don't have any other pointers
> to it.  rare_protect() would switch back to the previous mm and return
> (p - ARCH_RARE_OFFSET).
> 
> Does this satisfy Igor's requirements?  We wouldn't be able to
> copy_to/from_user() while rare_mm was active.  I think that's a feature
> though!  It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)
Andy Lutomirski Oct. 30, 2018, 6:51 p.m. UTC | #22
> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>> I support the addition of a rare-write mechanism to the upstream kernel.
>> And I think that there is only one sane way to implement it: using an
>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>> differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
>    void *p = rare_alloc(...);    /* writable pointer */
>    p->a = x;
>    q = rare_protect(p);        /* read-only pointer */
> 
> To subsequently modify q,
> 
>    p = rare_modify(q);
>    q->a = y;
>    rare_protect(p);

How about:

rare_write(&q->a, y);

Or, for big writes:

rare_write_copy(&q, local_q);

This avoids a whole ton of issues.  In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with.

Also, can we maybe focus on getting something merged for statically allocated data first?

Finally, one issue: rare_alloc() is going to utterly suck performance-wise due to the global IPI when the region gets zapped out of the direct map or otherwise made RO.  This is the same issue that makes all existing XPO efforts so painful. We need to either optimize the crap out of it somehow or we need to make sure it’s not called except during rare events like device enumeration.

Nadav, want to resubmit your series?  IIRC the only thing wrong with it was that it was a big change and we wanted a simpler fix to backport. But that’s all done now, and I, at least, rather liked your code. :)
Kees Cook Oct. 30, 2018, 7:14 p.m. UTC | #23
On Tue, Oct 30, 2018 at 11:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>> And I think that there is only one sane way to implement it: using an
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>> differ from init_mm in that it has extra mappings in the *user* region.
>>
>> I'd like to understand this approach a little better.  In a syscall path,
>> we run with the user task's mm.  What you're proposing is that when we
>> want to modify rare data, we switch to rare_mm which contains a
>> writable mapping to all the kernel data which is rare-write.
>>
>> So the API might look something like this:
>>
>>    void *p = rare_alloc(...);    /* writable pointer */
>>    p->a = x;
>>    q = rare_protect(p);        /* read-only pointer */
>>
>> To subsequently modify q,
>>
>>    p = rare_modify(q);
>>    q->a = y;
>>    rare_protect(p);
>
> How about:
>
> rare_write(&q->a, y);
>
> Or, for big writes:
>
> rare_write_copy(&q, local_q);
>
> This avoids a whole ton of issues.  In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with.

This is what I had before, yes:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=9ab0cb2618ebbc51f830ceaa06b7d2182fe1a52d

It just needs the switch_mm() backend.
Matthew Wilcox Oct. 30, 2018, 7:20 p.m. UTC | #24
On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> > 
> > I'd like to understand this approach a little better.  In a syscall path,
> > we run with the user task's mm.  What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> > 
> > So the API might look something like this:
> > 
> > 	void *p = rare_alloc(...);	/* writable pointer */
> > 	p->a = x;
> > 	q = rare_protect(p);		/* read-only pointer */
> > 
> > To subsequently modify q,
> > 
> > 	p = rare_modify(q);
> > 	q->a = y;
> 
> Do you mean
> 
>     p->a = y;
> 
> here? I assume the intent is that q isn't writable ever, but that's
> the one we have in the structure at rest.

Yes, that was my intent, thanks.

To handle the list case that Igor has pointed out, you might want to
do something like this:

	list_for_each_entry(x, &xs, entry) {
		struct foo *writable = rare_modify(entry);
		kref_get(&writable->ref);
		rare_protect(writable);
	}

but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
Igor Stoppa Oct. 30, 2018, 8:43 p.m. UTC | #25
On 30/10/2018 21:20, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>> And I think that there is only one sane way to implement it: using an
>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>
>>> I'd like to understand this approach a little better.  In a syscall path,
>>> we run with the user task's mm.  What you're proposing is that when we
>>> want to modify rare data, we switch to rare_mm which contains a
>>> writable mapping to all the kernel data which is rare-write.
>>>
>>> So the API might look something like this:
>>>
>>> 	void *p = rare_alloc(...);	/* writable pointer */
>>> 	p->a = x;
>>> 	q = rare_protect(p);		/* read-only pointer */

With pools and memory allocated from vmap_areas, I was able to say

protect(pool)

and that would do a swipe on all the pages currently in use.
In the SELinux policyDB, for example, one doesn't really want to 
individually protect each allocation.

The loading phase happens usually at boot, when the system can be 
assumed to be sane (one might even preload a bare-bone set of rules from 
initramfs and then replace it later on, with the full blown set).

There is no need to process each of these tens of thousands allocations 
and initialization as write-rare.

Would it be possible to do the same here?

>>>
>>> To subsequently modify q,
>>>
>>> 	p = rare_modify(q);
>>> 	q->a = y;
>>
>> Do you mean
>>
>>      p->a = y;
>>
>> here? I assume the intent is that q isn't writable ever, but that's
>> the one we have in the structure at rest.
> 
> Yes, that was my intent, thanks.
> 
> To handle the list case that Igor has pointed out, you might want to
> do something like this:
> 
> 	list_for_each_entry(x, &xs, entry) {
> 		struct foo *writable = rare_modify(entry);

Would this mapping be impossible to spoof by other cores?

I'm asking this because, from what I understand, local interrupts are 
enabled here, so an attack could freeze the core performing the 
write-rare operation, while another scrapes the memory.

But blocking interrupts for the entire body of the loop would make RT 
latency unpredictable.

> 		kref_get(&writable->ref);
> 		rare_protect(writable);
> 	}
> 
> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.

This seems suspiciously close to the duplication of kernel interfaces 
that I was roasted for :-)

--
igor
Andy Lutomirski Oct. 30, 2018, 9:02 p.m. UTC | #26
> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>> And I think that there is only one sane way to implement it: using an
>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>> 
>>>> I'd like to understand this approach a little better.  In a syscall path,
>>>> we run with the user task's mm.  What you're proposing is that when we
>>>> want to modify rare data, we switch to rare_mm which contains a
>>>> writable mapping to all the kernel data which is rare-write.
>>>> 
>>>> So the API might look something like this:
>>>> 
>>>>    void *p = rare_alloc(...);    /* writable pointer */
>>>>    p->a = x;
>>>>    q = rare_protect(p);        /* read-only pointer */
> 
> With pools and memory allocated from vmap_areas, I was able to say
> 
> protect(pool)
> 
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
> 
> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
> 
> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
> 
> Would it be possible to do the same here?

I don’t see why not, although getting the API right will be a tad complicated.

> 
>>>> 
>>>> To subsequently modify q,
>>>> 
>>>>    p = rare_modify(q);
>>>>    q->a = y;
>>> 
>>> Do you mean
>>> 
>>>     p->a = y;
>>> 
>>> here? I assume the intent is that q isn't writable ever, but that's
>>> the one we have in the structure at rest.
>> Yes, that was my intent, thanks.
>> To handle the list case that Igor has pointed out, you might want to
>> do something like this:
>>    list_for_each_entry(x, &xs, entry) {
>>        struct foo *writable = rare_modify(entry);
> 
> Would this mapping be impossible to spoof by other cores?
> 

Indeed. Only the core with the special mm loaded could see it.

But I dislike allowing regular writes in the protected region. We really only need four write primitives:

1. Just write one value.  Call at any time (except NMI).

2. Just copy some bytes. Same as (1) but any number of bytes.

3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.

Actually getting a modifiable pointer should be disallowed for two reasons:

1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.

2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.

We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.
Kees Cook Oct. 30, 2018, 9:07 p.m. UTC | #27
On Tue, Oct 30, 2018 at 2:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>>> And I think that there is only one sane way to implement it: using an
>>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>>>
>>>>> I'd like to understand this approach a little better.  In a syscall path,
>>>>> we run with the user task's mm.  What you're proposing is that when we
>>>>> want to modify rare data, we switch to rare_mm which contains a
>>>>> writable mapping to all the kernel data which is rare-write.
>>>>>
>>>>> So the API might look something like this:
>>>>>
>>>>>    void *p = rare_alloc(...);    /* writable pointer */
>>>>>    p->a = x;
>>>>>    q = rare_protect(p);        /* read-only pointer */
>>
>> With pools and memory allocated from vmap_areas, I was able to say
>>
>> protect(pool)
>>
>> and that would do a swipe on all the pages currently in use.
>> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
>>
>> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
>>
>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
>
> I don’t see why not, although getting the API right will be a tad complicated.
>
>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>>    p = rare_modify(q);
>>>>>    q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>>     p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>>    list_for_each_entry(x, &xs, entry) {
>>>        struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
>
> Indeed. Only the core with the special mm loaded could see it.
>
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
>
> 1. Just write one value.  Call at any time (except NMI).
>
> 2. Just copy some bytes. Same as (1) but any number of bytes.
>
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.
>
> Actually getting a modifiable pointer should be disallowed for two reasons:
>
> 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.
>
> 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.
>
> We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.

We still have to deal with certain structures under the write-rare
window. For example, see:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
They are wrappers to non-inline functions that have the same sanity-checking.
Igor Stoppa Oct. 30, 2018, 9:25 p.m. UTC | #28
On 30/10/2018 23:07, Kees Cook wrote:

> We still have to deal with certain structures under the write-rare
> window. For example, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
> They are wrappers to non-inline functions that have the same sanity-checking.

Even if I also did something similar, it was not intended to be final.
Just a stop-gap solution till the write-rare mechanism is identified.

If the size of the whole list_head is used as alignment, then the whole 
list_head structure can be given an alternate mapping and the plain list 
function can be used on this alternate mapping.

It can halve the overhead or more.
The typical case is when not only one list_head is contained in one 
page, but also the other, like when allocating and queuing multiple 
items from the same pool.
One single temporary mapping would be enough.

But it becomes tricky to do it, without generating code that is
almost-but-not-quite-identical.

--
igor
Matthew Wilcox Oct. 30, 2018, 9:25 p.m. UTC | #29
On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> Finally, one issue: rare_alloc() is going to utterly suck
> performance-wise due to the global IPI when the region gets zapped out
> of the direct map or otherwise made RO.  This is the same issue that
> makes all existing XPO efforts so painful. We need to either optimize
> the crap out of it somehow or we need to make sure it’s not called
> except during rare events like device enumeration.

Batching operations is kind of the whole point of the VM ;-)  Either
this rare memory gets used a lot, in which case we'll want to create slab
caches for it, make it a MM zone and the whole nine yeards, or it's not
used very much in which case it doesn't matter that performance sucks.

For now, I'd suggest allocating 2MB chunks as needed, and having a
shrinker to hand back any unused pieces.
Matthew Wilcox Oct. 30, 2018, 9:35 p.m. UTC | #30
On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > So the API might look something like this:
> > > > 
> > > > 	void *p = rare_alloc(...);	/* writable pointer */
> > > > 	p->a = x;
> > > > 	q = rare_protect(p);		/* read-only pointer */
> 
> With pools and memory allocated from vmap_areas, I was able to say
> 
> protect(pool)
> 
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to
> individually protect each allocation.
> 
> The loading phase happens usually at boot, when the system can be assumed to
> be sane (one might even preload a bare-bone set of rules from initramfs and
> then replace it later on, with the full blown set).
> 
> There is no need to process each of these tens of thousands allocations and
> initialization as write-rare.
> 
> Would it be possible to do the same here?

What Andy is proposing effectively puts all rare allocations into
one pool.  Although I suppose it could be generalised to multiple pools
... one mm_struct per pool.  Andy, what do you think to doing that?

> > but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
> 
> This seems suspiciously close to the duplication of kernel interfaces that I
> was roasted for :-)

Can you not see the difference between adding one syntactic sugar function
and duplicating the entire infrastructure?
Igor Stoppa Oct. 30, 2018, 9:49 p.m. UTC | #31
On 30/10/2018 23:35, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:


>> Would it be possible to do the same here?
> 
> What Andy is proposing effectively puts all rare allocations into
> one pool.  Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool.  Andy, what do you think to doing that?

The reason to have pools is that, continuing the SELinux example, it 
supports reloading the policyDB.

In this case it seems to me that it would be faster to drop the entire 
pool in one go, and create a new one when re-initializing the rules.

Or maybe the pool could be flushed, without destroying the metadata.
One more reason for having pools is to assign certain property to the 
pool and then rely on it to be applied to every subsequent allocation.

I've also been wondering if pools can be expected to have some well 
defined property.

One might be that they do not need to be created on the fly.
The number of pools should be known at compilation time.
At least the meta-data could be static, but I do not know how this would 
work with modules.

>>> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
>>
>> This seems suspiciously close to the duplication of kernel interfaces that I
>> was roasted for :-)
> 
> Can you not see the difference between adding one syntactic sugar function
> and duplicating the entire infrastructure?

And list_add()? or any of the other list related functions?
Don't they end up receiving a similar treatment?

I might have misunderstood your proposal, but it seems to me that they 
too will need the same type of pairs rare_modify/rare_protect. No?

And same for hlist, including the _rcu variants.

--
igor
Igor Stoppa Oct. 30, 2018, 9:55 p.m. UTC | #32
On 30/10/2018 23:25, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
>> Finally, one issue: rare_alloc() is going to utterly suck
>> performance-wise due to the global IPI when the region gets zapped out
>> of the direct map or otherwise made RO.  This is the same issue that
>> makes all existing XPO efforts so painful. We need to either optimize
>> the crap out of it somehow or we need to make sure it’s not called
>> except during rare events like device enumeration.
> 
> Batching operations is kind of the whole point of the VM ;-)  Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.
> 
> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.

One of the prime candidates for this sort of protection is IMA.
In the IMA case, there are ever-growing lists which are populated when 
accessing files.
It's something that ends up on the critical path of any usual 
performance critical use case, when accessing files for the first time, 
like at boot/application startup.

Also the SELinux AVC is based on lists. It uses an object cache, but it 
is still something that grows and is on the critical path of evaluating 
the callbacks from the LSM hooks. A lot of them.

These are the main two reasons, so far, for me advocating an 
optimization of the write-rare version of the (h)list.

--
igor
Matthew Wilcox Oct. 30, 2018, 10:08 p.m. UTC | #33
On Tue, Oct 30, 2018 at 11:55:46PM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:25, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > > Finally, one issue: rare_alloc() is going to utterly suck
> > > performance-wise due to the global IPI when the region gets zapped out
> > > of the direct map or otherwise made RO.  This is the same issue that
> > > makes all existing XPO efforts so painful. We need to either optimize
> > > the crap out of it somehow or we need to make sure it’s not called
> > > except during rare events like device enumeration.
> > 
> > Batching operations is kind of the whole point of the VM ;-)  Either
> > this rare memory gets used a lot, in which case we'll want to create slab
> > caches for it, make it a MM zone and the whole nine yeards, or it's not
> > used very much in which case it doesn't matter that performance sucks.
> > 
> > For now, I'd suggest allocating 2MB chunks as needed, and having a
> > shrinker to hand back any unused pieces.
> 
> One of the prime candidates for this sort of protection is IMA.
> In the IMA case, there are ever-growing lists which are populated when
> accessing files.
> It's something that ends up on the critical path of any usual performance
> critical use case, when accessing files for the first time, like at
> boot/application startup.
> 
> Also the SELinux AVC is based on lists. It uses an object cache, but it is
> still something that grows and is on the critical path of evaluating the
> callbacks from the LSM hooks. A lot of them.
> 
> These are the main two reasons, so far, for me advocating an optimization of
> the write-rare version of the (h)list.

I think these are both great examples of why doubly-linked lists _suck_.
You have to modify three cachelines to add an entry to a list.  Walking a
linked list is an exercise in cache misses.  Far better to use an XArray /
IDR for this purpose.
Igor Stoppa Oct. 30, 2018, 10:15 p.m. UTC | #34
On 30/10/2018 23:02, Andy Lutomirski wrote:
> 
> 
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:


>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
> 
> I don’t see why not, although getting the API right will be a tad complicated.

yes, I have some first-hand experience with this :-/

>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>>     p = rare_modify(q);
>>>>>     q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>>      p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>>     list_for_each_entry(x, &xs, entry) {
>>>         struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
> 
> Indeed. Only the core with the special mm loaded could see it.
> 
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
> 
> 1. Just write one value.  Call at any time (except NMI).
> 
> 2. Just copy some bytes. Same as (1) but any number of bytes.
> 
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.

Atomic? RCU?

Yes, they are technically just memory writes, but shouldn't the "normal" 
operation be executed on the writable mapping?

It is possible to sandwich any call between a rare_modify/rare_protect, 
however isn't that pretty close to having a write-rare version of these 
plain function.

--
igor
Nadav Amit Oct. 30, 2018, 11:18 p.m. UTC | #35
From: Andy Lutomirski
Sent: October 30, 2018 at 6:51:17 PM GMT
> To: Matthew Wilcox <willy@infradead.org>, nadav.amit@gmail.com
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> 
> 
>> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>> And I think that there is only one sane way to implement it: using an
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>> differ from init_mm in that it has extra mappings in the *user* region.
>> 
>> I'd like to understand this approach a little better.  In a syscall path,
>> we run with the user task's mm.  What you're proposing is that when we
>> want to modify rare data, we switch to rare_mm which contains a
>> writable mapping to all the kernel data which is rare-write.
>> 
>> So the API might look something like this:
>> 
>>   void *p = rare_alloc(...);    /* writable pointer */
>>   p->a = x;
>>   q = rare_protect(p);        /* read-only pointer */
>> 
>> To subsequently modify q,
>> 
>>   p = rare_modify(q);
>>   q->a = y;
>>   rare_protect(p);
> 
> How about:
> 
> rare_write(&q->a, y);
> 
> Or, for big writes:
> 
> rare_write_copy(&q, local_q);
> 
> This avoids a whole ton of issues. In practice, actually running with a
> special mm requires preemption disabled as well as some other stuff, which
> Nadav carefully dealt with.
> 
> Also, can we maybe focus on getting something merged for statically
> allocated data first?
> 
> Finally, one issue: rare_alloc() is going to utterly suck performance-wise
> due to the global IPI when the region gets zapped out of the direct map or
> otherwise made RO. This is the same issue that makes all existing XPO
> efforts so painful. We need to either optimize the crap out of it somehow
> or we need to make sure it’s not called except during rare events like
> device enumeration.
> 
> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> that it was a big change and we wanted a simpler fix to backport. But
> that’s all done now, and I, at least, rather liked your code. :)

I guess since it was based on your ideas…

Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
make kgdb use of poke_text() less disgusting. I still don’t know exactly
how to deal with it.

Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
v2.
Andy Lutomirski Oct. 31, 2018, 4:41 a.m. UTC | #36
On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> > On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > > So the API might look something like this:
> > > > >
> > > > >         void *p = rare_alloc(...);      /* writable pointer */
> > > > >         p->a = x;
> > > > >         q = rare_protect(p);            /* read-only pointer */
> >
> > With pools and memory allocated from vmap_areas, I was able to say
> >
> > protect(pool)
> >
> > and that would do a swipe on all the pages currently in use.
> > In the SELinux policyDB, for example, one doesn't really want to
> > individually protect each allocation.
> >
> > The loading phase happens usually at boot, when the system can be assumed to
> > be sane (one might even preload a bare-bone set of rules from initramfs and
> > then replace it later on, with the full blown set).
> >
> > There is no need to process each of these tens of thousands allocations and
> > initialization as write-rare.
> >
> > Would it be possible to do the same here?
>
> What Andy is proposing effectively puts all rare allocations into
> one pool.  Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool.  Andy, what do you think to doing that?

Hmm.  Let's see.

To clarify some of this thread, I think that the fact that rare_write
uses an mm_struct and alias mappings under the hood should be
completely invisible to users of the API.  No one should ever be
handed a writable pointer to rare_write memory (except perhaps during
bootup or when initializing a large complex data structure that will
be rare_write but isn't yet, e.g. the policy db).

For example, there could easily be architectures where having a
writable alias is problematic.  On such architectures, an entirely
different mechanism might work better.  And, if a tool like KNOX ever
becomes a *part* of the Linux kernel (hint hint!)

If you have multiple pools and one mm_struct per pool, you'll need a
way to find the mm_struct from a given allocation.  Regardless of how
the mm_structs are set up, changing rare_write memory to normal memory
or vice versa will require a global TLB flush (all ASIDs and global
pages) on all CPUs, so having extra mm_structs doesn't seem to buy
much.

(It's just possible that changing rare_write back to normal might be
able to avoid the flush if the spurious faults can be handled
reliably.)
Peter Zijlstra Oct. 31, 2018, 9:08 a.m. UTC | #37
On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
> > Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> > that it was a big change and we wanted a simpler fix to backport. But
> > that’s all done now, and I, at least, rather liked your code. :)
> 
> I guess since it was based on your ideas…
> 
> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
> make kgdb use of poke_text() less disgusting. I still don’t know exactly
> how to deal with it.
> 
> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
> v2.

Oh man, I completely forgot about kgdb :/

Also, would it be possible to entirely remove that kmap fallback path?
Igor Stoppa Oct. 31, 2018, 9:08 a.m. UTC | #38
Adding SELinux folks and the SELinux ml

I think it's better if they participate in this discussion.

On 31/10/2018 06:41, Andy Lutomirski wrote:
> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>> So the API might look something like this:
>>>>>>
>>>>>>          void *p = rare_alloc(...);      /* writable pointer */
>>>>>>          p->a = x;
>>>>>>          q = rare_protect(p);            /* read-only pointer */
>>>
>>> With pools and memory allocated from vmap_areas, I was able to say
>>>
>>> protect(pool)
>>>
>>> and that would do a swipe on all the pages currently in use.
>>> In the SELinux policyDB, for example, one doesn't really want to
>>> individually protect each allocation.
>>>
>>> The loading phase happens usually at boot, when the system can be assumed to
>>> be sane (one might even preload a bare-bone set of rules from initramfs and
>>> then replace it later on, with the full blown set).
>>>
>>> There is no need to process each of these tens of thousands allocations and
>>> initialization as write-rare.
>>>
>>> Would it be possible to do the same here?
>>
>> What Andy is proposing effectively puts all rare allocations into
>> one pool.  Although I suppose it could be generalised to multiple pools
>> ... one mm_struct per pool.  Andy, what do you think to doing that?
> 
> Hmm.  Let's see.
> 
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API.  

I agree.

> No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).

The policy db doesn't need to be write rare.
Actually, it really shouldn't be write rare.

Maybe it's just a matter of wording, but effectively the policyDB can be 
trated with this sequence:

1) allocate various data structures in writable form

2) initialize them

3) go back to 1 as needed

4) lock down everything that has been allocated, as Read-Only
The reason why I stress ReadOnly is that differentiating what is really 
ReadOnly from what is WriteRare provides an extra edge against attacks, 
because attempts to alter ReadOnly data through a WriteRare API could be 
detected

5) read any part of the policyDB during regular operations

6) in case of update, create a temporary new version, using steps 1..3

7) if update successful, use the new one and destroy the old one

8) if the update failed, destroy the new one

The destruction at points 7 and 8 is not so much a write operation, as 
it is a release of the memory.

So, we might have a bit different interpretation of what write-rare 
means wrt destroying the memory and its content.

To clarify: I've been using write-rare to indicate primarily small 
operations that one would otherwise achieve with "=", memcpy or memset 
or more complex variants, like atomic ops, rcu pointer assignment, etc.

Tearing down an entire set of allocations like the policyDB doesn't fit 
very well with that model.

The only part which _needs_ to be write rare, in the policyDB, is the 
set of pointers which are used to access all the dynamically allocated 
data set.

These pointers must be updated when a new policyDB is allocated.

> For example, there could easily be architectures where having a
> writable alias is problematic.  On such architectures, an entirely
> different mechanism might work better.  And, if a tool like KNOX ever
> becomes a *part* of the Linux kernel (hint hint!)

Something related, albeit not identical is going on here [1]
Eventually, it could be expanded to deal also with write rare.

> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation. 

Indeed. In my patchset, based on vmas, I do the following:
* a private field from the page struct points to the vma using that page
* inside the vma there is alist_head used only during deletion
   - one pointer is used to chain vmas fro mthe same pool
   - one pointer points to the pool struct
* the pool struct has the property to use for all the associated 
allocations: is it write-rare, read-only, does it auto protect, etc.

> Regardless of how
> the mm_structs are set up, changing rare_write memory to normal memory
> or vice versa will require a global TLB flush (all ASIDs and global
> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
> much.

1) it supports differnt levels of protection:
   temporarily unprotected vs read-only vs write-rare

2) the change of write permission should be possible only toward more 
restrictive rules (writable -> write-rare -> read-only) and only to the 
point that was specified while creating the pool, to avoid DOS attacks, 
where a write-rare is flipped into read-only and further updates fail
(ex: prevent IMA from registering modifications to a file, by not 
letting it store new information - I'm not 100% sure this would work, 
but it gives the idea, I think)

3) being able to track all the allocations related to a pool would allow 
to perform mass operations, like reducing the writabilty or destroying 
all the allocations.

> (It's just possible that changing rare_write back to normal might be
> able to avoid the flush if the spurious faults can be handled
> reliably.)

I do not see the need for such a case of degrading the write permissions 
of an allocation, unless it refers to the release of a pool of 
allocations (see updating the SELinux policy DB)

[1] https://www.openwall.com/lists/kernel-hardening/2018/10/26/11

--
igor
Peter Zijlstra Oct. 31, 2018, 9:18 a.m. UTC | #39
On Tue, Oct 30, 2018 at 11:03:51AM -0700, Dave Hansen wrote:
> On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> > Does this satisfy Igor's requirements?  We wouldn't be able to
> > copy_to/from_user() while rare_mm was active.  I think that's a feature
> > though!  It certainly satisfies my interests (kernel code be able to
> > mark things as dynamically-allocated-and-read-only-after-initialisation)
> 
> It has to be more than copy_to/from_user(), though, I think.
> 
> rare_modify(q) either has to preempt_disable(), or we need to teach the
> context-switching code when and how to switch in/out of the rare_mm.
> preempt_disable() would also keep us from sleeping.

Yes, I think we want to have preempt disable at the very least. We could
indeed make the context switch code smart and teach is about this state,
but I think allowing preemption in such section is a bad idea. We want
to keep these sections short and simple (and bounded), such that they
can be analyzed for correctness.

Once you allow preemption, things tend to grow large and complex.

Ideally we'd even disabled interrupts over this, to further limit what
code runs in the rare_mm context. NMIs need special care anyway.
Igor Stoppa Oct. 31, 2018, 9:27 a.m. UTC | #40
On 30/10/2018 18:37, Kees Cook wrote:
> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>>   1) find buffer-overrun / bound check failure
>>   2) use that to write to 'interesting' location
>>   3) that write results arbitrary code execution
>>   4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
> 
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).

Reading these points I see where I was not clear.

Maybe I can fix that. SELinux is a good example of safeguard against a 
takeover of root, so it is a primary target. Unfortunately it contains 
some state variables that can be used to disable it.

Other attack that comes to my mind, for executing code within the 
kernel, without sweating too much with ROP is the following:

1) find the rabbit hole to write kernel data, whatever it might be
2) spray the keyring with own key
3) use the module loading infrastructure to load own module, signed with 
the key at point 2)

Just to say that direct arbitrary code execution is becoming harder to 
perform, but there are ways around it which rely more on overwriting 
unprotected data.

They are a lower hanging fruit for an attacker.

--
igor
Peter Zijlstra Oct. 31, 2018, 9:29 a.m. UTC | #41
On Tue, Oct 30, 2018 at 02:25:51PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > Finally, one issue: rare_alloc() is going to utterly suck
> > performance-wise due to the global IPI when the region gets zapped out
> > of the direct map or otherwise made RO.  This is the same issue that
> > makes all existing XPO efforts so painful. We need to either optimize
> > the crap out of it somehow or we need to make sure it’s not called
> > except during rare events like device enumeration.
> 
> Batching operations is kind of the whole point of the VM ;-)  Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.

Yes, for the dynamic case something along those lines would be needed.
If we have a single rare zone, we could even have __GFP_RARE or whatever
that manages this.

The page allocator would have to grow a rare memblock type, and every
rare alloc would allocate from a rare memblock, when none is available,
creation of a rare block would set up the mappings etc..

> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.

Something like the percpu allocator?
Peter Zijlstra Oct. 31, 2018, 9:36 a.m. UTC | #42
On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > I support the addition of a rare-write mechanism to the upstream kernel.
> > And I think that there is only one sane way to implement it: using an
> > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
> 	void *p = rare_alloc(...);	/* writable pointer */
> 	p->a = x;
> 	q = rare_protect(p);		/* read-only pointer */
> 
> To subsequently modify q,
> 
> 	p = rare_modify(q);
> 	q->a = y;
> 	rare_protect(p);

Why would you have rare_alloc() imply rare_modify() ? Would you have the
allocator meta data inside the rare section?
Peter Zijlstra Oct. 31, 2018, 9:45 a.m. UTC | #43
On Tue, Oct 30, 2018 at 02:02:12PM -0700, Andy Lutomirski wrote:
> But I dislike allowing regular writes in the protected region. We
> really only need four write primitives:
> 
> 1. Just write one value.  Call at any time (except NMI).
> 
> 2. Just copy some bytes. Same as (1) but any number of bytes.

Given the !preempt/!IRQ contraints I'd certainly put an upper limit on
the number of bytes there.

> 3,4: Same as 1 and 2 but must be called inside a special rare write
> region. This is purely an optimization.
> 
> Actually getting a modifiable pointer should be disallowed for two
> reasons:
> 
> 1. Some architectures may want to use a special
> write-different-address-space operation.

You're thinking of s390 ? :-)

> Heck, x86 could, too: make
> the actual offset be a secret and shove the offset into FSBASE or
> similar. Then %fs-prefixed writes would do the rare writes.

> 2. Alternatively, x86 could set the U bit. Then the actual writes
> would use the uaccess helpers, giving extra protection via SMAP.

Cute, and yes, something like that would be nice.

> We don’t really want a situation where an unchecked pointer in the
> rare write region completely defeats the mechanism.

Agreed.
Peter Zijlstra Oct. 31, 2018, 10:02 a.m. UTC | #44
On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API.  No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).

Being able to use pointers would make it far easier to do atomics and
other things though.

> For example, there could easily be architectures where having a
> writable alias is problematic.

Mostly we'd just have to be careful of cache aliases, alignment should
be able to sort that I think.

> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation.

Or keep track of it externally. For example by context. If you modify
page-tables you pick the page-table pool, if you modify selinux state,
you pick the selinux pool.

> Regardless of how the mm_structs are set up, changing rare_write
> memory to normal memory or vice versa will require a global TLB flush
> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
> doesn't seem to buy much.

The way I understand it, the point is that if you stick page-tables and
selinux state in different pools, a stray write in one will never affect
the other.
Peter Zijlstra Oct. 31, 2018, 10:11 a.m. UTC | #45
On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:02, Andy Lutomirski wrote:

> > But I dislike allowing regular writes in the protected region. We
> > really only need four write primitives:
> > 
> > 1. Just write one value.  Call at any time (except NMI).
> > 
> > 2. Just copy some bytes. Same as (1) but any number of bytes.
> > 
> > 3,4: Same as 1 and 2 but must be called inside a special rare write
> > region. This is purely an optimization.
> 
> Atomic? RCU?

RCU can be done, that's not really a problem. Atomics otoh are a
problem. Having pointers makes them just work.

Andy; I understand your reason for not wanting them, but I really don't
want to duplicate everything. Is there something we can do with static
analysis to make you more comfortable with the pointer thing?
Matthew Wilcox Oct. 31, 2018, 11:33 a.m. UTC | #46
On Wed, Oct 31, 2018 at 10:36:59AM +0100, Peter Zijlstra wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> > 
> > I'd like to understand this approach a little better.  In a syscall path,
> > we run with the user task's mm.  What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> > 
> > So the API might look something like this:
> > 
> > 	void *p = rare_alloc(...);	/* writable pointer */
> > 	p->a = x;
> > 	q = rare_protect(p);		/* read-only pointer */
> > 
> > To subsequently modify q,
> > 
> > 	p = rare_modify(q);
> > 	q->a = y;
> > 	rare_protect(p);
> 
> Why would you have rare_alloc() imply rare_modify() ? Would you have the
> allocator meta data inside the rare section?

Normally when I allocate some memory I need to initialise it before
doing anything else with it ;-)

I mean, you could do:

	ro = rare_alloc(..);
	rare = rare_modify(ro);
	rare->a = x;
	rare_protect(rare);

but that's more typing.
Igor Stoppa Oct. 31, 2018, 7:38 p.m. UTC | #47
On 31/10/2018 11:08, Igor Stoppa wrote:
> Adding SELinux folks and the SELinux ml
> 
> I think it's better if they participate in this discussion.
> 
> On 31/10/2018 06:41, Andy Lutomirski wrote:
>> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> 
>> wrote:
>>>
>>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>>> So the API might look something like this:
>>>>>>>
>>>>>>>          void *p = rare_alloc(...);      /* writable pointer */
>>>>>>>          p->a = x;
>>>>>>>          q = rare_protect(p);            /* read-only pointer */
>>>>
>>>> With pools and memory allocated from vmap_areas, I was able to say
>>>>
>>>> protect(pool)
>>>>
>>>> and that would do a swipe on all the pages currently in use.
>>>> In the SELinux policyDB, for example, one doesn't really want to
>>>> individually protect each allocation.
>>>>
>>>> The loading phase happens usually at boot, when the system can be 
>>>> assumed to
>>>> be sane (one might even preload a bare-bone set of rules from 
>>>> initramfs and
>>>> then replace it later on, with the full blown set).
>>>>
>>>> There is no need to process each of these tens of thousands 
>>>> allocations and
>>>> initialization as write-rare.
>>>>
>>>> Would it be possible to do the same here?
>>>
>>> What Andy is proposing effectively puts all rare allocations into
>>> one pool.  Although I suppose it could be generalised to multiple pools
>>> ... one mm_struct per pool.  Andy, what do you think to doing that?
>>
>> Hmm.  Let's see.
>>
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API. 
> 
> I agree.
> 
>> No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
> 
> The policy db doesn't need to be write rare.
> Actually, it really shouldn't be write rare.
> 
> Maybe it's just a matter of wording, but effectively the policyDB can be 
> trated with this sequence:
> 
> 1) allocate various data structures in writable form
> 
> 2) initialize them
> 
> 3) go back to 1 as needed
> 
> 4) lock down everything that has been allocated, as Read-Only
> The reason why I stress ReadOnly is that differentiating what is really 
> ReadOnly from what is WriteRare provides an extra edge against attacks, 
> because attempts to alter ReadOnly data through a WriteRare API could be 
> detected
> 
> 5) read any part of the policyDB during regular operations
> 
> 6) in case of update, create a temporary new version, using steps 1..3
> 
> 7) if update successful, use the new one and destroy the old one
> 
> 8) if the update failed, destroy the new one
> 
> The destruction at points 7 and 8 is not so much a write operation, as 
> it is a release of the memory.
> 
> So, we might have a bit different interpretation of what write-rare 
> means wrt destroying the memory and its content.
> 
> To clarify: I've been using write-rare to indicate primarily small 
> operations that one would otherwise achieve with "=", memcpy or memset 
> or more complex variants, like atomic ops, rcu pointer assignment, etc.
> 
> Tearing down an entire set of allocations like the policyDB doesn't fit 
> very well with that model.
> 
> The only part which _needs_ to be write rare, in the policyDB, is the 
> set of pointers which are used to access all the dynamically allocated 
> data set.
> 
> These pointers must be updated when a new policyDB is allocated.
> 
>> For example, there could easily be architectures where having a
>> writable alias is problematic.  On such architectures, an entirely
>> different mechanism might work better.  And, if a tool like KNOX ever
>> becomes a *part* of the Linux kernel (hint hint!)
> 
> Something related, albeit not identical is going on here [1]
> Eventually, it could be expanded to deal also with write rare.
> 
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation. 
> 
> Indeed. In my patchset, based on vmas, I do the following:
> * a private field from the page struct points to the vma using that page
> * inside the vma there is alist_head used only during deletion
>    - one pointer is used to chain vmas fro mthe same pool
>    - one pointer points to the pool struct
> * the pool struct has the property to use for all the associated 
> allocations: is it write-rare, read-only, does it auto protect, etc.
> 
>> Regardless of how
>> the mm_structs are set up, changing rare_write memory to normal memory
>> or vice versa will require a global TLB flush (all ASIDs and global
>> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
>> much.
> 
> 1) it supports differnt levels of protection:
>    temporarily unprotected vs read-only vs write-rare
> 
> 2) the change of write permission should be possible only toward more 
> restrictive rules (writable -> write-rare -> read-only) and only to the 
> point that was specified while creating the pool, to avoid DOS attacks, 
> where a write-rare is flipped into read-only and further updates fail
> (ex: prevent IMA from registering modifications to a file, by not 
> letting it store new information - I'm not 100% sure this would work, 
> but it gives the idea, I think)
> 
> 3) being able to track all the allocations related to a pool would allow 
> to perform mass operations, like reducing the writabilty or destroying 
> all the allocations.
> 
>> (It's just possible that changing rare_write back to normal might be
>> able to avoid the flush if the spurious faults can be handled
>> reliably.)
> 
> I do not see the need for such a case of degrading the write permissions 
> of an allocation, unless it refers to the release of a pool of 
> allocations (see updating the SELinux policy DB)

A few more thoughts about pools. Not sure if they are all correct.
Note: I stick to "pool", instead of mm_struct, because what I'll say is 
mostly independent from the implementation.

- As Peter Zijlstra wrote earlier, protecting a target moves the focus 
of the attack to something else. In this case, probably, the "something 
else" would be the metadata of the pool(s).

- The amount of pools needed should be known at compilation time, so the 
metadata used for pools could be statically allocated and any change to 
it could be treated as write-rare.

- only certain fields of a pool structure would be writable, even as 
write-rare, after the pool is initialized.
In case the pool is a mm_struct or a superset (containing also 
additional properties, like the type of writability: RO or WR), the field

struct vm_area_struct *mmap;

is an example of what could be protected. It should be alterable only 
when creating/destroying the pool and making the first initialization.

- to speed up and also improve the validation of the target of a 
write-rare operation, it would be really desirable if the target had 
some intrinsic property which clearly differentiates it from 
non-write-rare memory. Its address, for example. The amount of write 
rare data needed by a full kernel should not exceed a few tens of 
megabytes. On a 64 bit system it shouldn't be so bad to reserve an 
address range maybe one order of magnitude lager than that.
It could even become a parameter for the creation of a pool.
SELinux, for example, should fit within 10-20MB. Or it could be a 
command line parameter.

- even if an hypervisor was present, it would be preferable to use it 
exclusively as extra protection, which triggers an exception only when 
something abnormal happens. The hypervisor should not become aware of 
the actual meaning of kernel (meta)data. Ideally, it would be mostly 
used for trapping unexpected writes to pages which are not supposed to 
be modified.

- one more reason for using pools is that, if each pool was also acting 
as memory cache for its users, attacks relying on use-after-free would 
not have access to possible vulnerability, because the memory and 
addresses associated with a pool would stay with it.

--
igor
Andy Lutomirski Oct. 31, 2018, 8:36 p.m. UTC | #48
> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API.  No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
> 
> Being able to use pointers would make it far easier to do atomics and
> other things though.

This stuff is called *rare* write for a reason. Do we really want to allow atomics beyond just store-release?  Taking a big lock and then writing in the right order should cover everything, no?

> 
>> For example, there could easily be architectures where having a
>> writable alias is problematic.
> 
> Mostly we'd just have to be careful of cache aliases, alignment should
> be able to sort that I think.
> 
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation.
> 
> Or keep track of it externally. For example by context. If you modify
> page-tables you pick the page-table pool, if you modify selinux state,
> you pick the selinux pool.
> 
>> Regardless of how the mm_structs are set up, changing rare_write
>> memory to normal memory or vice versa will require a global TLB flush
>> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
>> doesn't seem to buy much.
> 
> The way I understand it, the point is that if you stick page-tables and
> selinux state in different pools, a stray write in one will never affect
> the other.
> 

Hmm. That’s not totally crazy, but the API would need to be carefully designed. And some argument would have to be made as to why it’s better to use a different address space as opposed to checking in software along the lines of the uaccess checking.
Andy Lutomirski Oct. 31, 2018, 8:38 p.m. UTC | #49
> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>> On 30/10/2018 23:02, Andy Lutomirski wrote:
> 
>>> But I dislike allowing regular writes in the protected region. We
>>> really only need four write primitives:
>>> 
>>> 1. Just write one value.  Call at any time (except NMI).
>>> 
>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>> 
>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>> region. This is purely an optimization.
>> 
>> Atomic? RCU?
> 
> RCU can be done, that's not really a problem. Atomics otoh are a
> problem. Having pointers makes them just work.
> 
> Andy; I understand your reason for not wanting them, but I really don't
> want to duplicate everything. Is there something we can do with static
> analysis to make you more comfortable with the pointer thing?

I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.

And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?
Andy Lutomirski Oct. 31, 2018, 8:53 p.m. UTC | #50
> On Oct 31, 2018, at 1:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>>> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>> 
>>>> But I dislike allowing regular writes in the protected region. We
>>>> really only need four write primitives:
>>>> 
>>>> 1. Just write one value.  Call at any time (except NMI).
>>>> 
>>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>> 
>>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>>> region. This is purely an optimization.
>>> 
>>> Atomic? RCU?
>> 
>> RCU can be done, that's not really a problem. Atomics otoh are a
>> problem. Having pointers makes them just work.
>> 
>> Andy; I understand your reason for not wanting them, but I really don't
>> want to duplicate everything. Is there something we can do with static
>> analysis to make you more comfortable with the pointer thing?
> 
> I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
> 
> And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?

Also, if we’re using a hypervisor, then there are a couple ways it could be done:

1. VMFUNC.  Pointers work fine.  This is stronger than any amount of CR3 trickery because it can’t be defeated by page table attacks.

2. A hypercall to do the write. No pointers.

Basically, I think that if we can get away without writable pointers, we get more flexibility and less need for fancy static analysis. If we do need pointers, then so be it.
Peter Zijlstra Oct. 31, 2018, 9 p.m. UTC | #51
On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
> 
> > On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> >> To clarify some of this thread, I think that the fact that rare_write
> >> uses an mm_struct and alias mappings under the hood should be
> >> completely invisible to users of the API.  No one should ever be
> >> handed a writable pointer to rare_write memory (except perhaps during
> >> bootup or when initializing a large complex data structure that will
> >> be rare_write but isn't yet, e.g. the policy db).
> > 
> > Being able to use pointers would make it far easier to do atomics and
> > other things though.
> 
> This stuff is called *rare* write for a reason. Do we really want to
> allow atomics beyond just store-release?  Taking a big lock and then
> writing in the right order should cover everything, no?

Ah, so no. That naming is very misleading.

We modify page-tables a _lot_. The point is that only a few sanctioned
sites are allowed writing to it, not everybody.

I _think_ the use-case for atomics is updating the reference counts of
objects that are in this write-rare domain. But I'm not entirely clear
on that myself either. I just really want to avoid duplicating that
stuff.
Andy Lutomirski Oct. 31, 2018, 10:57 p.m. UTC | #52
> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>> 
>>>> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>>>> To clarify some of this thread, I think that the fact that rare_write
>>>> uses an mm_struct and alias mappings under the hood should be
>>>> completely invisible to users of the API.  No one should ever be
>>>> handed a writable pointer to rare_write memory (except perhaps during
>>>> bootup or when initializing a large complex data structure that will
>>>> be rare_write but isn't yet, e.g. the policy db).
>>> 
>>> Being able to use pointers would make it far easier to do atomics and
>>> other things though.
>> 
>> This stuff is called *rare* write for a reason. Do we really want to
>> allow atomics beyond just store-release?  Taking a big lock and then
>> writing in the right order should cover everything, no?
> 
> Ah, so no. That naming is very misleading.
> 
> We modify page-tables a _lot_. The point is that only a few sanctioned
> sites are allowed writing to it, not everybody.
> 
> I _think_ the use-case for atomics is updating the reference counts of
> objects that are in this write-rare domain. But I'm not entirely clear
> on that myself either. I just really want to avoid duplicating that
> stuff.

Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.

Can we see a *real* use case before we over complicate the API?
Igor Stoppa Oct. 31, 2018, 11:10 p.m. UTC | #53
On 01/11/2018 00:57, Andy Lutomirski wrote:
> 
> 
>> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:



>> I _think_ the use-case for atomics is updating the reference counts of
>> objects that are in this write-rare domain. But I'm not entirely clear
>> on that myself either. I just really want to avoid duplicating that
>> stuff.
> 
> Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
> 
> Can we see a *real* use case before we over complicate the API?
> 


Does patch #14 of this set not qualify? ima_htable.len ?

https://www.openwall.com/lists/kernel-hardening/2018/10/23/20

--
igor
Andy Lutomirski Oct. 31, 2018, 11:19 p.m. UTC | #54
On Wed, Oct 31, 2018 at 4:10 PM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
>
>
> On 01/11/2018 00:57, Andy Lutomirski wrote:
> >
> >
> >> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
>
> >> I _think_ the use-case for atomics is updating the reference counts of
> >> objects that are in this write-rare domain. But I'm not entirely clear
> >> on that myself either. I just really want to avoid duplicating that
> >> stuff.
> >
> > Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
> >
> > Can we see a *real* use case before we over complicate the API?
> >
>
>
> Does patch #14 of this set not qualify? ima_htable.len ?
>
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/20
>

Do you mean this (sorry for whitespace damage):

+ pratomic_long_inc(&ima_htable.len);

- atomic_long_inc(&ima_htable.len);
  if (update_htable) {
    key = ima_hash_key(entry->digest);
-   hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
+   prhlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
  }

ISTM you don't need that atomic operation -- you could take a spinlock
and then just add one directly to the variable.

--Andy
Igor Stoppa Oct. 31, 2018, 11:26 p.m. UTC | #55
On 01/11/2018 01:19, Andy Lutomirski wrote:

> ISTM you don't need that atomic operation -- you could take a spinlock
> and then just add one directly to the variable.

It was my intention to provide a 1:1 conversion of existing code, as it 
should be easier to verify the correctness of the conversion, as long as 
there isn't any significant degradation in performance.

The rework could be done afterward.

--
igor
Thomas Gleixner Nov. 1, 2018, 8:21 a.m. UTC | #56
Igor,

On Thu, 1 Nov 2018, Igor Stoppa wrote:
> On 01/11/2018 01:19, Andy Lutomirski wrote:
> 
> > ISTM you don't need that atomic operation -- you could take a spinlock
> > and then just add one directly to the variable.
> 
> It was my intention to provide a 1:1 conversion of existing code, as it should
> be easier to verify the correctness of the conversion, as long as there isn't
> any significant degradation in performance.
> 
> The rework could be done afterward.

Please don't go there. The usual approach is to

  1) Rework existing code in a way that the new functionality can be added
     with minimal effort afterwards and without creating API horrors.

  2) Verify correctness of the rework

  3) Add the new functionality
  
That avoids creation of odd functionality and APIs in the first place, so
they won't be used in other places and does not leave half cleaned up code
around which will stick for a long time.

Thanks,

	tglx
Igor Stoppa Nov. 1, 2018, 3:58 p.m. UTC | #57
On 01/11/2018 10:21, Thomas Gleixner wrote:
> On Thu, 1 Nov 2018, Igor Stoppa wrote:

>> The rework could be done afterward.
> 
> Please don't go there. The usual approach is to
> 
>    1) Rework existing code in a way that the new functionality can be added
>       with minimal effort afterwards and without creating API horrors.


Thanks a lot for the advice.
It makes things even easier for me, as I can start the rework, while the 
discussion on the actual write-rare mechanism continues.

--
igor
Nadav Amit Nov. 1, 2018, 4:31 p.m. UTC | #58
From: Peter Zijlstra
Sent: October 31, 2018 at 9:08:35 AM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>> that it was a big change and we wanted a simpler fix to backport. But
>>> that’s all done now, and I, at least, rather liked your code. :)
>> 
>> I guess since it was based on your ideas…
>> 
>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>> how to deal with it.
>> 
>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>> v2.
> 
> Oh man, I completely forgot about kgdb :/
> 
> Also, would it be possible to entirely remove that kmap fallback path?

Let me see what I can do about it.
Peter Zijlstra Nov. 1, 2018, 5:08 p.m. UTC | #59
On Wed, Oct 31, 2018 at 03:57:21PM -0700, Andy Lutomirski wrote:
> > On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:

> >> This stuff is called *rare* write for a reason. Do we really want to
> >> allow atomics beyond just store-release?  Taking a big lock and then
> >> writing in the right order should cover everything, no?
> > 
> > Ah, so no. That naming is very misleading.
> > 
> > We modify page-tables a _lot_. The point is that only a few sanctioned
> > sites are allowed writing to it, not everybody.
> > 
> > I _think_ the use-case for atomics is updating the reference counts of
> > objects that are in this write-rare domain. But I'm not entirely clear
> > on that myself either. I just really want to avoid duplicating that
> > stuff.
> 
> Sounds nuts. Doing a rare-write is many hundreds of cycles at best.

Yes, which is why I'm somewhat sceptical of the whole endeavour.
Nadav Amit Nov. 2, 2018, 9:11 p.m. UTC | #60
From: Nadav Amit
Sent: November 1, 2018 at 4:31:59 PM GMT
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> From: Peter Zijlstra
> Sent: October 31, 2018 at 9:08:35 AM GMT
>> To: Nadav Amit <nadav.amit@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>> Subject: Re: [PATCH 10/17] prmem: documentation
>> 
>> 
>> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>>> that it was a big change and we wanted a simpler fix to backport. But
>>>> that’s all done now, and I, at least, rather liked your code. :)
>>> 
>>> I guess since it was based on your ideas…
>>> 
>>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>>> how to deal with it.
>>> 
>>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>>> v2.
>> 
>> Oh man, I completely forgot about kgdb :/
>> 
>> Also, would it be possible to entirely remove that kmap fallback path?
> 
> Let me see what I can do about it.

My patches had several embarrassing bugs. I’m fixing and will resend later,
hopefully today.
Igor Stoppa Nov. 13, 2018, 2:25 p.m. UTC | #61
Hello,
I've been studying v4 of the patch-set [1] that Nadav has been working on.
Incidentally, I think it would be useful to cc also the 
security/hardening ml.
The patch-set seems to be close to final, so I am resuming this discussion.

On 30/10/2018 19:06, Andy Lutomirski wrote:

> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.

After reading the code, I see what you meant.
I think I can work with it.

But I have a couple of questions wrt the use of this mechanism, in the 
context of write rare.


1) mm_struct.

Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code 
(live patch?), which seems to happen sequentially and in a relatively 
standardized way, like replacing the NOPs specifically placed in the 
functions that need patching.

This is a bit different from the more generic write-rare case, applied 
to data.

As example, I have in mind a system where both IMA and SELinux are in use.

In this system, a file is accessed for the first time.

That would trigger 2 things:
- evaluation of the SELinux rules and probably update of the AVC cache
- IMA measurement and update of the measurements

Both of them could be write protected, meaning that they would both have 
to be modified through the write rare mechanism.

While the events, for 1 specific file, would be sequential, it's not 
difficult to imagine that multiple files could be accessed at the same time.

If the update of the data structures in both IMA and SELinux must use 
the same mm_struct, that would have to be somehow regulated and it would 
introduce an unnecessary (imho) dependency.

How about having one mm_struct for each writer (core or thread)?



2) Iiuc, the purpose of the 2 pages being remapped is that the target of 
the patch might spill across the page boundary, however if I deal with 
the modification of generic data, I shouldn't (shouldn't I?) assume that 
the data will not span across multiple pages.

If the data spans across multiple pages, in unknown amount, I suppose 
that I should not keep interrupts disabled for an unknown time, as it 
would hurt preemption.

What I thought, in my initial patch-set, was to iterate over each page 
that must be written to, in a loop, re-enabling interrupts in-between 
iterations, to give pending interrupts a chance to be served.

This would mean that the data being written to would not be consistent, 
but it's a problem that would have to be addressed anyways, since it can 
be still read by other cores, while the write is ongoing.


Is this a valid concern/approach?


--
igor


[1] https://lkml.org/lkml/2018/11/11/110
Andy Lutomirski Nov. 13, 2018, 5:16 p.m. UTC | #62
On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
> Hello,
> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> Incidentally, I think it would be useful to cc also the
> security/hardening ml.
> The patch-set seems to be close to final, so I am resuming this discussion.
>
> On 30/10/2018 19:06, Andy Lutomirski wrote:
>
> > I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>
> After reading the code, I see what you meant.
> I think I can work with it.
>
> But I have a couple of questions wrt the use of this mechanism, in the
> context of write rare.
>
>
> 1) mm_struct.
>
> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> (live patch?), which seems to happen sequentially and in a relatively
> standardized way, like replacing the NOPs specifically placed in the
> functions that need patching.
>
> This is a bit different from the more generic write-rare case, applied
> to data.
>
> As example, I have in mind a system where both IMA and SELinux are in use.
>
> In this system, a file is accessed for the first time.
>
> That would trigger 2 things:
> - evaluation of the SELinux rules and probably update of the AVC cache
> - IMA measurement and update of the measurements
>
> Both of them could be write protected, meaning that they would both have
> to be modified through the write rare mechanism.
>
> While the events, for 1 specific file, would be sequential, it's not
> difficult to imagine that multiple files could be accessed at the same time.
>
> If the update of the data structures in both IMA and SELinux must use
> the same mm_struct, that would have to be somehow regulated and it would
> introduce an unnecessary (imho) dependency.
>
> How about having one mm_struct for each writer (core or thread)?
>

I don't think that helps anything.  I think the mm_struct used for
prmem (or rare_write or whatever you want to call it) should be
entirely abstracted away by an appropriate API, so neither SELinux nor
IMA need to be aware that there's an mm_struct involved.  It's also
entirely possible that some architectures won't even use an mm_struct
behind the scenes -- x86, for example, could have avoided it if there
were a kernel equivalent of PKRU.  Sadly, there isn't.

>
>
> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> the patch might spill across the page boundary, however if I deal with
> the modification of generic data, I shouldn't (shouldn't I?) assume that
> the data will not span across multiple pages.

The reason for the particular architecture of text_poke() is to avoid
memory allocation to get it working.  i think that prmem/rare_write
should have each rare-writable kernel address map to a unique user
address, possibly just by offsetting everything by a constant.  For
rare_write, you don't actually need it to work as such until fairly
late in boot, since the rare_writable data will just be writable early
on.

>
> If the data spans across multiple pages, in unknown amount, I suppose
> that I should not keep interrupts disabled for an unknown time, as it
> would hurt preemption.
>
> What I thought, in my initial patch-set, was to iterate over each page
> that must be written to, in a loop, re-enabling interrupts in-between
> iterations, to give pending interrupts a chance to be served.
>
> This would mean that the data being written to would not be consistent,
> but it's a problem that would have to be addressed anyways, since it can
> be still read by other cores, while the write is ongoing.

This probably makes sense, except that enabling and disabling
interrupts means you also need to restore the original mm_struct (most
likely), which is slow.  I don't think there's a generic way to check
whether in interrupt is pending without turning interrupts on.
Nadav Amit Nov. 13, 2018, 5:43 p.m. UTC | #63
From: Andy Lutomirski
Sent: November 13, 2018 at 5:16:09 PM GMT
> To: Igor Stoppa <igor.stoppa@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>> Hello,
>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>> Incidentally, I think it would be useful to cc also the
>> security/hardening ml.
>> The patch-set seems to be close to final, so I am resuming this discussion.
>> 
>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>> 
>>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>> 
>> After reading the code, I see what you meant.
>> I think I can work with it.
>> 
>> But I have a couple of questions wrt the use of this mechanism, in the
>> context of write rare.
>> 
>> 
>> 1) mm_struct.
>> 
>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>> (live patch?), which seems to happen sequentially and in a relatively
>> standardized way, like replacing the NOPs specifically placed in the
>> functions that need patching.
>> 
>> This is a bit different from the more generic write-rare case, applied
>> to data.
>> 
>> As example, I have in mind a system where both IMA and SELinux are in use.
>> 
>> In this system, a file is accessed for the first time.
>> 
>> That would trigger 2 things:
>> - evaluation of the SELinux rules and probably update of the AVC cache
>> - IMA measurement and update of the measurements
>> 
>> Both of them could be write protected, meaning that they would both have
>> to be modified through the write rare mechanism.
>> 
>> While the events, for 1 specific file, would be sequential, it's not
>> difficult to imagine that multiple files could be accessed at the same time.
>> 
>> If the update of the data structures in both IMA and SELinux must use
>> the same mm_struct, that would have to be somehow regulated and it would
>> introduce an unnecessary (imho) dependency.
>> 
>> How about having one mm_struct for each writer (core or thread)?
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it) should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.
> 
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
> 
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>> 
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>> 
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

I guess that enabling IRQs might break some hidden assumptions in the code,
but is there a fundamental reason that IRQs need to be disabled? use_mm()
got them enabled, although it is only suitable for kernel threads.
Andy Lutomirski Nov. 13, 2018, 5:47 p.m. UTC | #64
On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Andy Lutomirski
> Sent: November 13, 2018 at 5:16:09 PM GMT
> > To: Igor Stoppa <igor.stoppa@gmail.com>
> > Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> > Subject: Re: [PATCH 10/17] prmem: documentation
> >
> >
> > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
> >> Hello,
> >> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> >> Incidentally, I think it would be useful to cc also the
> >> security/hardening ml.
> >> The patch-set seems to be close to final, so I am resuming this discussion.
> >>
> >> On 30/10/2018 19:06, Andy Lutomirski wrote:
> >>
> >>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
> >>
> >> After reading the code, I see what you meant.
> >> I think I can work with it.
> >>
> >> But I have a couple of questions wrt the use of this mechanism, in the
> >> context of write rare.
> >>
> >>
> >> 1) mm_struct.
> >>
> >> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> >> (live patch?), which seems to happen sequentially and in a relatively
> >> standardized way, like replacing the NOPs specifically placed in the
> >> functions that need patching.
> >>
> >> This is a bit different from the more generic write-rare case, applied
> >> to data.
> >>
> >> As example, I have in mind a system where both IMA and SELinux are in use.
> >>
> >> In this system, a file is accessed for the first time.
> >>
> >> That would trigger 2 things:
> >> - evaluation of the SELinux rules and probably update of the AVC cache
> >> - IMA measurement and update of the measurements
> >>
> >> Both of them could be write protected, meaning that they would both have
> >> to be modified through the write rare mechanism.
> >>
> >> While the events, for 1 specific file, would be sequential, it's not
> >> difficult to imagine that multiple files could be accessed at the same time.
> >>
> >> If the update of the data structures in both IMA and SELinux must use
> >> the same mm_struct, that would have to be somehow regulated and it would
> >> introduce an unnecessary (imho) dependency.
> >>
> >> How about having one mm_struct for each writer (core or thread)?
> >
> > I don't think that helps anything.  I think the mm_struct used for
> > prmem (or rare_write or whatever you want to call it) should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
> >
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
> >
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> I guess that enabling IRQs might break some hidden assumptions in the code,
> but is there a fundamental reason that IRQs need to be disabled? use_mm()
> got them enabled, although it is only suitable for kernel threads.
>

For general rare-writish stuff, I don't think we want IRQs running
with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

--Andy
Nadav Amit Nov. 13, 2018, 6:06 p.m. UTC | #65
From: Andy Lutomirski
Sent: November 13, 2018 at 5:47:16 PM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Igor Stoppa <igor.stoppa@gmail.com>, Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Andy Lutomirski
>> Sent: November 13, 2018 at 5:16:09 PM GMT
>>> To: Igor Stoppa <igor.stoppa@gmail.com>
>>> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: [PATCH 10/17] prmem: documentation
>>> 
>>> 
>>> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>>> Hello,
>>>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>>>> Incidentally, I think it would be useful to cc also the
>>>> security/hardening ml.
>>>> The patch-set seems to be close to final, so I am resuming this discussion.
>>>> 
>>>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>>> 
>>>>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>>> 
>>>> After reading the code, I see what you meant.
>>>> I think I can work with it.
>>>> 
>>>> But I have a couple of questions wrt the use of this mechanism, in the
>>>> context of write rare.
>>>> 
>>>> 
>>>> 1) mm_struct.
>>>> 
>>>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>>>> (live patch?), which seems to happen sequentially and in a relatively
>>>> standardized way, like replacing the NOPs specifically placed in the
>>>> functions that need patching.
>>>> 
>>>> This is a bit different from the more generic write-rare case, applied
>>>> to data.
>>>> 
>>>> As example, I have in mind a system where both IMA and SELinux are in use.
>>>> 
>>>> In this system, a file is accessed for the first time.
>>>> 
>>>> That would trigger 2 things:
>>>> - evaluation of the SELinux rules and probably update of the AVC cache
>>>> - IMA measurement and update of the measurements
>>>> 
>>>> Both of them could be write protected, meaning that they would both have
>>>> to be modified through the write rare mechanism.
>>>> 
>>>> While the events, for 1 specific file, would be sequential, it's not
>>>> difficult to imagine that multiple files could be accessed at the same time.
>>>> 
>>>> If the update of the data structures in both IMA and SELinux must use
>>>> the same mm_struct, that would have to be somehow regulated and it would
>>>> introduce an unnecessary (imho) dependency.
>>>> 
>>>> How about having one mm_struct for each writer (core or thread)?
>>> 
>>> I don't think that helps anything.  I think the mm_struct used for
>>> prmem (or rare_write or whatever you want to call it) should be
>>> entirely abstracted away by an appropriate API, so neither SELinux nor
>>> IMA need to be aware that there's an mm_struct involved.  It's also
>>> entirely possible that some architectures won't even use an mm_struct
>>> behind the scenes -- x86, for example, could have avoided it if there
>>> were a kernel equivalent of PKRU.  Sadly, there isn't.
>>> 
>>>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>>>> the patch might spill across the page boundary, however if I deal with
>>>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>>>> the data will not span across multiple pages.
>>> 
>>> The reason for the particular architecture of text_poke() is to avoid
>>> memory allocation to get it working.  i think that prmem/rare_write
>>> should have each rare-writable kernel address map to a unique user
>>> address, possibly just by offsetting everything by a constant.  For
>>> rare_write, you don't actually need it to work as such until fairly
>>> late in boot, since the rare_writable data will just be writable early
>>> on.
>>> 
>>>> If the data spans across multiple pages, in unknown amount, I suppose
>>>> that I should not keep interrupts disabled for an unknown time, as it
>>>> would hurt preemption.
>>>> 
>>>> What I thought, in my initial patch-set, was to iterate over each page
>>>> that must be written to, in a loop, re-enabling interrupts in-between
>>>> iterations, to give pending interrupts a chance to be served.
>>>> 
>>>> This would mean that the data being written to would not be consistent,
>>>> but it's a problem that would have to be addressed anyways, since it can
>>>> be still read by other cores, while the write is ongoing.
>>> 
>>> This probably makes sense, except that enabling and disabling
>>> interrupts means you also need to restore the original mm_struct (most
>>> likely), which is slow.  I don't think there's a generic way to check
>>> whether in interrupt is pending without turning interrupts on.
>> 
>> I guess that enabling IRQs might break some hidden assumptions in the code,
>> but is there a fundamental reason that IRQs need to be disabled? use_mm()
>> got them enabled, although it is only suitable for kernel threads.
> 
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

Oh.. Of course. It is sort of a measure to prevent a single malicious/faulty
write from corrupting the sensitive memory. Doing so limits the code that
can compromise the security by a write to the protected data-structures
(rephrasing for myself).

I think I should add it as a comment in your patch.
Igor Stoppa Nov. 13, 2018, 6:26 p.m. UTC | #66
On 13/11/2018 19:16, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> How about having one mm_struct for each writer (core or thread)?
>>
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it)

write_rare / rarely can be shortened to wr_  which is kinda less
confusing than rare_write, since it would become rw_ and easier to
confuse with R/W

Any advice for better naming is welcome.

> should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.

Yes, that is fine. In my proposal I was thinking about tying it to the
core/thread that performs the actual write.

The high level API could be something like:

wr_memcpy(void *src, void *dst, uint_t size)

>  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.

The mm_struct - or whatever is the means to do the write on that
architecture - can be kept hidden from the API.

But the reason why I was proposing to have one mm_struct per writer is
that, iiuc, the secondary mapping is created in the secondary mm_struct
for each writer using it.

So the updating of IMA measurements would have, theoretically, also
write access to the SELinux AVC. Which I was trying to avoid.
And similarly any other write rare updater. Is this correct?

>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.

Yes, that is true. I think it's safe to assume, from an attack pattern,
that as long as user space is not started, the system can be considered
ok. Even user-space code run from initrd should be ok, since it can be
bundled (and signed) as a single binary with the kernel.

Modules loaded from a regular filesystem are a bit more risky, because
an attack might inject a rogue key in the key-ring and use it to load
malicious modules.

>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

The only "excuse" I have is that write_rare is opt-in and is "rare".
Maybe the enabling/disabling of interrupts - and the consequent switch
of mm_struct - could be somehow tied to the latency configuration?

If preemption is disabled, the expectations on the system latency are
anyway more relaxed.

But I'm not sure how it would work against I/O.

--
igor
Igor Stoppa Nov. 13, 2018, 6:31 p.m. UTC | #67
On 13/11/2018 19:47, Andy Lutomirski wrote:

> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

Why would these be less sensitive?

But I see a big difference between my initial implementation and this one.

In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.

If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.

To access that mapping, the attack should be performed from the ISR, I
think.

--
igor
Igor Stoppa Nov. 13, 2018, 6:33 p.m. UTC | #68
I forgot one sentence :-(

On 13/11/2018 20:31, Igor Stoppa wrote:
> On 13/11/2018 19:47, Andy Lutomirski wrote:
> 
>> For general rare-writish stuff, I don't think we want IRQs running
>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> 
> Why would these be less sensitive?
> 
> But I see a big difference between my initial implementation and this one.
> 
> In my case, by using a shared mapping, visible to all cores, freezing
> the core that is performing the write would have exposed the writable
> mapping to a potential attack run from another core.
> 
> If the mapping is private to the core performing the write, even if it
> is frozen, it's much harder to figure out what it had mapped and where,
> from another core.
> 
> To access that mapping, the attack should be performed from the ISR, I
> think.

Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?

--
igor
Andy Lutomirski Nov. 13, 2018, 6:35 p.m. UTC | #69
On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:16, Andy Lutomirski wrote:
>
> > should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.
>
> Yes, that is fine. In my proposal I was thinking about tying it to the
> core/thread that performs the actual write.
>
> The high level API could be something like:
>
> wr_memcpy(void *src, void *dst, uint_t size)
>
> >  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
>
> The mm_struct - or whatever is the means to do the write on that
> architecture - can be kept hidden from the API.
>
> But the reason why I was proposing to have one mm_struct per writer is
> that, iiuc, the secondary mapping is created in the secondary mm_struct
> for each writer using it.
>
> So the updating of IMA measurements would have, theoretically, also
> write access to the SELinux AVC. Which I was trying to avoid.
> And similarly any other write rare updater. Is this correct?

If you call a wr_memcpy() function with the signature you suggested,
then you can overwrite any memory of this type.  Having a different
mm_struct under the hood makes no difference.  As far as I'm
concerned, for *dynamically allocated* rare-writable memory, you might
as well allocate all the paging structures at the same time, so the
mm_struct will always contain the mappings.  If there are serious bugs
in wr_memcpy() that cause it to write to the wrong place, we have
bigger problems.

I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
but that can wait for v2.  And it still doesn't obviously need
multiple mm_structs.

>
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
>
> Yes, that is true. I think it's safe to assume, from an attack pattern,
> that as long as user space is not started, the system can be considered
> ok. Even user-space code run from initrd should be ok, since it can be
> bundled (and signed) as a single binary with the kernel.
>
> Modules loaded from a regular filesystem are a bit more risky, because
> an attack might inject a rogue key in the key-ring and use it to load
> malicious modules.

If a malicious module is loaded, the game is over.

>
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> The only "excuse" I have is that write_rare is opt-in and is "rare".
> Maybe the enabling/disabling of interrupts - and the consequent switch
> of mm_struct - could be somehow tied to the latency configuration?
>
> If preemption is disabled, the expectations on the system latency are
> anyway more relaxed.
>
> But I'm not sure how it would work against I/O.

I think it's entirely reasonable for the API to internally break up
very large memcpys.
Andy Lutomirski Nov. 13, 2018, 6:36 p.m. UTC | #70
On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> I forgot one sentence :-(
>
> On 13/11/2018 20:31, Igor Stoppa wrote:
> > On 13/11/2018 19:47, Andy Lutomirski wrote:
> >
> >> For general rare-writish stuff, I don't think we want IRQs running
> >> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> >
> > Why would these be less sensitive?
> >
> > But I see a big difference between my initial implementation and this one.
> >
> > In my case, by using a shared mapping, visible to all cores, freezing
> > the core that is performing the write would have exposed the writable
> > mapping to a potential attack run from another core.
> >
> > If the mapping is private to the core performing the write, even if it
> > is frozen, it's much harder to figure out what it had mapped and where,
> > from another core.
> >
> > To access that mapping, the attack should be performed from the ISR, I
> > think.
>
> Unless the secondary mapping is also available to other cores, through
> the shared mm_struct ?
>

I don't think this matters much.  The other cores will only be able to
use that mapping when they're doing a rare write.
Andy Lutomirski Nov. 13, 2018, 6:48 p.m. UTC | #71
On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
> > For general rare-writish stuff, I don't think we want IRQs running
> > with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?

I'm not really saying they're less sensitive so much as that the
considerations are different.  I think the original rare-write code is
based on ideas from grsecurity, and it was intended to protect static
data like structs full of function pointers.   Those targets have some
different properties:

 - Static targets are at addresses that are much more guessable, so
they're easier targets for most attacks.  (Not spraying attacks like
the ones you're interested in, though.)

 - Static targets are higher value.  No offense to IMA or AVC, but
outright execution of shellcode, hijacking of control flow, or compete
disablement of core security features is higher impact than bypassing
SELinux or IMA.  Why would you bother corrupting the AVC if you could
instead just set enforcing=0?  (I suppose that corrupting the AVC is
less likely to be noticed by monitoring tools.)

 - Static targets are small.  This means that the interrupt latency
would be negligible, especially in comparison to the latency of
replacing the entire SELinux policy object.

Anyway, I'm not all that familiar with SELinux under the hood, but I'm
wondering if a different approach to thinks like the policy database
might be appropriate.  When the policy is changed, rather than
allocating rare-write memory and writing to it, what if we instead
allocated normal memory, wrote to it, write-protected it, and then
used the rare-write infrastructure to do a much smaller write to
replace the pointer?

Admittedly, this creates a window where another core could corrupt the
data as it's being written.  That may not matter so much if an
attacker can't force a policy update.  Alternatively, the update code
could re-verify the policy after write-protecting it, or there could
be a fancy API to allocate some temporarily-writable memory (by
creating a whole new mm_struct, mapping the memory writable just in
that mm_struct, and activating it) so that only the actual policy
loader could touch the memory.  But I'm mostly speculating here, since
I'm not familiar with the code in question.

Anyway, I tend to think that the right way to approach mainlining all
this is to first get the basic rare write support for static data into
place and then to build on that.  I think it's great that you're
pushing this effort, but doing this for SELinux and IMA is a bigger
project than doing it for static data, and it might make sense to do
it in bite-sized pieces.

Does any of this make sense?
Igor Stoppa Nov. 13, 2018, 7:01 p.m. UTC | #72
On 13/11/2018 20:35, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

>> The high level API could be something like:
>>
>> wr_memcpy(void *src, void *dst, uint_t size)

[...]

> If you call a wr_memcpy() function with the signature you suggested,
> then you can overwrite any memory of this type.  Having a different
> mm_struct under the hood makes no difference.  As far as I'm
> concerned, for *dynamically allocated* rare-writable memory, you might
> as well allocate all the paging structures at the same time, so the
> mm_struct will always contain the mappings.  If there are serious bugs
> in wr_memcpy() that cause it to write to the wrong place, we have
> bigger problems.

Beside bugs, I'm also thinking about possible vulnerability.
It might be overthinking, though.

I do not think it's possible to really protect against control flow
attacks, unless there is some support from the HW and/or the compiler.

What is left, are data-based attacks. In this case, it would be an
attacker using one existing wr_ invocation with doctored parameters.

However, there is always the objection that it would be possible to come
up with a "writing kit" for plowing through the page tables and
unprotect anything that might be of value.

Ideally, that should be the only type of data-based attack left.

In practice, it might just be an excess of paranoia from my side.

> I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
> but that can wait for v2.  And it still doesn't obviously need
> multiple mm_structs.

I left that out, for now, but yes, typing would play some role here.

[...]

> I think it's entirely reasonable for the API to internally break up
> very large memcpys.

The criteria for deciding if/how to break it down is not clear to me,
though. The single page was easy, but might be (probably is) too much.

--
igor
Igor Stoppa Nov. 13, 2018, 7:03 p.m. UTC | #73
On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
> 
> I don't think this matters much.  The other cores will only be able to
> use that mapping when they're doing a rare write.

Which has now been promoted to target for attacks :-)

--
igor
Igor Stoppa Nov. 13, 2018, 7:35 p.m. UTC | #74
On 13/11/2018 20:48, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>
>>> For general rare-writish stuff, I don't think we want IRQs running
>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>
>> Why would these be less sensitive?
> 
> I'm not really saying they're less sensitive so much as that the
> considerations are different.  I think the original rare-write code is
> based on ideas from grsecurity, and it was intended to protect static
> data like structs full of function pointers.   Those targets have some
> different properties:
> 
>  - Static targets are at addresses that are much more guessable, so
> they're easier targets for most attacks.  (Not spraying attacks like
> the ones you're interested in, though.)
> 
>  - Static targets are higher value.  No offense to IMA or AVC, but
> outright execution of shellcode, hijacking of control flow, or compete
> disablement of core security features is higher impact than bypassing
> SELinux or IMA.  Why would you bother corrupting the AVC if you could
> instead just set enforcing=0?  (I suppose that corrupting the AVC is
> less likely to be noticed by monitoring tools.)
> 
>  - Static targets are small.  This means that the interrupt latency
> would be negligible, especially in comparison to the latency of
> replacing the entire SELinux policy object.

Your analysis is correct.
In my case, having already taken care of those, I was going *also* after
the next target in line.
Admittedly, flipping a bit located at a fixed offset is way easier than
spraying dynamically allocated data structured.

However, once the bit is not easily writable, the only options are to
either find another way to flip it (unprotect it or subvert something
that can write it) or to identify another target that is still writable.

AVC and policyDB fit the latter description.

> Anyway, I'm not all that familiar with SELinux under the hood, but I'm
> wondering if a different approach to thinks like the policy database
> might be appropriate.  When the policy is changed, rather than
> allocating rare-write memory and writing to it, what if we instead
> allocated normal memory, wrote to it, write-protected it, and then
> used the rare-write infrastructure to do a much smaller write to
> replace the pointer?

Actually, that's exactly what I did.

I did not want to overload this discussion, but since you brought it up,
I'm not sure write rare is enough.

* write_rare is for stuff that sometimes changes all the time, ex: AVC

* dynamic read only is for stuff that at some point should not be
modified anymore, but could still be destroyed. Ex: policyDB

I think it would be good to differentiate, at runtime, between the two,
to minimize the chance that a write_rare function is used against some
read_only data.

Releasing dynamically allocated protected memory is also a big topic.
In some cases it's allocated and released continuously, like in the AVC.
Maybe it can be optimized, or maybe it can be turned into an object
cache of protected object.

But for releasing, it would be good, I think, to have a mechanism for
freeing all the memory in one loop, like having a pool containing all
the memory that was allocated for a specific use (ex: policyDB)

> Admittedly, this creates a window where another core could corrupt the
> data as it's being written.  That may not matter so much if an
> attacker can't force a policy update.  Alternatively, the update code
> could re-verify the policy after write-protecting it, or there could
> be a fancy API to allocate some temporarily-writable memory (by
> creating a whole new mm_struct, mapping the memory writable just in
> that mm_struct, and activating it) so that only the actual policy
> loader could touch the memory.  But I'm mostly speculating here, since
> I'm not familiar with the code in question.

They are all corner cases ... possible but unlikely.
Another, maybe more critical, one is that the policyDB is not available
at boot.
There is a window of opportunity, before it's loaded. But it could be
mitigated by loading a barebone set of rules, either from initrd or even
as "firmware".

> Anyway, I tend to think that the right way to approach mainlining all
> this is to first get the basic rare write support for static data into
> place and then to build on that.  I think it's great that you're
> pushing this effort, but doing this for SELinux and IMA is a bigger
> project than doing it for static data, and it might make sense to do
> it in bite-sized pieces.
> 
> Does any of this make sense?

Yes, sure.

I *have* to do SELinux, but I do not necessarily have to wait for the
final version to be merged upstream. And anyways Android is on a
different kernel.

However, I think both SELinux and IMA have a value in being sufficiently
complex cases to be used for validating the design as it evolves.

Each of them has static data that could be the first target for
protection, in a smaller patch.

Lists of write rare data are probably the next big thing, in terms of
defining the API.

But I could start with introducing __wr_after_init.

--
igor
Igor Stoppa Nov. 21, 2018, 4:34 p.m. UTC | #75
Hi,

On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> I forgot one sentence :-(
>>
>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>
>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>
>>> Why would these be less sensitive?
>>>
>>> But I see a big difference between my initial implementation and this one.
>>>
>>> In my case, by using a shared mapping, visible to all cores, freezing
>>> the core that is performing the write would have exposed the writable
>>> mapping to a potential attack run from another core.
>>>
>>> If the mapping is private to the core performing the write, even if it
>>> is frozen, it's much harder to figure out what it had mapped and where,
>>> from another core.
>>>
>>> To access that mapping, the attack should be performed from the ISR, I
>>> think.
>>
>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
> 
> I don't think this matters much.  The other cores will only be able to
> use that mapping when they're doing a rare write.


I'm still mulling over this.
There might be other reasons for replicating the mm_struct.

If I understand correctly how the text patching works, it happens 
sequentially, because of the text_mutex used by arch_jump_label_transform

Which might be fine for this specific case, but I think I shouldn't 
introduce a global mutex, when it comes to data.
Most likely, if two or more cores want to perform a write rare 
operation, there is no correlation between them, they could proceed in 
parallel. And if there really is, then the user of the API should 
introduce own locking, for that specific case.

A bit unrelated question, related to text patching: I see that each 
patching operation is validated, but wouldn't it be more robust to first 
validate  all of then, and only after they are all found to be 
compliant, to proceed with the actual modifications?

And about the actual implementation of the write rare for the statically 
allocated variables, is it expected that I use Nadav's function?
Or that I refactor the code?

The name, referring to text would definitely not be ok for data.
And I would have to also generalize it, to deal with larger amounts of data.

I would find it easier, as first cut, to replicate its behavior and 
refactor only later, once it has stabilized and possibly Nadav's patches 
have been acked.

--
igor
Nadav Amit Nov. 21, 2018, 5:36 p.m. UTC | #76
> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
> Hi,
> 
> On 13/11/2018 20:36, Andy Lutomirski wrote:
>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> I forgot one sentence :-(
>>> 
>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>> 
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>> 
>>>> Why would these be less sensitive?
>>>> 
>>>> But I see a big difference between my initial implementation and this one.
>>>> 
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>> 
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>> 
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>> 
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much.  The other cores will only be able to
>> use that mapping when they're doing a rare write.
> 
> 
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
> 
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
> 
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.

I think that if you create per-CPU temporary mms as you proposed, you do not
need a global lock. You would need to protect against module unloading, and
maybe refactor the code. Specifically, I’m not sure whether protection
against IRQs is something that you need. I’m also not familiar with this
patch-set so I’m not sure what atomicity guarantees do you need.

> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
> 
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?

It’s not “my” function. ;-)

IMHO the code is in relatively good and stable state. The last couple of
versions were due to me being afraid to add BUG_ONs as Peter asked me to.

The code is rather simple, but there are a couple of pitfalls that hopefully
I avoided correctly.

Regards,
Nadav
Igor Stoppa Nov. 21, 2018, 6:01 p.m. UTC | #77
On 21/11/2018 19:36, Nadav Amit wrote:
>> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
> 
> I think that if you create per-CPU temporary mms as you proposed, you do not
> need a global lock. You would need to protect against module unloading,

yes, it's unlikely to happen and probably a bug in the module, if it 
tries to write while being unloaded, but I can do it

> and
> maybe refactor the code. Specifically, I’m not sure whether protection
> against IRQs is something that you need.

With the initial way I used to do write rare, which was done by creating 
a temporary mapping visible to every core, disabling IRQs was meant to 
prevent that the "writer" core would be frozen and then the mappings 
scrubbed for the page in writable state.

Without shared mapping of the page, the only way to attack it should be 
to generate an interrupt on the "writer" core, while the writing is 
ongoing, and to perform the attack from the interrupt itself, because it 
is on the same core that has the writable mapping.

Maybe it's possible, but it seems to have become quite a corner case.

> I’m also not familiar with this
> patch-set so I’m not sure what atomicity guarantees do you need.

At the very least, I think I need to ensure that pointers are updated 
atomically, like with WRITE_ONCE() And spinlocks.
Maybe atomic types can be left out.

>> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
>>
>> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
> 
> It’s not “my” function. ;-)

:-P

ok, what I meant is that the signature of the __text_poke() function is 
quite specific to what it's meant to do.

I do not rule out that it might be eventually refactored as a special 
case of a more generic __write_rare() function, that would operate on 
different targets, but I'd rather do the refactoring after I have a 
clear understanding of how to alter write-protected data.

The refactoring could be the last patch of the write rare patchset.

> IMHO the code is in relatively good and stable state. The last couple of
> versions were due to me being afraid to add BUG_ONs as Peter asked me to.
> 
> The code is rather simple, but there are a couple of pitfalls that hopefully
> I avoided correctly.

Yes, I did not mean to question the quality of the code, but I'd prefer 
to not have to carry also this patchset, while it's not yet merged.

I actually hope it gets merged soon :-)
--
igor
Andy Lutomirski Nov. 21, 2018, 6:15 p.m. UTC | #78
> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
> Hi,
> 
>>> On 13/11/2018 20:36, Andy Lutomirski wrote:
>>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> 
>>> I forgot one sentence :-(
>>> 
>>>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>>> 
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>> 
>>>> Why would these be less sensitive?
>>>> 
>>>> But I see a big difference between my initial implementation and this one.
>>>> 
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>> 
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>> 
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>> 
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much.  The other cores will only be able to
>> use that mapping when they're doing a rare write.
> 
> 
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
> 
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
> 
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.

Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA.  You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.

> 
> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
> 
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
> Or that I refactor the code?

I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.

> 
> The name, referring to text would definitely not be ok for data.
> And I would have to also generalize it, to deal with larger amounts of data.
> 
> I would find it easier, as first cut, to replicate its behavior and refactor only later, once it has stabilized and possibly Nadav's patches have been acked.
> 

Sounds reasonable. You’ll still want Nadav’s code for setting up the mm in the first place, though.

> --
> igor
Igor Stoppa Nov. 22, 2018, 7:27 p.m. UTC | #79
On 21/11/2018 20:15, Andy Lutomirski wrote:

>> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
> 
> Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA.  You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.

I have studied the code involved with Nadav's patchset.
I am perplexed about these sentences you wrote.

More to the point (to the best of my understanding):

poking_init()
-------------
   1. it gets one random poking address and ensures to have at least 2
      consecutive PTEs from the same PMD
   2. it then proceeds to map/unmap an address from the first of the 2
      consecutive PTEs, so that, later on, there will be no need to
      allocate pages, which might fail, if poking from atomic context.
   3. at this point, the page tables are populated, for the address that
      was obtained at point 1, and this is ok, because the address is fixed

write_rare
----------
   4. it can happen on any available core / thread at any time, therefore
      each of them needs a different address
   5. CPUs support hotplug, but from what I have read, I might get away
      with having up to nr_cpus different addresses (determined at init)
      and I would follow the same technique used by Nadav, of forcing the
      mapping of 1 or 2 (1 could be enough, I have to loop anyway, at
      some point) pages at each address, to ensure the population of the
      page tables

so far, so good, but ...

   6. the addresses used by each CPU are fixed
   7. I do not understand the reference you make to
      "allocation and freeing of rare-writable ranges", because if I
      treat the range as such, then there is a risk that I need to
      populate more entries in the page table, which would have problems
      with the atomic context, unless write_rare from atomic is ruled
      out. If write_rare from atomic can be avoided, then I can also have
      one-use randomized addresses at each write-rare operation, instead
      of fixed ones, like in point 6.

and, apologies for being dense: the following is still not clear to me:

   8. you wrote:
      > You’ll still need a mutex (mmap_sem) to synchronize allocation
      > and freeing of rare-writable ranges, but that shouldn’t have much
      > contention.
      What causes the contention? It's the fact that the various cores
      are using the same mm, if I understood correctly.
      However, if there was one mm for each core, wouldn't that make it
      unnecessary to have any mutex?
      I feel there must be some obvious reason why multiple mms are not a
      good idea, yet I cannot grasp it :-(

      switch_mm_irqs_off() seems to have lots of references to
      "this_cpu_something"; if there is any optimization from having the
      same next across multiple cores, I'm missing it

[...]

> I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.

yes

[...]


> You’ll still want Nadav’s code for setting up the mm in the first place, though.

yes

--
thanks, igor
Matthew Wilcox Nov. 22, 2018, 8:04 p.m. UTC | #80
On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> I have studied the code involved with Nadav's patchset.
> I am perplexed about these sentences you wrote.
> 
> More to the point (to the best of my understanding):
> 
> poking_init()
> -------------
>   1. it gets one random poking address and ensures to have at least 2
>      consecutive PTEs from the same PMD
>   2. it then proceeds to map/unmap an address from the first of the 2
>      consecutive PTEs, so that, later on, there will be no need to
>      allocate pages, which might fail, if poking from atomic context.
>   3. at this point, the page tables are populated, for the address that
>      was obtained at point 1, and this is ok, because the address is fixed
> 
> write_rare
> ----------
>   4. it can happen on any available core / thread at any time, therefore
>      each of them needs a different address

No?  Each CPU has its own CR3 (eg each CPU might be running a different
user task).  If you have _one_ address for each allocation, it may or
may not be mapped on other CPUs at the same time -- you simply don't care.

The writable address can even be a simple formula to calculate from
the read-only address, you don't have to allocate an address in the
writable mapping space.
Andy Lutomirski Nov. 22, 2018, 8:53 p.m. UTC | #81
On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> > I have studied the code involved with Nadav's patchset.
> > I am perplexed about these sentences you wrote.
> >
> > More to the point (to the best of my understanding):
> >
> > poking_init()
> > -------------
> >   1. it gets one random poking address and ensures to have at least 2
> >      consecutive PTEs from the same PMD
> >   2. it then proceeds to map/unmap an address from the first of the 2
> >      consecutive PTEs, so that, later on, there will be no need to
> >      allocate pages, which might fail, if poking from atomic context.
> >   3. at this point, the page tables are populated, for the address that
> >      was obtained at point 1, and this is ok, because the address is fixed
> >
> > write_rare
> > ----------
> >   4. it can happen on any available core / thread at any time, therefore
> >      each of them needs a different address
>
> No?  Each CPU has its own CR3 (eg each CPU might be running a different
> user task).  If you have _one_ address for each allocation, it may or
> may not be mapped on other CPUs at the same time -- you simply don't care.
>
> The writable address can even be a simple formula to calculate from
> the read-only address, you don't have to allocate an address in the
> writable mapping space.
>

Agreed.  I suggest the formula:

writable_address = readable_address - rare_write_offset.  For
starters, rare_write_offset can just be a constant.  If we want to get
fancy later on, it can be randomized.

If we do it like this, then we don't need to modify any pagetables at
all when we do a rare write.  Instead we can set up the mapping at
boot or when we allocate the rare write space, and the actual rare
write code can just switch mms and do the write.
Igor Stoppa Dec. 4, 2018, 12:34 p.m. UTC | #82
Hello,
apologies for the delayed answer.
Please find my reply to both last mails in the thread, below.

On 22/11/2018 22:53, Andy Lutomirski wrote:
> On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
>>> I have studied the code involved with Nadav's patchset.
>>> I am perplexed about these sentences you wrote.
>>>
>>> More to the point (to the best of my understanding):
>>>
>>> poking_init()
>>> -------------
>>>    1. it gets one random poking address and ensures to have at least 2
>>>       consecutive PTEs from the same PMD
>>>    2. it then proceeds to map/unmap an address from the first of the 2
>>>       consecutive PTEs, so that, later on, there will be no need to
>>>       allocate pages, which might fail, if poking from atomic context.
>>>    3. at this point, the page tables are populated, for the address that
>>>       was obtained at point 1, and this is ok, because the address is fixed
>>>
>>> write_rare
>>> ----------
>>>    4. it can happen on any available core / thread at any time, therefore
>>>       each of them needs a different address
>>
>> No?  Each CPU has its own CR3 (eg each CPU might be running a different
>> user task).  If you have _one_ address for each allocation, it may or
>> may not be mapped on other CPUs at the same time -- you simply don't care.

Yes, somehow I lost track of that aspect.

>> The writable address can even be a simple formula to calculate from
>> the read-only address, you don't have to allocate an address in the
>> writable mapping space.
>>
> 
> Agreed.  I suggest the formula:
> 
> writable_address = readable_address - rare_write_offset.  For
> starters, rare_write_offset can just be a constant.  If we want to get
> fancy later on, it can be randomized.

ok, I hope I captured it here [1]

> If we do it like this, then we don't need to modify any pagetables at
> all when we do a rare write.  Instead we can set up the mapping at
> boot or when we allocate the rare write space, and the actual rare
> write code can just switch mms and do the write.

I did it. I have little feeling about the actual amount of data 
involved, but there is a (probably very remote) chance that the remap 
wouldn't work, at least in the current implementation.

It's a bit different from what I had in mind initially, since I was 
thinking to have one single approach to both statically allocated memory 
(is there a better way to describe it) and what is provided from the 
allocator that would come next.

As I wrote, I do not particularly like the way I implemented multiple 
functionality vs remapping, but I couldn't figure out any better way to 
do it, so eventually I kept this one, hoping to get some advice on how 
to improve it.

I did not provide yet an example, yet, but IMA has some flags that are 
probably very suitable, since they depend on policy reloading, which can 
happen multiple times, but could be used to disable it.

[1] https://www.openwall.com/lists/kernel-hardening/2018/12/04/3

--
igor
diff mbox series

Patch

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 26b735cefb93..1a90fa878d8d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@  Core utilities
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm
+   prmem
 
 Interfaces for kernel debugging
 ===============================
diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
new file mode 100644
index 000000000000..16d7edfe327a
--- /dev/null
+++ b/Documentation/core-api/prmem.rst
@@ -0,0 +1,172 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _prmem:
+
+Memory Protection
+=================
+
+:Date: October 2018
+:Author: Igor Stoppa <igor.stoppa@huawei.com>
+
+Foreword
+--------
+- In a typical system using some sort of RAM as execution environment,
+  **all** memory is initially writable.
+
+- It must be initialized with the appropriate content, be it code or data.
+
+- Said content typically undergoes modifications, i.e. relocations or
+  relocation-induced changes.
+
+- The present document doesn't address such transient.
+
+- Kernel code is protected at system level and, unlike data, it doesn't
+  require special attention.
+
+Protection mechanism
+--------------------
+
+- When available, the MMU can write protect memory pages that would be
+  otherwise writable.
+
+- The protection has page-level granularity.
+
+- An attempt to overwrite a protected page will trigger an exception.
+- **Write protected data must go exclusively to write protected pages**
+- **Writable data must go exclusively to writable pages**
+
+Available protections for kernel data
+-------------------------------------
+
+- **constant**
+   Labelled as **const**, the data is never supposed to be altered.
+   It is statically allocated - if it has any memory footprint at all.
+   The compiler can even optimize it away, where possible, by replacing
+   references to a **const** with its actual value.
+
+- **read only after init**
+   By tagging an otherwise ordinary statically allocated variable with
+   **__ro_after_init**, it is placed in a special segment that will
+   become write protected, at the end of the kernel init phase.
+   The compiler has no notion of this restriction and it will treat any
+   write operation on such variable as legal. However, assignments that
+   are attempted after the write protection is in place, will cause
+   exceptions.
+
+- **write rare after init**
+   This can be seen as variant of read only after init, which uses the
+   tag **__wr_after_init**. It is also limited to statically allocated
+   memory. It is still possible to alter this type of variables, after
+   the kernel init phase is complete, however it can be done exclusively
+   with special functions, instead of the assignment operator. Using the
+   assignment operator after conclusion of the init phase will still
+   trigger an exception. It is not possible to transition a certain
+   variable from __wr_ater_init to a permanent read-only status, at
+   runtime.
+
+- **dynamically allocated write-rare / read-only**
+   After defining a pool, memory can be obtained through it, primarily
+   through the **pmalloc()** allocator. The exact writability state of the
+   memory obtained from **pmalloc()** and friends can be configured when
+   creating the pool. At any point it is possible to transition to a less
+   permissive write status the memory currently associated to the pool.
+   Once memory has become read-only, it the only valid operation, beside
+   reading, is to released it, by destroying the pool it belongs to.
+
+
+Protecting dynamically allocated memory
+---------------------------------------
+
+When dealing with dynamically allocated memory, three options are
+ available for configuring its writability state:
+
+- **Options selected when creating a pool**
+   When creating the pool, it is possible to choose one of the following:
+    - **PMALLOC_MODE_RO**
+       - Writability at allocation time: *WRITABLE*
+       - Writability at protection time: *NONE*
+    - **PMALLOC_MODE_WR**
+       - Writability at allocation time: *WRITABLE*
+       - Writability at protection time: *WRITE-RARE*
+    - **PMALLOC_MODE_AUTO_RO**
+       - Writability at allocation time:
+           - the latest allocation: *WRITABLE*
+           - every other allocation: *NONE*
+       - Writability at protection time: *NONE*
+    - **PMALLOC_MODE_AUTO_WR**
+       - Writability at allocation time:
+           - the latest allocation: *WRITABLE*
+           - every other allocation: *WRITE-RARE*
+       - Writability at protection time: *WRITE-RARE*
+    - **PMALLOC_MODE_START_WR**
+       - Writability at allocation time: *WRITE-RARE*
+       - Writability at protection time: *WRITE-RARE*
+
+   **Remarks:**
+    - The "AUTO" modes perform automatic protection of the content, whenever
+       the current vmap_area is used up and a new one is allocated.
+        - At that point, the vmap_area being phased out is protected.
+        - The size of the vmap_area depends on various parameters.
+        - It might not be possible to know for sure *when* certain data will
+          be protected.
+        - The functionality is provided as tradeoff between hardening and speed.
+        - Its usefulness depends on the specific use case at hand
+    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
+
+- **Protecting the pool**
+   This is achieved with **pmalloc_protect_pool()**
+    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
+    - Any residual space still available from the current vmap_area is lost, as the area is protected.
+    - **protecting a pool after every allocation will likely be very wasteful**
+    - Using PMALLOC_MODE_START_WR is likely a better choice.
+
+- **Upgrading the protection level**
+   This is achieved with **pmalloc_make_pool_ro()**
+    - it turns the present content of a write-rare pool into read-only
+    - can be useful when the content of the memory has settled
+
+
+Caveats
+-------
+- Freeing of memory is not supported. Pages will be returned to the
+  system upon destruction of their memory pool.
+
+- The address range available for vmalloc (and thus for pmalloc too) is
+  limited, on 32-bit systems. However it shouldn't be an issue, since not
+  much data is expected to be dynamically allocated and turned into
+  write-protected.
+
+- Regarding SMP systems, changing state of pages and altering mappings
+  requires performing cross-processor synchronizations of page tables.
+  This is an additional reason for limiting the use of write rare.
+
+- Not only the pmalloc memory must be protected, but also any reference to
+  it that might become the target for an attack. The attack would replace
+  a reference to the protected memory with a reference to some other,
+  unprotected, memory.
+
+- The users of rare write must take care of ensuring the atomicity of the
+  action, respect to the way they use the data being altered; for example,
+  take a lock before making a copy of the value to modify (if it's
+  relevant), then alter it, issue the call to rare write and finally
+  release the lock. Some special scenario might be exempt from the need
+  for locking, but in general rare-write must be treated as an operation
+  that can incur into races.
+
+- pmalloc relies on virtual memory areas and will therefore use more
+  tlb entries. It still does a better job of it, compared to invoking
+  vmalloc for each allocation, but it is undeniably less optimized wrt to
+  TLB use than using the physmap directly, through kmalloc or similar.
+
+
+Utilization
+-----------
+
+**add examples here**
+
+API
+---
+
+.. kernel-doc:: include/linux/prmem.h
+.. kernel-doc:: mm/prmem.c
+.. kernel-doc:: include/linux/prmemextra.h
diff --git a/MAINTAINERS b/MAINTAINERS
index ea979a5a9ec9..246b1a1cc8bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9463,6 +9463,7 @@  F:	include/linux/prmemextra.h
 F:	mm/prmem.c
 F:	mm/test_write_rare.c
 F:	mm/test_pmalloc.c
+F:	Documentation/core-api/prmem.rst
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org